Dimension reduction based on weighted variance estimate

In this paper, we propose a new estimate for dimension reduction, called the weighted variance estimate (WVE), which includes Sliced Average Variance ...

0 downloads 31 Views 502KB Size

Download PDF

Science in China Series A: Mathematics Mar., 2009, Vol. 52, No. 3, 539–560 www.scichina.com math.scichina.com www.springerlink.com

Dimension reduction based on weighted variance estimate ZHAO JunLong1† & XU XingZhong2 1

Department of Mathematics, Beihang University; Laboratory of Mathematics, Information and Behavior of

the Ministry of Education, Beijing 100083, China 2 Department of Mathematics, Beijing Institute of Technology, Beijing 100081, China (email: [email protected], [email protected])

Abstract

In this paper, we propose a new estimate for dimension reduction, called the weighted variance estimate (WVE), which includes Sliced Average Variance Estimate (SAVE) as a special case. Bootstrap method is used to select the best estimate from the WVE and to estimate the structure dimension. And this selected best estimate usually performs better than the existing methods such as Sliced Inverse Regression (SIR), SAVE, etc. Many methods such as SIR, SAVE, etc. usually put the same weight on each observation to estimate central subspace (CS). By introducing a weight function, WVE puts diﬀerent weights on diﬀerent observations according to distance of observations from CS. The weight function makes WVE have very good performance in general and complicated situations, for example, the distribution of regressor deviating severely from elliptical distribution which is the base of many methods, such as SIR, etc. And compared with many existing methods, WVE is insensitive to the distribution of the regressor. The consistency of the WVE is established. Simulations to compare the performances of WVE with other existing methods conﬁrm the advantage of WVE.

Keywords:

central subspace, contour regression, sliced average variance estimate, sliced inverse

regression, suﬃcient dimension reduction, weight function MSC(2000):

1

62G08, 62H05

Introduction

In regression setting, the goal of a regression is to infer about the conditional distribution of the univariate response variable Y , given the p × 1 vector of predictor X. In this paper we consider suﬃcient dimension reduction, the basic idea being to seek a matrix β of dimension p × k (k p) such that the p-dimensional predictor X can be replaced by β T X without the loss of information in predicting the response Y . In other words, the distribution of Y |X is the same as that of Y |β T X. For the convenience of further discussion, we introduce the following notations throughout. The notation U ⊥ V means that the random vectors U and V are independent. Subspace will be denoted by S, and S(B) means the subspace spanned by the columns of the matrix or vector B. PB denotes the projection operator for S(B) with respect to the usual inner product and QB = I − PB . R(U ) denotes the support set of the distribution of the variable U . Following [1], matrix β of dimension p × k can be deﬁned as Received March 3, 2008; accepted June 2, 2008; published online December 5, 2008 DOI: 10.1007/s11425-008-0130-z † Corresponding author This work was supported by National Natural Science Foundation of China (Grant No. 10771015)

ZHAO JunLong & XU XingZhong

540

Y ⊥ X|β T X,

(1.1)

T

that is, given β X, Y is independent of X. S(β) is called the dimension reduction subspace (DRS). If the intersection of all DRS’s is still a DRS, it will be called central dimension reduction subspace, brieﬂy CS, denoted by SY |X (see [2, 3]). The existence of CS is assured under mild regularity conditions (see [2, 3]). Assume that Var(X) is nonsingular and let Z = [Var(X)]−1/2 (X − E(X)). Due to SY |X = [Var(X)]−1/2 SY |Z (see [1, 2]), we need only to estimate SY |Z . Some methods are available for estimating SY |Z including Sliced Inverse Regression (SIR) (see [4]), SIRII , Sliced Average of Variance Estimation (SAVE) (see [5]), Principal Hessian Direction (PHD)(see [6]), Contour Regression (see [7]), Minimum Average Variance Estimator (MAVE) (see [8]), etc. The eﬀectiveness of many of these methods such as SIR, SAVE, etc. are based on one or two of the following two strict assumptions (see [9]): C1. Linearity Condition: E(Z|β T Z) = Pβ Z, C2. Constant Covariance Condition: Var(Z|β T Z) ≡ Qβ = c (constant), where the columns of matrix β are the standardized orthogonal basis of SY |Z . The estimate matrix used by SIR and SAVE, respectively, are as follows: MSIR = Cov[E(Z|Y )], MSAVE = E[(Ip − Cov(Z|Y ))2 ]. SIR needs only the assumption C1, which is approximately equal to requiring Z to be an elliptical distribution. SAVE is based on both C1 and C2, which is approximately equal to requiring Z to be normal distribution. Some methods may be used to achieve the ellipticity, such as, Cox-Box transformation, method of re-weighting data by Voronio weights (see [10]). However, these methods need intensive computation and the selection of some parameters involved is empirical. And not all data can reach ellipticity by re-weighting method (see [10]). SIR is an eﬀective method for dimension reduction problem. However, in the case where Y is symmetric with respect to Z, SIR fails to ﬁnd any nonzero direction. To overcome the limits of SIR, Cook and Weisberg[5] developed SAVE method. SAVE is more comprehensive than SIR (see [9, 11]). It has been proved that the directions found by SIR can also be found by SAVE (see [9, 11, 12]). However, there are several problems with SAVE in application. First, SAVE requires the assumptions: C1 and C2 or normal distribution, which are very strong restrictions for application. Second, SAVE is sensitive to the distribution of X (see [13, 14]) and to the number of the slice (see [15]). Several methods have been developed to improve the performance of SAVE, such as Kernel SAVE (KSAVE) (see [16]) and Contour Projection SAVE (CP-SAVE) (see [17]). In this paper, a new family of estimates, called weighted variances estimates (brieﬂy WVE), is proposed which includes SAVE as a special case. The most important point which distinguishes WVE from SAVE and SIR is that diﬀerent weights are assigned to diﬀerent observations. WVE owns several advantages. First, the weight function makes WVE work well under much weaker assumption, removing both C1 and C2. Second, we can get best estimate in the family of WVE, which usually performs better than SAVE, SIR, etc. Third, WVE is robust compared with many methods such as SAVE, SIR, etc. The main contents of this paper are arranged as follows. In Section 2, the minimum norm estimate is introduced, which is the base of this

Dimension reduction based on weighted variance estimate

541

article. The weighted variance estimate is developed in Section 3. In Section 4, we discuss the selection of the best estimate from the family of WVE. In Section 5, consistency of the estimate is established. The simulation results and real data analysis are presented in Sections 6 and 7. Some relevant problems are discussed in Section 8. Finally, the proofs are presented in Section 9. 2

The estimate with minimum norm

2.1 Minimum norm estimate In this section, we introduce a new estimate: the minimum norm estimate (MNE), which is the base of the weighted variance estimate in Section 3. Li[7] et al. proposed the General Contour are independent identically distributed variable Regression (GCR) method. Suppose that X, X and that x and x are the corresponding observations of the two variables, respectively. GCR views x − x as an empirical direction and uses the variance of Y along the line crossing x and x to decide whether x − x is a direction in CS. Let R(Z) be the support of Z. In the similar spirit to that of Contour Regression, every point z in R(Z) is a vector and may be viewed as a direction. And in this paper, we directly select from R(Z) the points z (0) that fall in SY |Z as the estimate of directions in CS. But diﬀerent from [13], we use norm to decide whether a point z falls in CS. Since the estimate z (0) can be described by minimum norm, we called it minimum norm estimate (MNE). MNE needs only very weak assumption on Z and is very robust to the distribution of Z. Of course, each point z in R(Z), viewed as a direction, is not necessarily contained in CS. However, there indeed exist the points in R(Z) that fall in CS in many cases. For example, in the case where R(Z) = Rp , since CS is always the subspace of Rp , it is obvious that in R(Z) there must exist points which are contained in CS. Of course, assume that R(Z) = Rp is only used to illustrate our idea rather than the necessary condition for our method. The rest problems are to ﬁnd the situation or assumption under which there exist points falling in CS in R(Z) and to identify such points. Let us consider the ﬁrst problem. By (1.1), Y only concerns with β T Z, therefore, the given Y is equivalent to restricting the values of β T Z to the corresponding speciﬁc set. At the same time, for any direction b ∈ SY⊥|Z , bT Z is still a variable. It is possible that bT Z can reach 0. Given β T Z = c (constant), if bT Z can reach 0, then the point z ∗ ∈ R(Z), satisfying β T z ∗ = c and bT z ∗ = 0, is a direction in CS as we desired. Therefore, we need the following assumption. Assumption A. Suppose that β is a standardized orthogonal basis of SY |Z and that δ(0, ) is a neighborhood of 0 with radius > 0. For any b ∈ SY⊥|Z and any > 0, we have P (bT Z ∈ δ(0, )|β T Z) > 0. C1 requires that E(bT Z|β T Z) = 0, for any b ∈ S(β)⊥ , whereas Assumption A only requires that the conditional distribution bT Z|β T Z can take value in δ(0, ) in positive probability. Remark 2.1. Assumption A is equivalent to saying that 0 is in the support of the distribution bT Z|β T Z. Therefore, if R(Z) is convex, Assumption A is much weaker than C1 or C2, which approximately means Z being elliptical or normal distribution. It is easy to see that if the support of Z is the whole Rp , then Assumption A holds. Remark 2.2. Due to Y ⊥ Z|β T Z, Assumption A can be relaxed further to requiring Assumption A : P (bT Z ∈ δ(0, )|Y ) > 0. In regression setting where the coordinates of Z are

ZHAO JunLong & XU XingZhong

542

related, if the relation among the regressors is not too strong, Assumption A or A can holds generally. If the linearity and nonlinearity among the regressors are very strong, the model may become unidentiﬁable. For example, Y = X1 + and X2 = X12 + a1 with X = (X1 , . . . , Xp ), and 1 being independent normal distribution. As a → 0, the CS will not exist. As pointed by Li[18] , it seems reasonable that both (1, 0, . . . , 0) and (0, 1, 0, . . . , 0) should be used to construct the model. From the above argument, we remove the conditions C1 and C2 which are the fundamental assumptions of many methods. The next problem is to identify the points which fall in CS. Consider the dimension reduction model (see [4]) Y = f (β T Z, ), ⊥ Z, where Z ⊥ and f is any unspeciﬁed function. Let E and R(Y ) denote the supports of and Y , respectively. Let I0 be any Borel measurable set in R(Y ). Particularly, I0 can be the set of single element, for example, I0 = {y0 } for any y0 ∈ R(Y ). For any ∈ E, let L() = {z : y = f (β T z, ) ∈ I0 with ﬁxed} be the inverse image of the set {y : y ∈ I0 for ﬁxed } after being given, and let C0 = {z|y ∈ I0 } = {z : y = f (β T z, ) ∈ I0 , ∈ E} = L(), ∈E

be the inverse image of set {y : y ∈ I0 } with respect to z, that is, C0 is the collection of all possible z in R(Z) that makes f (β T z, ) ∈ I0 . Let a(I0 ) = inf{z2 : z ∈ C0 } and consider the points in the following set: C0min = {z (0) : z (0) ∈ C 0 and z (0) 2 = a(I0 )},

(2.1)

where C 0 is the closure of C0 . In fact, under Assumption A, for any z (0) ∈ C0min , we have z (0) ∈ SY |Z . Let us take a simple example for illustration. Suppose that Y = Z1 + e, where Z = (Z1 , Z2 ) ∼ N (0, I2 ) and e ∼ U (0, 1), the uniform distribution on [0, 1]. For this example, SY |Z = span{(1, 0)T }. For any given y, let I0 = {y} and y −1 = {z1 : z1 = y − e for any e ∈ [0, 1]} be the inverse image of y with respect to z1 . Let C0 = {z = (z1 , z2 ) : z ∈ R2 and z1 ∈ y −1 } and consider the point z (0) ∈ C 0 satisfying z (0) = inf{z : z ∈ C 0 }. For example, given y = 2, we have C 0 = {z = (z1 , z2 ) : z1 = y − e ∈ [1, 2] and z ∈ R2 }. For any z ∈ C 0 , since z2 = z12 + z22 = (2 − e)2 + z22 , it is easy to see that inf{z : z ∈ C 0 } = 1 and z (0) = (1, 0). Therefore z (0) ∈ SY |Z . In general, we have the following conclusion and the proof is presented in Section 9. Theorem 2.1.

Suppose that Assumption A holds. Then for any z (0) ∈ C0min , it holds that z (0) ∈ SY |Z

and

z (0) 2 = inf{Pβ z2 : z ∈ C 0 }.

Representing the above by variables, we obtain the minimum norm estimate (MNE): Z (0) (I0 ), which only takes value on the set C0min . Obviously, given I0 , Z (0) (I0 ) is a constant. However, it is possible that Z (0) (I0 ) = 0. Therefore, we consider the possibility of Z (0) (I0 ) being nonzero vectors. In fact, we can get nonzero estimate in general cases. Consider the dimension reduction model (see [4]) Y = f (β T Z, ), where ⊥ Z and f is any unspeciﬁed function. Case I.

is bounded.

For any ρ > 0, let δ(0, ρ) denote the closed sphere centered at 0 with radius ρ in Rp . Let A = {z : z ∈ R(Z)∩δ(0, ρ)} and let Y|Z∈A = f (β T Z, )|Z∈A = fA (Z, ) be the conditional model

Dimension reduction based on weighted variance estimate

543

of Y given Z ∈ A. Let a1 = inf Z∈A,∈E fA (Z, ), b1 = supZ∈A,∈E fA (Z, ). If R(Y )\[a1 , b1 ] = ∅, for any y ∈ R(Y )\[a1 , b1 ] and any z ∈ C0 where I0 = {y}, we have z ρ. Therefore, for any Borel measurable set I0 ⊆ R(Y )\[a1 , b1 ] and any z ∈ C0 , we still have z ρ. Consequently by the deﬁnition of z (0) in (2.1), it follows that z (0) ρ. This along with Theorem 2.1 leads to the following conclusion. Proposition 2.1. Suppose that Assumption A holds and that R(Y )\[a1 , b1 ] = ∅. Let I0 be any Borel measurable set in R(Y )\[a1 , b1 ]. Then for any z (0) ∈ C0min , it holds that z (0) ∈ SY |Z with z (0) ρ > 0. Case II. is unbounded. For any ρ > 0, let A = {Z : Z ∈ R(Z) ∩ δ(0, ρ)}, Ac = {Z : Z ∈ R(Z)\δ(0, ρ)}, Y|Z∈A =

f (β T Z, )|Z∈A = fA (Z, ). In this case, fA (Z, ) may be unbounded, for example, the local structure Y = f (β T Z) + . However, the variable Y|Z∈A is contained in a closed bounded interval in very high probability, that is, for any α > 0 being suﬃciently small, there exists [a2 , b2 ] with a2 , b2 being ﬁnite, such that P (Y|Z∈A ∈ [a2 , b2 ]) > 1 − α.

(2.2)

Suppose that R(Y )\[a2 , b2 ] = ∅. As argued in Case I, if Y|Z∈A ∈ [a2 , b2 ], then for any y ∈ R(Y )\[a2 , b2 ] and any z ∈ C0 with I0 = {y}, it holds that z > ρ, that is, Z|y > ρ, where Z|y is the conditional norm of Z, given y. Therefore, for any y ∈ R(Y )\[a2 , b2 ], the conditional probability P (Z|y > ρ| Y|Z∈A ∈ [a2 , b2 ]) = 1. (2.3) Therefore for any y ∈ R(Y )\[a2 , b2 ], by (2.2) and (2.3), we have P (Z|y > ρ) = P (Y|Z∈A ∈ [a2 , b2 ])P (Z|y > ρ| Y|Z∈A ∈ [a2 , b2 ]) +P (Y|Z∈A ∈ R(Y )\[a2 , b2 ])P (Z|y > ρ| Y|Z∈A ∈ R(Y )\[a2 , b2 ]) P (Y|Z∈A ∈ [a2 , b2 ]) > 1 − α. That is, for any y ∈ R(Y )\[a2 , b2 ], P (Z ∈ Ac |y) = E(I{Z∈Ac } |y) > 1 − α.

(2.4)

Recall the assumption R(Y )\[a2 , b2 ] = ∅. Then for any Borel measurable set I0 ⊆ R(Y )\[a2 , b2 ], by (2.4), we have P (Z ∈ Ac |Y ∈ I0 ) = E(I{Z∈Ac } |Y ∈ I0 ) > 1 − α. That is, P (Z > ρ|Y ∈ I0 ) > 1 − α. Therefore, by Theorem 2.1 and the deﬁnition of Z (0) (I0 ), we have the following conclusion. Proposition 2.2. Suppose that Assumption A holds. For any α ∈ (0, 1) being small, if R(Y )\[a2 , b2 ] = ∅, then, for any Borel measurable set I0 ⊆ R(Y )\[a2 , b2 ], it holds that Z (0) (I0 ) ∈ SY |Z

and

P (Z (0) (I0 ) > ρ|Y ∈ I0 ) > 1 − α.

Remark 2.3. From Propositions 2.1 and 2.2, we see that, as I0 ⊆ R(Y ) \ [ai , bi ], i = 1, 2, Z (0) (I0 ) can be nonzero estimate at least in probability 1−α. In the case of I0 ⊆ [ai , bi ], i = 1, 2, we cannot guarantee that there exists ρ > 0, such that, Z (0) (I0 ) > ρ in very high probability.

ZHAO JunLong & XU XingZhong

544

However, this does not mean that Z (0) (I0 ) must be zero when I0 ⊆ [a1 , b1 ] or ⊆ [a2 , b2 ]. It is possible to draw nonzero directions when I0 ⊆ [a1 , b1 ] or ⊆ [a2 , b2 ]. Therefore, the probability of Z (0) (I0 ) being nonzero varies for diﬀerent I0 . 2.2 The robustness of minimum norm estimate For simplicity, take I0 = {y} for any ﬁxed y ∈ R(Y ) and compare the robustness of MNE Z (0) (y) with the estimators of other well known methods. Note that Z (0) (y) is deﬁned by the inf{z : z ∈ C0 }, where C0 , according to the deﬁnition in Section 2, is the support set of the conditional distribution Z|Y = y. We only need to consider the robustness of T = inf{z : z is in the support of Z|Y = y}. Suppose that F is the c.d.f. of Z and that RF denotes the support of F . Deﬁne T (F ) = inf{z2 : z ∈ RF }, which is a functional of F . Let P = {F : F = (1 − 1 )Fz|y (z) + 1 G(z), 0 1 < }, where G(z) is from the class M = {all cumulant distribution G with RG ⊆ RFz|y }, and Fz|y (z) denotes the c.d.f. of the conditional distribution Z|Y = y. For any F ∈ P , by the deﬁnition of T (F ), we have ⎧ ⎨ T (F ), < 1, 1 z|y T (F ) = inf{T (Fz|y ), T (G)} = ⎩ T (G), 1 = 1. Therefore, it is easy to see that the breakdown point is 1. Let δz0 denote the degenerated c.d.f. deﬁned on z0 and F = (1 − )Fz|y (z) + δz0 , 0 1. In the case of 0 < < 1, it is easy to see that, the functional ⎧ ⎨ T (F ), z 2 T (F ), 0 z|y z|y T (F ) = ⎩ z0 2 , z0 2 < T (F ). z|y

And in the case of = 0 or 1, T (F ) ≡ T (Fz|y ) or z0 2 respectively. Therefore the inﬂuence function IC(z0 , Fz|y , T ) ≡ 0, a.e. Therefore T (F ) or equivalently the estimate Z (0) (y) is a very robust estimate and is insensitive to the outlier. 3

Dimension reduction with weighted variance estimate

In Section 2, we have shown that z (0) ∈ SY |Z under Assumption A. In fact, other points in C0 still carry the information of CS. Consider the points z ∈ C0 satisfying z2 < z (0) 2 + d, for any d 0. If d > 0, it is possible that z may not be contained in the SY |Z . However, the following theorem shows that the distance of z deviating from true CS is less than d1/2 , that is, the points in C0 with smaller norm are closer to true CS. Proposition 3.1. that

Under Assumption A, for any z ∈ C0 with z2 z (0) 2 + d, it follows 0 Qβ z d1/2 .

(3.1)

Specifically, as d → 0, we have Theorem 2.1. Theorem 2.1 and Proposition 3.1 tell us that the points with smaller norm are more important in estimating CS. There are two reasons. First, Proposition 3.1 actually provides a measure of the extent of a point z ∈ C0 deviating from true CS. Second, since MNE Z (0) (I0 ) is very robust and its assumption is much weaker (only Assumption A is needed), for the goal of obtaining

Dimension reduction based on weighted variance estimate

545

the estimate that can work under general case, more attention should be paid to the points of smaller norm. Therefore, we should allocate larger weights to the points of smaller norm, rather than put the same weight on each point which is the way of SIR, SAVE, etc. Let I0 = {y}, for any ﬁxed y ∈ R(Y ) and for clarity, rewrite the corresponding z (0) , deﬁned in Section 2, as z (0) (y). And for any z ∈ C0 , take kernel function to be Gaussian function w(z|y, a) =

1 ·K a

z − z (0) (y) a

2

,

z ∈ C0 ,

√ where K(u) = (1/ 2π)exp(−u) and a ∈ (0, +∞) is a tuning parameter. For the convenience of c selecting the better a, let a1 = 1−c , c ∈ (0, 1). Therefore, the kernel function can be represented as 2

c c ·K (z − z (0) (y)) w(z|y, c) = I{z∈C0 } . 1−c 1−c Representing the above by variable, we have 2

c c (Z − Z (0) (y)) w(Z|y, c) = √ I{Y =y} , · exp − 1−c 2π(1 − c) where, by the deﬁnition of Z (0) (y), Z (0) (y) is only a constant of y. Let W (y, c)=EZ (w(Z|y, c)), i.e. the expectation taken with respect to Z over the set C0 . Put the weight g(z|y, c) = w(z|y, c)/W (y, c) on point z ∈ C 0 . The weight function g(z|y, c) is the decreasing function of √ z, that is, points of larger norm have smaller weight. The largest weight c/[ 2π(1−c)W (y, c)] √ is put on the points with smallest norm z (0) (y). In addition, it is easy to see that c/[ 2π(1 − c)W (y, c)] is the increasing function of c. Thus, as c increases, more and more weights are put on the points of minimum norm. Remark 3.1. In this paper, we select the Gaussian function to construct weight function. How to select the weight function is a problem worthy of further eﬀorts. Generally speaking, it seems reasonable that K(u) should be a positive decreasing function, so that points of smaller norm have larger weights. Note that the weight function w(z|y, c) is diﬀerent from the one usually used in nonparametric estimate. Given y, w(z|y, c) concerns only z, that is, points of same norm are of equal importance. Remark 3.2. Recently developed methods, such as Fourier method (see [19]) and dOPG, dMAVE (see [20]) also use the weight function to improve the estimate. These methods mainly focus on the estimate of the gradient of some speciﬁed function, such as E(T (Y )|X) for some T (Y ), etc. In addition, diﬀerent weights are located to diﬀerent gradient directions to reach better estimate. For example, let d(x) be the gradient at x and f (x) be the density function of X. dOPG and dMAVE allocate large weights to d(x), if the density function f (x) at x is large. In short, large weights are assigned to those d(x) of which the estimate are more accurate. If every point in R(Z) can be viewed as the estimate of directions in CS. In this paper, the weight function g(z|y, c) assigns larger weights to the points closer to true CS. That is, more accurate estimates have larger weights. In this sense, this paper shares the spirit similar to those of [19] and [20]. In addition, dOPG, dMAVE, etc. make nearly no assumption on the distribution of X and can be used to non-elliptical case which is also the goal of this paper.

ZHAO JunLong & XU XingZhong

546

SAVE, based on C1 and C2, is more comprehensive than SIR (see [9, 11]). We extend SAVE further and take our estimate, represented by variables, as A(y, c) = EY {EZ [ZZ T g(Z|Y, c)]} − EZ [ZZ T g(Z|y, c)] + EZ [Zg(Z|y, c)]EZ [Z T g(Z|y, c)]. (3.2) In (3.2), EY [EZ (ZZ T g(Z|Y, c)], the expectation of EZ (ZZ T g(Z|Y, c)) with respect to Y , is a constant matrix for any ﬁxed c. A(y, c) has two very important extreme cases. Proposition 3.2.

By the definition in (3.2), we have

lim A(y, c) = Ip − Cov(Z|y),

c→0

T

lim A(y, c) = EY {E[Z (0) (Y )Z (0) (Y )]}

c→1

− E[Z (0) (y)Z (0)T (y)] + E[Z (0) (y)]E[Z (0)T (y)].

(3.3)

Deﬁne A(y, 0) = limc→0 A(y, c) = Ip −Cov(Z|y), the matrix used by SAVE. In the case of c = 0, we put the same weight on each point just like SAVE does. Deﬁne A(y, 1) = limc→1 A(y, c), that is, as c = 1, only the points of minimum norm are used to estimate the central subspace. The following theorem shows that the parameter c provides us with more choices of estimate and makes A(y, c) be eﬀective estimate in general situation. Theorem 3.1. If Z has a norm distribution, then S(A(y, c)) ⊆ SY |Z , for any y ∈ R(Y ) and any c ∈ [0, 1]. If only Assumption A holds, then S(A(y, 1)) ⊆ SY |Z . From Theorem 3.1, similar to SAVE, we use the following matrix: H(c) = E[A(Y, c)]2 , and make the eigenvalue decomposition of H(c). Since c ∈ [0, 1], we actually get a family of estimate, called the weighted variance estimate (WVE). From Theorem 3.1, it is easy to see the following conclusion. Corollary 3.1. Let S(H(c)) denote the subspace spanned by H(c). If Z has a norm distribution, then S(H(c)) ⊆ SY |Z , for any c ∈ [0, 1]. And under Assumption A, S(H(1)) ⊆ SY |Z . Remark 3.3. Note that H(c) is continuous with respect to c. Under Assumption A, by Corollary 3.1, there must exist some c0 ∈ [0, 1], such that, for any c ∈ [c0 , 1], the eigenvalues of H(c) which correspond to SY⊥|Z are smaller than those corresponding to SY |Z . Therefore, if Z deviates from norm distribution but Assumption A holds, then, selecting c from a small neighborhood of 1 still leads to good estimate. Remark 3.4. Wang[17] et al. proposed CP-SAVE to improve the performance of SAVE in the case where distribution of Z has heavy tail. They used the kernel matrix E[Ip /p − Cov(Z/Z|Y )]2 . Their method is based on the assumption of ellipticity and each point is also viewed equally in their importance of estimating CS. Our simulations in Section 6 show that, CP-SAVE is usually better than SAVE in the case of Z having heavy tail. And in the case of Z being normal, CP-SAVE is not necessarily better than SAVE. Our simulations show that the performance of WVE is usually much better than SAVE and CP-SAVE in the case of both ellipticity and non-ellipticity. Remark 3.5. Some recently developed methods, such as dMAVE (see [20]), GCR (see [7]), etc., can complete CS. In Corollary 3.1, we show that WVE is contained in CS under some assumptions, but do not guarantee that WVE can complete CS. As pointed by Li[21] , SAVE

Dimension reduction based on weighted variance estimate

547

can complete CS under very weak assumptions. Many applications and simulations have shown that SAVE can complete CS generally. It is reasonable to believe that WVE which includes SAVE as a special case can complete CS in general situation. Our simulations conﬁrm the conjecture here. The theoretical research is under investigation. For WVE, we have H(0) = E(Ip − Cov(Z|Y ))2 , the matrix used by SAVE method. In practice, similar to SIR and SAVE, we use the sliced version of H(c). Slice the range of Y into h slices I1 , . . . , Ih . Let Y = j, if Y ∈ Ij , j = 1, . . . , h. Therefore, we have the estimate matrix. H1 (c) = E[A(Y , c)]2 . SAVE, always putting each point with the same weight 1, is actually the special case of H(c) with c = 0. However, it may be inappropriate to put the same weight on each point in all cases. For example, in the case where C1 or C2 fails, SAVE may fail to ﬁnd the direction correctly. Proper selection of c may lead to better estimate than SAVE. Remark 3.6. WVE is diﬀerent from kernel SAVE (KSAVE) of [16] mainly in the diﬀerent weight functions used. KSAVE makes the kernel estimate of the MSAVE , but does not consider the distance of a sample from CS. One of the referee proposed that we may avoid slicing by following the way of the equation of (2.5) of [16]. This may increase the number of z (0) (y) and improve further the performance. However, this also may decrease the accuracy of the estimate of z (0) (y). For example, if we adopt the kernel used in [16], for ﬁxed bandwidth, the estimate of z (0) (y) for some y may be poor, since the number of the points in the neighborhood of y varies for diﬀerent y. Therefore, proper selection of bandwidth to make a compromise is important. This issue is under investigation. It is well known that mean is sensitive to the outlier and is not robust. Therefore, E(Z|Y ), E(ZZ T |Y ) and furthermore Cov(Z|Y ) are not robust. Thus, SAVE is not a robust estimate, which is also true for the moment based methods such as SIR, PHD, etc. By Proposition 3.2, as c → 1, the weights on the points with large norm will be suﬃciently small. Therefore, we can remove the outliers to some extent. Therefore, proper selection of c can lead to robust estimate. A complete study of the robustness of WVE may be conducted by referring to Prendergast[22]. It is one of the contents of our future work. In Section 4, we discuss the best selection of parameter c. 4

The selection of the parameter c, h and the dimension

Ye and Weiss[11] proposed a bootstrap method to select the parameters and the dimension for estimating the CS. We will adopt their main idea to solve our problem here. More details of this method can be found in [11]. For any two k dimensional subspace S(Mp×k ) and S(Np×k ), k deﬁne vector correlation coeﬃcient as the square root q of q 2 (M, N ) = i=1 ρ2i , where 1 ρ21 ρ22 · · · ρ2k 0 are the eigenvalues of M N N M . Ye and Weiss[11] used 1 − q as the estimate of the distance of the two subspaces, smaller value of 1−q implyies the stronger correlation. Let (Xi , Yi ), i = 1, 2, . . . , n, be i.i.d. observations and obtain (Xi∗b , Yi∗b ), i = 1, . . . , n, b = 1, . . . , B, by resampling B times from (Xi , Yi ), i = 1, 2, . . . , n. 4.1

Selecting c and the structure dimension with fixed h

For any ﬁxed h, suppose that dimension k of CS is known and that c is ﬁxed. Then we can get estimate S Y |Z of SY |Z from the data {(Xi , Yi ), i = 1, 2, . . . , n} and bootstrap estimate S Yb |Z from the data {(Xi∗b , Yi∗b ), i = 1, . . . , n, }, b = 1, . . . , B. Let 1 − q b , b = 1, . . . , B be the distance

ZHAO JunLong & XU XingZhong

548

B

between S Y |Z and S Yb |Z . Deﬁne Mq (c, k) = b=1 (1 − q b )/B as the variability of the estimate S Y |Z with parameter c and dimension k. Let V = [0, 1] if Z is normal and V = [1 − , 1] with > 0 being suﬃciently small, if only Assumption A holds. Step 1. For any given dimension of CS, select the parameter c. Suppose that dim(SY |Z ) = kh . Let c(kh ) denote the best selection of c corresponding to the given kh and h; then, according to [11], c(kh ) = arg minc∈V {Mq (c, kh )}. Step 2. Select the dimension for ﬁxed h. For any ﬁxed c, compare all the dimensions kh = 1, . . . , (p − 1). By [11], the choice of the dimension d h (c) for the given c will be d h (c) = h = maxc∈V d h (c). arg minkh {Mq (c, kh )}. Our selection of the dimension with ﬁxed h is K h , apply Step 1 to select the corresponding best parameter c(K h ). Step 3. For the selected K 4.2 Choosing the number of slice h In WVE, the number of the slice h plays the same role as in SIR and SAVE. Too large or too small h is not a good choice. We only discuss the best selection of h from its several ﬁxed values, say, h1 , . . . , hm . Step 1. For any ﬁxed hj , j = 1, . . . , m, apply the Selection steps of Subsection 4.1 to get hj , c(K(h j )) and Mq (c(K hj ), K hj ). K hj ), K hj )}. For the selected h0 , Step 2. Select the best slice number h0 = arg minhj {Mq (c(K h0 and c(K h0 ). apply the steps in Subsection 4.1 to get the corresponding K 5

Sample estimate and the consistency of the estimate

5.1 Sample estimate Suppose that (Xi , Yi )(i = 1, . . . , n) i.i.d. is a sample. Get the (Zi , Yi ) by standardizing Xi to Zi (i = 1, . . . , n). (1) Slice the set R(Y ) into h intervals denoted by I1 , . . . , Ih . Let Y = j denote that Y ∈ Ij and let Cj = {Zl : Yl ∈ Ij , l = 1, . . . , n}, j = 1, 2, . . . , h. Let Z (j) denote the minimum norm estimate corresponding to Ij , j = 1, . . . , h. n(j) be the estimate of Z (j) , that is, Z n(j) be one of the elements of Cj (2) For j = 1, . . . , h, let Z

(j) n 2 = min{Zl 2 , l = 1, . . . , n, Zl ∈ Cj }. Let nj = n I{Y ∈I } , for j = 1, 2, . . . , h, with Z i j i=1 and let 2 c c (j) K (Zi − Zn ) I{Yi ∈Ij } , w(Z i |Yi = j, c) = 1−c 1−c n (Y = j, c) = 1 W w(Z i |Y = j, c), nj i=1 w(Z i |Yi = j, c) . (Y = j, c) W n n h 1 1 1 T Zi Zi g (Zi |Y = j, c) − (Zi ZiT g (Zi |Y = j, c)) A(Y = j, c) = h j=1 nj i=1 nj i=1 n n 1 T Zi g(Zi |Y = j, c) Zi g(Zi |Y = j, c) , j = 1, . . . , h, + 2 nj i=1 i=1 g (Zi |Y = j, c) =

where IA is the indicator function of the set A.

Dimension reduction based on weighted variance estimate

(3) Compute the matrix 1 (c) = H

h

549

= j, c)]2

j=1 [A(Y

h

.

1 (c). The eigenvectors corresponding to (4) Make eigenvalue decomposition of the matrix H the nonzero or larger eigenvalues are the directions in SY |Z . 5.2 Consistency of the estimate Note that in Subsection 5.1, nj denotes the number of observations in Ij (j = 1, . . . , h). The following theorems proved in Section 9 describe the consistency of the estimates. Theorem 5.1. Suppose that R(X), the support of X, contains all possible value of X and that the fourth moments of the components of X exists. If nj → +∞ (j = 1, . . . , h), as n → +∞, n(j) → Z (j) , a.s. then, for j = 1, . . . , h, it follows that Z 1 (c). Based on Theorem 5.1, we have the consistency of the estimate H Theorem 5.2. H1 (c), a.s. 6

1 (c) → Under the assumptions of Theorem 5.1, as n → +∞, we have H

Simulation results

In order to measure the performance of diﬀerent methods, we use the measurement, introduced by Li et al.[7] , of distance between two subspaces of Rp . Let S1 , S2 be two q-dimensional subspaces and PS1 , PS2 be the orthogonal projections onto S1 , S2 , respectively. The distance between S1 and S2 can be deﬁned as dist(S1 , S2 ) = PS1 − PS2 , where · is Frobenius norm,

2 i.e. for matrix A = (aij )p×p , A = i,j aij . We compute the distance between the true CS and its estimate with the above deﬁnition. In the simulations of all examples, we slice R(Y ) in the following way. Suppose that n is sample size. Given t, take the number of the slices h = [n/t] and let the number of observations be tj = t in slice Ij (j = 1, 2, . . . , h − 1), and be th = n − t(h − 1) in slice Ih , where [a] denotes the largest integer equal to or less than a. In the following simulations, we consider the ability of WVE under four cases of slicing: I : t = 15, h = [n/15], II : t = 30, h = [n/30], III : t = 50, h = [n/50] and IV : t = 100, h = [n/100]. For each case, take the bootstrap resampling time B = 400. For each example, we perform h and c(K h ). And compare the performance a simulation to compute Mq (c, k) and select the K h , c(K h ) with 200 replicas. For each of existing methods with that of WVE under selected K method, compute the mean and standard deviations of the 200 distances between the estimates and the true CS under the distance deﬁned above (DIST and STD denote the mean and the standard deviations of distance, respectively). In the following simulations, we take sample size n = 600. In Model 1, Z is normal and, in Model 2, Z is elliptical distribution with its deviation from normal distribution being small. So we take V = [0, 1] for the Models 1 and 2. And in Models 3 and 4, we take V = [0.7, 1], since Z deviates greatly from normal distribution. Search the best c with step length 0.05 in the set h ), K h and Mq (c(K h ), K h ) are selected in one simulation. DIST V . In the following tables, c(K and STD are computed, respectively, for SIR, SAVE, CP-SIR, CP-SAVE, dMAVE, KSAVE, h ) and K h from 200 replicas. In the following models, GCR and for WVE with selected c(K T we suppose that X = (X1 , . . . , Xp ) with p = 6, Z = Var(X)−1/2 [X − E(X)], e ∼ N (0, 1) and

ZHAO JunLong & XU XingZhong

550

e ⊥ X. In addition, U [a, b], Fa,b and ta denote, respectively, the uniform distribution on [a, b], the F distribution with parameter (a, b) and t distribution with parameter a. Consider the following Model 1.

Y = Z12 /2 + 2Z2 + 0.4e, with X ∼ N (0, I6 ).

Model 2. Y = (Z1 + 0.5)2 + Z23 + 0.4e, with X ∼ M t(8), an elliptical distribution called the multivariate t distribution. Model 3. Y = 2Z1 + Z23 e, where X has the following distributions: X1 ∼ Ga(5, 2), X2 ∼ t5 , X3 ∼ X22 + 0.5, ∼ N (0, 1) and ⊥ X, X4 ∼ X1 + t4 , X5 ∼ F10,5 (1), X6 ∼ t5 . Model 4. Y = sign(Z1 )Z22 + Z1 e, where X has the following distribution: X1 ∼ U [0, 5], X2 ∼ U [0, 4], X3 ∼ 0.5X13 + 1 , X4 ∼ X2 + 2 , X5 ∼ N (0, 1), X6 ∼ t8 , i ∼ N (0, 1) and i ⊥ X, i = 1, 2 . For all models, SY |Z = span{(1, 0, . . . , 0)T , (0, 1, . . . , 0)T }. The Table 1 is about the simulation results of WVE. Table 1

The selected Kh , c(Kh ) for WVE and 200 replicas WVE

Model

h 40

15

2

0.45

0.1195

0.3876

0.1518

1

20

30

2

0.40

0.0594

0.3453

0.1207

2

3

4

t

Kh

c(Kh )

Mq (c(Kh , Kh )

DIST

STD

12

50

2

0.30

0.0350

0.2960

0.1158

6

100

2

0.25

0.0286

0.2838

0.1117

40

15

2

0.80

0.0942

0.4573

0.1967

20

30

2

0.55

0.0358

0.3360

0.0978

12

50

2

0.45

0.0258

0.2989

0.0957

6

100

2

0.40

0.0216

0.2881

0.0822

40

15

2

0.75

0.0429

0.3883

0.1122

20

30

2

0.75

0.0332

0.3895

0.1205

12

50

2

0.70

0.0542

0.4028

0.1120

6

100

2

0.70

0.0784

0.5538

0.2699

40

15

2

0.75

0.1035

0.5608

0.1393

20

30

2

0.70

0.0734

0.4225

0.1128

12

50

2

0.70

0.0582

0.3846

0.1209

6

100

2

0.70

0.1096

0.5021

0.1896

The column 4 in Table 1 is the dimensions selected for diﬀerent h and the last two columns are h, K h )), the results of 200 replicas of WVE with selected parameters. From the values of Mq (c(K 0 it is easy to see that the best slicing numbers h are 6, 6, 20 and 12, for the four models, respectively. And for each model, the performance of WVE corresponding to h0 is nearly the best. And from Table 1, WVE is insensitive to h; the reason is that Z (0) (y) is robust to the conditional distribution Z|Y . For example, in Model 2 where Z is heavy tail, the estimate of Cov(Z|Y ) is very unstable for h = 40; consequently, performance of SAVE is bad for h = 40 (see h ) = 0.8 Table 2). In this case, putting the large weights on points around z (0) (y), such as c(K (see Table 1), still leads to good estimate. The simulation results for SIR, SAVE, CP-SIR,

Dimension reduction based on weighted variance estimate

551

CP-SAVE are presented in Table 2. Table 2

CP-SIR, CP-SAVE, SAVE and SIR with 200 replicas

CP-SAVE Model

1

2

3

4

h

t

DIST

STD

CP-SIR DIST

STD

SAVE DIST

SIR

STD

DIST

STD

40

15

0.9776

0.3434

1.2483

0.1961

0.4753

0.2059

1.2894

0.1654

20

30

0.8835

0.3644

1.2510

0.1941

0.3865

0.1549

1.2667

0.1853

12

50

0.8551

0.3344

1.2628

0.2045

0.3378

0.1239

1.2520

0.2055

6

100

0.7596

0.3267

1.2667

0.1763

0.3124

0.1109

1.2657

0.1930

40

15

0.6098

0.2664

0.5309

0.2212

1.2625

0.2889

0.6958

0.2479

20

30

0.4339

0.1999

0.4629

0.1793

0.7526

0.3656

0.5967

0.2337

12

50

0.3519

0.1329

0.4333

0.1506

0.5077

0.2522

0.5678

0.2077

6

100

0.3499

0.1515

0.4802

0.1535

0.3890

0.1631

0.5664

0.1732

40

15

1.4680

0.1402

1.3811

0.0640

1.6868

0.2264

1.3753

0.0623

20

30

1.4445

0.1033

1.3955

0.0349

1.6225

0.2473

1.3865

0.0460

12

50

1.4373

0.0920

1.4013

0.0274

1.6004

0.2496

1.3912

0.0382

6

100

1.4449

0.1076

1.4061

0.0206

1.5418

0.2204

1.3914

0.0394

40

15

1.2793

0.2196

1.2441

0.2254

1.4609

0.0497

1.3103

0.1724

20

30

1.3073

0.1933

1.2613

0.2261

1.4321

0.0823

1.2891

0.2134

12

50

1.3160

0.1790

1.2867

0.1712

1.4151

0.0906

1.3068

0.1648

6

100

1.3276

0.1522

1.3206

0.1509

1.3936

0.1163

1.3296

0.1445

From Table 2, it is easy to see that in Model 1 where Z is normal, SAVE is better than CPSAVE and that in Model 2 where Z is heavy tail, CP-SAVE is better than SAVE. In Model 1, performances of SIR and CP-SIR are nearly the same, in which both methods lose one direction, and in model 2, CP-SIR is better than SIR. For Models 3 and 4, SIR, SAVE, CP-SIR and CPSAVE are all bad. From the comparison of Table 1 and 2, it is easy to see that WVE are much better than SIR, SAVE, CP-SIR, CP-SAVE in both elliptical and non-elliptical cases. Table 3 is about the simulation results of KSAVE, GCR and dMAVE. For GCR, let ρ be the radius of the cylinder around empirical direction Zi − Zj (i, j = 1, . . . , n, i = j). And according to [7], we take ρ = 1.5 for Model 1 and ρ = 1.3 for the rest three models, and 10 percent of such empirical directions having smaller variance of Y to make estimate. Table 3

KSAVE, dMAVE and GCR for 200 replicas

KSAVE

dMAVE

GCR

Model

DIST

STD

DIST

STD

DIST

STD

1

0.4060

0.1772

0.1009

0.0264

0.7087

0.2835

2

1.0444

0.3922

0.1142

0.0250

0.6577

0.1968

3

1.4508

0.1624

1.4134

0.0037

0.6678

0.3486

4

1.4350

0.0967

0.2154

0.0604

0.4809

0.0805

From the comparison of Tables 1–3, we can approximately order the performance of diﬀerent methods. For Model 1, the performance in descending order is dMAVE, WVE, SAVE/KSAVE, GCR, CP-SAVE, SIR/CP-SIR, where SAVE/KSAVE means no strict order between two meth-

ZHAO JunLong & XU XingZhong

552

ods, with their performance being similar. For Model 2, the order is dMAVE, WVE, CPSAVE/CP-SIR, SIR/GCR/SAVE/KSAVE. For Model 3 where the nonlinearity is very strong, the order is WVE, GCR, CP-SIR/CP-SAVE/SIR/SAVE/KSAVE/dMAVE. And for Model 4 where the nonlinearity is weaker than in Model 3, the order is dMAVE, WVE/GCR, CPSIR/CP-SAVE/SIR/SAVE/KSAVE. dMAVE is the best except for Model 3 where nonlinearity is very strong. SAVE, SIR, CP-SAVE, CP-SIR, KSAVE all perform bad for Models 3 and 4 where Z deviates greatly from normal or elliptical distribution. In Models 1, 2 and 4, WVE is the second best, and in Model 3, WVE is the best. Therefore, WVE is very robust to the distribution of Z and insensitive to number of slice h. And form Table 3, GCR is also insensitive to the distribution of the regressor. Table 4 shows the ﬁrst two directions picked by diﬀerent methods for Model 3 in one simulation, which clearly demonstrates the impacts of linearity or nonlinearity among the regressors on diﬀerent methods. The results of SAVE and WVE are obtained under h = 20. And SVi , KSi , dMi , WVEi , GCRi , i = 1, 2, are the ﬁrst two directions selected by SAVE, KSAVE, dMAVE, WVE, GCR, respectively. Results of other methods are omitted for conciseness. Table 4 SAVE SV1

The ﬁrst two directions selected by diﬀerent methods for Model 3 KSAVE

dMAVE

WVE

GCR

SV2

KS1

KS2

dM1

dM2

WVE1

WVE2

GCR1

GCR2

–0.1064

0.7379

0.6565

–0.3063

–0.9508

0.9191

0.3494

0.9946

–0.0253

0.8134

0.0421

–0.2998

0.1229

0.0021

–0.0103

–0.3539

0.9216

0.0199

0.9016

0.0875

–0.9884

0.6007

–0.7416

0.9495

–0.3080

0.0467

–0.1476

–0.0428

–0.3917

0.1251

–0.0741

–0.0148

0.0600

0.0566

0.0057

0.1314

0.0392

0.0549

–0.1670

0.0826

0.0110

–0.0674

0.0175

0.0374

0.0252

–0.0094

0.0238

–0.0723

0.0227

0.0232

0.0661

0.0022

0.0033

0.0032

0.0173

0.1021

0.0683

0.0185

0.0684

0.5548

From Table 4, it is clear that WVE and GCR are less aﬀected by the nonlinearity between Z2 and Z3 . Directions picked by other methods are greatly aﬀected by Z3 . Therefore, WVE is very robust to the distribution of the regressor. 7

Real data analysis

We illustrate the proposed estimate on Australian athlete sport data (see [23, p. 98]). We use LBM (lean body mass) as the response. The predictors are the Bfat, Hematocrit (Hc), Hemoglobin (Hg), Height (Ht), Ferr (plasma ferritin concentration), RCC (red cell count), Sex, SSF (sum of skin folds), WCC (white cell count), and Weight in kilograms. We consider the four cases: h = 15, 30, 50, 100. h = 1. In all four cases, the dimensions estimated by both SAVE and WVE are the same, K h , c(K h ) and Mq (c(K h ), K h ) by WVE in four In Table 5, we present the selected dimension K cases. From Table 5, we have h0 = 2. We make the plots to see only the diﬀerence of SAVE and WVE and the plots for other methods are omitted for conciseness. Let SAVE1 and WVE1 denote the directions obtained by SAVE and by WVE, respectively. Figure 1 presents the plots h ) respectively. of response LBM against SAVE1 and WVE1 for the selected c(K

Dimension reduction based on weighted variance estimate Table 5 h

t

553

The selected dimension and parameter Kh

c(Kh )

Mq (c(Kh ), Kh )

8

15

1

0.40

0.1697

6

30

1

0.40

0.0871

4

50

1

0.65

0.0556

2

100

1

0.85

0.0335

Figure 1

Plots of LMS against SAVE1 and WVE1

In Figure 1, (b) and (d) are of LBM against SAVE1 , where the male (denoted by +) and female (denoted by o) are mixed. From (b) and (d) we may infer that the male and female are located in one plane and cannot be separated, especially in (b) where no clear pattern can be drawn. However, if we plot the LBM against the ﬁrst two directions of SAVE, SAVE1 and SAVE2 , and rotate this 3-D plot, we can ﬁnd that the male and female are in fact located in two planes nearly parallel to each other. This has been pointed by Cook and Lee[9] . In Figure 1, (a) and (c) are of LBM against WVE1 , where the two groups are clearly separated and the relation between the two groups are much sharper. The direction of WVE provides more information than SAVE. 8

Discussion

In this paper, a new approach to dimension reduction is proposed. The minimum norm estimate Z (0) (y) in section 2 plays a fundamental role, based on which we develop a new class of estimate WVE which includes SAVE as a special case. Compared with other methods, the most important merit of WVE is that its requirement is very weak and that it is very robust, which makes it work better than SAVE and other methods under general and complicated situations. The introduction of weighted function plays a very important role to decrease the inﬂuence of

ZHAO JunLong & XU XingZhong

554

the points outside of CS. It is valuable to consider diﬀerent ways of putting weights on points to get better estimate. Bootstrap method (see [11]) plays a very important role to select the dimension and the parameter c, which makes it possible to get the best estimate in the family of WVE’s. As pointed by Ye and Weiss[11] , it is possible that a poor estimator could exhibit low variability, although high bias and low variability has so far not appeared in the examples we tried. In this case, the comparison of diﬀerent methods may help us to eliminate the poor estimate. And as the number of slice being large, the estimate may be unstable and other resampling plans such as the jackknife, etc. may serve as good alternatives. It is also worthy of eﬀort to get other way of selecting the parameter c. 9

Proofs

For the simplicity of the proof of Theorem 2.1, we introduce the following lemma ﬁrst. Lemma 9.1.

For any Borel measurable set I0 ⊆ R(Y ), it holds that inf{z2|y ∈ I0 } = inf{z2 : z ∈ C 0 } = inf{z2 : z ∈ C0 }, inf{Pβ z2 : z ∈ C 0 } = inf{Pβ z2 : z ∈ C0 }.

Proof.

By the deﬁnition of C0 , we have inf{z2|y ∈ I0 } = inf{z2 : z ∈ C0 }.

(9.1)

inf{z2 : z ∈ C 0 } inf{z2 : z ∈ C0 }.

(9.2)

Due to C0 ⊆ C 0 , we have

In addition, C 0 is closed, therefore, for any z ∈ C 0 , there exists {zn , n = 1, 2, . . .} ∈ C0 , such that zn → z. Therefore, for any ε > 0, there exists N > 0, such that, as n > N , z zn 2 − ε inf{z2 : Z ∈ C0 } − ε = inf{z2|Y ∈ I0 } − ε. Consequently, we have inf{z2 : z ∈ C 0 } inf{z2 : z ∈ C0 } − ε.

(9.3)

Let ε → 0. By (9.1), (9.2) and (9.3), we have inf{z2|y ∈ I0 } = inf{z2 : z ∈ C 0 } = inf{z2 : z ∈ C0 }. The second equation can be proved similarly. This completes the proof. Proof of Theorem 2.1. Due to C0 = {z : y = f (β T z, ) ∈ I0 , with ∈ E}. If z ∈ C0 , then there exists 0 , such that y = f (β T z, 0 ) ∈ I0 . Let Pβ z = z , A(z ) = {z + z : for all z ∈ S(β)⊥ such that z + z ∈ R(Z)}, where R(Z) is the support set of z. Then, for any z1 = z + z ∈ A(z ), we still have f (β T z1 , 0 ) = f (β T (z + z ), 0 ) = f (β T z, 0 ) ∈ I0 .

Dimension reduction based on weighted variance estimate

555

Therefore, if z ∈ C0 , then, for all z1 ∈ A(z ), we have z1 ∈ C0 . That is A(z ) ⊆ C0 .

(9.4)

By Assumption A, for any given z , inf{z : z , z + z ∈ A(z )} = 0. That is z ∈ A(z ),

(9.5)

where A(z ) is the closure of A(z ). Furthermore, by (9.4) and (9.5), we have z ∈ C 0 . Consequently, by the deﬁnition of z , we have {Pβ z : z ∈ C0 } ⊆ C 0 .

(9.6)

inf{z2 : z ∈ C 0 } inf{Pβ z2 : z ∈ C 0 }.

(9.7)

Therefore, by (9.6), we have

On the other hand, due to the fact that, for any z ∈ C0 , Pβ z2 < z2 , we have inf{Pβ z2 : z ∈ C0 } < inf{z2 : z ∈ C0 }.

(9.8)

By (9.8) and Lemma 9.1, we have inf{Pβ z2 : z ∈ C 0 } inf{z2 |z ∈ C 0 }.

(9.9)

Therefore, by (9.7), (9.9) and Lemma 9.1, it follows that inf{Pβ z2 : z ∈ C 0 } = inf{z2|z ∈ C 0 } = inf{z2|y ∈ I0 }.

(9.10)

Due to C 0 being closed and z (0) ∈ C 0 , we have Pβ z (0) 2 inf{Pβ z2 : z ∈ C 0 }.

(9.11)

Therefore, by the deﬁnition of z (0) , it follows that inf{z2|y ∈ I0 } = z (0) 2 = Pβ z (0) 2 + Qβ z (0) 2 inf{Pβ z2 : z ∈ C 0 }. Combining the above with (9.10), we have z (0) 2 = inf{Pβ z2 : z ∈ C 0 }. In addition, by (9.11), we have Qβ z (0) 2 = 0. That is z (0) ∈ SY |Z . This completes the proof. Proof of Proposition 3.1. For any I0 ⊆ R(Y ), under Assumption A, by Lemma 9.1, Theorem 2.1 and the deﬁnition of z (0) , we have inf{z2 : z ∈ C0 } = z (0) = inf{Pβ z2 : z ∈ C0 }. Therefore any z ∈ C0 satisfying z2 < z (0) 2 + d, we have z (0) 2 z = Pβ z2 + Qβ z2 < z (0) 2 + d. Due to Pβ z2 inf{Pβ z2 : z ∈ C0 }, it follows that Qβ z < d1/2 . This completes the proof.

ZHAO JunLong & XU XingZhong

556

Proof of Proposition 3.2. We ﬁrst prove the conclusions limc→0 E{Zg(Z|y, c)} = E(Z|y) and limc→1 E{Zg(Z|y, c)} = E(Z (0) (y)). Note that Var(Z) = Ip , that is, the variance of the √ component of Z is 1. In addition, note that K(u) < 1/ 2π, then by Lebesgue theorem, we have lim E{Zg(Z|y, c)} = lim E{Zw(Z|y, c)/W (y, c)}

c→0

c→0

E{ZK([(c/(1 − c))(Z − z (0)(y))]2 )}I{Y =y} c→0 E{K([(c/(1 − c))(Z − z (0) (y))]2 )}I{Y =y}

= lim

= E(Z|y), where, w(z|y, c) = [c/(1 − c)]K([(c/(1 − c))(z − z (0)(y))]2 )I{Y =y} . This completes the proof of the ﬁrst conclusion. On the other hand, note that lim K

c→1

⎧ √ 2

⎨ 1/ 2π, c Z − z (0) (y)) I{Y =y} = ⎩ 0, 1−c

Z = z (0) (y); Z = z (0) (y).

Therefore, by Lebesgue theorem, we have lim E{Zg(Z|y, c)} = lim E{Zw(Z|y, c)/W (y, c)}

c→1

c→1

E{ZK([(c/(1 − c))(Z − z (0)(y))]2 )}I{Y =y} c→1 E{K([(c/(1 − c))(Z − z (0) (y))]2 )}I{Y =y}

= lim

= E(Z (0) (y)). By similar approach, we have lim EZ (ZZ T g(Z|Y, c)) = EZ (ZZ T |Y ),

c→0

lim EZ (ZZ T g(Z|Y, c)) = E[Z (0) (Y )Z (0)T (Y )].

c→1

Consequently, by Lebesgue theorem, lim EY [EZ (ZZ T g(Z|Y, c))] = EY (EZ (ZZ T |Y )) = Ip ,

c→0

lim EY [EZ (ZZ T g(Z|Y, c))] = EY {E[Z (0) (Y )Z (0)T (Y )]}.

c→1

Therefore, for any y ∈ R(Y ), we have, lim A(y, c) = Ip − Cov(Z|y),

c→0

lim A(y, c) = EY {E[Z (0) (Y )Z (0)T (Y )]}

c→1

− E[Z (0) (y)Z (0)T (y)] + E[Z (0) (y)]E[Z (0)T (y)], this completes the proof. Proof of Theorem 3.1. First, we prove that if Z is normal, then, for any y ∈ R(Y ) and any c ∈ [0, 1], S(A(y, c)) ⊆ SY |Z . We only need to show that, for any γ ∈ SY⊥|Z , γ T A(y, c) = 0, where A(y, c) = EY [EZ (ZZ T g(Z|Y, c))]−EZ (ZZ T g(Z|y, c))+EZ (Zg(Z|y, c))EZ (Z T g(Z|y, c)).

Dimension reduction based on weighted variance estimate

557

By the property of normal distribution, it holds that γ T Z ⊥ β T Z. Therefore, by (1.1), we have EZ (γ T ZZ T γg(Z|Y, c)) = E{EZ [γ T ZZ T γg(Z|Y, c)|β T Z]} = E{EZ (γ T ZZ T γ|β T Z)EZ (g(Z|Y, c)|β T Z)} = Ip−k EZ (g(Z|Y, c)) ≡ Ip−k , E(γ T Zg(Z|y, c)) = E{E[γ T Zg(Z|y, c)|β T Z]} = E{E(γ T Z|β T Z)E[g(Z|y, c)|β T Z]} = 0. Therefore, we have γ T A(Y, c)γ = 0. Second, we show that, if only Assumption A holds, then S(A(y, 1)) ⊆ SY |Z . Note that A(y, 1) = limc→1 A(y, c), by Proposition 3.2 and Theorem 2.1, the conclusion is obvious. This completes the proof. Proof of Theorem 5.1. We only need to prove the conclusion for any ﬁxed j (j = 1, . . . , h). Let Cjmin denote the set deﬁned in Section 2 corresponding to Ij , j = 1, . . . , h. By the deﬁnition of Z (j) , for any z (j) ∈ Cjmin , Z (j) = z (j) , a constant only concerning with j. Since Z = −1/2 1/2 Σx (X − E(X)) and Σx > 0, for any z (j) ∈ Cjmin , there exists x(j) = Σx z (j) + E(X). Let 1/2 min n(j) > Z (j) + δ0 , for some Cj,x = {x(j) : x(j) = Σx z (j) + E(X), for all z (j) ∈ Cjmin }. If Z δ0 > 0, then for any Zl ∈ Cj , we have Zl > Z (j) + δ0 . Therefore, there exists δ1 > 0, such that, for any z (j) ∈ Cj , (j) 2 −1 + δ1 = z (j) 2 + δ1 . (Xl − X n )T Σ x (Xl − X n ) > Z

On the other hand, since the fourth moment of the components of X exists, we have −1 −1 X n → E(X) and Σ x → Σx ,

a.s.

Therefore, there exists N , as n > N , for any Zl ∈ Cj , we have (XlT − E(X))Σ−1 x (Xl − E(X)) > z (j) 2 + δ1 (j) = (x(j) − E(X))T Σ−1 − E(X)) + δ1 x (x

min for all x(j) ∈ Cj,x

a.s.

Since Σ−1 x is constant matrix, there must exist δ2 > 0, such that Xl − x(j) > δ2 ,

min for any x(j) ∈ Cj,x .

min , we have Therefore, as n > N , for any z (j) ∈ Cj,x

(j) > Z (j) + δ0 ) < P (Z T Zl > Z (j) 2 + δ1 , for all Zl ∈ Cj ) P (Z n l < P (Xl − x(j) > δ2 , for all Zl ∈ Cj ) P (Xl − x(j) > δ2 ). =

(9.12)

Yl ∈Ij

The last equation is due to the fact that (Xl − x(j) )I{Yl ∈Cj } are i.i.d. variables, following the conditional distribution (X − x(j) )I{Yl ∈Cj } . In addition, by the assumption that all the possible

ZHAO JunLong & XU XingZhong

558

values of X are contained in R(X), i.e. the support of X, x(j) is also contained in R(X). Thus, by the continuity of X, for any δ2 > 0, there exists > 0, such that P (Xl − x(j) > δ2 ) < 1 − .

(9.13)

Recall that nj → +∞ (j = 1, . . . , h), as n → +∞. Therefore, by (9.12) and (9.13), as n → ∞, it follows that n(j) > Z (j) + δ0 ) < P (Z P (Xl − X (j) > δ2 ) Yl ∈Ij

= (1 − )nj → 0. Furthermore, it follows that (j) > Z (j) + δ0 ) P (Z (1 − )nj (1 − )n < +∞. n n

n

n

n → Z (j) , a.s. This completes the proof. Thus, by the Borel-Contelli theorem, we have Z (j)

Proof of Theorem 5.2.

Since the fourth moment of the components of the X exists, we have −1 → Σ−1 , X n → E(X) and Σ x x

a.s.

(9.14)

−1/2 −1/2 Note that Zi = Σ (Xi − X n ). Let w 1 (Xi |Yi = j, c) = w( Σ (Xi − X n |Yi = j, c) = x x w(Z i |Yi = j, c). By Theorem 5.1 and (9.14), for any > 0, there exists N , such that, as n > N , |w 1 (Xi |Yi = j, c) − w1 (Xi |Yi = j, c)| < ,

a.s. for any i with Zi ∈ Cj ,

(9.15)

where w1 (Xi |Yi = j, c) =

c K 1−c

2 c − 12 − 12 (j) I{Yi ∈Ij } . (Σx (Xi − E(X)) − Σx x ) 1−c

Note that, for any Zi ∈Cj , w 1 (Xi |Yi = j, c) = w1 (Xi |Yi = j, c) = 0. Therefore, by (9.15), for any > 0, there exists N , such that, as n > N , |w 1 (Xi |Yi = j, c) − w1 (Xi |Yi = j, c)| < ,

a.s. for i = 1, . . . , n.

(9.16)

Furthermore, as n > N , by (9.16), we have n n 1 1 < , w (X | Y = j, c) − w (X | Y = j, c) 1 i i 1 i i nj nj i=1 i=1

a.s.

(9.17)

Since {w1 (Xi |Yi = j, c), i = 1, . . . , n with Zi ∈ Cj } are i.i.d. variable, then as n → +∞, we have n 1 w1 (Xi |Yi = j, c) − E(w1 (X|Y = j, c)) → 0 a.s. (9.18) nj i=1 Consequently, by (9.17) and (9.18), as n → +∞, it follows that n 1 w 1 (Xi |Yi = j, c) − E(w1 (X|Y = j, c)) → 0 nj i=1

a.s.

(9.19)

Dimension reduction based on weighted variance estimate

559

Note that w 1 (Xi |Yi = j, c) = w(Z i |Yi = j, c). By (9.19), as n → +∞, it follows that (Y = j, c) − E(w1 (X|Y = j, c)) → 0, W

a.s.

(9.20)

Noting that w1 (X|Y = j, c) = w(Z|Y = j, c), we have W (Y = j, c) = E(w(Z|Y = j, c)) = E(w1 (X|Y = j, c)). Therefore, by (9.20), as n → +∞, we have (Y = j, c) − W (Y = j, c) → 0, W

a.s.

(9.21)

Note that w1 (Xi |Yi = j, c) = w(Zi |Yi = j, c) and that g is continuous with respect to w and W . By (9.16) and (9.21), there exists N1 , such that, as n > N1 , | g (Zi |Y = j, c) − g1 (Xi |Y = j, c)| < where

a.s. for i = 1, . . . , n,

(9.22)

w1 (Xi |Yi = j, c) (Xi − E(X))|Y = j, c) = . g1 (Xi |Y = j, c) = g(Σ−1/2 x W (Y = j)

For matrix A, let A∞ = max{ai,j }, where A = (ai,j ). Note that g1 (X|Y = j, c) = g(Z|Y = j, c). By (9.14) and (9.22), for any ﬁxed j and c, as n → +∞, we have n 1 T → 0, Z Z g (Z | Y = j, c) − E(ZZg(Z| Y = j, c)) i i i nj ∞ i=1 n 1 Zi g(Zi |Y = j, c) − E(Zg(Z|Y = j, c)) nj → 0. ∞ i=1 Therefore, as n → +∞, we have Y = j, c) → A(

h

E(ZZ T g(Z|Y = j, c)) − E(ZZ T g(Z|Y = j, c))

j=1

+

1 E(Zg(Z|Y = j, c))E(Z T g(Z|Y = j, c)), a.s. nj

(9.23)

Since Y is discrete, the right side (9.23) can be write as EY E(ZZ T g(Z|Y , c)) − E(ZZ T g(Z|Y = j, c)) + EY E(Zg(Z|Y , c))E(Z T g(Z|Y , c)). 1 (c) and the law of large number, as n → +∞, we have Therefore, by the deﬁnition of H 1 (c) → H1 (c), a.s. This completes the proof. H Acknowledgements The authors thank the editor and two referees for providing useful comments and suggestions which led to the great improvement of presentation of the paper. References 1 Cook R D. Regression Graphics. New York: Wiley, 1998 2 Cook R D. On the interpretation of regression plots. J Amer Statist Assoc, 89: 177–189 (1994)

560

ZHAO JunLong & XU XingZhong

3 Cook R D. Graphics for regressions with a binary response. J Amer Statist Assoc, 91: 983–992 (1996) 4 Li K C. Sliced inverse regression for dimension reduction. J Amer Statist Assoc, 86: 316–342 (1991) 5 Cook R D, Weisberg S. Discussion of “sliced inverse regression for dimension reduction” by K-C Li. J Amer Statist Assoc, 86: 328–332 (1991) 6 Li K C. On principal Hessian directions for the data visualization and dimension reduction: another application of Stein’s lemma. J Amer Statist Assoc, 87: 1025–1039 (1992) 7 Li B, Zha H, Chiaromente F. Contour Regression: a general approach to dimension reduction. Ann Statist, 33: 1580–1616 (2005) 8 Xia Y, Tong H, Li W K, et al. An adaptive estimation of dimension reduction space. J Roy Statist Soc Ser B, 64: 363–410 (2002) 9 Cook R D, Lee H. Dimension reduction in binary response regression. J Amer Statist Assoc, 94: 1187–1200 (1999) 10 Cook R D, Nachtsheim C J. Reweighting to achieve elliptically contoured covariates in regression. J Amer Statist Assoc, 89: 592–599 (1994) 11 Ye Z, Weiss R E. Using the bootstrap to select one of a new class of dimension reduction methods. J Amer Statist Assoc, 98: 968–979 (2003) 12 Zhu L X, Ohtaki M, Li Y X. On hybrid method inverse regression-based algorithms. Comput Statist Data Anal, 51: 2621–2635 (2007) 13 Cook R D. SAVE: a method for dimension reduction and graphics in regression. Comm Statist Theory Methods, 29: 2109–2121 (2000) 14 Zhao J L, Xu X Z, Ma J J. Extending SAVE and PHD. Comm Statist Theory Methods, 36: 1591–1606 (2007) 15 Li Y X, Zhu L X. Asymptotics for sliced average variance estimation. Ann Statist, 35: 41–69 (2007) 16 Zhu L P, Zhu L X. On kernel method for sliced average variance estimation. J Multivariate Anal, 98: 970–991 (2007) 17 Wang H S, Ni L Q, Tsai C L. Improving dimension reduction via contour-projection. Statist Sinica, 18: 299–311 (2008) 18 Li K C. Nonlinear confounding in high-dimensional regression. Ann Statist, 25: 577–612 (1997) 19 Zhu Y, Zeng P. Fourier methods for estimating the central subspace and the central mean subspace in regression. J Amer Statist Assoc, 101: 1638–1651 (2006) 20 Xia Y. A Constructive approach to the estimation of dimension reduction directions. Ann Statist, 35: 2654–2690 (2007) 21 Li B, Wang S. On directional regression for dimension reduction. J Amer Statist Assoc, 102: 997–1008 (2007) 22 Prendergast L A. Implications of inﬂuence function analysis for sliced inverse regression and sliced average variance estimation. Biometrika, 94: 585–601 (2007)

Dimension reduction based on weighted variance estimate

Recommend Documents