Science in China Series A: Mathematics Mar., 2009, Vol. 52, No. 3, 539–560 www.scichina.com math.scichina.com www.springerlink.com
Dimension reduction based on weighted variance estimate ZHAO JunLong1† & XU XingZhong2 1
Department of Mathematics, Beihang University; Laboratory of Mathematics, Information and Behavior of
the Ministry of Education, Beijing 100083, China 2 Department of Mathematics, Beijing Institute of Technology, Beijing 100081, China (email:
[email protected],
[email protected])
Abstract
In this paper, we propose a new estimate for dimension reduction, called the weighted variance estimate (WVE), which includes Sliced Average Variance Estimate (SAVE) as a special case. Bootstrap method is used to select the best estimate from the WVE and to estimate the structure dimension. And this selected best estimate usually performs better than the existing methods such as Sliced Inverse Regression (SIR), SAVE, etc. Many methods such as SIR, SAVE, etc. usually put the same weight on each observation to estimate central subspace (CS). By introducing a weight function, WVE puts different weights on different observations according to distance of observations from CS. The weight function makes WVE have very good performance in general and complicated situations, for example, the distribution of regressor deviating severely from elliptical distribution which is the base of many methods, such as SIR, etc. And compared with many existing methods, WVE is insensitive to the distribution of the regressor. The consistency of the WVE is established. Simulations to compare the performances of WVE with other existing methods confirm the advantage of WVE.
Keywords:
central subspace, contour regression, sliced average variance estimate, sliced inverse
regression, sufficient dimension reduction, weight function MSC(2000):
1
62G08, 62H05
Introduction
In regression setting, the goal of a regression is to infer about the conditional distribution of the univariate response variable Y , given the p × 1 vector of predictor X. In this paper we consider sufficient dimension reduction, the basic idea being to seek a matrix β of dimension p × k (k p) such that the p-dimensional predictor X can be replaced by β T X without the loss of information in predicting the response Y . In other words, the distribution of Y |X is the same as that of Y |β T X. For the convenience of further discussion, we introduce the following notations throughout. The notation U ⊥ V means that the random vectors U and V are independent. Subspace will be denoted by S, and S(B) means the subspace spanned by the columns of the matrix or vector B. PB denotes the projection operator for S(B) with respect to the usual inner product and QB = I − PB . R(U ) denotes the support set of the distribution of the variable U . Following [1], matrix β of dimension p × k can be defined as Received March 3, 2008; accepted June 2, 2008; published online December 5, 2008 DOI: 10.1007/s11425-008-0130-z † Corresponding author This work was supported by National Natural Science Foundation of China (Grant No. 10771015)
ZHAO JunLong & XU XingZhong
540
Y ⊥ X|β T X,
(1.1)
T
that is, given β X, Y is independent of X. S(β) is called the dimension reduction subspace (DRS). If the intersection of all DRS’s is still a DRS, it will be called central dimension reduction subspace, briefly CS, denoted by SY |X (see [2, 3]). The existence of CS is assured under mild regularity conditions (see [2, 3]). Assume that Var(X) is nonsingular and let Z = [Var(X)]−1/2 (X − E(X)). Due to SY |X = [Var(X)]−1/2 SY |Z (see [1, 2]), we need only to estimate SY |Z . Some methods are available for estimating SY |Z including Sliced Inverse Regression (SIR) (see [4]), SIRII , Sliced Average of Variance Estimation (SAVE) (see [5]), Principal Hessian Direction (PHD)(see [6]), Contour Regression (see [7]), Minimum Average Variance Estimator (MAVE) (see [8]), etc. The effectiveness of many of these methods such as SIR, SAVE, etc. are based on one or two of the following two strict assumptions (see [9]): C1. Linearity Condition: E(Z|β T Z) = Pβ Z, C2. Constant Covariance Condition: Var(Z|β T Z) ≡ Qβ = c (constant), where the columns of matrix β are the standardized orthogonal basis of SY |Z . The estimate matrix used by SIR and SAVE, respectively, are as follows: MSIR = Cov[E(Z|Y )], MSAVE = E[(Ip − Cov(Z|Y ))2 ]. SIR needs only the assumption C1, which is approximately equal to requiring Z to be an elliptical distribution. SAVE is based on both C1 and C2, which is approximately equal to requiring Z to be normal distribution. Some methods may be used to achieve the ellipticity, such as, Cox-Box transformation, method of re-weighting data by Voronio weights (see [10]). However, these methods need intensive computation and the selection of some parameters involved is empirical. And not all data can reach ellipticity by re-weighting method (see [10]). SIR is an effective method for dimension reduction problem. However, in the case where Y is symmetric with respect to Z, SIR fails to find any nonzero direction. To overcome the limits of SIR, Cook and Weisberg[5] developed SAVE method. SAVE is more comprehensive than SIR (see [9, 11]). It has been proved that the directions found by SIR can also be found by SAVE (see [9, 11, 12]). However, there are several problems with SAVE in application. First, SAVE requires the assumptions: C1 and C2 or normal distribution, which are very strong restrictions for application. Second, SAVE is sensitive to the distribution of X (see [13, 14]) and to the number of the slice (see [15]). Several methods have been developed to improve the performance of SAVE, such as Kernel SAVE (KSAVE) (see [16]) and Contour Projection SAVE (CP-SAVE) (see [17]). In this paper, a new family of estimates, called weighted variances estimates (briefly WVE), is proposed which includes SAVE as a special case. The most important point which distinguishes WVE from SAVE and SIR is that different weights are assigned to different observations. WVE owns several advantages. First, the weight function makes WVE work well under much weaker assumption, removing both C1 and C2. Second, we can get best estimate in the family of WVE, which usually performs better than SAVE, SIR, etc. Third, WVE is robust compared with many methods such as SAVE, SIR, etc. The main contents of this paper are arranged as follows. In Section 2, the minimum norm estimate is introduced, which is the base of this
Dimension reduction based on weighted variance estimate
541
article. The weighted variance estimate is developed in Section 3. In Section 4, we discuss the selection of the best estimate from the family of WVE. In Section 5, consistency of the estimate is established. The simulation results and real data analysis are presented in Sections 6 and 7. Some relevant problems are discussed in Section 8. Finally, the proofs are presented in Section 9. 2
The estimate with minimum norm
2.1 Minimum norm estimate In this section, we introduce a new estimate: the minimum norm estimate (MNE), which is the base of the weighted variance estimate in Section 3. Li[7] et al. proposed the General Contour are independent identically distributed variable Regression (GCR) method. Suppose that X, X and that x and x are the corresponding observations of the two variables, respectively. GCR views x − x as an empirical direction and uses the variance of Y along the line crossing x and x to decide whether x − x is a direction in CS. Let R(Z) be the support of Z. In the similar spirit to that of Contour Regression, every point z in R(Z) is a vector and may be viewed as a direction. And in this paper, we directly select from R(Z) the points z (0) that fall in SY |Z as the estimate of directions in CS. But different from [13], we use norm to decide whether a point z falls in CS. Since the estimate z (0) can be described by minimum norm, we called it minimum norm estimate (MNE). MNE needs only very weak assumption on Z and is very robust to the distribution of Z. Of course, each point z in R(Z), viewed as a direction, is not necessarily contained in CS. However, there indeed exist the points in R(Z) that fall in CS in many cases. For example, in the case where R(Z) = Rp , since CS is always the subspace of Rp , it is obvious that in R(Z) there must exist points which are contained in CS. Of course, assume that R(Z) = Rp is only used to illustrate our idea rather than the necessary condition for our method. The rest problems are to find the situation or assumption under which there exist points falling in CS in R(Z) and to identify such points. Let us consider the first problem. By (1.1), Y only concerns with β T Z, therefore, the given Y is equivalent to restricting the values of β T Z to the corresponding specific set. At the same time, for any direction b ∈ SY⊥|Z , bT Z is still a variable. It is possible that bT Z can reach 0. Given β T Z = c (constant), if bT Z can reach 0, then the point z ∗ ∈ R(Z), satisfying β T z ∗ = c and bT z ∗ = 0, is a direction in CS as we desired. Therefore, we need the following assumption. Assumption A. Suppose that β is a standardized orthogonal basis of SY |Z and that δ(0, ) is a neighborhood of 0 with radius > 0. For any b ∈ SY⊥|Z and any > 0, we have P (bT Z ∈ δ(0, )|β T Z) > 0. C1 requires that E(bT Z|β T Z) = 0, for any b ∈ S(β)⊥ , whereas Assumption A only requires that the conditional distribution bT Z|β T Z can take value in δ(0, ) in positive probability. Remark 2.1. Assumption A is equivalent to saying that 0 is in the support of the distribution bT Z|β T Z. Therefore, if R(Z) is convex, Assumption A is much weaker than C1 or C2, which approximately means Z being elliptical or normal distribution. It is easy to see that if the support of Z is the whole Rp , then Assumption A holds. Remark 2.2. Due to Y ⊥ Z|β T Z, Assumption A can be relaxed further to requiring Assumption A : P (bT Z ∈ δ(0, )|Y ) > 0. In regression setting where the coordinates of Z are
ZHAO JunLong & XU XingZhong
542
related, if the relation among the regressors is not too strong, Assumption A or A can holds generally. If the linearity and nonlinearity among the regressors are very strong, the model may become unidentifiable. For example, Y = X1 + and X2 = X12 + a1 with X = (X1 , . . . , Xp ), and 1 being independent normal distribution. As a → 0, the CS will not exist. As pointed by Li[18] , it seems reasonable that both (1, 0, . . . , 0) and (0, 1, 0, . . . , 0) should be used to construct the model. From the above argument, we remove the conditions C1 and C2 which are the fundamental assumptions of many methods. The next problem is to identify the points which fall in CS. Consider the dimension reduction model (see [4]) Y = f (β T Z, ), ⊥ Z, where Z ⊥ and f is any unspecified function. Let E and R(Y ) denote the supports of and Y , respectively. Let I0 be any Borel measurable set in R(Y ). Particularly, I0 can be the set of single element, for example, I0 = {y0 } for any y0 ∈ R(Y ). For any ∈ E, let L() = {z : y = f (β T z, ) ∈ I0 with fixed} be the inverse image of the set {y : y ∈ I0 for fixed } after being given, and let C0 = {z|y ∈ I0 } = {z : y = f (β T z, ) ∈ I0 , ∈ E} = L(), ∈E
be the inverse image of set {y : y ∈ I0 } with respect to z, that is, C0 is the collection of all possible z in R(Z) that makes f (β T z, ) ∈ I0 . Let a(I0 ) = inf{z2 : z ∈ C0 } and consider the points in the following set: C0min = {z (0) : z (0) ∈ C 0 and z (0) 2 = a(I0 )},
(2.1)
where C 0 is the closure of C0 . In fact, under Assumption A, for any z (0) ∈ C0min , we have z (0) ∈ SY |Z . Let us take a simple example for illustration. Suppose that Y = Z1 + e, where Z = (Z1 , Z2 ) ∼ N (0, I2 ) and e ∼ U (0, 1), the uniform distribution on [0, 1]. For this example, SY |Z = span{(1, 0)T }. For any given y, let I0 = {y} and y −1 = {z1 : z1 = y − e for any e ∈ [0, 1]} be the inverse image of y with respect to z1 . Let C0 = {z = (z1 , z2 ) : z ∈ R2 and z1 ∈ y −1 } and consider the point z (0) ∈ C 0 satisfying z (0) = inf{z : z ∈ C 0 }. For example, given y = 2, we have C 0 = {z = (z1 , z2 ) : z1 = y − e ∈ [1, 2] and z ∈ R2 }. For any z ∈ C 0 , since z2 = z12 + z22 = (2 − e)2 + z22 , it is easy to see that inf{z : z ∈ C 0 } = 1 and z (0) = (1, 0). Therefore z (0) ∈ SY |Z . In general, we have the following conclusion and the proof is presented in Section 9. Theorem 2.1.
Suppose that Assumption A holds. Then for any z (0) ∈ C0min , it holds that z (0) ∈ SY |Z
and
z (0) 2 = inf{Pβ z2 : z ∈ C 0 }.
Representing the above by variables, we obtain the minimum norm estimate (MNE): Z (0) (I0 ), which only takes value on the set C0min . Obviously, given I0 , Z (0) (I0 ) is a constant. However, it is possible that Z (0) (I0 ) = 0. Therefore, we consider the possibility of Z (0) (I0 ) being nonzero vectors. In fact, we can get nonzero estimate in general cases. Consider the dimension reduction model (see [4]) Y = f (β T Z, ), where ⊥ Z and f is any unspecified function. Case I.
is bounded.
For any ρ > 0, let δ(0, ρ) denote the closed sphere centered at 0 with radius ρ in Rp . Let A = {z : z ∈ R(Z)∩δ(0, ρ)} and let Y|Z∈A = f (β T Z, )|Z∈A = fA (Z, ) be the conditional model
Dimension reduction based on weighted variance estimate
543
of Y given Z ∈ A. Let a1 = inf Z∈A,∈E fA (Z, ), b1 = supZ∈A,∈E fA (Z, ). If R(Y )\[a1 , b1 ] = ∅, for any y ∈ R(Y )\[a1 , b1 ] and any z ∈ C0 where I0 = {y}, we have z ρ. Therefore, for any Borel measurable set I0 ⊆ R(Y )\[a1 , b1 ] and any z ∈ C0 , we still have z ρ. Consequently by the definition of z (0) in (2.1), it follows that z (0) ρ. This along with Theorem 2.1 leads to the following conclusion. Proposition 2.1. Suppose that Assumption A holds and that R(Y )\[a1 , b1 ] = ∅. Let I0 be any Borel measurable set in R(Y )\[a1 , b1 ]. Then for any z (0) ∈ C0min , it holds that z (0) ∈ SY |Z with z (0) ρ > 0. Case II. is unbounded. For any ρ > 0, let A = {Z : Z ∈ R(Z) ∩ δ(0, ρ)}, Ac = {Z : Z ∈ R(Z)\δ(0, ρ)}, Y|Z∈A =
f (β T Z, )|Z∈A = fA (Z, ). In this case, fA (Z, ) may be unbounded, for example, the local structure Y = f (β T Z) + . However, the variable Y|Z∈A is contained in a closed bounded interval in very high probability, that is, for any α > 0 being sufficiently small, there exists [a2 , b2 ] with a2 , b2 being finite, such that P (Y|Z∈A ∈ [a2 , b2 ]) > 1 − α.
(2.2)
Suppose that R(Y )\[a2 , b2 ] = ∅. As argued in Case I, if Y|Z∈A ∈ [a2 , b2 ], then for any y ∈ R(Y )\[a2 , b2 ] and any z ∈ C0 with I0 = {y}, it holds that z > ρ, that is, Z|y > ρ, where Z|y is the conditional norm of Z, given y. Therefore, for any y ∈ R(Y )\[a2 , b2 ], the conditional probability P (Z|y > ρ| Y|Z∈A ∈ [a2 , b2 ]) = 1. (2.3) Therefore for any y ∈ R(Y )\[a2 , b2 ], by (2.2) and (2.3), we have P (Z|y > ρ) = P (Y|Z∈A ∈ [a2 , b2 ])P (Z|y > ρ| Y|Z∈A ∈ [a2 , b2 ]) +P (Y|Z∈A ∈ R(Y )\[a2 , b2 ])P (Z|y > ρ| Y|Z∈A ∈ R(Y )\[a2 , b2 ]) P (Y|Z∈A ∈ [a2 , b2 ]) > 1 − α. That is, for any y ∈ R(Y )\[a2 , b2 ], P (Z ∈ Ac |y) = E(I{Z∈Ac } |y) > 1 − α.
(2.4)
Recall the assumption R(Y )\[a2 , b2 ] = ∅. Then for any Borel measurable set I0 ⊆ R(Y )\[a2 , b2 ], by (2.4), we have P (Z ∈ Ac |Y ∈ I0 ) = E(I{Z∈Ac } |Y ∈ I0 ) > 1 − α. That is, P (Z > ρ|Y ∈ I0 ) > 1 − α. Therefore, by Theorem 2.1 and the definition of Z (0) (I0 ), we have the following conclusion. Proposition 2.2. Suppose that Assumption A holds. For any α ∈ (0, 1) being small, if R(Y )\[a2 , b2 ] = ∅, then, for any Borel measurable set I0 ⊆ R(Y )\[a2 , b2 ], it holds that Z (0) (I0 ) ∈ SY |Z
and
P (Z (0) (I0 ) > ρ|Y ∈ I0 ) > 1 − α.
Remark 2.3. From Propositions 2.1 and 2.2, we see that, as I0 ⊆ R(Y ) \ [ai , bi ], i = 1, 2, Z (0) (I0 ) can be nonzero estimate at least in probability 1−α. In the case of I0 ⊆ [ai , bi ], i = 1, 2, we cannot guarantee that there exists ρ > 0, such that, Z (0) (I0 ) > ρ in very high probability.
ZHAO JunLong & XU XingZhong
544
However, this does not mean that Z (0) (I0 ) must be zero when I0 ⊆ [a1 , b1 ] or ⊆ [a2 , b2 ]. It is possible to draw nonzero directions when I0 ⊆ [a1 , b1 ] or ⊆ [a2 , b2 ]. Therefore, the probability of Z (0) (I0 ) being nonzero varies for different I0 . 2.2 The robustness of minimum norm estimate For simplicity, take I0 = {y} for any fixed y ∈ R(Y ) and compare the robustness of MNE Z (0) (y) with the estimators of other well known methods. Note that Z (0) (y) is defined by the inf{z : z ∈ C0 }, where C0 , according to the definition in Section 2, is the support set of the conditional distribution Z|Y = y. We only need to consider the robustness of T = inf{z : z is in the support of Z|Y = y}. Suppose that F is the c.d.f. of Z and that RF denotes the support of F . Define T (F ) = inf{z2 : z ∈ RF }, which is a functional of F . Let P = {F : F = (1 − 1 )Fz|y (z) + 1 G(z), 0 1 < }, where G(z) is from the class M = {all cumulant distribution G with RG ⊆ RFz|y }, and Fz|y (z) denotes the c.d.f. of the conditional distribution Z|Y = y. For any F ∈ P , by the definition of T (F ), we have ⎧ ⎨ T (F ), < 1, 1 z|y T (F ) = inf{T (Fz|y ), T (G)} = ⎩ T (G), 1 = 1. Therefore, it is easy to see that the breakdown point is 1. Let δz0 denote the degenerated c.d.f. defined on z0 and F = (1 − )Fz|y (z) + δz0 , 0 1. In the case of 0 < < 1, it is easy to see that, the functional ⎧ ⎨ T (F ), z 2 T (F ), 0 z|y z|y T (F ) = ⎩ z0 2 , z0 2 < T (F ). z|y
And in the case of = 0 or 1, T (F ) ≡ T (Fz|y ) or z0 2 respectively. Therefore the influence function IC(z0 , Fz|y , T ) ≡ 0, a.e. Therefore T (F ) or equivalently the estimate Z (0) (y) is a very robust estimate and is insensitive to the outlier. 3
Dimension reduction with weighted variance estimate
In Section 2, we have shown that z (0) ∈ SY |Z under Assumption A. In fact, other points in C0 still carry the information of CS. Consider the points z ∈ C0 satisfying z2 < z (0) 2 + d, for any d 0. If d > 0, it is possible that z may not be contained in the SY |Z . However, the following theorem shows that the distance of z deviating from true CS is less than d1/2 , that is, the points in C0 with smaller norm are closer to true CS. Proposition 3.1. that
Under Assumption A, for any z ∈ C0 with z2 z (0) 2 + d, it follows 0 Qβ z d1/2 .
(3.1)
Specifically, as d → 0, we have Theorem 2.1. Theorem 2.1 and Proposition 3.1 tell us that the points with smaller norm are more important in estimating CS. There are two reasons. First, Proposition 3.1 actually provides a measure of the extent of a point z ∈ C0 deviating from true CS. Second, since MNE Z (0) (I0 ) is very robust and its assumption is much weaker (only Assumption A is needed), for the goal of obtaining
Dimension reduction based on weighted variance estimate
545
the estimate that can work under general case, more attention should be paid to the points of smaller norm. Therefore, we should allocate larger weights to the points of smaller norm, rather than put the same weight on each point which is the way of SIR, SAVE, etc. Let I0 = {y}, for any fixed y ∈ R(Y ) and for clarity, rewrite the corresponding z (0) , defined in Section 2, as z (0) (y). And for any z ∈ C0 , take kernel function to be Gaussian function w(z|y, a) =
1 ·K a
z − z (0) (y) a
2
,
z ∈ C0 ,
√ where K(u) = (1/ 2π)exp(−u) and a ∈ (0, +∞) is a tuning parameter. For the convenience of c selecting the better a, let a1 = 1−c , c ∈ (0, 1). Therefore, the kernel function can be represented as 2
c c ·K (z − z (0) (y)) w(z|y, c) = I{z∈C0 } . 1−c 1−c Representing the above by variable, we have 2
c c (Z − Z (0) (y)) w(Z|y, c) = √ I{Y =y} , · exp − 1−c 2π(1 − c) where, by the definition of Z (0) (y), Z (0) (y) is only a constant of y. Let W (y, c)=EZ (w(Z|y, c)), i.e. the expectation taken with respect to Z over the set C0 . Put the weight g(z|y, c) = w(z|y, c)/W (y, c) on point z ∈ C 0 . The weight function g(z|y, c) is the decreasing function of √ z, that is, points of larger norm have smaller weight. The largest weight c/[ 2π(1−c)W (y, c)] √ is put on the points with smallest norm z (0) (y). In addition, it is easy to see that c/[ 2π(1 − c)W (y, c)] is the increasing function of c. Thus, as c increases, more and more weights are put on the points of minimum norm. Remark 3.1. In this paper, we select the Gaussian function to construct weight function. How to select the weight function is a problem worthy of further efforts. Generally speaking, it seems reasonable that K(u) should be a positive decreasing function, so that points of smaller norm have larger weights. Note that the weight function w(z|y, c) is different from the one usually used in nonparametric estimate. Given y, w(z|y, c) concerns only z, that is, points of same norm are of equal importance. Remark 3.2. Recently developed methods, such as Fourier method (see [19]) and dOPG, dMAVE (see [20]) also use the weight function to improve the estimate. These methods mainly focus on the estimate of the gradient of some specified function, such as E(T (Y )|X) for some T (Y ), etc. In addition, different weights are located to different gradient directions to reach better estimate. For example, let d(x) be the gradient at x and f (x) be the density function of X. dOPG and dMAVE allocate large weights to d(x), if the density function f (x) at x is large. In short, large weights are assigned to those d(x) of which the estimate are more accurate. If every point in R(Z) can be viewed as the estimate of directions in CS. In this paper, the weight function g(z|y, c) assigns larger weights to the points closer to true CS. That is, more accurate estimates have larger weights. In this sense, this paper shares the spirit similar to those of [19] and [20]. In addition, dOPG, dMAVE, etc. make nearly no assumption on the distribution of X and can be used to non-elliptical case which is also the goal of this paper.
ZHAO JunLong & XU XingZhong
546
SAVE, based on C1 and C2, is more comprehensive than SIR (see [9, 11]). We extend SAVE further and take our estimate, represented by variables, as A(y, c) = EY {EZ [ZZ T g(Z|Y, c)]} − EZ [ZZ T g(Z|y, c)] + EZ [Zg(Z|y, c)]EZ [Z T g(Z|y, c)]. (3.2) In (3.2), EY [EZ (ZZ T g(Z|Y, c)], the expectation of EZ (ZZ T g(Z|Y, c)) with respect to Y , is a constant matrix for any fixed c. A(y, c) has two very important extreme cases. Proposition 3.2.
By the definition in (3.2), we have
lim A(y, c) = Ip − Cov(Z|y),
c→0
T
lim A(y, c) = EY {E[Z (0) (Y )Z (0) (Y )]}
c→1
− E[Z (0) (y)Z (0)T (y)] + E[Z (0) (y)]E[Z (0)T (y)].
(3.3)
Define A(y, 0) = limc→0 A(y, c) = Ip −Cov(Z|y), the matrix used by SAVE. In the case of c = 0, we put the same weight on each point just like SAVE does. Define A(y, 1) = limc→1 A(y, c), that is, as c = 1, only the points of minimum norm are used to estimate the central subspace. The following theorem shows that the parameter c provides us with more choices of estimate and makes A(y, c) be effective estimate in general situation. Theorem 3.1. If Z has a norm distribution, then S(A(y, c)) ⊆ SY |Z , for any y ∈ R(Y ) and any c ∈ [0, 1]. If only Assumption A holds, then S(A(y, 1)) ⊆ SY |Z . From Theorem 3.1, similar to SAVE, we use the following matrix: H(c) = E[A(Y, c)]2 , and make the eigenvalue decomposition of H(c). Since c ∈ [0, 1], we actually get a family of estimate, called the weighted variance estimate (WVE). From Theorem 3.1, it is easy to see the following conclusion. Corollary 3.1. Let S(H(c)) denote the subspace spanned by H(c). If Z has a norm distribution, then S(H(c)) ⊆ SY |Z , for any c ∈ [0, 1]. And under Assumption A, S(H(1)) ⊆ SY |Z . Remark 3.3. Note that H(c) is continuous with respect to c. Under Assumption A, by Corollary 3.1, there must exist some c0 ∈ [0, 1], such that, for any c ∈ [c0 , 1], the eigenvalues of H(c) which correspond to SY⊥|Z are smaller than those corresponding to SY |Z . Therefore, if Z deviates from norm distribution but Assumption A holds, then, selecting c from a small neighborhood of 1 still leads to good estimate. Remark 3.4. Wang[17] et al. proposed CP-SAVE to improve the performance of SAVE in the case where distribution of Z has heavy tail. They used the kernel matrix E[Ip /p − Cov(Z/Z|Y )]2 . Their method is based on the assumption of ellipticity and each point is also viewed equally in their importance of estimating CS. Our simulations in Section 6 show that, CP-SAVE is usually better than SAVE in the case of Z having heavy tail. And in the case of Z being normal, CP-SAVE is not necessarily better than SAVE. Our simulations show that the performance of WVE is usually much better than SAVE and CP-SAVE in the case of both ellipticity and non-ellipticity. Remark 3.5. Some recently developed methods, such as dMAVE (see [20]), GCR (see [7]), etc., can complete CS. In Corollary 3.1, we show that WVE is contained in CS under some assumptions, but do not guarantee that WVE can complete CS. As pointed by Li[21] , SAVE
Dimension reduction based on weighted variance estimate
547
can complete CS under very weak assumptions. Many applications and simulations have shown that SAVE can complete CS generally. It is reasonable to believe that WVE which includes SAVE as a special case can complete CS in general situation. Our simulations confirm the conjecture here. The theoretical research is under investigation. For WVE, we have H(0) = E(Ip − Cov(Z|Y ))2 , the matrix used by SAVE method. In practice, similar to SIR and SAVE, we use the sliced version of H(c). Slice the range of Y into h slices I1 , . . . , Ih . Let Y = j, if Y ∈ Ij , j = 1, . . . , h. Therefore, we have the estimate matrix. H1 (c) = E[A(Y , c)]2 . SAVE, always putting each point with the same weight 1, is actually the special case of H(c) with c = 0. However, it may be inappropriate to put the same weight on each point in all cases. For example, in the case where C1 or C2 fails, SAVE may fail to find the direction correctly. Proper selection of c may lead to better estimate than SAVE. Remark 3.6. WVE is different from kernel SAVE (KSAVE) of [16] mainly in the different weight functions used. KSAVE makes the kernel estimate of the MSAVE , but does not consider the distance of a sample from CS. One of the referee proposed that we may avoid slicing by following the way of the equation of (2.5) of [16]. This may increase the number of z (0) (y) and improve further the performance. However, this also may decrease the accuracy of the estimate of z (0) (y). For example, if we adopt the kernel used in [16], for fixed bandwidth, the estimate of z (0) (y) for some y may be poor, since the number of the points in the neighborhood of y varies for different y. Therefore, proper selection of bandwidth to make a compromise is important. This issue is under investigation. It is well known that mean is sensitive to the outlier and is not robust. Therefore, E(Z|Y ), E(ZZ T |Y ) and furthermore Cov(Z|Y ) are not robust. Thus, SAVE is not a robust estimate, which is also true for the moment based methods such as SIR, PHD, etc. By Proposition 3.2, as c → 1, the weights on the points with large norm will be sufficiently small. Therefore, we can remove the outliers to some extent. Therefore, proper selection of c can lead to robust estimate. A complete study of the robustness of WVE may be conducted by referring to Prendergast[22]. It is one of the contents of our future work. In Section 4, we discuss the best selection of parameter c. 4
The selection of the parameter c, h and the dimension
Ye and Weiss[11] proposed a bootstrap method to select the parameters and the dimension for estimating the CS. We will adopt their main idea to solve our problem here. More details of this method can be found in [11]. For any two k dimensional subspace S(Mp×k ) and S(Np×k ), k define vector correlation coefficient as the square root q of q 2 (M, N ) = i=1 ρ2i , where 1 ρ21 ρ22 · · · ρ2k 0 are the eigenvalues of M N N M . Ye and Weiss[11] used 1 − q as the estimate of the distance of the two subspaces, smaller value of 1−q implyies the stronger correlation. Let (Xi , Yi ), i = 1, 2, . . . , n, be i.i.d. observations and obtain (Xi∗b , Yi∗b ), i = 1, . . . , n, b = 1, . . . , B, by resampling B times from (Xi , Yi ), i = 1, 2, . . . , n. 4.1
Selecting c and the structure dimension with fixed h
For any fixed h, suppose that dimension k of CS is known and that c is fixed. Then we can get estimate S Y |Z of SY |Z from the data {(Xi , Yi ), i = 1, 2, . . . , n} and bootstrap estimate S Yb |Z from the data {(Xi∗b , Yi∗b ), i = 1, . . . , n, }, b = 1, . . . , B. Let 1 − q b , b = 1, . . . , B be the distance
ZHAO JunLong & XU XingZhong
548
B
between S Y |Z and S Yb |Z . Define Mq (c, k) = b=1 (1 − q b )/B as the variability of the estimate S Y |Z with parameter c and dimension k. Let V = [0, 1] if Z is normal and V = [1 − , 1] with > 0 being sufficiently small, if only Assumption A holds. Step 1. For any given dimension of CS, select the parameter c. Suppose that dim(SY |Z ) = kh . Let c(kh ) denote the best selection of c corresponding to the given kh and h; then, according to [11], c(kh ) = arg minc∈V {Mq (c, kh )}. Step 2. Select the dimension for fixed h. For any fixed c, compare all the dimensions kh = 1, . . . , (p − 1). By [11], the choice of the dimension d h (c) for the given c will be d h (c) = h = maxc∈V d h (c). arg minkh {Mq (c, kh )}. Our selection of the dimension with fixed h is K h , apply Step 1 to select the corresponding best parameter c(K h ). Step 3. For the selected K 4.2 Choosing the number of slice h In WVE, the number of the slice h plays the same role as in SIR and SAVE. Too large or too small h is not a good choice. We only discuss the best selection of h from its several fixed values, say, h1 , . . . , hm . Step 1. For any fixed hj , j = 1, . . . , m, apply the Selection steps of Subsection 4.1 to get hj , c(K(h j )) and Mq (c(K hj ), K hj ). K hj ), K hj )}. For the selected h0 , Step 2. Select the best slice number h0 = arg minhj {Mq (c(K h0 and c(K h0 ). apply the steps in Subsection 4.1 to get the corresponding K 5
Sample estimate and the consistency of the estimate
5.1 Sample estimate Suppose that (Xi , Yi )(i = 1, . . . , n) i.i.d. is a sample. Get the (Zi , Yi ) by standardizing Xi to Zi (i = 1, . . . , n). (1) Slice the set R(Y ) into h intervals denoted by I1 , . . . , Ih . Let Y = j denote that Y ∈ Ij and let Cj = {Zl : Yl ∈ Ij , l = 1, . . . , n}, j = 1, 2, . . . , h. Let Z (j) denote the minimum norm estimate corresponding to Ij , j = 1, . . . , h. n(j) be the estimate of Z (j) , that is, Z n(j) be one of the elements of Cj (2) For j = 1, . . . , h, let Z
(j) n 2 = min{Zl 2 , l = 1, . . . , n, Zl ∈ Cj }. Let nj = n I{Y ∈I } , for j = 1, 2, . . . , h, with Z i j i=1 and let 2 c c (j) K (Zi − Zn ) I{Yi ∈Ij } , w(Z i |Yi = j, c) = 1−c 1−c n (Y = j, c) = 1 W w(Z i |Y = j, c), nj i=1 w(Z i |Yi = j, c) . (Y = j, c) W n n h 1 1 1 T Zi Zi g (Zi |Y = j, c) − (Zi ZiT g (Zi |Y = j, c)) A(Y = j, c) = h j=1 nj i=1 nj i=1 n n 1 T Zi g(Zi |Y = j, c) Zi g(Zi |Y = j, c) , j = 1, . . . , h, + 2 nj i=1 i=1 g (Zi |Y = j, c) =
where IA is the indicator function of the set A.
Dimension reduction based on weighted variance estimate
(3) Compute the matrix 1 (c) = H
h
549
= j, c)]2
j=1 [A(Y
h
.
1 (c). The eigenvectors corresponding to (4) Make eigenvalue decomposition of the matrix H the nonzero or larger eigenvalues are the directions in SY |Z . 5.2 Consistency of the estimate Note that in Subsection 5.1, nj denotes the number of observations in Ij (j = 1, . . . , h). The following theorems proved in Section 9 describe the consistency of the estimates. Theorem 5.1. Suppose that R(X), the support of X, contains all possible value of X and that the fourth moments of the components of X exists. If nj → +∞ (j = 1, . . . , h), as n → +∞, n(j) → Z (j) , a.s. then, for j = 1, . . . , h, it follows that Z 1 (c). Based on Theorem 5.1, we have the consistency of the estimate H Theorem 5.2. H1 (c), a.s. 6
1 (c) → Under the assumptions of Theorem 5.1, as n → +∞, we have H
Simulation results
In order to measure the performance of different methods, we use the measurement, introduced by Li et al.[7] , of distance between two subspaces of Rp . Let S1 , S2 be two q-dimensional subspaces and PS1 , PS2 be the orthogonal projections onto S1 , S2 , respectively. The distance between S1 and S2 can be defined as dist(S1 , S2 ) = PS1 − PS2 , where · is Frobenius norm,
2 i.e. for matrix A = (aij )p×p , A = i,j aij . We compute the distance between the true CS and its estimate with the above definition. In the simulations of all examples, we slice R(Y ) in the following way. Suppose that n is sample size. Given t, take the number of the slices h = [n/t] and let the number of observations be tj = t in slice Ij (j = 1, 2, . . . , h − 1), and be th = n − t(h − 1) in slice Ih , where [a] denotes the largest integer equal to or less than a. In the following simulations, we consider the ability of WVE under four cases of slicing: I : t = 15, h = [n/15], II : t = 30, h = [n/30], III : t = 50, h = [n/50] and IV : t = 100, h = [n/100]. For each case, take the bootstrap resampling time B = 400. For each example, we perform h and c(K h ). And compare the performance a simulation to compute Mq (c, k) and select the K h , c(K h ) with 200 replicas. For each of existing methods with that of WVE under selected K method, compute the mean and standard deviations of the 200 distances between the estimates and the true CS under the distance defined above (DIST and STD denote the mean and the standard deviations of distance, respectively). In the following simulations, we take sample size n = 600. In Model 1, Z is normal and, in Model 2, Z is elliptical distribution with its deviation from normal distribution being small. So we take V = [0, 1] for the Models 1 and 2. And in Models 3 and 4, we take V = [0.7, 1], since Z deviates greatly from normal distribution. Search the best c with step length 0.05 in the set h ), K h and Mq (c(K h ), K h ) are selected in one simulation. DIST V . In the following tables, c(K and STD are computed, respectively, for SIR, SAVE, CP-SIR, CP-SAVE, dMAVE, KSAVE, h ) and K h from 200 replicas. In the following models, GCR and for WVE with selected c(K T we suppose that X = (X1 , . . . , Xp ) with p = 6, Z = Var(X)−1/2 [X − E(X)], e ∼ N (0, 1) and
ZHAO JunLong & XU XingZhong
550
e ⊥ X. In addition, U [a, b], Fa,b and ta denote, respectively, the uniform distribution on [a, b], the F distribution with parameter (a, b) and t distribution with parameter a. Consider the following Model 1.
Y = Z12 /2 + 2Z2 + 0.4e, with X ∼ N (0, I6 ).
Model 2. Y = (Z1 + 0.5)2 + Z23 + 0.4e, with X ∼ M t(8), an elliptical distribution called the multivariate t distribution. Model 3. Y = 2Z1 + Z23 e, where X has the following distributions: X1 ∼ Ga(5, 2), X2 ∼ t5 , X3 ∼ X22 + 0.5, ∼ N (0, 1) and ⊥ X, X4 ∼ X1 + t4 , X5 ∼ F10,5 (1), X6 ∼ t5 . Model 4. Y = sign(Z1 )Z22 + Z1 e, where X has the following distribution: X1 ∼ U [0, 5], X2 ∼ U [0, 4], X3 ∼ 0.5X13 + 1 , X4 ∼ X2 + 2 , X5 ∼ N (0, 1), X6 ∼ t8 , i ∼ N (0, 1) and i ⊥ X, i = 1, 2 . For all models, SY |Z = span{(1, 0, . . . , 0)T , (0, 1, . . . , 0)T }. The Table 1 is about the simulation results of WVE. Table 1
The selected Kh , c(Kh ) for WVE and 200 replicas WVE
Model
h 40
15
2
0.45
0.1195
0.3876
0.1518
1
20
30
2
0.40
0.0594
0.3453
0.1207
2
3
4
t
Kh
c(Kh )
Mq (c(Kh , Kh )
DIST
STD
12
50
2
0.30
0.0350
0.2960
0.1158
6
100
2
0.25
0.0286
0.2838
0.1117
40
15
2
0.80
0.0942
0.4573
0.1967
20
30
2
0.55
0.0358
0.3360
0.0978
12
50
2
0.45
0.0258
0.2989
0.0957
6
100
2
0.40
0.0216
0.2881
0.0822
40
15
2
0.75
0.0429
0.3883
0.1122
20
30
2
0.75
0.0332
0.3895
0.1205
12
50
2
0.70
0.0542
0.4028
0.1120
6
100
2
0.70
0.0784
0.5538
0.2699
40
15
2
0.75
0.1035
0.5608
0.1393
20
30
2
0.70
0.0734
0.4225
0.1128
12
50
2
0.70
0.0582
0.3846
0.1209
6
100
2
0.70
0.1096
0.5021
0.1896
The column 4 in Table 1 is the dimensions selected for different h and the last two columns are h, K h )), the results of 200 replicas of WVE with selected parameters. From the values of Mq (c(K 0 it is easy to see that the best slicing numbers h are 6, 6, 20 and 12, for the four models, respectively. And for each model, the performance of WVE corresponding to h0 is nearly the best. And from Table 1, WVE is insensitive to h; the reason is that Z (0) (y) is robust to the conditional distribution Z|Y . For example, in Model 2 where Z is heavy tail, the estimate of Cov(Z|Y ) is very unstable for h = 40; consequently, performance of SAVE is bad for h = 40 (see h ) = 0.8 Table 2). In this case, putting the large weights on points around z (0) (y), such as c(K (see Table 1), still leads to good estimate. The simulation results for SIR, SAVE, CP-SIR,
Dimension reduction based on weighted variance estimate
551
CP-SAVE are presented in Table 2. Table 2
CP-SIR, CP-SAVE, SAVE and SIR with 200 replicas
CP-SAVE Model
1
2
3
4
h
t
DIST
STD
CP-SIR DIST
STD
SAVE DIST
SIR
STD
DIST
STD
40
15
0.9776
0.3434
1.2483
0.1961
0.4753
0.2059
1.2894
0.1654
20
30
0.8835
0.3644
1.2510
0.1941
0.3865
0.1549
1.2667
0.1853
12
50
0.8551
0.3344
1.2628
0.2045
0.3378
0.1239
1.2520
0.2055
6
100
0.7596
0.3267
1.2667
0.1763
0.3124
0.1109
1.2657
0.1930
40
15
0.6098
0.2664
0.5309
0.2212
1.2625
0.2889
0.6958
0.2479
20
30
0.4339
0.1999
0.4629
0.1793
0.7526
0.3656
0.5967
0.2337
12
50
0.3519
0.1329
0.4333
0.1506
0.5077
0.2522
0.5678
0.2077
6
100
0.3499
0.1515
0.4802
0.1535
0.3890
0.1631
0.5664
0.1732
40
15
1.4680
0.1402
1.3811
0.0640
1.6868
0.2264
1.3753
0.0623
20
30
1.4445
0.1033
1.3955
0.0349
1.6225
0.2473
1.3865
0.0460
12
50
1.4373
0.0920
1.4013
0.0274
1.6004
0.2496
1.3912
0.0382
6
100
1.4449
0.1076
1.4061
0.0206
1.5418
0.2204
1.3914
0.0394
40
15
1.2793
0.2196
1.2441
0.2254
1.4609
0.0497
1.3103
0.1724
20
30
1.3073
0.1933
1.2613
0.2261
1.4321
0.0823
1.2891
0.2134
12
50
1.3160
0.1790
1.2867
0.1712
1.4151
0.0906
1.3068
0.1648
6
100
1.3276
0.1522
1.3206
0.1509
1.3936
0.1163
1.3296
0.1445
From Table 2, it is easy to see that in Model 1 where Z is normal, SAVE is better than CPSAVE and that in Model 2 where Z is heavy tail, CP-SAVE is better than SAVE. In Model 1, performances of SIR and CP-SIR are nearly the same, in which both methods lose one direction, and in model 2, CP-SIR is better than SIR. For Models 3 and 4, SIR, SAVE, CP-SIR and CPSAVE are all bad. From the comparison of Table 1 and 2, it is easy to see that WVE are much better than SIR, SAVE, CP-SIR, CP-SAVE in both elliptical and non-elliptical cases. Table 3 is about the simulation results of KSAVE, GCR and dMAVE. For GCR, let ρ be the radius of the cylinder around empirical direction Zi − Zj (i, j = 1, . . . , n, i = j). And according to [7], we take ρ = 1.5 for Model 1 and ρ = 1.3 for the rest three models, and 10 percent of such empirical directions having smaller variance of Y to make estimate. Table 3
KSAVE, dMAVE and GCR for 200 replicas
KSAVE
dMAVE
GCR
Model
DIST
STD
DIST
STD
DIST
STD
1
0.4060
0.1772
0.1009
0.0264
0.7087
0.2835
2
1.0444
0.3922
0.1142
0.0250
0.6577
0.1968
3
1.4508
0.1624
1.4134
0.0037
0.6678
0.3486
4
1.4350
0.0967
0.2154
0.0604
0.4809
0.0805
From the comparison of Tables 1–3, we can approximately order the performance of different methods. For Model 1, the performance in descending order is dMAVE, WVE, SAVE/KSAVE, GCR, CP-SAVE, SIR/CP-SIR, where SAVE/KSAVE means no strict order between two meth-
ZHAO JunLong & XU XingZhong
552
ods, with their performance being similar. For Model 2, the order is dMAVE, WVE, CPSAVE/CP-SIR, SIR/GCR/SAVE/KSAVE. For Model 3 where the nonlinearity is very strong, the order is WVE, GCR, CP-SIR/CP-SAVE/SIR/SAVE/KSAVE/dMAVE. And for Model 4 where the nonlinearity is weaker than in Model 3, the order is dMAVE, WVE/GCR, CPSIR/CP-SAVE/SIR/SAVE/KSAVE. dMAVE is the best except for Model 3 where nonlinearity is very strong. SAVE, SIR, CP-SAVE, CP-SIR, KSAVE all perform bad for Models 3 and 4 where Z deviates greatly from normal or elliptical distribution. In Models 1, 2 and 4, WVE is the second best, and in Model 3, WVE is the best. Therefore, WVE is very robust to the distribution of Z and insensitive to number of slice h. And form Table 3, GCR is also insensitive to the distribution of the regressor. Table 4 shows the first two directions picked by different methods for Model 3 in one simulation, which clearly demonstrates the impacts of linearity or nonlinearity among the regressors on different methods. The results of SAVE and WVE are obtained under h = 20. And SVi , KSi , dMi , WVEi , GCRi , i = 1, 2, are the first two directions selected by SAVE, KSAVE, dMAVE, WVE, GCR, respectively. Results of other methods are omitted for conciseness. Table 4 SAVE SV1
The first two directions selected by different methods for Model 3 KSAVE
dMAVE
WVE
GCR
SV2
KS1
KS2
dM1
dM2
WVE1
WVE2
GCR1
GCR2
–0.1064
0.7379
0.6565
–0.3063
–0.9508
0.9191
0.3494
0.9946
–0.0253
0.8134
0.0421
–0.2998
0.1229
0.0021
–0.0103
–0.3539
0.9216
0.0199
0.9016
0.0875
–0.9884
0.6007
–0.7416
0.9495
–0.3080
0.0467
–0.1476
–0.0428
–0.3917
0.1251
–0.0741
–0.0148
0.0600
0.0566
0.0057
0.1314
0.0392
0.0549
–0.1670
0.0826
0.0110
–0.0674
0.0175
0.0374
0.0252
–0.0094
0.0238
–0.0723
0.0227
0.0232
0.0661
0.0022
0.0033
0.0032
0.0173
0.1021
0.0683
0.0185
0.0684
0.5548
From Table 4, it is clear that WVE and GCR are less affected by the nonlinearity between Z2 and Z3 . Directions picked by other methods are greatly affected by Z3 . Therefore, WVE is very robust to the distribution of the regressor. 7
Real data analysis
We illustrate the proposed estimate on Australian athlete sport data (see [23, p. 98]). We use LBM (lean body mass) as the response. The predictors are the Bfat, Hematocrit (Hc), Hemoglobin (Hg), Height (Ht), Ferr (plasma ferritin concentration), RCC (red cell count), Sex, SSF (sum of skin folds), WCC (white cell count), and Weight in kilograms. We consider the four cases: h = 15, 30, 50, 100. h = 1. In all four cases, the dimensions estimated by both SAVE and WVE are the same, K h , c(K h ) and Mq (c(K h ), K h ) by WVE in four In Table 5, we present the selected dimension K cases. From Table 5, we have h0 = 2. We make the plots to see only the difference of SAVE and WVE and the plots for other methods are omitted for conciseness. Let SAVE1 and WVE1 denote the directions obtained by SAVE and by WVE, respectively. Figure 1 presents the plots h ) respectively. of response LBM against SAVE1 and WVE1 for the selected c(K
Dimension reduction based on weighted variance estimate Table 5 h
t
553
The selected dimension and parameter Kh
c(Kh )
Mq (c(Kh ), Kh )
8
15
1
0.40
0.1697
6
30
1
0.40
0.0871
4
50
1
0.65
0.0556
2
100
1
0.85
0.0335
Figure 1
Plots of LMS against SAVE1 and WVE1
In Figure 1, (b) and (d) are of LBM against SAVE1 , where the male (denoted by +) and female (denoted by o) are mixed. From (b) and (d) we may infer that the male and female are located in one plane and cannot be separated, especially in (b) where no clear pattern can be drawn. However, if we plot the LBM against the first two directions of SAVE, SAVE1 and SAVE2 , and rotate this 3-D plot, we can find that the male and female are in fact located in two planes nearly parallel to each other. This has been pointed by Cook and Lee[9] . In Figure 1, (a) and (c) are of LBM against WVE1 , where the two groups are clearly separated and the relation between the two groups are much sharper. The direction of WVE provides more information than SAVE. 8
Discussion
In this paper, a new approach to dimension reduction is proposed. The minimum norm estimate Z (0) (y) in section 2 plays a fundamental role, based on which we develop a new class of estimate WVE which includes SAVE as a special case. Compared with other methods, the most important merit of WVE is that its requirement is very weak and that it is very robust, which makes it work better than SAVE and other methods under general and complicated situations. The introduction of weighted function plays a very important role to decrease the influence of
ZHAO JunLong & XU XingZhong
554
the points outside of CS. It is valuable to consider different ways of putting weights on points to get better estimate. Bootstrap method (see [11]) plays a very important role to select the dimension and the parameter c, which makes it possible to get the best estimate in the family of WVE’s. As pointed by Ye and Weiss[11] , it is possible that a poor estimator could exhibit low variability, although high bias and low variability has so far not appeared in the examples we tried. In this case, the comparison of different methods may help us to eliminate the poor estimate. And as the number of slice being large, the estimate may be unstable and other resampling plans such as the jackknife, etc. may serve as good alternatives. It is also worthy of effort to get other way of selecting the parameter c. 9
Proofs
For the simplicity of the proof of Theorem 2.1, we introduce the following lemma first. Lemma 9.1.
For any Borel measurable set I0 ⊆ R(Y ), it holds that inf{z2|y ∈ I0 } = inf{z2 : z ∈ C 0 } = inf{z2 : z ∈ C0 }, inf{Pβ z2 : z ∈ C 0 } = inf{Pβ z2 : z ∈ C0 }.
Proof.
By the definition of C0 , we have inf{z2|y ∈ I0 } = inf{z2 : z ∈ C0 }.
(9.1)
inf{z2 : z ∈ C 0 } inf{z2 : z ∈ C0 }.
(9.2)
Due to C0 ⊆ C 0 , we have
In addition, C 0 is closed, therefore, for any z ∈ C 0 , there exists {zn , n = 1, 2, . . .} ∈ C0 , such that zn → z. Therefore, for any ε > 0, there exists N > 0, such that, as n > N , z zn 2 − ε inf{z2 : Z ∈ C0 } − ε = inf{z2|Y ∈ I0 } − ε. Consequently, we have inf{z2 : z ∈ C 0 } inf{z2 : z ∈ C0 } − ε.
(9.3)
Let ε → 0. By (9.1), (9.2) and (9.3), we have inf{z2|y ∈ I0 } = inf{z2 : z ∈ C 0 } = inf{z2 : z ∈ C0 }. The second equation can be proved similarly. This completes the proof. Proof of Theorem 2.1. Due to C0 = {z : y = f (β T z, ) ∈ I0 , with ∈ E}. If z ∈ C0 , then there exists 0 , such that y = f (β T z, 0 ) ∈ I0 . Let Pβ z = z , A(z ) = {z + z : for all z ∈ S(β)⊥ such that z + z ∈ R(Z)}, where R(Z) is the support set of z. Then, for any z1 = z + z ∈ A(z ), we still have f (β T z1 , 0 ) = f (β T (z + z ), 0 ) = f (β T z, 0 ) ∈ I0 .
Dimension reduction based on weighted variance estimate
555
Therefore, if z ∈ C0 , then, for all z1 ∈ A(z ), we have z1 ∈ C0 . That is A(z ) ⊆ C0 .
(9.4)
By Assumption A, for any given z , inf{z : z , z + z ∈ A(z )} = 0. That is z ∈ A(z ),
(9.5)
where A(z ) is the closure of A(z ). Furthermore, by (9.4) and (9.5), we have z ∈ C 0 . Consequently, by the definition of z , we have {Pβ z : z ∈ C0 } ⊆ C 0 .
(9.6)
inf{z2 : z ∈ C 0 } inf{Pβ z2 : z ∈ C 0 }.
(9.7)
Therefore, by (9.6), we have
On the other hand, due to the fact that, for any z ∈ C0 , Pβ z2 < z2 , we have inf{Pβ z2 : z ∈ C0 } < inf{z2 : z ∈ C0 }.
(9.8)
By (9.8) and Lemma 9.1, we have inf{Pβ z2 : z ∈ C 0 } inf{z2 |z ∈ C 0 }.
(9.9)
Therefore, by (9.7), (9.9) and Lemma 9.1, it follows that inf{Pβ z2 : z ∈ C 0 } = inf{z2|z ∈ C 0 } = inf{z2|y ∈ I0 }.
(9.10)
Due to C 0 being closed and z (0) ∈ C 0 , we have Pβ z (0) 2 inf{Pβ z2 : z ∈ C 0 }.
(9.11)
Therefore, by the definition of z (0) , it follows that inf{z2|y ∈ I0 } = z (0) 2 = Pβ z (0) 2 + Qβ z (0) 2 inf{Pβ z2 : z ∈ C 0 }. Combining the above with (9.10), we have z (0) 2 = inf{Pβ z2 : z ∈ C 0 }. In addition, by (9.11), we have Qβ z (0) 2 = 0. That is z (0) ∈ SY |Z . This completes the proof. Proof of Proposition 3.1. For any I0 ⊆ R(Y ), under Assumption A, by Lemma 9.1, Theorem 2.1 and the definition of z (0) , we have inf{z2 : z ∈ C0 } = z (0) = inf{Pβ z2 : z ∈ C0 }. Therefore any z ∈ C0 satisfying z2 < z (0) 2 + d, we have z (0) 2 z = Pβ z2 + Qβ z2 < z (0) 2 + d. Due to Pβ z2 inf{Pβ z2 : z ∈ C0 }, it follows that Qβ z < d1/2 . This completes the proof.
ZHAO JunLong & XU XingZhong
556
Proof of Proposition 3.2. We first prove the conclusions limc→0 E{Zg(Z|y, c)} = E(Z|y) and limc→1 E{Zg(Z|y, c)} = E(Z (0) (y)). Note that Var(Z) = Ip , that is, the variance of the √ component of Z is 1. In addition, note that K(u) < 1/ 2π, then by Lebesgue theorem, we have lim E{Zg(Z|y, c)} = lim E{Zw(Z|y, c)/W (y, c)}
c→0
c→0
E{ZK([(c/(1 − c))(Z − z (0)(y))]2 )}I{Y =y} c→0 E{K([(c/(1 − c))(Z − z (0) (y))]2 )}I{Y =y}
= lim
= E(Z|y), where, w(z|y, c) = [c/(1 − c)]K([(c/(1 − c))(z − z (0)(y))]2 )I{Y =y} . This completes the proof of the first conclusion. On the other hand, note that lim K
c→1
⎧ √ 2
⎨ 1/ 2π, c Z − z (0) (y)) I{Y =y} = ⎩ 0, 1−c
Z = z (0) (y); Z = z (0) (y).
Therefore, by Lebesgue theorem, we have lim E{Zg(Z|y, c)} = lim E{Zw(Z|y, c)/W (y, c)}
c→1
c→1
E{ZK([(c/(1 − c))(Z − z (0)(y))]2 )}I{Y =y} c→1 E{K([(c/(1 − c))(Z − z (0) (y))]2 )}I{Y =y}
= lim
= E(Z (0) (y)). By similar approach, we have lim EZ (ZZ T g(Z|Y, c)) = EZ (ZZ T |Y ),
c→0
lim EZ (ZZ T g(Z|Y, c)) = E[Z (0) (Y )Z (0)T (Y )].
c→1
Consequently, by Lebesgue theorem, lim EY [EZ (ZZ T g(Z|Y, c))] = EY (EZ (ZZ T |Y )) = Ip ,
c→0
lim EY [EZ (ZZ T g(Z|Y, c))] = EY {E[Z (0) (Y )Z (0)T (Y )]}.
c→1
Therefore, for any y ∈ R(Y ), we have, lim A(y, c) = Ip − Cov(Z|y),
c→0
lim A(y, c) = EY {E[Z (0) (Y )Z (0)T (Y )]}
c→1
− E[Z (0) (y)Z (0)T (y)] + E[Z (0) (y)]E[Z (0)T (y)], this completes the proof. Proof of Theorem 3.1. First, we prove that if Z is normal, then, for any y ∈ R(Y ) and any c ∈ [0, 1], S(A(y, c)) ⊆ SY |Z . We only need to show that, for any γ ∈ SY⊥|Z , γ T A(y, c) = 0, where A(y, c) = EY [EZ (ZZ T g(Z|Y, c))]−EZ (ZZ T g(Z|y, c))+EZ (Zg(Z|y, c))EZ (Z T g(Z|y, c)).
Dimension reduction based on weighted variance estimate
557
By the property of normal distribution, it holds that γ T Z ⊥ β T Z. Therefore, by (1.1), we have EZ (γ T ZZ T γg(Z|Y, c)) = E{EZ [γ T ZZ T γg(Z|Y, c)|β T Z]} = E{EZ (γ T ZZ T γ|β T Z)EZ (g(Z|Y, c)|β T Z)} = Ip−k EZ (g(Z|Y, c)) ≡ Ip−k , E(γ T Zg(Z|y, c)) = E{E[γ T Zg(Z|y, c)|β T Z]} = E{E(γ T Z|β T Z)E[g(Z|y, c)|β T Z]} = 0. Therefore, we have γ T A(Y, c)γ = 0. Second, we show that, if only Assumption A holds, then S(A(y, 1)) ⊆ SY |Z . Note that A(y, 1) = limc→1 A(y, c), by Proposition 3.2 and Theorem 2.1, the conclusion is obvious. This completes the proof. Proof of Theorem 5.1. We only need to prove the conclusion for any fixed j (j = 1, . . . , h). Let Cjmin denote the set defined in Section 2 corresponding to Ij , j = 1, . . . , h. By the definition of Z (j) , for any z (j) ∈ Cjmin , Z (j) = z (j) , a constant only concerning with j. Since Z = −1/2 1/2 Σx (X − E(X)) and Σx > 0, for any z (j) ∈ Cjmin , there exists x(j) = Σx z (j) + E(X). Let 1/2 min n(j) > Z (j) + δ0 , for some Cj,x = {x(j) : x(j) = Σx z (j) + E(X), for all z (j) ∈ Cjmin }. If Z δ0 > 0, then for any Zl ∈ Cj , we have Zl > Z (j) + δ0 . Therefore, there exists δ1 > 0, such that, for any z (j) ∈ Cj , (j) 2 −1 + δ1 = z (j) 2 + δ1 . (Xl − X n )T Σ x (Xl − X n ) > Z
On the other hand, since the fourth moment of the components of X exists, we have −1 −1 X n → E(X) and Σ x → Σx ,
a.s.
Therefore, there exists N , as n > N , for any Zl ∈ Cj , we have (XlT − E(X))Σ−1 x (Xl − E(X)) > z (j) 2 + δ1 (j) = (x(j) − E(X))T Σ−1 − E(X)) + δ1 x (x
min for all x(j) ∈ Cj,x
a.s.
Since Σ−1 x is constant matrix, there must exist δ2 > 0, such that Xl − x(j) > δ2 ,
min for any x(j) ∈ Cj,x .
min , we have Therefore, as n > N , for any z (j) ∈ Cj,x
(j) > Z (j) + δ0 ) < P (Z T Zl > Z (j) 2 + δ1 , for all Zl ∈ Cj ) P (Z n l < P (Xl − x(j) > δ2 , for all Zl ∈ Cj ) P (Xl − x(j) > δ2 ). =
(9.12)
Yl ∈Ij
The last equation is due to the fact that (Xl − x(j) )I{Yl ∈Cj } are i.i.d. variables, following the conditional distribution (X − x(j) )I{Yl ∈Cj } . In addition, by the assumption that all the possible
ZHAO JunLong & XU XingZhong
558
values of X are contained in R(X), i.e. the support of X, x(j) is also contained in R(X). Thus, by the continuity of X, for any δ2 > 0, there exists > 0, such that P (Xl − x(j) > δ2 ) < 1 − .
(9.13)
Recall that nj → +∞ (j = 1, . . . , h), as n → +∞. Therefore, by (9.12) and (9.13), as n → ∞, it follows that n(j) > Z (j) + δ0 ) < P (Z P (Xl − X (j) > δ2 ) Yl ∈Ij
= (1 − )nj → 0. Furthermore, it follows that (j) > Z (j) + δ0 ) P (Z (1 − )nj (1 − )n < +∞. n n
n
n
n → Z (j) , a.s. This completes the proof. Thus, by the Borel-Contelli theorem, we have Z (j)
Proof of Theorem 5.2.
Since the fourth moment of the components of the X exists, we have −1 → Σ−1 , X n → E(X) and Σ x x
a.s.
(9.14)
−1/2 −1/2 Note that Zi = Σ (Xi − X n ). Let w 1 (Xi |Yi = j, c) = w( Σ (Xi − X n |Yi = j, c) = x x w(Z i |Yi = j, c). By Theorem 5.1 and (9.14), for any > 0, there exists N , such that, as n > N , |w 1 (Xi |Yi = j, c) − w1 (Xi |Yi = j, c)| < ,
a.s. for any i with Zi ∈ Cj ,
(9.15)
where w1 (Xi |Yi = j, c) =
c K 1−c
2 c − 12 − 12 (j) I{Yi ∈Ij } . (Σx (Xi − E(X)) − Σx x ) 1−c
Note that, for any Zi ∈Cj , w 1 (Xi |Yi = j, c) = w1 (Xi |Yi = j, c) = 0. Therefore, by (9.15), for any > 0, there exists N , such that, as n > N , |w 1 (Xi |Yi = j, c) − w1 (Xi |Yi = j, c)| < ,
a.s. for i = 1, . . . , n.
(9.16)
Furthermore, as n > N , by (9.16), we have n n 1 1 < , w (X | Y = j, c) − w (X | Y = j, c) 1 i i 1 i i nj nj i=1 i=1
a.s.
(9.17)
Since {w1 (Xi |Yi = j, c), i = 1, . . . , n with Zi ∈ Cj } are i.i.d. variable, then as n → +∞, we have n 1 w1 (Xi |Yi = j, c) − E(w1 (X|Y = j, c)) → 0 a.s. (9.18) nj i=1 Consequently, by (9.17) and (9.18), as n → +∞, it follows that n 1 w 1 (Xi |Yi = j, c) − E(w1 (X|Y = j, c)) → 0 nj i=1
a.s.
(9.19)
Dimension reduction based on weighted variance estimate
559
Note that w 1 (Xi |Yi = j, c) = w(Z i |Yi = j, c). By (9.19), as n → +∞, it follows that (Y = j, c) − E(w1 (X|Y = j, c)) → 0, W
a.s.
(9.20)
Noting that w1 (X|Y = j, c) = w(Z|Y = j, c), we have W (Y = j, c) = E(w(Z|Y = j, c)) = E(w1 (X|Y = j, c)). Therefore, by (9.20), as n → +∞, we have (Y = j, c) − W (Y = j, c) → 0, W
a.s.
(9.21)
Note that w1 (Xi |Yi = j, c) = w(Zi |Yi = j, c) and that g is continuous with respect to w and W . By (9.16) and (9.21), there exists N1 , such that, as n > N1 , | g (Zi |Y = j, c) − g1 (Xi |Y = j, c)| < where
a.s. for i = 1, . . . , n,
(9.22)
w1 (Xi |Yi = j, c) (Xi − E(X))|Y = j, c) = . g1 (Xi |Y = j, c) = g(Σ−1/2 x W (Y = j)
For matrix A, let A∞ = max{ai,j }, where A = (ai,j ). Note that g1 (X|Y = j, c) = g(Z|Y = j, c). By (9.14) and (9.22), for any fixed j and c, as n → +∞, we have n 1 T → 0, Z Z g (Z | Y = j, c) − E(ZZg(Z| Y = j, c)) i i i nj ∞ i=1 n 1 Zi g(Zi |Y = j, c) − E(Zg(Z|Y = j, c)) nj → 0. ∞ i=1 Therefore, as n → +∞, we have Y = j, c) → A(
h
E(ZZ T g(Z|Y = j, c)) − E(ZZ T g(Z|Y = j, c))
j=1
+
1 E(Zg(Z|Y = j, c))E(Z T g(Z|Y = j, c)), a.s. nj
(9.23)
Since Y is discrete, the right side (9.23) can be write as EY E(ZZ T g(Z|Y , c)) − E(ZZ T g(Z|Y = j, c)) + EY E(Zg(Z|Y , c))E(Z T g(Z|Y , c)). 1 (c) and the law of large number, as n → +∞, we have Therefore, by the definition of H 1 (c) → H1 (c), a.s. This completes the proof. H Acknowledgements The authors thank the editor and two referees for providing useful comments and suggestions which led to the great improvement of presentation of the paper. References 1 Cook R D. Regression Graphics. New York: Wiley, 1998 2 Cook R D. On the interpretation of regression plots. J Amer Statist Assoc, 89: 177–189 (1994)
560
ZHAO JunLong & XU XingZhong
3 Cook R D. Graphics for regressions with a binary response. J Amer Statist Assoc, 91: 983–992 (1996) 4 Li K C. Sliced inverse regression for dimension reduction. J Amer Statist Assoc, 86: 316–342 (1991) 5 Cook R D, Weisberg S. Discussion of “sliced inverse regression for dimension reduction” by K-C Li. J Amer Statist Assoc, 86: 328–332 (1991) 6 Li K C. On principal Hessian directions for the data visualization and dimension reduction: another application of Stein’s lemma. J Amer Statist Assoc, 87: 1025–1039 (1992) 7 Li B, Zha H, Chiaromente F. Contour Regression: a general approach to dimension reduction. Ann Statist, 33: 1580–1616 (2005) 8 Xia Y, Tong H, Li W K, et al. An adaptive estimation of dimension reduction space. J Roy Statist Soc Ser B, 64: 363–410 (2002) 9 Cook R D, Lee H. Dimension reduction in binary response regression. J Amer Statist Assoc, 94: 1187–1200 (1999) 10 Cook R D, Nachtsheim C J. Reweighting to achieve elliptically contoured covariates in regression. J Amer Statist Assoc, 89: 592–599 (1994) 11 Ye Z, Weiss R E. Using the bootstrap to select one of a new class of dimension reduction methods. J Amer Statist Assoc, 98: 968–979 (2003) 12 Zhu L X, Ohtaki M, Li Y X. On hybrid method inverse regression-based algorithms. Comput Statist Data Anal, 51: 2621–2635 (2007) 13 Cook R D. SAVE: a method for dimension reduction and graphics in regression. Comm Statist Theory Methods, 29: 2109–2121 (2000) 14 Zhao J L, Xu X Z, Ma J J. Extending SAVE and PHD. Comm Statist Theory Methods, 36: 1591–1606 (2007) 15 Li Y X, Zhu L X. Asymptotics for sliced average variance estimation. Ann Statist, 35: 41–69 (2007) 16 Zhu L P, Zhu L X. On kernel method for sliced average variance estimation. J Multivariate Anal, 98: 970–991 (2007) 17 Wang H S, Ni L Q, Tsai C L. Improving dimension reduction via contour-projection. Statist Sinica, 18: 299–311 (2008) 18 Li K C. Nonlinear confounding in high-dimensional regression. Ann Statist, 25: 577–612 (1997) 19 Zhu Y, Zeng P. Fourier methods for estimating the central subspace and the central mean subspace in regression. J Amer Statist Assoc, 101: 1638–1651 (2006) 20 Xia Y. A Constructive approach to the estimation of dimension reduction directions. Ann Statist, 35: 2654–2690 (2007) 21 Li B, Wang S. On directional regression for dimension reduction. J Amer Statist Assoc, 102: 997–1008 (2007) 22 Prendergast L A. Implications of influence function analysis for sliced inverse regression and sliced average variance estimation. Biometrika, 94: 585–601 (2007)