Using sliced mean variance–covariance inverse regression for classification and dimension reduction

The sliced mean variance–covariance inverse regression (SMVCIR) algorithm takes grouped multivariate data as input and transforms it to a new coordina...

1 downloads 93 Views 2MB Size

Download PDF

Comput Stat (2014) 29:769–798 DOI 10.1007/s00180-013-0460-3 ORIGINAL PAPER

Using sliced mean variance–covariance inverse regression for classification and dimension reduction Charles D. Lindsey · Simon J. Sheather · Joseph W. McKean

Received: 27 January 2012 / Accepted: 14 October 2013 / Published online: 27 October 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract The sliced mean variance–covariance inverse regression (SMVCIR) algorithm takes grouped multivariate data as input and transforms it to a new coordinate system where the group mean, variance, and covariance differences are more apparent. Other popular algorithms used for performing graphical group discrimination are sliced average variance estimation (SAVE, targetting the same differences but using a different arrangement for variances) and sliced inverse regression (SIR, which targets mean differences). We provide an improved SMVCIR algorithm and create a dimensionality test for the SMVCIR coordinate system. Simulations corroborating our theoretical results and comparing SMVCIR with the other methods are presented. We also provide examples demonstrating the use of SMVCIR and the other methods, in visualization and group discrimination by k-nearest neighbors. The advantages and differences of SMVCIR from SAVE and SIR are shown clearly in these examples and simulation. Keywords Discrimination · Singular value decomposition · SAVE · SIR · Visualization · SMVCIR

C. D. Lindsey (B) StataCorp College Station, College Station, TX 77845, USA e-mail: [email protected] S. J. Sheather Department of Statistics, Texas A&M University College Station, College Station, TX 77843, USA e-mail: [email protected] J. W. McKean Department of Statistics, Western Michigan University, Kalamazoo, MI 49008, USA e-mail: [email protected]

123

770

C. D. Lindsey et al.

1 Introduction We observe n independent and identically distributed (k + 1) × 1 random variates. We refer to the k variables x1 , . . . , xk that begin each observation as our predictor variables, and the final variable y of each observation as our response variable. The support of the response variable y can be partitioned into g groups. This partitioning is defined before sampling. Our goal is to transform the input data into a new coordinate system that highlights the mean, variance, and covariance differences between the response groups, and to determine the dimension of that coordinate system. So we will transform the raw data into the discrimination coordinate system and only examine dimensions that are deemed to be statistically and practically significant. In our population, the predictors in response group i have mean μi and full rank variance matrix i . The probability of drawing an observation from group i when randomly sampling from the population is pi . We calculate the centered means, μci by subtracting the marginal mean μ from each of the group means. The marginal mean is equivalent to the group proportion weighted average of the individual group means. Similarly to the centered means, we calculate the centered variances, ic by subtracting the group proportion weighted average of the individual group variances. By centering the parameters, we create differences from the average. Our goal is to form a new coordinate system with a spanning set formed by these differences. SMVCIR (sliced mean variance–covariance inverse regression, Sheather et al. 2008) extracts the variance differences from ic and stacks them together in the vectors σ ci . The group covariance difference matrices, with zeroes along the diagonal ˘ ci . In this way, in the circumstance that a group differs from the rest are denoted as in several variances but no covariances, only one difference is added to the SMVCIR coordinate system. SAVE (sliced average variance estimation, Cook 2000) is another inverse regression algorithm. It targets similar difference information to SMVCIR, but it leaves the variance difference parameters with the covariance difference parameters. The other inverse regression algorithm we will study, SIR (sliced inverse regression, Li 1991) ignores group variance and covariance differences, focusing solely on group mean differences. Both SIR and SAVE target subsets of the central subspace of the regression of y on x. When conditioning on all of the transformed coordinates from the central subspace, y and x are independent. So the transformed coordinates provide all the useful information about x as it relates to y. SMVCIR does not necessarily target the central subspace, because of the way it stacks variance differences. Instead, by considering variance differences separately from other covariance differences, SMVCIR is able to detect mean and covariance differences when variance differences exist across many of the variables. Both Cook and Critchley (2000) and Zhu and Hastie (2003) documented that SAVE can have difficulty detecting mean differences. When variance differences exist across many of the variables, SAVE will use a dimension for each separate difference. In this case, mean differences may only be evident in higher SAVE coordinates. Rather than examining the first few most discriminating SAVE dimensions to see the loca-

123

Classification and dimension reduction

771

tion difference between the groups, one would have to examine both these and higher (more weakly discriminatory) dimensions. By stacking the variance differences, SMVCIR avoids this problem. We will see how SMVCIR easily uncovers mean differences which are not readily detected in SAVE in later examples. SIR concurs with SMVCIR in these situations, but does not provide SMVCIR’s variance/covariance discrimination. Using the superscript s to denote the scaling standardization and m to denote the Mahalanobis standardized parameter matrices and vectors, we obtain the following spanning sets for SMVCIR, SAVE, and SIR. Note that we weight each difference with the square root of the population proportion for its group. This prevents the dominance of large differences in small groups or small differences in large groups. KSMVCIR = KSAVE = KSIR =

√ √ √

p1 μcs 1 ···

√

p1 μcm 1 ··· p1 μcm 1 ···

pg μcs g

√ √

√

pg μcm g pg μcm g

p1 σ cs 1 ···

√

√

pg σ cs g

p1 1cm · · ·

√

√

√ ˘ cs ˘ cs p1 1 · · · pg g

pg cm g

(1)

In practice, we estimate each of the elements in these matrices using group means, variances, covariances, and proportions. We call the estimate of the spanning set Kn . We form a kernel matrix Kn Kn and obtain its eigen decomposition. Then the original data are standardized (in an analogous manner to the difference parameter standardization) and the new data coordinate system is formed by multiplying the eigenvectors of the kernel matrix by the standardized predictors. This process was explained in Muirhead (1982). The spanning set of differences forms the columns of a matrix, and spans the column space of this matrix. The new coordinates estimate a rotation of the original data to the central features of this column space. So the new coordinates estimate a rotation of the original data to the central features of the spanning set of difference vectors. The strengths of these difference features are measured by the eigenvalues of the kernel (which are also estimated). So we can obtain a clear picture of how the groups differ by examining the final coordinates produced by the inverse regression algorithm. But how many coordinates do we need to examine? This is the dimension of the inverse regression coordinate system, d. In the next section, we provide more detail on the SMVCIR algorithm, demonstrating its use on a generated dataset. There are two groups with mean, variance, and covariance differences in the setting. The SAVE and SIR algorithms are invoked on the dataset as well. We see that SAVE places the lone mean difference at the highest, or least discriminatory coordinate. SIR finds a single dimension, which correspond to the mean difference. SMVCIR places the mean difference in only the second coordinate. The choice of scaling in the SMVCIR algorithm is also discussed and a motivation for the dimensionality test for d is provided. In Sect. 3, we find the asymptotic null distribution of the SMVCIR dimensionality test statistic for d = d0 .

123

772

C. D. Lindsey et al.

Section 4 provides simulations evaluating the size and power of the test statistic. In our power evaluation simulations, we estimate the value of d and compare this estimate with that obtained under SIR and SAVE. In the final section, we see how all three algorithms perform on three examples. We first analyze the pen digit data from Zhu and Hastie (2003). Here the groups correspond to the handwritten digits 0–9. In its first dimension, SMVCIR clusters the digit group “5” into two separate groups. As an apparent explanation we find that each group corresponds to a distinct way of writing the digit. This clustering is not found in any of the strong early dimensions of SAVE and SIR. k-nearest neighbors discrimination is used on the output coordinates from each of the three algorithms, and the discrimination on the SMVCIR coordinates provides a significantly smaller misclassification error rate than SAVE and a comparable rate to SIR. Our second example examines chemical data recorded on three cultivars of wine grown in Italy. The data is taken from Izenman (2008). SMVCIR clearly shows variance, covariance, location differences between the three cultivars. It also has a smaller misclassification error rate than SAVE. The third example examines sonic data recorded in Ermont, France. The data consists of bird, car, and plane sounds. This data is taken from the LDR package (Cook et al. 2011). We find that SMVCIR provides good graphical classification. It also has a smaller misclassification error rate than SAVE in this instance. The statistical software Stata (StataCorp 2011) is used to perform SMVCIR and draw graphics. The statistical softare R (R Development Core Team 2012), and the package dr (Weisberg 2009) are used to perform the SAVE and SIR algorithms. The SMVCIR code for Stata will soon be publicly available. 2 Generated example In this section we invoke the three inverse regression algorithms on a generated example with mean, variance, and covariance differences. Further details are also provided about the standardization used by SMVCIR, along with a motivation for the dimensionality test for the dimension of the SMVCIR coordinate system. Suppose we observe a two-group population with these parameters. p1 = .25 p2 = .75 μ1 = 0 0 0 0 0 0 μ2 = 6 0 0 0 0 ⎡4 0 0 0 0 ⎢0 9 0 0 0 ⎢ 0 0 16 0 0 =⎢ ⎢ 0 0 0 25 0 1 ⎣ 0 0 0 0 0 0

2

0 0

⎤

0 0 ⎥ 0 ⎥ ⎥ 0 ⎥ ⎦ 36 20 20 35

(2)

= I6

Each group is marginally multivariate normal. There are 4 difference dimensions in the SMVCIR coordinate system. The means are clearly different for the first variable

123

Classification and dimension reduction

773

x1 between the two groups. The variance of all variables x1 , . . . , x6 is different across the two groups. The covariance between x5 and x6 is clearly different as well. The mean difference should contribute one unique difference to the SMVCIR coordinates. All six of the elemental variance differences will contribute in total one unique variance difference to the SMVCIR coordinate system (recall that variances are stacked together). We will see two dimensions for the covariance difference between x5 and x6 , counting the 5th and 6th column of the covariance difference matrix. We will draw a sample from this population. Then we will invoke the three inverse regression algorithms on the sample. 2.1 Standardization We use the scaling standardization in SMVCIR. This standardization puts the predictors on a common scale, but may result in correlated coordinates. Use of the Mahalanobis standardization would provide orthogonal coordinates, but could introduce covariance difference dimensions. Examples of this phenomenon are documented in Lindsey (2010). In the original SMVCIR algorithm, a Mahalanobis standardization was used. The original algorithm did not use statistical inference to choose the number of dimensions, relying only on a scree plot of the SMVCIR kernel’s eigenvalues. The extra dimensions introduced by the standardization were weak in discriminatory power, and hence associated with small eigenvalues. They were thus ignored in analysis using the original algorithm. (Cook 1998, p. 102) suggested that orthogonal coordinates be used to assess the features of the central subspace of the predictors on the response. So linear combinations of the predictors that are linearly independent of each other are advocated. But Cook (1998) went on to say, “Interpretation may be facilitated at other times by letting the columns of A correspond to linear combinations of particular interest, provided the latter are not too strongly correlated.” He was discussing the choice of vectors to multiply the data by to obtain the final transformed coordinates. So the final transformed coordinates (linear combinations of the predictors) may be formulated to correspond to difference relationships that are of interest, so long as they are not too strongly correlated. We take a similar approach in using SMVCIR, checking the correlation of the resulting coordinates before inference is performed. If there is heavy marginal correlation in the final SMVCIR coordinates, then use of the algorithm was not successful and the results are suspect. We have rarely found this problem in practice. The marginal variance matrix is calculated as

=

2 i=1

i pi +

2

μi μi pi − μμ

(3)

j=1

This leads to the following marginal variance vector in our two-group setting: ⎡ 8.5 ⎤ ⎢3 ⎥ ⎢ 4.75 ⎥ σ = ⎢7 ⎥ ⎣ ⎦

(4)

9.75 9.5

123

774

C. D. Lindsey et al.

When we invoke SMVCIR on a sample from this setting, We will standardize by dividing the predictors by an estimates of the square root of the elements of σ . 2.2 Dimensionality test To test for the dimension d of the SMVCIR coordinate system, we propose the following iterated testing scheme. We perform each test one after the next, until we accept a null hypothesis. The index of that accepted test is our estimated value of d. If none of the tests are accepted, we infer that d = k. Test 0 : H0 : d = 0 vs H1 : d > 0 .. .

(5)

Test k-1 : H0 : d = k − 1 vs H1 : d = k

For test i, we calculate the following statistic. Λˆ i = n

k

λˆ j

(6)

j=i+1

The scalar values λˆ 1 ≥ . . . ≥ λˆk are the eigenvalues of the estimated kernel matrix Kn Kn . The eigenvalues are a good measure of the discrimation strength of their corresponding dimension. The higher discrimination we see in dimensions that exceed i, the more likely we are to believe that d > i. We find that under H0 , the test statistic is asymptotically distributed identically to a linear combination of independent χ12 random variables. Tests for the dimension of the SAVE and SIR coordinate systems have been developed in Shao et al. (2007) and Bura and Cook (2001) respectively. They both use a similar test statistic to ours, and both also find an asymptotic null distributional result that is a linear combination of independent χ12 . Once we obtain an estimate of d, this provides an upper bound on the number of SMVCIR dimensions that we should examine. We represent the total discrimination strength of the SMVCIR coordinates as the sum of the k singular values of the spanning set. Recall that the eigenvalues of the SMVCIR kernel represent the discrimination strength of their associated dimensions. The singular values of the SMVCIR spanning set are the square roots of the SMVCIR kernel eigenvalues. We examine a scree plot of the singular values of the estimated SMVCIR spanning set. The horizontal axis of the plot represents the number of potential SMVCIR dimensions to examine, ranging from 1 to all k possible dimensions. For horizontal value j, the vertical axis shows the percentage of the total discrimination strength explained by the first j singular values of the estimated SMVCIR spanning set. We examine the first r SMVCIR dimensions, where r is the minimum of our estimate of d and the smallest horizontal value in the scree plot to adequately satisfy our discrimination strength restrictions and such that choice of a higher value leads to only marginal discrimination strength gain. By allowing our d estimate to serve as an upper bound, we guarantee a dimension choice that is practically and statistically significant.

123

−6 −4 −2 0 2 4 6

775

−6 −4 −2 0 2 4 6

SMVCIR 2

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

SMVCIR 1

−6 −4 −2 0 2 4 6

Classification and dimension reduction

SMVCIR 3

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

−6 −4 −2 0 2 4 6

Group 1 Group 2

SMVCIR 4

Fig. 1 Generated example, SMVCIR coordinates

Now we simulate 5000 draws from the two-group population using hierarchical sampling and then perform SMVCIR, SAVE, and SIR on the resulting sample. In this hierarchical sampling, we sample each observation by first drawing a group. We choose group 1 with probability .25 and group 2 with probability .75. The plots of the SMVCIR coordinates are displayed in Fig. 1. Regression lines for each group are also drawn on the plot. The mean difference is demonstrated in the second dimension. The other dimensions correspond to the variance and covariance differences between the two groups. We see the covariance difference most clearly in the plot of the first and fourth dimensions. Execution of the iterated test algorithm (5) results in a choice of d = 4, so we only examine four dimensions in Fig. 1. Using the dimensionality test, we reject tests i = 0, . . . , 3 at the .01 level, and accept the null hypothesis that d = 4 with a p value exceeding .58. The scree plot of the spanning set singular values, Fig. 2 shows that choosing r = 4 provides 98 % of the potential discrimination strength as well. We also checked the correlations of the first four SMVCIR coordinates and found none that exceed 0.1 in magnitude, so we are not concerned that marginal linear dependency in the raw data distorts the SMVCIR results. SAVE finds six difference dimensions in this setting, shown in Fig. 3. At each stage of the iterated testing, we reject the null at the 0.01 level. Only in the highest, and least discriminatory coordinate of its space, does it show a location difference between the two groups. Each variable that has a different variance between group 1 and 2 contribute a difference. The covariance differences between the groups are

123

C. D. Lindsey et al.

0

20

40

%

60

80

100

776

0

2

4

6

Dimensions

0

10

20

−20 −10

0

10

20

10 20 −20 −10 0 10

20

−20 −10

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

20

−20 −10

20

20

10

10

0

0

−20 −10

−20 −10

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

10

20

0

10

Group 1 Group 2

−20 −10

SAVE 5

0

10

20

−20 −10

0 −20 −10

SAVE 4

−20 −10

10

10

10 0

10 0

SAVE 3

0

20

−20 −10

−20 −10

0

20

20

−20 −10

10

10

20

0

0

10

−20 −10

−20 −10

0

20

−20 −10

10

20

0

−20 −10

SAVE 2

10 20

−20 −10 0 −20 −10

0

20

−20 −10

10

20

0

20

−20 −10

−20 −10 0

10 20 −20 −10 0

10 20 −20 −10 0

SAVE 1

10 20

Fig. 2 Generated example, SMVCIR scree plot

SAVE 6

Fig. 3 Generated example, SAVE coordinates

not immediately evident from the graph. SAVE does not estimate covariance differences separately from variances, so the obvious variance differences may overshadow covariance differences in this case. SIR only finds one difference dimension at the 0.01 level, as it can only detect mean differences. In the transformed coordinates, group 1 has mean −4.6 and group 2 mean 1.53.

123

Classification and dimension reduction

777

We have demonstrated the SMVCIR algorithm and compared it to SAVE and SIR. Now we will justify the dimensionality test. 3 Distribution of SMVCIR dimensionality test statistic We begin by finding the asymptotic distribution of the spanning set estimate Kn to be multivariate normal. Then we use this distributional result and theory about the distribution of eigen/singular values from an asymptotically multivariate normal random vector to obtain the asymptotic null distribution of Λˆ i under H0 : d = i. We were inspired by the strategy employed by Yin (2005) in determining the asymptotic distribution of the SAVE spanning set. Of course the details of our derivations are very different. Bura and Yang (2011) explore dimensionality testing of dimension reduction kernels, including SAVE and SIR. They provide a general framework for the dimensionality test, once the distribution of the kernel/spanning set is established. The SMVCIR dimensionality test fits within this framework. 3.1 Spanning set distribution In this section the spanning set estimate is constructed, and its asymptotic distribution is calculated. This asymptotic distribution applies to the “vec” of our estimate. Each component vector of the estimate is a function of sample group proportions and group or marginal means, variances, and covariances. This suggests that we could start with a central limit theorem and apply the delta method to obtain the asymptotic distribution of our estimate. This suggestion is correct, but neither the central limit theorem or delta method invocation is trivial. We begin with a central limit theorem invoked on group pseudo-first and pseudosecond moments, group proportions, and marginal first and second moments. By pseudo-moment, we mean the moment with a different scaling constant. We actually divide the group j moment’s summand by the total sample size n instead of the group sample size n j . Formulation of our central limit theorem is facilitated by work in Gannoun and Saracco (2003). Then we invoke a delta method with five separate stages. For convenience we approach this as 5 separate delta methods. The equivalence of this approach to using one non-decomposed function and one very complicated delta method is justified by the chain rule for partial derivatives (Lütkepohl 2007, p. 666). In the first stage, group first and second moments are formed using the group proportions and pseudo-moments from the central limit theorem stage. In the next stage, we calculate the group/marginal variances/covariances using the second and first moments. Then we finally standardize the group means and variances using the marginal means and variances. This forms mean differences as well, as we subtract out the marginal mean. In the next stage, we create variance and covariance differences by subtracting out the pooled variance/covariance matrix. Following this, in the final delta method stage, we weight the calculated group mean, variance, and covariance differences using the group proportions. This ends the delta method. To obtain the final spanning set distribution we then use a permutation matrix to stack the variances and place zeroes across the diagonal of the covariance matrices.

123

778

C. D. Lindsey et al.

Fig. 4 Spanset schematic

The final asymptotic distribution of the SMVCIR spanning set is given in the following. Note that we obtain a singular asymptotic covariance matrix . Many of our estimates within Kn are constrained to add up to some constant, since they have been centered or represent group proportions. √ vec n (Kn −K) →d N (0, )

(7)

Figure 4 concisely summarizes how we obtain the asymptotic distribution of the spanning set. For brevity we have to use some abbreviation in the schematic. The letter “C” stands for centered. The letter “S” means scaled (standardized). So “C SMean” indicates that the mean has been centered and its scale standardized. The abbrevation “Mom” stands for moment, so “* Mom” stands for the second order product moments. 3.2 Final test statistic distribution ˆ i in terms of the singular value decomposition of the spanning We proceed by writing set estimate Kn . Then we find the null distribution of a similar expression that replaces the estimated spanning set statistic with the mean adjusted estimated spanning set statistic from (7). We show how the distribution of our test statistic and the meanadjusted statistic are equivalent. Finally we provide 3 reference distributions for the null distribution of our statistic based on these results. The singular value decomposition of Kn yields the following. The 1 indexed matrices correspond to singular values σˆ 1 , . . . , σˆ i , while the 0 indexed matrices correspond to singular values σˆ i+1 , . . . , σˆ k .

ˆ0 ˆ1 U Kn = U

123

ˆ ˆ1 0 D V 1 =U ˆ 1D ˆ 1V ˆ +U ˆ 0V ˆ ˆ 0D 1 0 ˆ ˆ V0 0 D0

(8)

Classification and dimension reduction

779

Now we rewrite our test statistic Λˆ i in terms of the above matrices. Λˆ i = n

k

λˆ j = n

ˆ ⊗U ˆ ⊗U ˆ vec (Kn ) ˆ vec (Kn ) V V 0 0 0 0

(9)

j=i+1

Noting that vec(ABC) = (C ⊗ A) vec(B), we can apply theorem 1 of Bura and Yang (2011). Theory Λˆ i ≈ d

(k−i)(h−i)

λjχj

j=1

λ1 ≥ . . . ≥ λ(k−i)(h−i) eigenvalues of χ j ∼ i.i.d. χ12

(10) (V0

⊗ U0 )(V0

⊗ U0 )

Using Serfling (1980) and (7) We find that Kn → p K. It immediately folˆ 0, V ˆ 1, V ˆ 0, D ˆ 1, D ˆ 0 each converge in probability via the continuous ˆ 1, U lows that U mapping theorem. We can also obtain a consistent estimate of spanning set covariˆ by using sample moments (up to the fourth moment and including product ance, moments). Given this, by using Slutsky’s Theorem, Lemma 2.1 from Tyler (1981), and other ˆ i by theory from Eaton and Tyler (1994) we can approximate the distribution of Empirical (k−i)(h−i) λˆ j χ j Λˆ i ≈ d j=1

(11)

ˆ V ˆ ⊗U ˆ )( ˆ0 ⊗U ˆ 0) λˆ 1 ≥ . . . ≥ λˆ (k−i)(h−i) eigenvalues of (V 0 0 χ j ∼ i.i.d. χ12 To use the empirical reference distribution we must simulate draws from the given distribution to determine a decision rule. This simulation is fairly straightforward and involves drawing a number of independent χ12 realizations. We also obtain a null distribution that we can use in practice without simulation. The approximation methods from Bentler and Xie (2000) are used. We correct our test statistic so that its distribution is close to that of a χ 2 random variable. This yields the approximate empirical reference distribution. As this reference distribution does not rely on simulation, it is preferred for use in practice. Approximate Empirical 2 ˆ V ˆ 0 ⊗U ˆ 0) trace (Vˆ 0 ⊗Uˆ 0 )( For d closest integer to 2 ˆ V ˆ 0 ⊗U ˆ 0) trace (Vˆ 0 ⊗Uˆ 0 )( d ˆi ≈ d χ 2 Λ d ˆ V ˆ 0 ⊗U ˆ 0) trace (Vˆ 0 ⊗Uˆ 0 )(

(12)

123

780

C. D. Lindsey et al.

4 Simulation study Now we corroborate our theoretical results by simulation. In this section we will study the size and power of our SMVCIR dimensionality tests and compare the choice of SMVCIR dimension with SIR and SAVE dimensions under the iterated testing scheme (5). We will evaluate the approximate empirical (12) test using a family of 3 group and 8 predictor distributions. We will evaluate 6 different mean, variance, and covariance scenarios. The scenarios will be imposed upon 3 separate base distributions. The first is normal. The second is a multivariate Student’s T distribution that is parameterized by the given mean and variance parameters and has 10 degrees of freedom (only 2 more than its variate dimension). Note that the actual variance of the T distribution will be 10/(10 − 2) = 1.25 times the given parameters. In determining group differences, this will have a negligible effect. The third is created by standardizing 8 independent exponential(1) variables. Three of these variables are scaled by −1, so their direction of skew is reversed. First we define our 6 difference parameter scenarios. Then we will show simulations corroborating that our dimensionality test statistics have correct size, that the postulated size for the test is close to the actual size of the test. We follow this by a dimension choice simulation, where we use the iterated testing scheme (5) for SMVCIR, SIR, and SAVE on samples from each of our population/difference parameter settings. This will provide a comparison of the three inverse regression algorithms. This simulation will also demonstrate that our SMVCIR dimensionality test is powerful, as choosing the correct dimension involves rejecting the tests of lower dimensions. 4.1 Mean, variance, covariance situations We will use 6 separate situations, corresponding to SMVCIR dimension d = 0, 1, 2, 4, 5, 6. The group proportions are the same for each situation we study. p1 = p2 = .25 p3 = .5

(13)

The group means for the d = 0 case are the following. μ01 = μ02 = μ03 = 5 −5 3 2 1 6 7 0

(14)

The group variances for the d = 0 case are given by the following. ⎡ 10 01

123

=

02

=

03

⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣

0 0 0 0 0 0 6

0 12 0 0 0 0 0 0

0 0 9 7 −2 0 0 3

0 0 7 11 −3 0 0 6

0 0 −2 −3 10 0 0 1

0 0 0 0 0 9 0 0

0 0 0 0 0 0 10 0

⎤

6 0⎥ 3⎥ 6⎥ ⎥ 1⎥ ⎥ 0⎦ 0 14

(15)

Classification and dimension reduction

781

In the d = 1 case, we change the first group mean to μ1 = [ −1 −1 0 2 1 6 0 0 ] . Differing parameters are noted in bold. To get the d = 2 case, we change the diagonal of the group 3 variance matrix to v3 = [ 20 22 9 11 10 19 30 24 ] . For the d = 4 case, we replace row and column eight of the group 3 variance matrix with cv38 = [ −16 0 −3 −6 −1 0 0 24 ] . Even though 4 covariance parameters are now different, we only have 2 SMVCIR dimensions of difference. The linear dependence of the differences within the same row leads to only one unique column of differences in the SMVCIR spanning set. To get d = 5, we change the diagonal of the group 2 variance matrix to v2 = [ 50 42 12 10 9 2 2 34 ] . Finally, d = 6 is obtained by changing the group 3 mean to μ3 = [ −5 −5 −5 2 −5 6 7 0 ] . These settings provide a variety of practically significant differences for SMVCIR to detect. When applied to our three population settings: normal, T10 , and standardized exponential, they provide a broad group of settings for examination of our SMVCIR dimensionality test. 4.2 Size simulations We generate samples of n = 100, 1000, 5000 from each of the populations and d = i ˆ i on each sample. We may decide to reject (i = 0, 1, 2, 4, 5, 6) settings and evaluate or accept the null hypothesis H0 : d = i The approximate empirical (12) reference ˆ i and the null diagonalized spanning set distribution is be used. Here we calculate covariance estimate. Then we compute the correction scaling factor and corrected degrees of freedom according to (12). A decision rule is then formulated using the resulting χ 2 distribution. In each of the simulations we create and test 1000 samples for each given sample size. We use α = 0.05. Note that this is just the nominal size. We are testing the actual size now. Also, treating the size as a proportion, we calculate an exact 95 % confidence interval using the Clopper-Pearson technique (Clopper and Pearson 1935). This helps us see how the size may range in repeated simulations. Figure 5 contains the results of our size simulations, with a separate graph for each d = i test. Each graph has the size on the vertical axis and n on the horizontal axis. We draw a line at α = 0.05 in each plot. We find satisfactory size performance for the approximate empirical test in all settings for n ≥ 1000. Some settings have satisfactory size at n = 100 as well. 4.3 Power and dimension comparison Now we present simulations testing the power of our approximate empirical SMVCIR dimensionality test (12) and comparing SMVCIR with SIR and SAVE. We use the same difference parameter/population settings as before, drawing 1000 samples from each and then performing the iterated tests (5) at level α = 0.05 for SMVCIR, SIR, and SAVE. The tests used for SIR and SAVE are implemented in the dr package (Weisberg 2009). These are Bura and Cook (2001) for SIR and Shao et al. (2007) for SAVE. Table 1 contains the results for the normal population. The chosen dimensions are recorded in the columns, while the true dimension for SMVCIR is set in the rows of the table. Note that SIR and SAVE may have different true dimensions than SMVCIR.

123

782

C. D. Lindsey et al.

2

0

.12 .08 .04

0

0

.04

.04

.08

.08

.12

.12

.16

.16

1

.16

0

100

1000

5000

100

1000

100

.16

.16

.12

.12

.08 .04

.04 5000

0

0 1000

5000

6

.08

.12 .08 .04 0 100

1000

5

.16

4

5000

100

1000

Normal

5000

100

T

1000

5000

Exponential

Fig. 5 Size simulation results Table 1 True dimension by dimension choice: normal population Dimension choice 0 0

1

2

1

SMVCIR

944

42

SIR

960

40

SAVE

959

2

3

40

1

940

46

SIR

946

54

SAVE

951

SIR

960

5

48

1

938

48

13

941 962

1 237

718

45

46

12

1

427

547

35

9

38

SAVE 5

SMVCIR SIR

956 913

SMVCIR SIR SAVE

123

26

87

SAVE 6

8

40

SMVCIR SIR

7

14

SAVE 4

6

14

SMVCIR

SMVCIR

4

551

449

955

37

8

3

800

197

1000

Classification and dimension reduction

783

Table 2 True dimension by dimension choice: T10 population Dimension choice 0 0

1

2

1

2

3

SMVCIR

946

48

4

SIR

939

60

1

SAVE

953

46

1

SMVCIR

937

57

SIR

946

54

SAVE

956

4

5

1

1

4

1

1

6

3

955

36

SIR

968

32

SMVCIR SIR

945 963

471

30

43

11

1

647

331

41

10

949 916

SMVCIR SIR SAVE

22

84

SAVE 6

1

37

SMVCIR SIR

8

498

SAVE 5

7

44

SMVCIR SAVE

4

6

687

313

957

39

4

41

820

139

1000

The true dimension choices for each algorithm are bolded. We find that SMVCIR tests perform very well at all sample sizes. Dimensions below the true are soundly rejected, and higher dimensions are only chosen roughly 5 % of the time. This reinforces the positive results from our size simulations in the last section. SIR also performs very well. We see that SAVE is not very powerful at some of our higher dimensions. Recall that SAVE will use a separate dimension for each differing column of a group covariance matrix. This can lead to difference vectors in the spanning set that are zeroed in all but a few places. Hence their linear independence from the other differences may be quite weak. This results in very weak differences that SAVE should detect, and therefore a lowering of its power. The results for the next population setting, the multivariate T10 are recorded in Table 2. SMVCIR and SIR both have excellent power here as well. We see that SAVE is even less powerful here. Spot checking of several cases in the SMVCIR d = 2 case (where the SAVE dimension was 6) suggested that the final SAVE dimension was so small that outliers were masking its detection. We found that several severe outliers in a group other than 2 (2 is the group that differs in variance from the others in this setting) could subtly bloat the variance of the group they belong to and make it equivalent to that of group 2 in this final SAVE dimension. The mean difference is regularly placed at the beginning of the dimension ordering here because of its linear independence from the other differences, so all discrimination for the other dimensions must be in variance.

123

784

C. D. Lindsey et al.

Table 3 True dimension by dimension choice: standardized exponential(1) population Dimension choice 0 *0

1

2

1

2

3

SMVCIR

951

39

8

SIR

950

48

2

SAVE

948

51

1

SMVCIR

955

36

SIR

954

46

SAVE

962

SMVCIR SIR

5

1

1

6

3

38

6

SMVCIR SIR

946 973

1

698

279

21

46

6

2

786

201

13

35

14

1

950 916

84

SAVE 6

3

SMVCIR SIR SAVE

2

27

SMVCIR SIR

8

2

SAVE 5

7

33

SAVE 4

6

38 953

967

4

29

832

136

954

34

12

193

692

115

1000

This phenomenon, outliers masking variance differences appears to be happening often in this setting. Therefore under the moderately heavy tailed T10 distribution we expect to see weaker power for SAVE than under a normal distribution. Do note that these results show the proper size of the SAVE test under the null hypothesis though. As exemplified in the SMVCIR d = 2 case, 471/(471 + 30 + 1) = 0.938247. The results for the standardized exponential population setting are recorded in Table 3. SMVCIR and SIR again have good power. SAVE is even less powerful here than under the T10 distribution. As before, we spot checked several examples and saw that outliers were masking the small differences. This exponential distribution has even more outliers than the T10 , so our results are not surprising. These results demonstrate how our SMVCIR dimensionality test has good power in rejecting the null hypotheses under the alternative. SIR is also shown to be quite powerful, and is mostly undettered in its dimension choice by variance and covariance differences. We find that SAVE can have difficulty choosing a single dimension when the differences it finds are not strongly linearly independent or of sufficient individual strength. The results also demonstrate how outlier heavy distributions can mask slight group differences that light outlier distributions (normal) show with clarity. We also note that the size of the SMVCIR, SAVE, and SIR tests is appropriate in each of the examined power simulation situations.

123

Classification and dimension reduction

785

5 Examples In this section we will show how SMVCIR and its dimensionality test works on real datasets. We will also invoke SAVE and SIR on these datasets and compare the three algorithms. As before, we use the dr package (Weisberg 2009) implementation of SIR and SAVE. We will study three datasets. First we examine the training dataset from the pen digit handwriting data of Zhu and Hastie (2003) and Asuncion and Newman (2007). In its first dimension, SMVCIR finds interesting clustering involving the digit 5 that is not clearly noticeable in SAVE or SIR within the first four dimensions. The second dataset contains data on three wine cultivars grown in the same part of Italy during the 1970s. The third dataset contains sonic data recorded in Ermont, France. The goal is to discriminate between birds, cars, and planes using the noise they produce. As an alternative or complement to visual examination, we can use a group discrimination algorithm on the SMVCIR coordinates. The misclassification error rate of the discrimination algorithm tells us how well the groups are separated by transformation to the SMVCIR coordinate system. We use the k-nearest neighbors group discrimination algorithm here. This classic algorithm dates back at least as far as Hodges and Fix (1951). An introductory treatment is provided in Rencher (2002), while a more detailed explanation is given in Hastie et al. (2009). Essentially, a predicted group is assigned to a point based on the majority group among the actual groups of its k neighbors. Determining which k neighbors to use in the calculation is based on a distance measure from the point in question. We use the Mahalanobis distance here. In the case of ties, where there is no clear majority group between the neighboring points, the predicted group is chosen randomly from the most likely groups. The misclasification error rate measures how often the predicted group of a point fails to match the actual group of the point. To choose k, the number of neighbors to use in classification of a single point, we refer to the research of Loftsgaarden and Quesenberry (1965). Loftsgaarden and √ Quesenberry suggest that k should be chosen to be n i , where n i is the typical group size of the data. This rule is mentioned in Rencher (2002), who also suggests trying several values of k and choosing the one which yields the smallest misclassification error rate. We use the Loftsgaarden and Quesenberry rule here. The output coordinates of the three algorithms: SMVCIR, SAVE, and SIR will be compared graphically and using k-nearest neighbors discrimination. In each example, we find that the misclassification error rate for the SMVCIR coordinates is smaller than that of SAVE. Also, in the pen digit data, the SMVCIR misclassification error rate is comparable to SIR. 5.1 Pen digit example In Zhu and Hastie (2003), the pen digit dataset is described as: “The Pen Digit database from the UCI machine-learning repository contains 10,992 samples of handwritten digits (0, 1, 2, . . . , 9) collected from 44 different writers. Each digit is stored as a 16-dimensional vector. The 16 attributes are derived using standard temporal and spatial resampling techniques in order to create vectors of the same length for every character.”

123

C. D. Lindsey et al.

0

20

40

%

60

80

100

786

0

5

Dimensions

10

15

Fig. 6 Pen digit, SMVCIR singular values scree plot

These attributes are actually coordinates of an individual penstroke on a digital tablet recorded in time and space, (x, y)T =1,...,8 . They divided the data into a training (7,494 observation) and test (3,498 observation) dataset. These datasets can be found on the Machine Learning Database website Asuncion and Newman (2007). We focus on the training dataset here. Using the SMVCIR dimensionality test, we find that all 16 SMVCIR dimensions are significant at the .01 level. Knowing this we could choose any dimension up to 16 for visual examination. As described earlier, we will use a scree plot to decide the number of plots, r to examine within the 16 significant dimensions. Figure 6 contains this plot. Over half of the discrimation power is found in the first four dimensions. So the most important group differences should be found within the first four dimensions. Further, inspection of the scree plot shows that the distance from r = 2 to r = 4 is approximately 19 %, r = 4 to r = 6 approximately 13 %, and r = 6 to r = 8 is 10 %. The distance continues to decrease as we increase the chosen dimension r . This flattening means that we only gain marginal practical significance by choosing higher r . We choose r = 4 as a practically and statistically significant dimension that is also efficient, in that higher choices only provide marginal gains in significance. There are likely other interesting group differences to be found in the higher dimensions, and we would relax our restrictions on the choice of r if we were to give a definitive analysis of the pen digit data. For brevity we will focus on the important differences and maintain a fairly strict rule for choosing r . When we invoke SAVE on the data, we also find all possible 16 dimensions are significant. Similarly, all 10 possible SIR dimensions are significant. We will focus on the strong differences in using the output of these algorithms as well. So we will look at the first four dimensions of each of the three inverse regression algorithms. As shown in Fig. 7, SAVE seems to find little location difference in the first four dimensions. Zhu and Hastie (2003) found in this example (pen digit data, restricted to the digits 0, 6, and 9) and generalized to other examples, that SAVE can have difficulty detecting mean group differences. We do see useful variance group differ-

123

−75−50−25 0 25 50 −75−50−25 0 25 50

−75−50−25 0 25 50

−75−50−25 0 25 50

SAVE 2

−75−50−25 0 25 50

−75−50−25 0 25 50

−75−50−25 0 25 50

SAVE 1

787

−75−50−25 0 25 50

Classification and dimension reduction

−75−50−25 0 25 50 −75−50−25 0 25 50

−75−50−25 0 25 50

SAVE 3

−75−50−25 0 25 50

Digit 0 Digit 2 Digit 4 Digit 7 Digit 9

Digit 1 Digit 3 Digit 6 Digit 8 Digit 5

SAVE 4

50 100 0 −50

0 −50

0

0

50

100

−50

0

50

100

−50

0

50

100

0

50

100

−50

0

50

100

−50

0

50

100

0

50

−50

0

−50

SIR 3

0

50

100

−50

SIR 2

−50

50

100

−50

100

−50

SIR 1

50 100

50 100

Fig. 7 Pen digit, SAVE D1 –D4

Digit 0 Digit 2 Digit 4 Digit 7 Digit 9

Digit 1 Digit 3 Digit 6 Digit 8 Digit 5

SIR 4

Fig. 8 Pen digit, SIR D1 –D4

123

4

6 −4 −2 0

−4 −2 0

2

2

4

4 2 −4 −2 0

SMVCIR 2

−4 −2 0 2 4 6

−4 −2 0 2 4 6

−4 −2 0 2 4 6

2

4

6

−4 −2 0

2

4

6

SMVCIR 3

Digit 1 Digit 3 Digit 6 Digit 8 Digit 5

−4 −2 0

2

4

6

−4 −2 0 2 4 6

−4 −2 0 2 4 6

Digit 0 Digit 2 Digit 4 Digit 7 Digit 9

−4 −2 0

−4 −2 0 2 4 6

SMVCIR 1

6

C. D. Lindsey et al.

6

788

SMVCIR 4

Fig. 9 Pen digit, SMVCIR D1 –D4 Table 4 Pen digit, k-nearest neighbors discrimination, LOO error rates

Dimensions

SIR

SMVCIR

SAVE

1

0.49

0.55

0.73

2

0.26

0.29

0.59

3

0.14

0.16

0.47

4

0.09

0.09

0.33

ence information in the plots though. SIR, whose first four coordinates are shown in Fig. 8, provides strong location difference information for some of the digits. Finally in Fig. 9, SMVCIR provides a balance of variance, covariance, and location difference information. Moreover, a clustering of points into two groups for the digits 5 and 9 is highlighted in the first SMVCIR dimension. For brevity we will ignore the clustering in the digit 9 and focus on digit 5. SAVE does not show this as important information in the first four dimensions, and SIR does not show the clustering very strongly. Our visual inference is valid as the highest correlation in the first four SMVCIR dimensions has magnitude less than 0.24. We will investigate the clustering that SMVCIR uncovered shortly. First we see how well the three algorithms discriminate between the digits under the k-nearest neighbors method. The sample size is similar for each digit. Therefore, given that we have ten groups and 7494 observations, we use k = 27. The square root of 700 is 26.457513. The square root of 750 is 27.386128. Our choice of k is reasonable.

123

Classification and dimension reduction

789

2

1

8

7

80

80

31

Digit 5 Upper

100

100

Digit 5 Lower

6

60

60

4

2

40

40

5

3

20

20

6 7 20

40

4

0

0

8 0

5

60

80

100

0

20

40

60

80

100

Fig. 10 Pen digit, SMVCIR D1 clusters on digit 5

Table 4 shows the leave one out misclassification error rates for all three algorithms on the pen digit data. The lower the rate, the better discrimination the algorithm provides. The rows correspond to the different number of coordinates that could be used for discrimination. We find that SIR has the best discrimination. SMVCIR is competitive with SIR though. SMVCIR drastically outperforms SAVE. We see differences in misclassification error rate of 20 or 30 % percent when only a few coordinates are used. The difference shrinks as more dimensions are used, but it is still substantial until many coordinates are used. Now we investigate the clustering in the digit 5. We investigate these clusters by looking at the original pen digit data. As described earlier, the variables are a spacetime series. So we should be able to ascertain the shape of a drawn digit by connecting them in a line. We do this for the upper (D1 > 0) and lower (D1 <= 0) clusters of digit 5 in the SMVCIR coordinates. Figure 10 shows the results. We placed digits at the mean points of each of the (x, y)T =1,...,8 coordinate pairs. These digits reflect the time sequence in the data. These were connected in a similar manner to the original coordinates. We omitted the connection from the fifth coordinate to the sixth coordinate in the large cluster for clarity. The points represent marks in time and space, and there is not necessarily a connection drawn between one mark and the next. This is likely the case with our omission. If they had made the connection between (x, y)5 and (x, y)6 then the resulting digit would look like 6 rather than the digit 5. We found that writers in the upper cluster began writing their 5 at the top left (1) and continued straight down (2) and drew the bottom curve (3–5). They came back later and drew the straight line (6–8) that tops the digit 5. Writers in the lower cluster began at the top right (1). They drew the digit’s top (1–3) and then continued down drawing the left border (3–4) and finally the bottom curve (5–8). As mentioned earlier, 44 writers were used to generate the pen digit data. Our assumption of independent observations may be inappropriate. SMVCIR has demonstrated that there may be some latent classes in the digit 5 that should be controlled for, perhaps by introducing new groups. This would also help our assumption of independence.

123

C. D. Lindsey et al.

0

20

40

%

60

80

100

790

0

5

Dimensions

10

15

Fig. 11 Wine, SMVCIR singular values scree plot

Digital handwriting discrimination can have application in a variety of settings. One of which is security. Now many deliverymen record digital signatures for package delivery, and most stores take digital signature at checkout. Application of discrimination techniques to digital handwriting data can help us identify unique traits of writers, and thus identify writers. We have shown SMVCIR is a useful technique to use in this type of analysis.

5.2 Wine example Now we move to the wine example from (Izenman 2008, p. 275). The wine dataset records chemical analyses of 178 wines. The wines were all grown in the same part of Italy during 1970–1979. They are classified according to 3 different cultivars: Barbera (A), Barolo (O), and Grignolino (G). There are 59 Barolo wines, 71 Grignolino wines, and 48 Barbera wines. The chemical attributes recorded were Alcohol, MalicAcid, Ash, AlcAsh (Alcalinity of the Ash), Mg (Magnesium), Phenols (Total Phenols), Flav (Flavanoids), NFP (Non-Flavanoid Phenols), Proa (Proanthocyanims), Color (Color Intensity), Hue, OD (OD280/OD315 of Diluted Wines), and Proline. So we have 13 predictors and 3 groups. We invoke SMVCIR on the data. Both dimensionality tests find 11 (out of the possible 13) statistically significant dimensions at the .05 level. The scree plot of the spanning set singular values in Fig. 11 shows that the first five dimensions make up 70 % of the discrimination strength. Further we see that the curve in Fig. 11 has already started to drastically flatten compared to its pattern for the first few SMVCIR dimensions. Therefore choosing a higher r than five will only gain us marginal discrimination power. So we choose to examine the first five dimensions. We find these dimensions all have correlation magnitude less than 0.12, so our visual inference is not distorted.

123

791

0

2

4

−4 −2

0

2

4

2 −4

−2

0

2

4

−4

−2

0

2

4

−4

−2

0

2

4

0

2

4

−4

−2

0

2

4

−4

−2

0

2

4

−4

−2

0

2

4

4

−2

0

−4

−2 −4

SMVCIR 4

0

2

4

−4 −2

0

2

4 2 0 −4 −2

SMVCIR 3

−4

2

2 −2

0

2 0 −2 −4

SMVCIR 2

−4 −2

−4 −2 −4 −2

−2

4

−4

2

4

0

4

−2

4

−4

0

2 0

0 −4 −2

SMVCIR 1

−4 −2

0

2

2

4

4

4

4

Classification and dimension reduction

Barolo Barbera

Grignolino

SMVCIR 5

Fig. 12 Wine, SMVCIR D1 –D5

Figure 12 contains the SMVCIR coordinates. We see that the first and second SMVCIR dimensions provide a location difference between the three cultivar groups. We can also see variance differences in the second and third dimensions. The differing covariance between the Barolo and other cultivars can be seen as dimension 1 is plotted against the other coordinates. The difference in covariance between the Barbera and Grignolino cultivar groups can be seen in plots of the second, fourth, and fifth SMVCIR dimensions together. To determine which predictors compose the differences that SMVCIR finds, we use standardized regression coefficients, predicting the SMVCIR dimension using the standardized predictors as regression predictors. The resulting coefficients are then scaled so that their squared values add to 1. Table 5 contains the standardized coefficients for the first five SMVCIR dimensions. The sharpest covariance differences involving the first SMVCIR coordinate are found when plotting it versus the second and fourth SMVCIR coordinates. In the standardized coefficients of the first SMVCIR coordinate, particularly high mass is placed on the predictors Phenols, Flav, and OD. In the standardized coefficients for the second SMVCIR coordinates, we see high mass on Alcohol, Color, and Proline. We draw a matrix plot of the original values for these six predictors in Fig. 13. We have added regression lines this time to better show the covariance relationships. Many of the covariance differences highlighted by comparing the first and second SMVCIR coordinates contrast Barolo versus Grignolino and Barbera. We see this for the covariance of Proline with Alcohol, Phenols and Flav. We do see some covariance

123

792

C. D. Lindsey et al. Predictor

D1

Alcohol MalicAcid

D2

D3

D4

D5

0.11

−0.55

0.06

−0.08

0.15

−0.21

−0.18

−0.32

−0.01

−0.16

0.05

−0.25

−0.50

0.23

−0.01

−0.20

0.17

−0.50

0.10

−0.29

Mg

0.12

−0.20

−0.36

0.45

0.44

Phenols

0.38

−0.07

−0.16

−0.33

−0.11

Ash AlcAsh

Flav

0.44

0.01

−0.09

0.06

−0.16

NFP

−0.25

−0.01

−0.32

−0.59

0.50

0.30

0.01

−0.28

−0.31

−0.28

−0.20

−0.53

0.10

−0.32

−0.20

Hue

0.33

0.23

0.00

−0.12

0.50

OD

0.42

0.13

−0.08

−0.16

−0.13

Proline

0.28

−0.44

0.19

0.16

0.03

2

3

4

1

2

3

4

1 12

13

14

15

11

12

13

14

15

11

12

13

14

15

0

5

10

15

0

5

10

15

0

5

10

15

0

5

10

15

0

500 100015002000

0

500 100015002000

0

500 100015002000

0

500 100015002000

0

500 100015002000

4

5 2

2

3

3

4

4

4 3 2

1

0

4 2 15

1

14 13 12 11 0

Color

5

10

15

Alcohol

1

2

3

3

4

4 3 2 1

OD

0

1

1 0

Flav

11

5

1

6

5

2

4

0

3

11 12 13 14 15

2

1

1 1

5

0

1

2

2

2

2

3

3

3

3 2 1

Phenols

3

4

4

4

4

Color

4

Proa

5

Table 5 Wine, SMVCIR Standardized Coefficients

Barolo Barbera

Grignolino

Proline

Fig. 13 Wine, SMVCIR D1 , D2 covariance differences

differences contrasting Grignolino with Barbera and Barolo however. The covariance of Flav with OD and Color have this property. For brevity, we omit investigation of the other cultivar differences implied by SMVCIR. We have found some interesting cultivar group differences by using SMVCIR on the wine data. Now we will see how SAVE and SIR perform. SIR finds two statistically

123

793

0 −2

−1

SIR 1

1

2

Classification and dimension reduction

−2

−1

0

1

SIR 2 Barolo Barbera

Grignolino

Fig. 14 Wine, SIR coordinates D1 , D2

signficicant (at level .05) location difference dimensions between the cultivars in Fig. 14. Like SMVCIR, SIR separates the groups in location well, but it provides less covariance difference information. Using the iterated test scheme (5), SAVE finds only 7 statistically significant difference dimensions at the 0.05 level. We examine the first five in Fig. 15, and there is little location difference in the three groups, unlike SMVCIR. Using k = 7, we find in Table 6 that SMVCIR has a significantly smaller misclassification error rate than SAVE when one to three dimensions are used. It has a slightly higher rate when five dimensions are used. SIR has a higher rate than SMVCIR when only one dimension is used, but it gives perfect classification when two dimensions are used.

5.3 Birds, cars, and planes example Our third example contains sonic data recorded in Ermont France. The data consists of 58 bird sounds, 43 car sounds, and 64 plane sounds. This data is taken from the LDR package (Cook et al. 2011). Each recording is represented by 13 SDMFCCs (Scale Dependent Mel-Frequency Cepstrum Coefficients). These are the predictor values that we will invoke SAVE, SIR, and SMVCIR on. Invoking SMVCIR on the data, we find that the first three dimensions represent over fifty percent of the discrimination strength. The dimensionality test shows that all thirteen dimensions are significant at the .01 level. We choose r = 3 as a practically and statistically significant dimension. It is also efficient, as the growth of discrimination slows down at higher dimensions (Fig. 16).

123

0

.5

1

−1 −.5

0

.5

1

.5 −1 −.5 −1 −.5

0

.5

1

−1 −.5

0

.5

1

−1 −.5

0

.5

1

1

−1 −.5 0

.5

1

−1 −.5 0

.5

1

−1 −.5 0

.5

1

1

.5

0

−1 −.5 1

.5

.5

0

0

Barolo Barbera

−1 −.5

SAVE 4

0

.5

1

−1 −.5

−1 −.5

SAVE 3

−1 −.5 0

.5

.5 0

.5 0 −1 −.5

SAVE 2

0

.5 0 −1 −.5

−1 −.5

1

1

.5

1

0

1

−1 −.5

−1 −.5

−1 −.5 0

0 −1 −.5

SAVE 1

.5

.5

1

1

1

C. D. Lindsey et al.

1

794

SAVE 5

Grignolino

Fig. 15 Wine, SAVE D1 –D5 Table 6 Wine, k-nearest neighbors discrimination, LOO error rates

Dimensions

SIR

SMVCIR

SAVE 0.52

1

0.57

0.14

2

0

0.02

0.30

3

0.02

0.09

4

0.02

0.02

5

0.03

0.02

Both SIR and SAVE also find all dimensions statistically significant. The maximum correlation among the first three SMVCIR coordinates is 0.2163793, so visual inference using the SMVCIR coordinates is valid. We will examine the first three coordinates from SMVCIR and SAVE. SIR only produces two coordinates, since there are three groups, and these will be examined. The SAVE coordinates in Fig. 17 show variance and covariance discrimination, but little location discrimination. SIR provides excellent location discrimination in this setting, as shown in Fig. 18. SMVCIR provides variance, covariance, and location discrimination, as shown in Fig. 19. For brevity we will forego investigation of which covariates cause these group differences. The investigation would be similar to that of the wine example, involving the calculation of standardized coefficients and investigation of covariates having particularly large coefficient values.

123

795

0

20

40

%

60

80

100

Classification and dimension reduction

0

5

Dimensions

10

15

SAVE 1

−.1−.05 0 .05 .1 .15

−.1−.05 0 .05 .1 .15

Fig. 16 Birds, cars, and planes, SMVCIR singular values scree plot

SAVE 2

−.1−.05 0 .05 .1 .15 −.1−.05 0 .05 .1 .15

−.1−.05 0 .05 .1 .15

−.1−.05 0 .05 .1 .15

SAVE 3 Plane Bird

Car

Fig. 17 Birds, cars, and planes, SAVE D1 –D3

Using k = 7, we find in Table 7 that SIR has the lowest misclassification error rate under k-nearest neighbors discrimination. SMVCIR is competitive to SIR in this measure, and outperforms SAVE. We see differences in rate of 20 % when only a few coordinates are used.

123

C. D. Lindsey et al.

−.05

0

SIR 1

.05

.1

796

−.05

0

.05

.1

SIR 2 Plane Bird

Car

4 0 −4 −8

−8

SMVCIR 1

−4

0

4

Fig. 18 Birds, cars, and planes, SIR D1 –D2

−4

0

4

−8

−4

0

4

−8

−4

0

4

−8

SMVCIR 2

−4

0

4

−8

SMVCIR 3 Plane Bird

Car

Fig. 19 Birds, cars, and planes, SMVCIR D1 –D3

123

Classification and dimension reduction Table 7 Birds, Cars, and Planes, k-nearest neighbors discrimination, LOO error rates

797

Dimensions

SIR

SMVCIR

1

0.35

0.4

0.53

2

0.07

0.28

0.45

0.11

0.39

3

SAVE

6 Conclusion In this article we provided an improved SMVCIR algorithm and a dimensionality test for its coordinate system. Using a generated example, we demonstrated how SMVCIR can easily uncover location differences in its early, strongly discriminating coordinates that SAVE will only detect in its high, weakly discriminating coordinates. Simulations corroborating our theoretical results were also presented. They demonstrated that the SMVCIR dimensionality test has appropriate size and power in a variety of group difference and population settings. The attributes of SIR and SAVE were also documented in the simulations. We found that dimensionality tests for SIR have appropriate size and power, even in the face of variance/covariance differences. We found that SAVE, while having appropriate size, can have difficulty choosing a single dimension when the differences it finds are not strongly linearly independent or of sufficient individual strength. In this situation, outliers can trick SAVE into adding dimensions that should not be present. We also explored how SMVCIR performed versus SIR and SAVE in three real data examples. In the pen digit example, SMVCIR found clustering in the digit 5 that was not easily detectable as an important feature in SIR and SAVE. We found an apparent explanation in that individuals in one of the digit 5 clusters wrote the top horizontal line of the digit 5 last, while individuals in the other cluster wrote it first. In the wine example, we found that SAVE did provide discrimination on location, but only in higher dimensions. SMVCIR placed it first. In this example we also explored the important predictors in discrimination by using standardized coefficients. In the planes, cars, and birds example, we found SMVCIR provided location discrimination that was missed by SAVE. SMVCIR also provided variance/covariance discrimination analogous to SAVE’s in this example. In all three examples, k-nearest neighbors discrimination was used on the output coordinates of the SAVE, SIR, and SMVCIR algorithms. We found that SIR was in general had the lowest misclassification error rate. SMVCIR was competitive with SIR and provided significantly smaller misclassification error rates than SAVE. As the literature including our discussions have shown, real data sets? exist where one of the three procedures (SIR, SAVE, or SMVCIR) is more proficient than the others. As future work, we intend toinvestigate adaptive procedures which attempt to choose? a “best” method among the three (SIR, SAVE, or SMVCIR) for a given data set.

123

798

C. D. Lindsey et al.

References Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/ MLRepository.html Bentler PM, Xie J (2000) Corrections to test statistics in principal hessian directions. Stat Probab Lett 47:381–389 Bura E, Cook RD (2001) Extending sliced inverse regression: the weighted chi-squared test. J Am Stat Assoc 96:996–1003 Bura E, Yang J (2011) Dimension estimation in sufficient dimension reduction: a unifying approach. J Multivar Anal 102:130–142 Clopper CJ, Pearson ES (1935) The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26:404–413 Cook RD (1998) Regression graphics: ideas for studying regressions through graphics. Wiley, New York Cook RD (2000) Save: a method for dimension reduction and graphics in regression. Commun Stat Theory Methods 29:2109–2121 Cook RD, Critchley F (2000) Detecting regression outliers and mixtures graphically. J Am Stat Assoc 95:781–794 Cook RD, Forzani LM, Tomassi DR (2011) Ldr: a package for likelihood-based sufficient dimension reduction. J Stat Softw 39:1–20, http://www.jstatsoft.org/v39/i03 Eaton ML, Tyler DE (1994) Asymptotic distributions of singular values with applications to canonical correlations and correspondence analysis. J Multivar Anal 50:238–264 Gannoun A, Saracco J (2003) An asymptotic theory for sirα method. Stat Sinica 13:297–310 Hastie T, Tibshirani RJ, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York Hodges JL, Fix E (1951) Discriminatory analysis: nonparametric discrimination, consistency properties. Tech. Rep. 4, Project No. 21–49-004, Randolph Field, Texas: Brooks Air Force Base, USAF School of Aviation Medicine Izenman AJ (2008) Modern multivariate statistical techniques: regression, classification, and manifold learning. Springer, New York Li KC (1991) Sliced inverse regression for dimension reduction. J Am Stat Assoc 86:316–327 Lindsey CD (2010) Sliced mean variance-covariance inverse regression dimensionality test. PhD thesis, Texas A&M University, available by request to the author Loftsgaarden DO, Quesenberry CP (1965) A nonparametric estimate of a multivariate density function. Ann Math Stat 36:1049–1051 Lütkepohl H (2007) New introduction to multiple time series analysis, 2nd edn. Springer, New York Muirhead RJ (1982) Aspects of multivariate statistical analysis. Wiley, New York R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN 3-900051-07-0 Rencher AC (2002) Methods of multivariate analysis, 2nd edn. Wiley, New York Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New York Shao Y, Cook RD, Weisberg S (2007) Marginal tests with sliced average variance estimation. Biometrika 94:285–296 Sheather SJ, McKean JW, Crimin K (2008) Sliced mean variance-covariance inverse regression. Comput Stat Data Anal 52:1908–1927 StataCorp (2011) Release 12. College Station, Texas, http://www.stata.com Tyler DE (1981) Asymptotic inference for eigenvectors. Ann Stat 9:725–736 Weisberg S (2009) dr: Methods for dimension reduction for regression. http://CRAN.R-project.org/ package=dr, R Package Version 3.0.4 Yin X (2005) Tests of dimension with sliced average variance estimation. Research Report 13, Department of Statistics, University of Georgia Zhu M, Hastie TJ (2003) Feature extraction for nonparametric discriminant analysis. J Comput Graph Stat 12:101–120

123

Using sliced mean variance–covariance inverse regression for classification and dimension reduction

Recommend Documents