ANNALS OF BIOMEDICAL ENGINEERING 6, 108--116
(1978)
Preprocessing by Factor Analysis of Centro-Occipital EEG Power and Asymmetry from Three Subject Groups A. ~V[. DYMOND, 1 R. W. COGER,1'2 AND E. A. SERAFETINIDES1,2,3 1 Veterans Administration Hospital (Brentwood), Los Angeles, California 90073, 2Department of Psychiatry, UCLA School of Medicine, Los Angeles, California 9002~, and 3 Brain Research Institute, UCLA Center for the Health Sciences, Los Angeles, California 9002~
Received June 10, 1977 Principal components analysis with Varimax rotation was applied to centro-occipital EEG power spectral density (PSD) and asymmetry variables in order to detect the subsets of intercorrelated variables. Seven orthogonal variable subsets (factors) were found: left and right PSDs from 0 to 8, 6 to 12, 12 to 20, and 12 to 30 Hz, and asymmetries from 0 to 6, 6 to 14, and 14 to 30 Hz. Attempts to validate these results suggest that the factors are satisfactorily orthogonal and account for a large part of the original data variance. The same set of factors appears to be present in normal subjects, stabilized alcoholics, and some chronic schizophrenics. More effective use of multivariate data in later statistical tests may be made possible by replacing the original variables with the factor scores computed from a weighted sum of the variables in the factors. Also, factor extraction allows comparison of the variable organization in different subject groups. This can have both physiological and statistical significance. The h u m a n electroencephalogram ( E E G ) is examined as a source of information about neurophysiological processes. Although visual inspection of E E G records has provided much valuable information, it is desirable to quantify the data and apply objective statistical evaluations. Ongoing E E G data reduction is often accomplished by spectral transformation which produces power spectral density (PSD) estimates in a series of narrow frequency bands. For only a few seconds of record from only one E E G channel, spectral transformation can result in a large n u m b e r of P S D variables. Depending on parameters such as epoch length and spectral bandwidth, these PSDs can have substantial variance (Blackman and Tukey, 1958; Jenkins and Watts, 1968). High dimensionality and variance can have adverse effects on subsequent statistical data analysis (Meisel, 1972; Chen, 1973). For example, a multivariate technique such as discriminant analysis finds ways to assign subjects to groups b y looking at differences between the groups for each of the variables. If only a small number of subjects were initially recorded, then the mean and the varilance of any given variable might be quite different from the values found when a larger sample was measured. This can result in spurious inclusion of this variable 1.08
0090-6964/78/0062-0108502.00/0 Copyright ~ 1978 by Academic Press, Inc. All rights of reproduction in any form reserved.
FACTOR ANALYSIS OF C-O EEG
109
for separation of groups, even though no such information exists in the variable. The chance of something like this happening increases as smaller numbers of subjects are used or as larger numbers of variables are measured. Since it is known that there are high correlations between many of the adjacent EEG PSD bands, it may be advantageous to apply data preprocessing techniques to the data prior to statistical evaluations. Specifically, it might be possible to find clusters of highly correlated variables, and then use all of the variables in a cluster to compute a single score which would represent the cluster. Since this new variable would be calculated from a weighted sum of the original variables, it might have less variance and might be more normally distributed. This approach would automatically result in a significant dimensionality reduction since original variables would be combined into clusters. The new variables would be used in place of the original variables in subsequent statistical evaluations. A further advantage of this data preprocessing would arise from comparison of the variable subsets found in different subject groups. Many multivariate statistical procedures assume a similar variance-covariance behavior for different groups, and extracting the clusters of correlated variables within each group will allow investigation of the validity of this assumption. If different variable clusters are found in different subject groups, this could guide the choice of subsequent statistical tests. In itself, it would suggest that the subject groups were different. It could also provide an alternate approach to assigning cases to groups by means of evaluating which type of clusters best describes an individual subject's variables. Finding different variable clusters in different subject groups could have physiological significance since it may suggest the operation of different neuronal mechanisms. This report describes an attempt to detect subsets of intercorrelated variables from human centro-occipital EEGs. The method chosen to extract these subsets or clusters of variables was factor analysis, using the principal components method with Varimax rotation. The clusters of intercorrelated variables found by factor analysis are termed factors. They can be identified as the orthogonal eigenvectors describing the greatest percentages of the variance in the data space of the original variables (Harmon, 1967). This factor extraction can produce as many factors as the rank of the correlation matrix. Since each successive factor accounts for a decreasing proportion of the total variance, the usual procedure is to end factor extraction when the amount of the variance being explained by new factors drops below some preset limit. Following factor extraction, the factor axes are rotated by Varimax method in order to arrange the variable loadings so that each variable is primarily associated with just one of the factors. The result of these procedures is a set of equations of the form zii = ~ ajmF,,,, where zj~ is the j t h variable (in the form of a standard score) for the ith case, aim is the factor loading coefficient for the j t h variable and the ruth factor, and F , , is the factor score (also a standard score) of the mth factor for the ith case. For orthogonal factors, the factor loading coefficients can be identified as the simple correlation coefficients between the variable and the factor. This fact can be used to assign a specific interpretation to each of the factors. This can be
110
DYMOND, COGER, AND SERAFETINIDES
done by plotting the values of the factor loading coefficients versus the variables for each of the factors. Those variables having high factor loading eoeffieients on the factor can then be presumed to be the primary contributors to the faetor. The three subject groups studied consisted of 30 control subjects, 44 stabilized aleoholies, and 18 schizophrenics. All subjects were male. The mean age of eaeh group was 38.2, 44.7, and 41.4 yr, and the standard deviation was 11.1, 8.2, and 8.9 yr. The aleoholie subjects had normal physical examinations and were free of obvious eognitive defects. They were all stabilized inpatients and were medicated only with Antabuse (disulfiram). The schizophrenics were all chronic inpatients, with a normal physical examination and no complicating secondary diagnosis. They were drug-free except for Penfluridol (diphenylbutylpiperidine). The control subjects were chosen from hospital employees. Recordings were taken from bilateral eentro-oeeipital leads (C3-O1, C4-O2) while the subjects were resting with eyes elosed in a comfortable chair in a dimly illuminated room. Amplifier half-amplitude filter settings of 0.3 and 100 Hz were used. Recordings were made on paper and on magnetic tape. At a later time, the records on magnetie tape were analyzed for PSDs and asymmetries on a Nieolet MED-80 minicomputer, using the fast Fourier transform. The data were prefiltered with analog Butterworth low-pass filters having 24 dB/octave roll-offs, which were 3 dB down at 50 Hz. The digitizing rate was 100 Hz. Three consecutive 8-see, artifact-free epochs of EEG were spectral analyzed, integrated, and the average power for each 2-Hz band up to 30 Hz was stored. These averaged values were used to compute an asymmetry score, based on the formula (L -- R ) / ( L + R), where L and R refer to the left and right PSDs for a given band. The PSD variables underwent a log transformation before factor analysis (Blaekman and Tukey, 1958; Benignus st at., 1970). All variables were screened for distribution and outliers on an IBM 360/91 at the UCLA Health Seienees Computing Facility (HSCF). Faetor analysis, ineluding both factor extraction and rotation, was done using the BMDP4M program at HSCF (Dixon, 1975). Factor analysis was performed in several different ways during the evaluation of the data from the three patient groups. For each group, principal eomponents analysis was first earried out for all 45 variables (15 left PSDs, 15 right PSDs, and 15 asymmetries). Variable sets were also subdivided for individual factor analyses for left power alone, right power alone, and asymmetry alone. All analyses were conducted for only one patient group at a time. These various analyses disclosed a set of factors which was eonsistent between the different groups, and which was consistent with previous findings (Coger et al., 1976 ; Defayolle and Dinand, 1974 ; Dymond et al., 1975 ; Dolce and Deeker, 1975). Results of a factor analysis which was representative of all groups are shown in Fig. 1. These factors resulted from factor analysis of all 45 variables from the aleoholie patients. The figure shows the factor loading coefficients for the seven factors plotted as a function of frequency. For clarity, some variables which had no significant loadings on a given factor are not shown. The order of appearanee of a factor in the program, and therefore the amount of total varianee accounted for by the factor, is shown to the left of each graph. The graphs have
111
FACTOR ANALYSIS OF C-O EEG
F2
0.5 0
F4
i
0.5 0
J
1.0 z t&l B
F7
0.5 0
r t&l 0r
1.0
F I
0.5
c~
<[
j
0
.g ~
~
~
l
i
6
8
I0
iI
i- i _ r
I
I
I
I
I
I
I.-
F5
0,5 0
F6
0.5
oj. F3
0,5 i
I
0
] 0
9 L E F T POWER oRIGHTPOWER ~ASYMMETRY
2
4
12 14 16
FREQUENCY
IB 2 0 2 2 2 4
26
28 30
(2 HZ B A N D S )
FIG. 1. Seven factors found from principal components analysis of C,~-O,and C4-O~EEG power and asymmetry in 2-Hz bands up to 30 Hz in 44 stabilized alcoholics. Each graph is one of the factors and shows factor loading coefficients (the correlation between the factor and the variables) plotted versus frequency, Factor loadings for asymmetry on the power factors and vice versa are not shown for simplicity since they were small. The factor numbers at the left of eacb graph show the order of factor appearance in the analysis. The horizontal bar at the bottom of each graph indicates the group of variables selected as represented by the factor. been arranged sequentially according to the frequency bands accounted for b y the factors. The horizontal bar at the b o t t o m of each graph corresponds to the frequency band having high loadings on the factor. The top graph in Fig. 1 represents a grouping of E E G P S D data below 8 Hz. The second factor is composed of the E E G P S D s between 6 and 12 tlz, and can be identified as overlying the alpha band. The third factor, which was the last factor to come out of the analysis, covers the range between 12 and 20 Hz, and is associated with beta-1 activity. The fourth factor illustrates a grouping of E E G activity between 12 and 30 Hz, and can be labeled as overall beta activity. All
112
DYMOND, COGER, AN]) SERAFETINIDES
four of the factors are very similar to those found in previous work, and are consistent with the E E G bands which have become traditional, based on visual observation in classical E E G studies. The last three graphs in Fig. 1 show three factors associated with C-O asymmetry. These three factors have been tentatively assigned the names of low-range, mid-range, and high-range asymmetry. Although there is not as much past experience to use in judging asymmetry factors as there is for PSD factors, these three factors appear reasonable, and can be accepted, at least on a temporary basis. To evaluate these seven factors, it is necessary to know how well they account for the information contained in the original 45 variables. Also, if the factors are to be used to compare the three groups, it is necessary to know how well they apply to each of the groups. There is no simple way to completely answer these questions, but there are several examinations which can be performed to provide some practical information. For example, as they are returned from the program, the factors are orthogonal. However, this orthogonality is based on the computation of factors scores using all of the n ( = 45) variables in the analysis, as
F~i = ~
bpizyi ,
where F~i is the score for the pth factor of the ith case, b~j is the factor score coefficient for the pth factor and the j t h variable, and zii is the j t h variable for the ith case. Since the factors represent smaller subsets of variables, it is appropriate to recompute the factor scores based on these variables alone. Factor scores computed in this way can then be used to retest the orthogonality of the factors. If the factors are well represented by the high loading variables, orthogonality should be preserved. The success of the factor analysis can also be evaluated by examining the percentage of the original data variance accounted for by the factors. This can be estimated by computing the multiple correlation coefficients squared (/~2) between each of the variables and the factors. If the analysis h~od been completely successful and all the variance in the original d~ta was accounted for, the /7~ for each of the variables would be equal to unity, and the sum of all the R 2 values would be equal to the total number of variables analyzed. In practice, the R 2 for each variable is less than unity, and their sum divided by the total number of variable~ then gives a measure of the total variance in the original data set being accounted for by the factors. All of these tests were applied to the data. The factor analysis shown in Fig. 1 was used as the basis to compute factor scores for each of the three patient groups. The variables for a group were converted to standard scores independently of the other groups. The same set of factor score coefficients was used to score each group. These coefficients were computed by multiple correlation between each factor and only those variables considered to be reflected in the factor, as indicated by the horizontal bars in Fig. 1. The alcoholic group's variables were used to calculate these coefficients.
FACTOR ANALYSIS OF C-O EEG
113
T a b l e 1 shows the factor score c o r r e l a t i o n m a t r i c e s for each of t h e t h r e e s u b j e c t groups. E x a m i n a t i o n of these m a t r i c e s i n d i c a t e s f a i r l y good f a c t o r o r t h o g o n a l i t y . T h e few factors which do a p p r o a c h significant c o r r e l a t i o n s are g e n e r a l l y f r o m a d j a c e n t f r e q u e n c y bands. As examples, f a c t o r s 1 a n d 7, a n d f a c t o r s 3 a n d 6 of t h e schizophrenic group show t h e h i g h e s t correlations. R e f e r r i n g to Fig. 1, factors 1 and 7 b o t h involve b e t a a c t i v i t y a n d o v e r l a p in frequency. F a c t o r s 3 and 6 are from a d j a c e n t a s y m m e t r y b a n d s , a n d also show some o v e r l a p in t h e factor l o a d i n g coefficients. Since only a few c o r r e l a t i o n s a p p r o a c h significance, t h e conclusion is t h a t , in practice, t h e factors are all s a t i s f a c t o r i l y o r t h o g o n a l . M u l t i p l e c o r r e l a t i o n coefficients s q u a r e d (R ~) were n e x t c a l c u l a t e d b e t w e e n t h e factors and each of t h e v a r i a b l e s , for each of t h e groups. These R 2 v a l u e s were s u m m e d s e p a r a t e l y for t h e left power variables, t h e r i g h t power v a r i a b l e s , a n d t h e a s y m m e t r i e s , and the results are shown in T a b l e 2. Also shown in T a b l e 2 are t h e e s t i m a t e s of t h e v a r i a n c e e x p l a i n e d o b t a i n e d from t h e eigenvalues p r o d u c e d d u r i n g f a c t o r a n a l y s i s of each of t h e s a m e sets of v a r i a b l e s . R e f e r r i n g TABLE 1 Factor Score Correlation Matrices for the Three Subject Groups ~ F1
F2
F3
F4
1.000 0.117 --0.214 0.376 -0.144 -0.110 0.089
1.000 --0.040 0.282 -0.018 -0.283 0.387
1.000 0.071 0.331 0.330 --0.158
1.000 -0.059 -0.283 0.257
1.000 0.149 0.032 0.332 0.021 0.060 0.324
1.000 0.116 0.155 0.082 0.031 0.191
1.000 --0.006 0.311 0.368 0.066
1.000 --0.035 0.057 0.214
1.000 0.338 '--0.298 0.442 0.060 --0.379 0.619
1.000 0.196 0.357 0.224 0.036 0.469
1.000 0.219 0.303 0.619 --0.180
1.000 0.277 0.269 0.585
F5
F6
F7
1.000 0.503 0.156
1.000 0.174
1.000
1.000 0.142 --0.116
1.000 --0.091
1.000
1.000 --0.163
1.000
Controls F1 F2 F3 F4 F5 F6 F7 Alcoholics F1 F2 F3 F4 F5 F6 F7 Schizophrenics F1 F2 F3 F4 F5 F6 F7
1.000 0.564 0.159
These scores were calculated using the same factor score coefficients and only the variables considered to be involved in a given factor, as is indicated by the horizontal bars in Fig. 1.
114
DYMOND, COGER, AND SERAFETINIDES TABLE 2 Percent Variance Explained, Based on Summed Eigenvaiues and Summed R 2 Values, for Left PSDs, Right PSDs, and Asymmetry Variable Sets in the Three Subject Groups" Left PSD
Controls Alcoholics Schizophrenics
Right PSD
Asymmetry
Summed eigenvalues
Summed R2
Summed eigenvalues
Summed R2
Summed eigeDvalues
Summed R2
80 83 86
75 85 88
81 81 86
74 83 88
70 66 76
53 66 81
The eigenvalues were obtained from principal components analysis of each separate set of variables. The R 2 values were obtained from multiple correlation between each of the variables and all of the factors for that subject group, with all factor scores being based on the same set of factor score coefficients. to the s u m m e d R 2 values, the left and right power variables are similar and account for slightly more t h a n 8 0 % of the variance for the alcoholics and schizophrenics, and slightly less t h a n 80% of the variance for the control subjects. E i g h t y - o n e percent of the variance for the schizophrenics' a s y m m e t r y d a t a is accounted for, while smaller a m o u n t s are found for the alcoholics a n d controls. C o m p a r i s o n of the variances explained f r o m the s u m m e d eigenvalues and f r o m the s u m m e d R ~ values shows a close correspondence, except for the control group a s y m m e t r y data. However, the higher value seen for these eigenvalues is based on a return of five factors with significant eigenvalues from the factor analysis. E x a m i n a t i o n of the factors suggested t h a t three of t h e m were c o m p a r a b l e to the a s y m m e t r y factors in the other two groups, while two of t h e m seemed to be collections of scattered variables. If only the factors which appeared meaningful are considered, the proportion of the variance explained is in accord with the value obtained from the s u m m e d R 2 values. T h e general a g r e e m e n t between variance explained from s u m m e d eigenvalues and f r o m s u m m e d R 2 values suggests t h a t the factors are indeed accounting for a large proportion of the total variance, and t h a t the methods of computing factor scores, based on only those variables considered to be associated with the factors, is appropriate. T h e a m o u n t of actual variance being accounted for b y these procedures is p r o b a b l y higher t h a n indicated since p a r t of the original variance in the raw variables was due to r a n d o m sources. T h e differences in variance explained for the P S D d a t a between the different p a t i e n t groups is not large. T h e differences between the variance explained for the a s y m m e t r y d a t a between the schizophrenics and the other groups m a y suggest t h a t although the same a s y m m e t r y factors are present in each group, the strengths of the factors differ. This possibility is s u p p o r t e d b y inspection of the a s y m m e t r y variable correlation matrices corresponding to the three a s y m m e t r y factors for the three groups, which show average correlations paralleling the percentages of variance explained b y the a s y m m e t r y factors. T h e P S D factors correspond to power below S Hz, a l p h a power, beta-1 power,
FACTOt~ ANALYSIS OF C-O EEG
115
and overall beta power. Defayolle and Dinand (1974) found essentially the same four PSD factors, using similar techniques in a group of normal subjects. The nature of the first factor, 0 to 8 Hz PSDs, requires further consideration since it overlaps the traditional delta (0-4 Hz) and theta (4-8 Hz) bands. Defayolle and Dinand chose to name this factor as theta, but in any event, sinee this 0- to 8-Hz factor has been found by them, by Dolce and Decker (1975), and in the present study, it appears that splitting this frequeney range from C-O leads in these resting subjects into delta and theta bands is not neeessary. The general grouping of E E G frequencies into bands has been established for some time, and the practicality of using these bands in place of individual fiequeney components in later statistical tests has been demonstrated by the m any investigations which have used frequeney banding. Even with this approximate approaeh to data preproeessing, results superior to those obtained when using the original variables have been found (Larsen and Walter, 1970). Factor analysis provides an objective method of identifying the frequency bands, and provides a set of weighting coefficients to use in calculating a factor score whieh eharaeterizes the data in the band. Using an objective method to identify the correlated subsets of variables resulting from E E G spectral analysis m a y be of additional value in the study of CNS processes. There is no guarantee that the same groupings of variables will be found from different leads, different subject groups, or during different tasks. The presence of different faetors under sueh conditions would be an interesting finding in itself. Finding faetors unique to a particular experimental eondition eould provide a useful tool for separating this condition from others, or for assigning eases to it. The same set of faetors appears to exist under the recording conditions used for the three groups studied here, and the use of factor analysis seems to have satisfied the goals of preproeessing. There has been a substantial reduction in the J LEFT 8L RIGHTEEG POWER F2 O'SHz
_ICI
F4 6-12Hz
F7 FI 12-20Hz 12-30Hz
I
EEG ASYMMETRY F5 O-6Hz
FS 6-14Hz
]
F5 14-50Hz
v
9 CONTROLS 0 ALCOHOLICS SCHIZOPHRENICS
FIG. 2. Comparison of average factor scores for the three subject groups on the seven factors found to describe their resting C-O PSD and asymmetry variables. For this comparison, all original variables were first converted to standard scores based on the control group's variables and then factor scores were calculated using the same set of factor score coefficientsderived from the analysis shown in Fig. 1.
116
DYMOND, COGER, AND SERAFETINIDES
dimensionality of the data, with 45 original variables being reduced to seven factors. Unlike the original variables, the factors have very little correlation between them. Very little of the original data has been lost in the preprocessing, as is indicated by the generally high values of the variance explained. The mathematical and physiological manageability of the seven factors is illustrated b y Fig. 2, which compares the mean factor scores for the three groups. For this figure only, the original variables' standard scores used to compute the factor scores for each group were calculated based on the means and standard deviations of the control group's variables. The factor score coefficients were the same as before. Preprocessing wilt allow the use of factor scores in place of the original variables in later statistical tests of the data. Factor scores should allow better results to be obtained than might be possible when using the original variables. Some preliminary studies using factor scores as input to discriminant analysis for separation of groups has suggested t h a t this approach is v e r y effective (Coger et al., 1976). I t is proposed t h a t data preproeessing techniques be further explored as an aid in E E G data analysis. ACKNOWLEDGMENTS This work was supported by the Medical Research Service of the Veterans Administration. Computing assistance was obtained from the Health Sciences Computing Facility, UCLA, supported by NIIt Special Research Resources Grant RR-3. The authors wish to thank Ms. D. Adams and Dr. J. Frane for their advice and assistance. REFERENCES Benignus, V. A., Bunce, A., Bre~mer, F. J., Barrat, E. S., and Frazier, T. W. Significancetest for characteristics among multiple spectrum estimates. Currents in Modern Biology 1970, 3, 269-276. Blackman, R. B., and Tukey, J. W. The measurement of power spectra. New York: Dover, 1958. Chen, C. Statistical pattern recognition. Rochelle Park, N.Y. : Hayden, 1973. Coger, R. W., Dymond, A. M., and Serafetinides, E. A. Classification of psychiatric patients with factor analytic EEG variables. Proceedings of the San Diego Biomedical Symposium 1976, 15, 279-284. Defayolle, M., and Dinand, J. P. Application de l'analyse factorielle a l'etude de la structure de I'EEG. Electroencephalography and Clinical Neurophysiology 1974, 36, 319-322. Dixon, W. J. Biomedical computer programs. Los Angeles : Univ. of California Press, 1975. Dolce, G., and Decker, H. Application of multivariate statistical methods in analysis of spectral values of the EEG. In G. Dolce and H. Kunkel (Eds.), CEAN--Computerized EEG analysis. Stuttgart: Gustav Fisher Verlag, 1975. Dymond, A. M., Coger, R. W., and Serafetinides, E. A. Extension to three psychiatric populations of a factor analysis grouping of ongoing EEG and visual evoked response variables. Proceedings of the 28th Annual Conference on Engineering in Medicine and Biology 1975, 17, 51. Harmon, H. H. Modern factor analysis. Chicago : Chicago Univ. Press, 1967. Jenkins, G. M., and Watts, D. G. Spectral analysis and its applications. San Francisco: HoldenDay, 1968. Larsen, L. E., and Walter, D. O. On automatic methods of sleep staging by EEG spectra. Electroencephalography and Clinical Neurophysiology 1970, 28, 459-467. Meisel, W. S. Computer-oriented approaches to pattern recognition. New York : Academic Press, 1972.