PSYCHOMETRIKA--VOL. 10, NO. 8 sEPrE~B~a, 1945
APPROXIMATE METHODS IN CALCULATING DISCRIMINANT FUNCTIONS GEOFFREY BEALL INSTITUTE OF PAPER CHF2VIISTRY, APPLETON, WISCONSIN*
Approximate methods of solving for discriminant functions have been tried on three sets of data. The principal illustration is the problem of finding a weighted sum of scores, on four psychological tests, so that men and women may be distinguished most clearly. The work starts from the complete solution, due to R. A. Fisher, where it is necessary to solve as many simultaneous equations, dependent on the standard deviations of the tests and t h e i r mutual correlations, as there are tests. I t is proposed, by way of numerical simplification, t h a t a set of equations be substituted where some one quantity replaces all the correlations. A solution is obtained where the weights to be assigned the tests are v e r y simply expressed in terms of differences between the mean values of tests, £he standard deviations of tests, and the said quantity. The difficulty remains of finding an estimate of the a r b i t r a r y constant t h a t will give good discrimination. I f an optimal solution is made a result is obtained which, in the three sets of data considered, is almost indistinguishable from t h a t yielded by the complete solution. The calculation of this optimal common quantity is, however, itself so considerable t h a t another estimate, previously suggested by R. W. B. Jackson, appears more profitable. This estimate is derived simply from the variability between the total scores for each subject and the variability of each test. U s i n g this estimate, the disc*'iminant functions can be rapidly calculated; the results compare _ry favorably, in the case of the data considered, with those from the complete solution.
1. The Problem of Discriminating Between Two Groups The problem of discrimination, i.e., of so combining various considerations on a given object (or subject) that objects belonging to one group may be characterized to distinguish them mos t clearly from those of an alternate group, has been treated, in a general way, by Fisher {1937). Fisher illustrated his method with the discrimination of two species of Iris on the basis of sepal and petal size. The statistical problem is the same when it is required to discriminate between, say, normal and psychotic men on the basis of various psychological tests. The present work is concerned with approximate methods, on the line suggested by Jackson (1943), which may be justiffed by their facilitating practice. The data used to establish the method of discrimination must consist of an observation on the ith (i ~ 1 , . . . , m) consideration of * The p r e s e n t w o r k w a s d o n e w h i l e the w r i t e r w a s e m p l o y e d by the Ontario D e p a r t m e n t of Health.
2O5
206
PSYCHOMETRIKA
the j t h ( j = 1 , . . . , nt + n~) object, as a score, x~j when the objects have been assigned to Group I or Group II, respectively, for 1 < J < nt ha+ 1
(1)
Fisher (19"17) has solved the problem of discrimination by supposing that there is required a linear function, y~ - - ~ a~ x~j,
(2)
subsequently termed discriminant, where the values as are to be chosen in such proportion t h a t D z / S shall be maximal where, for the discriminant values, D • depends on the variability between groups and S on the variability within groups. Specifically, set D = y~ - - Y2----~ a~ (x~ - - x~2) ,
(3)
i=l
where, for Group I ,
~t~
m (4)
y~ = X y~/n~ = X a~ x .
~=1
j=t
is the mean discriminant value for Group I and
Its x,, -- 2: x . / m
(5)
J=l
is the mean value of the ith consideration for Group I and values y~ and x~ are similarly defined for Group II. Secondly,
i=l
i=t
i'=]
for i' ¢- i , where
{ tit s~ , - nt + ~
fit÷l%
:E ( z ~ i - - x . ) ~ + X ( x . - - x ~ , ) ~ i=t
} (7)
S:-,*l
is the standard deviation within groups for the ith consideration and where r
=
-=
(x --x)(x,j--z
tlt÷nt
+,:,,+,X ( z y z ~ , ) ( z
)
}
--z,,~)/(n,+m)ss.,,
(8)
GEOFFREY BEALL
207
is the corresponding correlation coefficient within groups. The required minimization is given by the proportional solution, f o r the quantities as, of the m equations, a~si + ~ a~, r . , si, ~ t~,
(9)
i'=1
where (10)
t~ - - (x~, - - x i , ) / s ~ .
The w o r k of setting up the numerical equations involved in (9) m a y be v e r y heavy, since first m ( m - - 1 ) / 2 sums of squares or of cross-products must be found and then m equations m u s t be solved. Indeed the procedure would probably become impracticable f o r m g r e a t e r than about 8. 2. A p p r o x i m a t e
Solutions
To avoid the w o r k required by a complete solution of such a set, f o r m great, of equations as (9), Jackson (1943) has suggested on empirical grounds an approximate solution based o n the assumption t h a t for all i and i' r . . -----r , (11) so t h a t from (6) i=l
i=l
i '=1
If (12) be used instead of (6) and the ratio be minimized w i t h respect to the coefficients a~, we obtain, f o r any given value of r , the rn equations, ai s~ + r ~ a~, s~, ---- t~,
(13)
i'=1
in place of (9). The most convenient proportional solution of (13) is a~ :
{ (1 - - r ) t ~ + m r t ' ~ } / s ~ ,
(14)
where t'i - - t~ - - t
(15)
t : -- Y. t~
(16)
and m ,=l~i-
is the mean value of t~. Equation (14) makes the coefficients dependent on one variable, ?- , which must still be determined f o r a n y data. The best value of
208
PSYCHObIETRI KA
r m a y be found by substituting from (14) in (3) and (6) for a, and then maximizing D~/S with respect to r , to get r--
mt2C + mtD -- t A B -- AC --m2tA + m t 2 C - - m ( m - -
1)tD-- tAB + (m-- I)AC'
(17)
where
i~l
i'=l (18)
C=y.
t'r
i=l
i'=l
i=l
~ t" t r r i.'=l
D=~
.
Obviously, (17) cannot be used in practice, for which a useful relation is suggested by the fact that insofar as (12) can be identified with (6),
~=l i ' = l
"
/'is'
I=I ~'=I
,
~,
" B1 t
P
where the coefficients, a~, are not limited. It is convenient to make a~ = 1 ,
(20)
for all i , when f r o m (19).
,=l
"~'=
(21) ~--
80 ~ ~
8~ 2
82 ~
8i s
,
"=
where
~t so2--m 'n.t
1
+ ~-_.
ttt+~t (xj--x,)
2+
J=t
F, ( x i - - x ~ )
} 2 ,
(22)
j=,,,l
where
xi
x,~,
(23)
GEOFFREY BEALL
xl : - -
209
Y. x~,
(24)
and x~, similarly defined, is the squared standard deviation within the groups for the total score of each object (or subject) on all considerations, and where s :
2~ s , .
(25)
It will be seen that in (21) r is the mean of the quantities ~ , weighted by standard deviations. The estimate of (21) is very similar to one previously proposed by Jackson (1943), i.e., his r", which we shall write r=
{ 1/1 zo2 - ~ z ~
~
z'--
i~1
z,' i=l
}
,
t26)
where z?--
X
( x , i - - x~) 2
(27)
is the squared standard deviation over all the material for the ith consideration (embracing both groups), zo~
-
( x ~ - - x ) ~, h a + n 2 j--1
(28)
when z - - - -
~
xj
(29)
is the squared standard deviation over all the material for the total score of each subject, and
z = ]~ z,.
(30)
t=1
It will be seen that (26) differs from (21) in that the variability is calculated over all the nl + n~ objects, rather than within the groups of n~ and n~, respectively. Jackson's estimate of (26) is related by m
m
X:X s % ( r , + t
t, / 4)
i=1 '$~1
r=
(31)
~ s~ s . ~ / ( 1 + t ~ / 4 ) (1+ t~,~/4)
i=1 i ' = 1
to the values r
I~I},X, T
1.18
1.55
1.54
1.57
1.57
Present Data
.74
.78
.78
.78
.78
Present Data
.79
.79
.79
.79
.80
Travers (1939)
.98
.98
.98
.98
.98
Fisher (1937)
.69
.79
.97
.89
1.62
17.46
t l / s ~ of (33)
The square root of the ratio of the sum of squares betwee~ groups to the total sum of squares, or the coefficient, R , of multiple correlation.
1.64
1.68
1.64
1.72
22.15
1.72
26.03
t';/s~ of (32)
Travers (1939)
25.97
~ o f (26)
26.28
r o f (21)
Limiting cases
26.34
of (17)
O n basis of c o m m o n
Fisher (1937)
Example
(9)
Complete Solution
The ratio of the sum of squares between groups to the sum within groups
TABLE 1 The Relative Success of Various Sets of Discriminant Coefficients
li
::_=
.56
.55
.89
I ......... 43
3.85
Using a~:l
.
:
.
.
.
©
GEOFFREY BEALL
211
For three problems of discrimination the satisfaction to be obtained by determining variously the coefficients, a~, was found with the results shown in Table 1. First there are data due to Fisher (1937) on the discrimination of two species of I r i s , s e t o s a and v e r s i color, by measurements on length and breadth of sepals and petals of 50 specimens of each species. Secondly, there are data due to Travers (1939), on the discrimination of engineers and air pilots by the scores of 20 men in each group on 6 tests involving understanding, co-ordination, etc. These data of Travers have already been reconsidered by Jackson (1943). Thirdly, there are some fresh data,* as shown in Table 2 , on psychological tests given 32 men and 32 women. For each lot of data, discriminant coefficients were found from the complete solution of (9), from the best value of r , as given b y (17) and from the estimate of r , as given by (21). F u r t h e r the estimated r , of Jackson (1943) as in (26) was used throughout because he had already used it similarly for Travers' data. AIso the relation ai - - t'ds~ ,
(32)
which is approached as r or m increases in (14), and a~ - - t~is; ,
(33)
which obtains for r ---- 0 in (14), were used. Finally, a~ was determined as in (20). For each set of coefficients, for each lot of data, there was found the ratio of the sum of squares between means to that within groups; originally, as in connection with (2), it was desired to make this ratio maximal. In the present cases where rh ----n~ - - n the ratio tabled is n D 2 / 2 S . R--
(34)
In addi¢ion the values of
{ n D 2 / ( 2 S + nD~)} t,
(35)
as used by Jackson (1943), are shown. From Table 1, it is apparent that using r of (17) the results were almost as good as from the complete solution, although the coefficients were not identical. Using r of (21) a very good result was obtained; but Jackson's r" of (26) worked even better in all Chree cases considered. The very convenient coefficient t ' j s ~ , of (32), gave a result not f a r f ~ m that obtained by (21) in (14), as was anticipated, since often in (14) the second term in the denominator must be predominant, when, effectively, (32) obtains. The equally convenient coefficient t d s ~ , of (33), gave a poorer result. The result from (20) is not of particular interest except to show how bad discrimination may be when, as in school-room prac* The present data were kindly made available by Dr. L. S. Penrose.
212
PSYCHOMETRIKA
rice, the results of examinations are combined by simple addition. The encouraging performance of a common estimate of r m a y be appreciated better when one notes t h a t f o r the present three examples the correlations for the various considerations were as shown in Table 2. The second case (Travers 1939) is r a t h e r a s t r i n g e n t test of the utility of an assumption of a common r because, as will be seen in Table 2, the sign of the quantities r , varies. T r a v e r s notes t h a t his fifth test was scored in a negative sense, i.e., " a low score indicates TABLE 2 The Correlations of the Various Considerations Measurement 2 -3 4 'rest 2 3 4 5 6
Fisher (1937) 1 +.co + .64 -4-.4"/ T r a v e r s (1939) 2
I
I I l 1 +.25 +.05 +.07 --.02 + .38
2 +.38 + .46
+ .71
3
+.08 --.04 +.03 --.41 --.05 + .14 --.19 New Data 1 +.57 +.39 +.37
Test 2 3 4
3
4
5
--.02 +.07
--.02
2
3
+.39 .4..31
+.55
a good p e r f o r m a n c e . " In practice, where one knew such a condition to exist, one would, presumably, reverse the sign of such scores before considering a common value of r . I f such a reversal is m a d e in the present case, (21) gives r = + .09 and the ratio, as in Table 1, is but little affected. The estimates of a common r f r o m the two relations (17) and (21) t o g e t h e r with Jackson's suggestion, r" of (26), are shown in Table 3. TABLE 3 Various Estimates of a Common Value r
__ Fisher (1937) Travers (1939) New Data
i I { I
~°f!lT!. .........I +.44 i +.20 { +.54 t
r o f (21) +.55 +.06 .4-4-.48
t I } I
r o f (26) +.36 +.14 +.47
213
GEOFFREY BEALL
3. A n I l l u s t r a t i v e E x a m p l e The results, as summarized in Table 1, suggest that in practice an investigator m i g h t use as discriminant coefficients the values t'~/s, o as of {82}, a common value of r , as f r o m (26) in (14), or t h e complete solution as in (9), depending on circumstances. Below a r e illust r a t e d in some detail the t h r e e alternative procedures to be followed and the results to be obtained b y each procedure. The example chosen is t h e discrimination of men f r o m women b y f o u r psychological tests, as first mentioned in connection with Table 1. Test 1 is on pictorial absurdities, 2 on p a p e r f o r m board, 3 on tool recognition, and 4 on vocabulary. The d a t a on w o m e n are f o r 32 applicants f o r a profesFIOURE I * The Distribution of the Oriffinal Measurements
on Four Tests, by Sex.
TeJt I
T**t 3
_
1
i T**~4
|
.... !
'
m
* T h e present i l l - - n 8
~
d u e to Mr. J. k. McCIunJ o f t h e O ~
II
Delmlq~memt o~ J~eedt~
2 1 ,|
PSYCH 0M ETRI KA
sional i,osition requiring 10 or more years of successful schooling (the completion of second-year high school in Ontario, up to a University degree,). Against these data were set results for men chosen with the same ;~cadernic restrictions; from a. v e r y large group, 32 men w e r e d r a w ~ randomly. The 4 tests were each scored according to the number ( f questions answered successfully. The correlations b e t w e e n the tesl:s are shown in Table 2. The data a r e set out f o r reference in Table 4. The distribution of the original m e a s u r e m e n t s b y sex m a y be of sore(, interest and is set out in F i g u r e 1. It will be noted t h a t t h e TABLE 4 T h e Scores of 32 M e n a n d 32 W o m e n o n F o u r P s y c h o l o g i c a l T e s t s Men 15 17 t5 13 20 15 15 13 14 17 I7 ]7 15 18 18 15 18 10 18 ]8 13 16 11 ][6 ][6 18 16 15 :18 :18 17 19
2 17 15 14 12 17 21 13 5 7 15 17 20 15 19 18 14 17 14 21 21 17 16 15 13 13 18 15 I6 19 16 20 19
3 24 32 29 10 26 26 26 22 30 30 26 28 29 32 31 26 33 19 30 34 30 16 25 26 23 34 28 29 32 33 2l 30
511
509
870
1
Women 4
29
70 90 81 51 91 83 76 62 68 89 80 89 83 97 94 76 94 60 98
26
99
24 16 23
84 64
14 26 23 16 28 21 22 22 17 27 20 24 24 28 27 21 26 17
16 21 24 27 24 23 23 21 28 728
74 71 73 94 86 84 92 9O 79 96
2618
1 13 14 12 12 11 12 10 10 12 11 12 14 14 13 14 13 16 14 16 13 2 14 17 16 15 12 14 13 11 7 12
2 14 12 19 13 20 9 13 8 20 10 18 18 10 16 8 16 21 17 16 16 6 16 ]7 13 14 10 17 15 16 7 15
3 12 14 21 10 16 14 18 13 19 11 25 13 25 8 18 23 26 14 15 23 16 22 22 16 20 12 24 18 18 19
4 21 26 21 16 16 18 24 23 23 27 25 26 28 14 25 28 26 14 23 24 21 26 28 14 26 9 23 2O 2a 18
7
6
5
e
395
445
533
702
60 66 73 51 63 53 65 54 74 59 80 71 77 51
60 80 89
I f I
59 70 76 45 78 84 59 75 43 78 66 73 51
28
I
62
is
[ 1 2075
GEOFFREY BEALL
215
m a i n d i s c r i m i n a t i o n is b y T e s t 3. T h e d i s c r i m i n a t i o n to be u n d e r t a k e n would depend in l a r g e p a r t on p r a c t i c a l considerations. I f the d a t a w e r e p r e l i m i n a r y o r t h e inv e s t i g a t o r w e r e pressed, he m i g h t decide to use s i m p l y the values t'~/s, (multiplied b y 10 in t h e p r e s e n t case) as d i s c r i m i n a n t coefficients a n d could a n t i c i p a t e reasonably good results. T h e n f o r t h e d a t a o f T a b l e 4, he would need only m a k e t h e calculations as s h o w n i n t h e first p a n e l o f T a b l e 5. I f t h e i n v e s t i g a t o r could go to m o r e pains, h e m i g h t estimate, u s i n g (26) f r o m t h e d a t a o f T a b l e 4, r ---
231.12669 - - 104.57690 (19.40928) 2 - - 104.576090
-- +.4650,
where zo~ : 231.12669 is calculated f r o m t h e totals over all 4 tests f o r individuals, i.e., f r o m t h e 5th a n d 10th columns o f T a b l e 4; ~. z~~ :
104.57690
i=l
is based on the p r e l i m i n a r y calculation of s t a n d a r d deviations w i t h i n t h e 1st a n d 6th columns, etc., o f Table 4; and z:
19.40928,
the m e a n s t a n d a r d deviation, is derived similarly. U s i n g the estimate, r - - +.4650, in ( 1 4 ) , t h e calculations shown in t h e second panel o f TABLE 5 The Calculation ef Discriminant Coefficients for Two Approximate Solutions The coefficients a{ - - t'i/s~; of Equation (32) Te~
t~
t'~
s~
ai---t'i/8 i
1 2
1.3760 .5097 1.9748 .1747
+.3672 --.4991 +.9660 --.8341
2.6345 3.9291 5.3328 4.6501
+1.3938 --1.2703 + 1.8114 --1.7937
3 4
The coefficien~% on the common r of (27) in (14) Test
(I --r)ti
~.rt' i
1 2 3 4
+ .7362 + .2727
+1.0565
+ .6830 - - .9283 + 1.7968
+ .0935
--1.5514
ais i = (1--~)t i + ~-t'
+1.4'192 £556 +2.8533 --1.4579
+ 5.3870 --1.6686 +5.3505 --3.1352
216
PSYCHOMETRIKA
Table 5 are necessary. The values, a~, again include a factor of 10. For both the sets of a~, the ratio of sum of squares between and within groups is, of course, shown in Table 1. The discriminant distributions have been, further, plotted in Figure 2, to aid in appreciation of the discrimination obtained. Finally, conditions might be such that the investigator would elect to make the complete solution of (9). It would then be necessary to find the values s~ and r . , s . from (7) and (8). The quantities n~r..s,s~, and n2s~2 (multiples by n of the sums of squares and crosspr(~lucts about group means), as shown in Table 6, are most conveniTABLE 6 Values of n~ si 2 and n 2 rii, s i s i, Required for the Complete Discriminant Test
1
1
1
1
I
14,214
I
2 3 4
l [
11,998 11,295 9,326
I I I
[
2
l
3
I 31,534 16,849 11,618
1 [ I
l
4
I 58~43 27,738
l [ I
44,284
enlLly first found. Substituting from this table in (9), a proportional solution is as follows: a~ = +2.6344 a~ : - - 1 . 0 4 9 3 a~ : - +2.4054 a , = --1.5983. For the three sets of discriminant coefficients discussed, the distribution of the resulting discriminant values is shown in Figure 2. To aid comparison, the mean value, (y~ + y ~ ) / 2 , i.e., the point halfway between the means for males and females, respectively, has been made the same as has also the value of S , i.e., the sum of squares within groups. It will be seen how the two constituent distributions separate as the diseriminant function is improved and also how the constituent distributions become compact, particularly in comparison with the corresponding result for Test 3 in Figure 1. It would, of course, be unnecessary to calculate the actual discriminant values, used to obtain Figure 2, in many and probably most problems that would be treated. 4. Sum~qary The weighted sum of observations on an object, so that the distinction between two groups to which such objects m a y have been assigned is great, has been found variously. By assuming that the correlation between various observations (as between various psy-
GEOFFREY BEALL
217
FIoumz 2 The Discrimination Obtained by Various Methods of Calculating the Coefl~cient~
chological tests) is common, results have been obtained that permit two groups to be distinguished readily and effectively; the discrimination approaches closely the m a x i m u m possible with the data given in these problems. A still more simple weighting, involving simply differences between means for the two groups on the various scores, is also very satisfactory. REFERENCES Fisher, R. A. The use of multiple,measurements in taxonomic problems. Ann. Euge%ics, 1937, 7, 179-188. Jackson, R. W. B. Approximate multiple regression weights. J. ezp. Education, 1943, 11, 221-225. Travers, R. M. W. The use of a discriminant function in the treatment of psychological group differences. Psychomet~ika, 1939, 4, 2~32.