Fresenius Zeitscbrift fiir
Fresenius Z. Anal. Chem 298, 97-109 (1979)
by Springer~Verlag1979
Information Theory Applied to Qualitative Analysis P. Cleij and A. Dijkstra Laboratoryof AnalyticalChemistry,State Universityof Utrecht, Croesestraat77A, 3522 AD Utrechl, Netherlands
Anwendung der Informationstheorie in der qualitativen Analyse Zusammenfassung. Grundlegende Aspekte der Informationstheorie und ihre Anwendbarkeit ffir die qualitative Analyse werden unter Berficksichtigung der bisher in der Literatur beschriebenen Anwendungen besprochen. Allgemeine Ausdrficke ffir Unsicherheit~ Information, Aquivokation und [nformationsgehalt werden beschrieben und ihre Beziehung zu den gebr~uchlicheren Ausdrficken ffir die Informationsparameter wird angegeben. Es wird gezeigt, dab der Informationsgehalt (oder die )~quivokation) - unter bestimmten Bedingungen - als ein wichtiges Kriterium fiir die Auswertung, Selektion oder Optimierung von analytischen Verfahren dienen kann. Die Grundlagen der Informationstheorie k6nnen auf einfache Weise ffir einkanalige Identifizierungsoder Klassifizierungsverfahren angewendet werden. Wenn es sich jedoch um mehrkanalige Verfahren (Kombinationen von Einkanal-Verfahren) handelt, wird die Anwendung der Informationstheorie in hohem MaBe durch fiberm~13ige Berechnungen begrcnzt. Die weitere Entwicklung von einfachen Verfahren for die Berechnung zuverlfissiger Sch~itzungen von Informationsgehalten und von Optimierungsstrategien, die die Berechnung einer beschr/inkten Anzahl von Informationsgehalten erfordern, ist deshalb von grol3er Wichtigkeit ftir die kfinftige Anwendung der Informationstheorie in der qualitativen Analyse.
Summary. Fundamental aspects of information theory and its usefulness for qualitative analysis are discussed : in view of the applications described so far in the literature. General expressions for the uncertainty, information, equivocation and information content are described and related to the more commonly used expressions of the information parameters.
It is shown that the information content (or equivocation) can serve - with certain restrictions as an important criterion for the evaluation, selection or optimization of analytical procedures. The information theoretical concepts are easy to apply to one-channel identification or classification procedures. However, when multi-channel procedures (combinations of one-channel procedures) are involved, a straightforward use of information theory is to a large extent restricted by the excessive computations. Therefore, the further development of simple methods for the calculation of reliable estimates of information contents and of optimization strategies enabling the calculation of only small numbers of information contents, is of great importance for the future use of information theory in qualitative analysis.
Key words: Informationstheorie in der Qualitativen Analyse
1. Introduction According to a recent definition, Analytical Chemistry should provide optimal strategies for the acquisition of relevant information about material systems [18]. This is an example of the use of the term information in the analytical chemical literature. Often the term is used in a colloquial and qualitative sense, but also the quantification of information in Analytical Chemistry has been the subject of several studies. However, a comparison of the various approaches to the quantification of information is difficult. The terminology introduced is far from being uniform and is thoroughly confusing. Terms such as (amount of) information, (average) information content, average information, specific information, (information) entropy and mutual information, partly covering the same quantities, have been introduced. The principles used for the quantification of these information parameters are mainly originating
0016-1152/79/0298/0097/$02.60
98 from information and communication theory [31]. Although analyzing can be compared in a way with communicating, a straightforward translation of an analytical process into a communication process is not (always) possible. Therefore, the analytical chemist has to be aware of a number of pitfalls when applying information theory to analytical problems. In addition, it can be questioned whether and to what extent quantification of information is possible and useful for the analytical practice. The aim of this paper is to evaluate the applications of information theory to analytical problems. As far as possible the treatment presented here will be related to concepts that already have been introduced in the analytical literature. The discussion will be limited to the application in the area of qualitative analysis of pure compounds (or elements). The main reason for the limitation to qualitative analysis is, that in our opinion quantitative analysis can be described adequately by a set of more or less generally accepted characteristics (precision, accuracy, selectivity, etc.). Application of information theory to quantitative analysis yields a performance characteristic which is not essentially different from the precision (or signal-to noise ratio). It probably does contribute to the philosophy of analyzing and Analytical Chemistry, but not to a better use of existing procedures, or to the development of improved techniques for quantitative analysis. The application of information theoretical principles to qualitative analysis, however, can, in our opinion, lead to a better description and handling of identification problems and of procedures suitable for solving such problems. A qualitative analysis leads to either an identification or a classification of an unknown compound. For practical reasons only the identification will be treated explicitly. However, the theory described.in this paper can easily be applied to classification problems with only minor changes in terminology. Generally speaking, one-dimensional physical quantities (melting points, retention indices, etc.) or spectra are used for identification. It is evident that in principle the number of chemical compounds or elements that can be distinguished by these physical quantities or spectra increases when the signal-to-noise ratio and/or the resolution of the (spectro)meter is increased. This has already been discussed adequately by Kaiser [24] who introduced for this purpose the term informing power. The informing power can be calculated by application of the sampling theorem as used in communication theory [3]. It essentially is a measure of the number of spectra that in principle might be distinguished. However, in practice not every spectrum that might be conceived will be found. The shape of the bands or peaks, the (non) simultaneous occurrence
Fresenius Z. Anal. Chem., Band 298 (1979) (correlations) of certain peaks and empty spectral regions will reduce the number of different spectra and thus the number of compounds that can be distinguished by their spectra. For that reason the treatment presented here is based upon Shannon's theory of information [31] and as such only the spectra (or values of physical quantities) met in practice are taken into account. A term frequently met in the literature is the term pre-information. It is used for the information known prior to the analysis and usually it is considered to reduce the efforts required to arrive at a solution of the analytical problem. This pre-information will play an important role in this theoretical approach to qualitative analysis. In addition, a (statistical) description of the analytical procedure and a method of interpreting analytical signals will be important items.
2. Qualitative Analysis and Uncertainty As mentioned before, only the identification of pure compounds (or elements) will be considered. In such cases the goal of an analysis is to achieve a (nearly) unambiguous assessment of the identity of the unknown compound. Aiming at such an identification can be considered as aiming at a (nearly) zero uncertainty pertaining to the identity of the analysed sample. This statement implies that the uncertainty remaining after analysis can be regarded as an important parameter to describe the quality of the analytical result. (In quantitative analysis the corresponding uncertainty is described by the precision.) Similarly, the uncertainty parameter can be used to describe the situation before analysis. Consequently an analysis can be regarded as a process of changing the degree of uncertainty with respect to the unknown identity. The goal of an analysis is then to gain a maximum decrease of uncertainty. This implies that the decrease of uncertainty also describes the (in)adequacy of the analysis. It seems reasonable to state that the process of reducing the uncertainty corresponds with a process of gathering information. Therefore the decrease of uncertainty will be considered as a measure for th'e (amount of) information obtained. Consequently, the (in)adequacy of an analysis can be described by two essentially equivalent parameters, i.e. the (amount of) uncertainty after analysis and the decrease of uncertainty or the (amount of) information. 3. The Analytical Problem From an information theoretical point of view the most important aspect of the qualitative analytical problem is the pre-information about the identity of the unknown compound (X). This pre-information should be used to specify the analytical problem in terms of a set
99
P. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis
of possible identities X~, X 2. . . . X,., . . ,X,~ together with the set of a priori probabilities p(XO, p(X2)...p(X,.)... p(X,,). Here p ( X ) is the probability (before analysis) of the identity of the unknown compound being X~. Pre-information can be gathered in several ways. An analytical laboratory usually operates in a more or less specified environment. From the laboratory records (the history of the laboratory) it may be concluded that certain compounds are more frequently identified than others. Moreover, often it is known that an unknown compound belongs to a certain class, for instance pesticides or hydrocarbons. Also the preinformation can be drawn from a preliminary analysis or from the origin of the unknown compound, leaving only a limited number of a priori possible identities. Instances where pre-information is not available at all can be considered as extremely rare. It is often impossible to formulate the analytical problem in terms of probabilities that are based upon relative frequencies. In such instances probabilities should be interpreted in a Bayesian way [29], i.e. as a measure of subjective expectations or a subjective estimate of relative occurrence, deduced from qualitative or semi-quantitative pre-information. The magnitudes of such subjective or 'personal' probabilities will depend not only upon the quantity and quality of one's personal pre-information, but also upon one's personal judgement. Subjective a priori probabilities can explain different interpretations of the same analytical result for the same unknown compound by different analysts. It has to be emphasized that the quantification of the a priori probabilities is necessary in order to quantify several essential parameters in information theory, as will be shown in the next paragraphs. The subjective aspect of this quantification causes the rather subjective character of information theory, and it is mainly at this point that the quantification of information parameters is questionable. A priori probabilities were explicity considered by several authors [9, ~3, 26, 28, 30], although these were always taken equal to l/n for a set of n possible (group)-identities. Also, a priori probability distribution functions, concerning concentrations or amounts, were taken into account in an information theoretical approach to quantitative analysis [14, 15].
4. The Analytical Procedure
The model of an identification procedure used in this study is a black box with a certain input, output and input-output ( I / 0 ) relation. The input of a procedure is the unknown compound X, which is a member of the set of identities {Xi}. The output is one of the signals of a set of possible signals Y1, Y2, - 9 YI~.99 Y,,. Such a signal can
.j....~Y1 p(Y~ IX~ )
xl ~P X2 ~
(Y2Ix, )
..~Y2
p (Y3 IX1 ) ~ P (Y2 IX2) P(Y31X2)''~-~Y3 "P(Y4 IX2)
"~Y4 I
I I I I I
......~ Yk-1 j-P(Yk-1 IXi ) Xi ~ p ( Y k IXi) P(Yk,,11Xi)
~"(k Yk4
Fig. 1. Schematic representation of the//O-relation of an analytical procedure. I-'o1"symbols see text
be a reading of a thermometer (melting point, boiling point), a (coded) spectrum (infrared, mass, etc.), or simply a colour. The number of possible signals depends on the number of decimals used for the actual reading. In the case of coded spectra it depends on the details of the code. The I/O-relation can be represented by a set of conditional probabilities p(Y~lXt), p(Y2LJ(1) ... P(YmlX1) , P(X1 IX2) "'" P ( Y k l X i ) " . P(Yml Y,,), where p(YkIX) is the probability of measuring a signal YI, when X~ is the analysed compound (see Fig. 1). These conditional probabilities have to satisfy: k=l
p(rklX~) = 1
(1)
f o r / = 1, 2 . . . n. The probabilities of the I/O-relation can be obtained by calibration and by establishing the reproducibility of the measuring process. The result of the calibration is usually represented in tables (libraries, files) of for instance melting points, Rfvalues or mass spectra. 5. The Interpretation of Analytical Signals
The final step of an analysis consists of interpreting the measured signal Y, which is one of the members of the set {Yk }. The interpretation leads to a new set of probabilities for the several identities. In other words, if a signal Y~has been measured, the interpretation leads to a set of a posteriori probabilities p(XI[ }~), p(Z2[ YI~) ---P(Y/I Yk)...P(Y,~[ Yk), wherep(~l Yl~)is defined as the probability of J( = X~ if Y = Y/~. It is evident that usually the majority o f a posteriori probabilities will be
100
Fresenius Z. Anal. Chem., Band 298 (1979)
zero. In case of an unambiguous identification, only one a posteriori probability is one, all others being zero. The interpretation of the signals not only requires the characteristics of the analytical procedure (represented by the I/0 relation), but also the pre-information (represented by the a priori probabilities). This is expressed by Bayes' theorem [34] as follows: p(X~I Yk) =
p(~) p(Yki X,-)
p( Y~)
(2)
whereby P(Yk) is the probability of measuring a signal Yk. This probability can be determined by
p(~3 = ~ p(Xi), p(~lx3.
(3)
i=1
Eq. (2) will be called the interpretation relation.
6. Quantification of Uncertainty Uncertainty can be used as a criterion for judging the quality of the analytical result, provided that a quantification of the uncertainty is possible. This can be done by making use of a well defined uncertainty function which relates the uncertainty to a measurable quantity. It is possible to arrive at such a function by starting from the principle that there exists an uncertainty pertaining to a subject (identity, signal) when there are several alternatives, each with a certain probability. For n such alternatives, 1, 2 ... i ... n, with probabilities p(1), p(2) ... p(i) ... p(n) respectively, the degree of uncertainty H will be a function h of these probabilities H = h[p(1), p(2)...p(i)...p(n)] = h{p(i)}
H = ld n.
(7)
7. Uncertainty and Information As has already been discussed, three important quantities can be used to describe the analysis, i.e. the uncertainty before analysis, the uncertainty after analysis and the difference between these uncertainties. The difference was considered as a measure of the amount of information obtained from the analysis. The uncertainty after the analysis and the information can be used for judging the quality of the analysis. The uncertainty, H(X), before analysis pertaining to the identity of X,, according to the above definitions, given by
H(X) = -
Z p(X~) ld p(X,.).
(8)
i:1
If a signal Yk has been measured, the uncertainty after analysis, H(X~ Y~), is given by
(4)
H(XI Yk) = - X p(X~l Yk) ld P(~I Yk)
with
(9)
i =1
p(i) = 1.
(5)
i=1
In order to arrive at an acceptable function for the uncertainty one can formulate a set of desirable properties for this function. For instance, H has to be zero in case of complete certainty (one possibility only), or positive in case of uncertainty (more possibilities). Another desirable property will be that in case of equal probabilities (P(0 = 1/n for all i) H increases with increasing n. A more complete list of desirable properties for an uncertainty function is given in Appendix I. A frequently used uncertainty function is the Shannon function on the basis of the dual logarithm (ld = logz) , originating from communication and information theory [31]. In this paper this function will be used as an expression for the uncertainty H, i.e.
H = -
It should be realized that this definition of H is, to some extent, arbitrary. There are other functions with the properties given in Appendix I [1]. The quantity H, as defined by Eq. (6), is often called entropy because of the analogy between the Shannon function and the expressions for the entropy as used in statistical thermodynamics. In certain instances the term entropy will be used in this paper, In case of n equal probabilities (l/n) the expression for H reduces to
Z p(i) ldp(i)
(6)
i=l
with 0 ld 0 = 0. The unit of uncertainty is called bit (not to be confused with computer bits).
H(X) and H(XI Yk) will be called 'compound entropies', because these quantities represent uncertainties with respect to the identity of the unknown compound. The amount of information obtained in case of an output signal Yk, I(Xk Yk), is defined as the decrease of uncertainty [4, 10, 12, 14] i.e. I(Xl Yk) = H(X) - H(XI Y,~).
(1 O)
It is important to observe that in case of unambiguous identifications H(X]Yk) = 0 and thus I(Xt Yk) = H(X). Therefore H(X) can also be considered as the information required for an unambiguous identification (missing information). Defining information according to Eq. (10) in some instances leads to negative values of I(X[ Yk). For instance, in case of two possible identities, X1 and X2, with a priori probabilities of 1/s and 4/s respectively, and p(YllX1) = 4/s and p(Y1LX2) = l/s, the information from signal 171 equals - 0 . 2 2 bit.
p. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis A definition of information that always yields positive values is given by [1] n
F = -
S
,=
~
P(~-IYk) l d P ( ~ I Y 0 p(X,,)
(11)
which is analoguous to the formula used by Eckschlager and Vajda [14, 15] for a description of the quantitative analysis. It can be shown that for equal a priori probabilities (l/n), I' and l(J(] Y~) are equivalent. However, we prefer to stick to the quantity I(XI YI~)as an important parameter for qualitative analysis because of the direct relation between the decrease of uncertainty and the aim of the analysis, i.e. to attain a (near) unambiguous identification (zero uncertainty after analysis). The information defined by Eq. (10) can be considered as a generalization of the definition given by Brillouin [3, see also 8, 14, 19, 23]. The definition according to Brillouin is only defined in case of n equally probable identities before and nl~ equally probable identities after analysis and is given by
101
every possible (magnitude of the) analytical signal. For such 'ideal' procedures H(X] Yk)= 0 for k = 1... m, which is identical with I(X] Yk) = H(J0 for k = 1... m. Therefore it seems to be reasonable to describe the quality of an analytical procedure for a given analytical problem by the expected value (average) of the uncertainty after analysis, H(X[ Y). This expected value will be called equivocation, E, in analogy with the use in communication theory [31]. It is defined by E = H(X[Y) =
~ p ( Y ~ ) . H ( X tYk).
Similarly, the quality of an analytical procedure can be described by the expected value of the information [29], which will be called the information content, L of the procedure. It is defined by I = I(X[ Y) =
S P(Yk)" I(X[ Y~.).
(12)
With the condition of equal a priori and equal a posteriori probabilities it can easily be shown that the quantities I(XI Y~) and I" are equivalent. Yet another definition of information [5,16, 18, 23] is given by I " ' = - ld p(Yk).
(13)
(15)
k=l
It can also be expressed [2, 3] by (16)
I = H(X) - H(XI Y). Flk
I" = - l d - - . n
(14)
k=l
Eqs. (14) and (16) show that information content and equivocation are complementary quantities, i.e. (17)
1 + E = H(X).
For an 'ideal' procedure E equals zero a n d / e q u a l s H(X). In all other cases E > 0 and I < H(X). Introducing probabilities, I can be expressed as follows :
In cases of equal a priori probabilities and in the absence of noise I " ' and I(XI Yk) are equivalent. In our information theoretical approach of qualitative analysis this definition has not been considered as a generally applicable alternative for the definition of information according to Eq. (10). The reason is that in general a low probability of a signal does not imply, if the signal is actually measured, tile acquisition of a large amount of information (interpreted in a colloquial way). In some instances even the opposite is true. Finally an important remark should be made. The information defined in this paragraph (and also the uncertainty after analysis) is an a posteriori quantity in the sense that it can only be estab]ished after performance of the analysis. It therefore is not a quality criterion that can be used for choosing an analytical procedure.
This expression clearly shows that the information content can also be regarded as the expected value of the information parameter defined by Eq. (11), which implies that the information content is always nonnegative. A proof of this non-negativity can also be given with property D of the Shannon function (Appendix II).
8. Equivocation and Information Content
9. Signal Entropy and Information Content
In general the analyst will use an analytical procedure that results in a (nearly) unambiguous identification for
As in communication theory, information parameters in analytical chemistry are often defined in terms of
I=
-
S p(~)ldp(X~) i
1
+ ~ P(Y~) S P(~t Y~) l d p ( ~ [ Yk). k=l
(18)
i=1
p(X/t Y~) and p(Y1r can be calculated with Eqs. (2) and (3), i. e. from the pre-information and I/O-relation. Together with Eq. (3) the following relation can be derived: I =
S
~ P(Yk) P ( ~ I Y~,-)ld P(~I Y~)
,: 1 ~ : 1
p(~)
(19)
102
Fresenius Z. Anal. Chem., Band 298 (1979)
possible signals rather than in terms of possible identities. The situation before the analysis was described by the uncertainty with respect to the identity of the unknown compound, which was called compound entropy. Similarly, it is possible to represent the uncertainty with respect to the identity of the unknown signal, the signal entropy H(Y), by
H(Y) = -
~ P(Y/~)ldp(Y~).
(20)
k=l
In addition, the uncertainty pertaining to the unknown signal if the compound is known to be X,.; the conditional entropy H(YI ~), is given by H(YIX,) = -
~ p(YklX~) ldp(Y/~lX,).
(21)
This relation is used in Analytical Chemistry for a number of information parameters such as the (amount of) information [19, 22, 25, 32], average information [5], information content [20, 26], average information content [18] and average selective information [23]. When the possible compounds can be derided into r groups ofn k compounds in such a way that compounds within a group are completely indistinguishable and compounds of different groups are completely distinguishable by the analytical procedure, and when all compounds are a priori equally probable, Eq. (26) reduces to
t=-
}
k= 1 n
n
(28)
k=l
This uncertainty can be considered as a measure of the noise. Using these signal entropies, it is possible to define the following expected values (compare with Eqs. 14, 15 and 16) :
H(YIX) =
s p(X~).H(NX 0
(22)
i=1
A division into groups as described is for instance possible in the case of absence of noise. As will be shown in paragraph 11, Eq. (28) can also be interpreted in another way. Eq. (28) has been used by several authors [6,7,10, 26-28]. Cleij and Dijkstra [4] compared the use of Eqs. (26) and (28) for the calculation of information contents of TLC identification procedures.
and I(YJ X) = H(Y) - H(YJ X).
(23)
I(YI X) can be regarded as the expected value of the information about the unknown signal Y that is obtained when the identity of the unknown compound is made available (from other means than the experiment). Because of property B of the Shannon function (Appendix II) it can be shown that [2, 31]
I(XI Y) = I(YI X)
(24)
which implies that the information content can also be expressed in terms of signal entropies:
I = H(Y) - H(YIX).
(25)
This expression is essentially the same as that for the mutual information used by Ritter [30] for classification problems. With Eqs. (20), (21) and (22) it is possible to derive I=
--
Z p(X,) ~ p(Y~lXi)ldp(Y~lX). i=1
Z P(Yk, rl)ldp(Yk, Y{)+ k=l
i=1
/=1
(26)
k=l
(29)
1=1
with
k=l
This expression can also be derived from Eq. (18) or Eq. (19) by making use of Eqs. (2) and (3). In the absence of noise (P(YkIX~) = I or 0), Eq. (26) reduces to I = -
When it is possible to solve an analytical problem with one procedure, the solution may be provided by using a combination of two or more procedures. The analysis of an unknown compound with two analytical p r o cedures results in two signals, i.e. a signal Yk for procedure K and a signal Y{ for procedure L. The I/O relation of the combination of two procedures is represented by the conditional probability P(Yk, ~'l x3, which is the probability of measuring the signals Y/~and Yi if measurements are done with compound Xi. Then the information content of the combination of two procedures can be expressed by
p(Yu, YflXi) ld p(Yk, Y;IXi)
~ p(Y1~)ldp(Y~) k-1
+
10. Combination of Procedures
~ p(Y~) ld p(Y~). k=l
(27)
/
n
!
p(Yk, Yt) = Z p(Xi) "p(YR, Y, IXi).
(30)
i=1
When the errors for both procedures are not correlated, the following relation holds
P( Yk, Y~IXi) = P( Yk]Xi) "p( Y~]Xi).
(31)
P. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis
The information obtained about the identity of the unknown compound is given by
Then mi
In, L = -
~
Z
k=l
+
p(Yk, Y/)ldp(Y~,Y[)
I(XIy) = H(X) - H(XIy).
1-1
+ Z
k=l
p(X,) Z p(Y{IG) ldp(Y{iG).
I=l
(32)
l=I
Using property L of the Shannon function (Appendix II), it can be shown that in the case of noncorrelated errors the following inequality holds
IK,L <_ IK q- IL
(37)
The equivocation E and information content I for an analytical procedure with continuously variable signals are defined, analogy with Eqs. (14) and (15), by
Z p(X/) } p(Y1~l)(/) ldp(Yl(l,~.) =i
103
(33)
with IK and IL being the information contents of procedures K a n d L, respectively. The equality holds, if the analytical quantities measured with the procedures are not correlated, i.e. if
E =
~ p(y) H(Xby ) dy
(38)
I =
~ p(y) I(X[y) dy
(39)
withp(y) 9 dy being the probability (before analysis) of measuring a signal with magnitude between y and y + dy. The probability distribution function p(y) can be calculated with tl
p(y) =
S p(X,.) P(Yl~)-
(40)
i=1
P(Yk, Y;) = P(Yk) 'P(Y;) (34) for k = l ... m and l = l ... m'. For the combination of more than two procedures it is possible to derive equations similar to Eqs. (29) and (32). Correlations between analytical quantities can be taken into account implicitly [4, 10] and explicitly [1 l 13, 17, 21, 25]. In the latter case correlation parameters, i.e. the covariance or the correlation coefficient, are used in the ultimate expressions for the information content.
In terms of compound entropies the information content equals I = - p(X~) ld p(X,) + ~p(y) 2 p(X,.ly)ldp(X,.ly) dy.
(41)
i=1
By using Eqs. (36) and (40), this expression for I can be replaced by I = +
~p(y) ldp(y) dy
Z p(X,.)fp(y[X~)ldp(ylX,.) dy.
(42)
i-1
11. Continuously Variable Signals In the preceding paragraphs procedures that can produce only a limited number of different signals (discrete signals) were discussed. However, in Analytical Chemistry often signals that can take any value, i.e. continuously variable signals, are measured. The I/O relation for such procedures can be represented by the set of conditional probability functions p(y[ J0, W(Y]X2)...p(y[Xi)...p(y]X,), with p(ylX~) dy being the probability of measuring a signal with magnitude between y and y + dy, when the measurement is performed with compound X~. Then the uncertainty after measuring a signal with magnitude y, H(X]y), is given by
H(XIy ) = -- X p(X,-ly) Id p(X~Iy)
(35)
i=1
where p(X~.ly) is the probability that the unknown identity is X~after measuring a signal with magnitude y. This probability can be calculated with the interpretation relation (Bayes theorem) for continuously variable signals p(X) 'P(Yl X,) p(X, ly) = (36) i=1
p(X,) .p(yl K).
When the entropy of a probability function p(x) is according to Shannon [31] defined as equal to - .[ p(x) ld p(x)dx, Eq. (42) can, analogously to Eq. (26), be considered as the signal entropy form of the expression for the information content in the continuous case. However, it should be observed that the entropy of a probability distribution function cannot be regarded as an absolute measure for the uncertainty. The first reason for this is that this entropy can be negative. Secondly, the value of the entropy depends on the units in which x is expressed. Consequently the entropy is not defined unambiguously and it would also be more correct to replace Eq. (42) by I = i=) 1 p(X)~p(yIX,) ld p(yIZ') p(y) dy
(43)
which shows that the information content is independent of the units in which the measurement is expressed. When the error distribution function is equal for all compounds (X/), Eq. (42) can be replaced by I = -- ~ p(y) ld p(y) dy + ~ Pe(Y) ld Pe(Y) dy
(44)
where Pe(Y) is the error distribution function. This equation was used by Dupuis and Dijkstra [11, see also 28].
Fresenius Z. Anal. Chem., Band 298 (1979)
104
Table 1. Symbols (one-channel procedures, discrete signals) Identities and signals possible identities of the unknown compound (before analysis) identity of the unknown compound possible signals measured signal
X/,i =1 . . . n : X: Yi~, k = l ... m : Y:
Probabilities
p(~):
a priori (before analysis) probability of Xi being the unknown identity a posteriori (after measuring a signal YI~)probability of N being the unknown identity probability of measuring a signal Yk probability of measuring a signal Yk when compound X,. is analyzed (I/O-relation)
p(X~lY~) p(gk):
Table 2. Definitions (one-channel procedures, discrete signals) Uncertainties Uncertainty with respect to the unknown identity X before analysis : n
S p(X~) ld p(Xi)
H(X) = -
i=1
Uncertainty with respect to the unknown identity X after measuring a signal Y1,: n
H(X[ Yk) = -- s p(X~]Y1~)ld p(Xil Y~) i=l
Information Amount of information about the unknown identity X provided by measuring a signal Y1~: I (XI Y~) = H ( X ) - H(XI Y~) Equivocation Equivocation or expected value of the uncertainty after analysis : m
E = z p(Y~3 9 H(Xl Yk) k=l
Information content Information content or expected value of the amount of information obtained: rn
i = z p(Yk) - I(xI ~,) k=l
By a s s u m i n g t h a t the e r r o r d i s t r i b u t i o n function is G a u s s i a n (with s t a n d a r d d e v i a t i o n o-e), Eq. (44) reduces to
i n f o r m a t i o n contents [11, 28]:
I = -- ~ p(y) l d p ( y ) d y -- ld ~re ] / 2 ~z e.
with n k being the n u m b e r o f c o m p o u n d s with their signal magnitudes in interval k. W h e n choosing A y equal to a e V ~ 7z e = 4.13 cre, Eq. (46) reduces to Eq. (28).
(45)
This e q u a t i o n was used by Cleij a n d D i j k s t r a [4] in calculations showing t h a t the c o n t i n u o u s l y variable signal can be replaced b y a discrete one with m i n o r changes in i n f o r m a t i o n content. W h e n the signal axis is divided into r intervals o f equal length A y a n d when the a p r i o r i p r o b a b i l i t i e s are the same, Eq. (45) can be r e p l a c e d by Eq. (46) for the calculation o f a p p r o x i m a t e
I ~
-
X -nk - ld -nk k= l
12. S u m m a r y
El
+
ld d y
-
l d o-e ] / 2
7r e
(46)
1"l
of the
Theory
F o r the case of one-channel identification p r o c e d u r e s yielding discrete signals a survey o f symbols, definitions
P. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis
105
Table 3. Expressions (one-channel procedures, discrete signals) Information General expression:
I(XI Yk) = --
n
n
2; p(J~}) ld p(X~) +
X P(~I Y~-)ld p(X~.[ gk)
i--I
i--1
n, nk: number of possibilities for the u n k n o w n identity with equal probabilities, before and after analysis respectively:
I(Yl Y~) = - ld
171,
(Brioullin)
H
Equal a priori probabiIities (p(~) = l/n) and absence of noise (p(Y/;IX/) = I or 0): 1 (XI YI~) = - ld p(Yk) Information content General expression, c o m p o u n d entropy form: n
I = --
m
p(~.) ld p();)) +
r i--1
n
Z P(Yk) Z p(X~l Y~.) ld P(~[ I/I,-) k-I
k-i
General expression, signal entropy form:
I = -
m
n
2 p(Y~)ldp(Y~) +
Z p(X~) Z, p(XklX~)ldp(Y~lXi)
k=l
i-1
m
k-1
Absence of noise: m
Z p(~,) id P(Yk)
1 k
1
Equal a priori probabilities, grouping of possible compounds : r
I = -
17k
11/<
Z - - ld k=
1
17
/7
with n/;: number of possible compounds in group k (for details, see text) Equivocation Relation with information content:
E = ~7(J~) - I and expressions for the information parameters is given in Tables 1, 2 and 3 respectively. Use of these tables will facilitate the reading of the text. It also provides an easy comparison of the meaning of the different symbols and concepts, and of the conditions that are relevant for the use of the different expressions for the information parameters.
13. Applications In order to illustrate the usefulness of applying information theory to analytical problems, a short survey is presented of applications to qualitative analysis described in analytical literature.
13.1. Thin-Layer Chromatography The first application of information theory to TLC was reported by Souto and De Valesi [32]. The capabilities of eleven solvent systems to identify antibiotic substan-
ces were compared by using a quantity that can be considered as the ratio of the information content of a system to that of a system with equally spread R j-values. For each system the 35 antibiotics were divided into groups corresponding with a division of the R Iscale into equal intervals. The information content of each system was calculated with Eq.(28). Several solvent systems could be designated as the most suitable systems for the identification of antibiotics. Massart, De Clercq and Smits [6, 7, 26, 28] evaluated the suitability of a number of TLC-systems for solving several identification problems by calculating information contents in a way similar to that of Souto and De Valesi. Two methods of dividing the compounds into groups were applied [26, 28] and the use of the information content was compared with the use of chi-square and discriminating power as quality criteria for identification procedures [7, 28]. The same method of calculating infor-
106 mation contents was also used by Massart and De Clercq [27] for the selection of (near) optimal combinations of the TLC systems. Numerical taxonomy was applied for dividing the TLC system into groups of systems whith similar properties. An approximately optimal combination was made by selecting from each group the system with the highest information content. As an example, a combination of 3 out of 10 solvent systems was selected, permitting the unambiguous identification of 26 synthetic food dyes by their R Ivalues. Considering the identification of DDT and 12 related compounds by TLC, Cleij and Dijkstra [4] used Eq. (27) for the calculation of information contents for 33 solvent systems. A comparison was made with the calculation method of Massart [26]. Eqs. (32) and (t7) were used to calculate the equivocations for the 528 possible combinations of two solvent systems, enabling the selection of the best combination.
13.2. Gas-Liquid Chromatography Information theory was applied to GLC by Dupuis and Dijkstra [11] and by Eskes et al. [17]. Aiming at an optimal identification by retrieval of retention indices, (combinations of) stationary phases were selected by using the information content as a criterion. It appears that different (combinations of) phases are selected for different indentification problems, i.e. the unknown compound being an alcohol, ester, etc. The optimal combinations of two or three columns were selected by comparing the information contents for all possible combinations. In order to avoid excessive computations, near optimal combinations of four or more phases were established by adding to an already selected combination the phase that yields the largest increment of the information content (AI-method).
13.3. Mass Spectrometry Grotch [20] calculated the information contents of identification procedures based upon the retrieval of mass spectra with binary coded intensities. The influence of the intensity threshold for coding the intensities upon the information content was investigated whereas correlations and errors were neglected. Wangen et al. [33] used information theoretical principles for minimizing the number of (computer) bits required for coding mass spectra by combining mass positions. A reduction from 352 to 48 bits per spectrum was achieved resulting in only a slight decrease of the recall (percentage of correct compounds or close isomers in first position) from 84 to 74 ~o. Van Marlen and Dijkstra [25] described a method of calculating information contents of identification pro-
Fresenius Z. Anal. Chem., Band 298 (1979) cedures based upon retrieval of mass spectra with binary coded intensities, taking the correlations between the intensities into account. Due to these correlations a set of 120 (out of 300) masses was selected with hardly any loss of information. The method of calculating information contents and the selection strategy (A/-method) were essentially the same as the methods applied to G L C by Dupuis and Dijkstra [11]. It should be observed that the use of one peak intensity for retrieval purposes can be considered as a one-channel procedure. Consequently, the use of a spectrum is to be regarded as a combination of one-channel procedures. Erni [16] presented an application of information theory using the parameter information instead of the information content. The information obtained when a feature in an 'unknown' spectrum is absent or present, I- or I +, was used as a weight factor in a matching criterium for the retrieval of binary coded mass spectra. De Jong et al. [23] used the information [Eq. (13)], in their paper called selective information, for the development of an efficient 'reverse search' method aiming at the identification of pure pesticides and mixtures. A'reverse search' method uses coded reference spectra containing only intensitities for a restricted set of masses. By using the information as a selection criterion small numbers of masses (less then 15) could be selected for every reference spectrum, without affecting the 'uniqueness' of these spectra. The reference file contained spectra of 350 pesticides.
13.4. Infrared Spectrometry Dupuis and Dijkstra [12] evaluated the applicability of the ASTM infrared data base (binary coded spectra with 140 wavelength intervals) for retrieval purposes. The effects of correlations between the occurrence of peaks in different intervals and of errors in the coded spectra upon the information content of a retrieval method were calculated. It was concluded that especially the coding errors have a pronounced influence upon the quality of the retrieval. Dupuis and Dijkstra [13] also evaluated the retrieval with accurately measured and (binary) coded spectra. The effects of the intensity threshold and of the analytical problem (unknown compound being an alcohol, hydrocarbon, etc.) were quantified in terms of differences in information content. Usually all wavelength intervals into which the IRwavelength band is divided are used for the retrieval. However, especially when small references files are involved, it is not necessary to use all intervals for retrieval. Two methods based upon information theory for the selection of a restricted set of intervals, to be used for retrieval purposes, have been published.
P. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis
The method described by Heite et al. [21] uses numerical taxonomy and information theory. It is similar to the approach of Massart and De Clerq [6] for selecting optimal combinations of TLC systems. By applying the method 97.7 ~ of 5100 ASTM spectra were uniquely coded with 40 intervals. A second method was described by Dupuis and Dijkstra [10]. The intervals were selected with the AImethod. As an example, 160 hydrocarbons could be coded uniquely by using 12 intervals (threshold 3 ~) whereas for the coding of 100 alcoholspectra 10 intervals were sufficient. The mutual information was used by Ritter et al. [30] for the classification with binary coded infrared spectra. This information parameter corresponds to the information content of a classification procedure. It was shown that the mutual information is to a large extent correlated with the 'recognition ability' (for two class classifications). Another application of information theory to classification problems was presented by Dupuis [9]. Procedures for two class classifications and four class classifications were developed, making use of the absence or presence of peaks for only a limited number of wavelength intervals. The classification procedures were generated by an adapted AI-method. These procedures are similar to the interpretation rules used in artificial intelligence programs.
13.5. Remarks
Some general remarks about the application of the information theory to qualitative analysis should be made.
a) All applications published so far, and most likely all future applications, essentially are dealing with the selection or optimization of (combinations of) analytical procedures. The information content serves as a selection criterion that describes the (expected or average) performance of an analytical procedure in relation to the identification problem that has to be solved. For such problems also the equivocation, which is complementary to the information content, can be used. It leads to the same solutions of selection and optimization problems. The use of the equivocation has the advantage of directly showing the deviation from an ideal procedure, which for every measured signal leads to an unambiguous identification. However, the information content has been widely used in Analytical Chemistry. As shown by De Jong et al. [231, also the information can be used as a parameter for selection or optimization problems, although it is not a quality criterion for the whole identification procedure.
107
b) The use of the information content as a selection or optimization criterion is relatively easy when applied to one-channel procedures (measurement of one variable). Then only the information contents of the several alternatives have to be calculated. Difficulties arise when optimal multi-channel procedures, composed of a limited number of one-channel procedures, have to be developed, especially when the number of one-channel procedures being considered is large. Then the calculation of the information content for all possible combinations is virtually impossible. For instance, the selection of a number spectral features (channels, wavenumbers) to be used for the identification by means of retrieval of spectral data is difficult because of the large number of features available. A solution of this problem can be an a priori reduction of the number of available channels. However, usually such a reduction will be rather arbitrary. Another, supplementary, solution is the application of a selection strategy requiring the calculation of a relatively small number of information contents. Such strategies are the AI-method [9-13, 17, 25] and the taxonomic approach [7, 21, 27]. c) Another difficulty in applying information theory to the combination of one-channel procedures is the large number of combinations of possible signals that has to be taken into account. For instance, if the Rsvalue is regarded as a (discrete) quantity measured in two decimals,the combination of two TLC procedures leads to 1012 ~ 104 possible combinations of signals. For continuously variable signals the application of information theory requires the solution of multidimensional integrals which can be very (calculation) time consuming and often is impossible. For GLC [11, 17] this problem was solved by describing the retention indices for several columns by a multi-dimensional normal distribution function. This model leads to integrals that can be expressed in terms of variances and covariances of the retention indices. The assessment of the information content of retrieval procedures using spectra with binary coded intensities [l 2, 13, 25] can be based upon the same model. Also replacement of continuously variable signals by discrete signals will reduce the calculation efforts [4]. d) The information theoretical approach presented in this paper has been restricted to the identification of a single compound. However, for a number of identification procedures it can be expected that the better the procedure is for identifying single compounds the better it is for identifying the components of a mixture. This especially applies to the identification through the retrieval of chromatographic data such as Ra--values and retention indices. Consequently, in some instances the information content or equivocation can also be
108
used for mixtures. As such the information content was used by Massart, De Clercq and Smits [6, 7, 2 6 - 2 8 ] . e) The general formulae for the calculation of information contents are difficult to apply. Therefore it is not surprizing that in papers dealing with applications simplified expressions have been used. The information parameters presented in literature can often be related to the information content as described in this paper by assuming a simple model for the analytical problem (equal a priori probabilities of the possible identities) and for the analytical procedure (normal error distribution function, error independent of the compound being analyzed, or absence of noise). It is obvious that the simplified expressions have a limited applicability.
14. Conclusions Information theory yields a criterion that describes the ability of analytical procedures for the indentification (or classification) of unknown compounds. This criterion - information content or equivocation - can be an important tool to evaluate, select or optimize identification procedures. The basic parameter of the theory of uncertainty and information, as described in this paper, is the uncertainty pertaining to the identity after analyzing the unknown compound. It can be compared with the precision as used in quantitative analysis. The equivocation is the expected value of the uncertainty after analysis. The information, i.e. the decrease of the uncertainty as a result of the analysis, and the information content, i.e. the expected value of the information, are not essentially different from the uncertainty after analysis and the equivocation. Only the zero level is different for these quantities. The information parameters presented in literature can often be related to the information content as described in this paper by assuming a simple model for the analytical problem or procedure. Information theory can (relatively) easily be applied to one-channel procedures. Application to combinations of one-channel procedures or multi-channel procedures is much more difficult. For instance, when the optimization of a retrieval procedure using spectral data is considered, the large number of possible combinations of peak positions requires the calculation of large numbers of information contents. Also the straightforward calculation of information contents of multi-channel procedures can be very (calculation) time consuming. Therefore the role that information theory will play in the evaluation, selection and optimization of identification (or classification) procedures will to a large extent depend on the further development of
Fresenius Z. Anal. Chem., B a n d 298 (1979)
models that enable faster calculations of information contents. It also depends on the further development of selection or optimization strategies that require the calculation of only a limited number of information contents.
Appendix I Desirable properties [1,2, 3, 31 ] of an uncertainty parameter H and an uncertainty function h are: 1. H h a s to be equal zero when there is only one possibility (with probability 1), so h{l} = 0. 2. H has to be positive in case of two or m o r e possibilities (with probabilities larger t h a n zero). 3. In case of n possibilities with equal probabilities (p(0 = l/n, for i = 1 ... n) H has to increase with increasing n. To put it differently: h{~} has to be a strictly m o n o t o n i c increasing function of n. 4. For equal probabilities H has to be infinite when the n u m b e r of possibilities is infinite. For h we write: Lim t7{88 = ~ . 5. Considering situations with equal n u m b e r s of possibilities H has to be m a x i m u m w h e n all probabilities are equal. So: h[p(l),p(2) ... P ( 0 . . . p(n)] < h
, - ...... 17
n
1 with equality if and only ifp(i) =
for i = 1 ... n. n
6. The function h {17(0}has to be continuous in p(t). 7. The function k{p(i)} has to be symmetrical. So permutations in the set {p(i)} should not influence the value of H. 8. The value of H for a set of probabilities should not change if the set is extended with a zero probability: h[(p(1), p(2) ... p(/) ... p(n)] = h[p(1),p(2) ... P(0--. p(n), 0l. 9. In case of two possibilities with probabilities equal to p and p 1, hip, 1-p] should be strictly monotonic increasing with p for 0
1.
This property clearly is not essential. It only has a practical m e a n i n g t h r o u g h the effect of limiting the n u m b e r of functions without excluding functions t h a t differ in an essential way from the functions following this property. It should be regarded as an normalization condition.
Appendix I1
Properties of Shannon's Function h Property A. S h a n n o n ' s inequality [3, 1, 2]: For two sets of probabilities {p(/)} and {q(i)} satisfying, n
X p(i)= i
X q(i) = 1, i=1
i
S h a u n o n ' s inequality is given by: --
Xp(/)ldp(i) i
<
-
X p(i) ld q(O [
with equality only if p(/) = q(O, for i = 1 ... n.
P. Cleij and A. Dijkstra: Information Theory Applied to Qualitative Analysis To describe the other properties the following is given: Two subjects X and Y; for the first subject existing the possibilities 2(1, )(2 .. 9X~... X,,, for the second the possibilities Y> Y2 .. Y~... Ym ; p(i) is the probability that X - ~ , q(j) is the probability that X = ~, p(i,j) is the probability that X -- ~ and Y = Y,.and p(]~i) is the probability that Y = ~ if it is known that X - X~. Relations between the probabilities are: 9
p(i,j) p(i) = S p(i, j), p(]) = s p(i, j), p(]/i) j ~
p(i)
6. 7. 8. 9. 10. 11. 12. 13.
Property B. h is strong additive [1, see also 3, 31, 2]. h{p(i,j)}
= h {p(i)} + X p(i) . h {p(j/i)}
14. 15.
i
Property C [3, 31, 1, 2]:
16.
The following property can be proven, only using Shannon's inequality [3, 2] and can therefore be regarded as a different form of Shannon's inequality
t7.
h {p(i,j)} _< /, {P(0} + h {p(j)}
18. 19.
with equality only ifp(i, j) - p(i) . p ( j ) for all i a n d j . Property D [3, 3l, 1, 2]:
20. 21.
The following property can be proven, using property B and C
[2, 3, 311:
22. 23.
S p ( i ) " h {p(j/i)} >__ h {p(])} i
with equality only ifp(i, j) = p(i) . p(]) (--* p(j) = p(]/i)) for all i and j.
References 1. Acz61, J., Dar6czy, Z.: On measures of information and their characterisations. New York: Academic Press 1975 2. Ash, R.: Information theory. New "fork, London, Sydney: lntersci. Publ. 1965 3. Brillouin, L. : Science and information theory, 2 "a ed. New York: Academic Press i967 4. Cleij, P., Dijkstra, A. : Fresenius Z. Anal. Chem. 294, 361 (1979) 5. Clerc, J. T., Kaiser, R., Rendl, J., Spitzy, H., Zettler, H., Gottschalk, G., Malissa, H., Schwarz-Bergkampf, E., Werder, R. D.: Fresenius Z. Anal. Chem. 272, ] (1974)
24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34.
109
de Clercq, H., Massart, D. L.: J. Chromatogr. 93, 243 (1974) de Clercq, H., Massart, D. L.: J. Chromatogr. 115, 1 (1975) DoerffeI, K., Hildebrandt, W.: Wissenschaffi. Z. 11, 30 (1969) Dupuis, F. : Thesis, Chapter V, State Univ. Utrecht 1977 Dupuis, F., Cleij, P., van 't Klooster, H. A., Dijkstra, A.: Anal. Claim. Acta 112, 83 (1979) Dupuis, F., Dijkstra, A.: Anal 9 Chem. 47, 379 (1975) Dupuis, F., Dijkstra, A.: Fresenius Z. Anal. Chem. 290, 357 (1978) Dupuis, F., Dijkstra, A., van der Maas, J. H. : Fresenins Z. Anal. Chem. 291, 27 (1978) Eckschlager, K. : Fresenius Z. Anal. Chem. 277, t (1975) Eckschlager, K., Vajda, I.: Collect. Czechoslov. Chem. Commun. 39, 3076 (1974) Erni, F.: Thesis No. 4296, Eidgen. Techn. Hochschule, Zfirich 1972 Eskes, A., Dupuis, F., Dijkstra, A., de Clercq, H., Massart, D. L.:Anal. Chem. 47, 2168 (1975) Gottschalk, G.: Fresenius Z. Anal. Chem. 258, 1 (1972) Griepink, B., Dijkstra, G. : Fresenius Z. Anal. Chem. 257, 269 (1971) Grotch, S. L.: Anal 9 Chem. 42, 1214 (t970) HeRe, F. H., Dupuis, P. F., van 't Klooster, H. A., Dijkstra, A. : Anal. Chim. Acta 103, 313 (1978) Isenhour, T. L., Justice, J. B. :Adv. Mass Spectrom 9 6, 981 (1974) de Jong, G., van Bekkum, J., van 't Klooster, H. A., Freudenthal, J.: Adv. Mass. Spectrom. 7, 1091 (1978) Kaiser, H.: A n a l Chem. 42, 24A (1970) van Marlen, G., Dijkstra, A.: Anal. Chem. 48, 595 (1976) Massart, D. L.: J. Chromatogr. 79, 157 (1973) Massart, D. L., de Clercq, H.: Anal. Chem. 46, 1988 (1974) Massart, D. L., de Clercq, H., Stairs, R.: Euroanalysis II Conference, Budapest 1975 Raeside, D. E.: Med. Physics 3, 1 (i976) Ritter, G. L., Lowry, S. R., Woodruff, H. B., Isenhour, T. L.: Anal. Chem. 48, 1027 (1976) Shannon, E., Weaver, W. : The mathematical theory of information. Urbana: The University of Illinois Press 1947 Souto, J., de Valesi, A. G.: J. Chromatogr. 46, 274 (1970) Wangen, L. E., Woodward, W9 S., Isenhour, T. L. : Anal. Chem. 43, 1605 (1971) Weinberg, F.: Grundlagen der Wahrscheinlichkeitsrechnung und Statistik sowie Anwendungen in Operations Research. Berlin, Heidelberg, New York: Springer 1968
Received April 6, 1979