Arch. Environ. Contam. Toxicol. 39, 289 –298 (2000) DOI: 10.1007/s002440010107
A R C H I V E S O F
Environmental Contamination a n d Toxicology © 2000 Springer-Verlag New York Inc.
Modeling the Toxicity of Chemicals to Tetrahymena pyriformis Using Molecular Fragment Descriptors and Probabilistic Neural Networks S. P. Niculescu,1 K. L. E. Kaiser,2 T. W. Schultz3 1 2 3
TerraBase Inc., 3350 Fairview St., Suite 160, Burlington, Ontario L7N 3L5, Canada National Water Research Institute, P.O. Box 5050, Burlington, Ontario L7R 4A6, Canada College of Veterinary Medicine, University of Tennessee, Knoxille, Tennessee 37901-1071, USA
Received: 16 November 1999/Accepted: 9 April 2000
Abstract. The results of an investigation into the use of a probabilistic neural network (PNN)– based methodology to model the 48-h ICG50 (inhibitory concentration for population growth) sublethal toxicity of 825 chemicals to the ciliate Tetrahymena pyriformis are presented. The information fed into the neural networks is solely based on simple molecular descriptors as can be derived from the chemical structure. In contrast to most other toxicological models, the octanol/water partition coefficient is not used as an input parameter, and no rules of thumb or other substance selection criteria are employed. The cross-validation and external validation experiments confirmed excellent recognitive and predictive capabilities of the resulting models and recommend their future use in evaluating the potential of most organic molecules to be toxic to Tetrahymena.
The large-scale production and use of certain chemicals has caused widespread concern about their environmental and health effects. Pesticides are applied in increasing amounts to large parts of the world’s agricultural soils. Frequently, their residues and metabolites enter ground and surface waters and the atmosphere and are thus transported into different ecosystems. Industrial compounds and their by-products enter the environment primarily through spills and untreated effluents. Household detergents and related materials are used around the globe and find their way into the environment through insufficient or absent waste water treatment. Recent changes in legislation provide an impetus for government to curb or control substance release, particularly for compounds with little or known effects. Of the 65,000⫹ substances on the Domestic and Non-Domestic Substances Lists (Canada), only a small percent have any measured toxicity, bioaccumulation, and biodegradation data; for most substances, no such information is available. To assess these compounds’ potential to cause undesirable environmental
Correspondence to: K. L. E. Kaiser
and health effects, certain key properties need to be estimated. The science of estimating or predicting variations in certain properties of narrowly defined data sets is well established as in the form of quantitative structure-activity relationships (QSARs), most of them relying on linear relationships between a physicochemical property and the biological effect of interest. In contrast, when it comes to estimating the effects of a wide variety of chemicals, most linear relationships fail to provide reliable predictions. A new perspective is needed. Among the protozoa, members of the class Ciliophora, especially free-living Holotrichia, are commonly used in population-based aquatic toxicity assessments (Gilron and Lynn 1998). Among the holotrichs, the most often used organism in aquatic toxicity testing has become the common freshwater hymenostome, Tetrahymena. Tetrahymena is a large, motile, phagotrophic cell. The ability to culture Tetrahymena axenically led to the enormous expansion of knowledge of their physiology and biochemistry. This plethora of basic knowledge was garnered from work with strain GL of the species T. pyriformis. This strain lacks a micronucleus and has no sexual stage in its life cycle. While the knowledge of the organism’s biology that has made the modern day use of T. pyriformis as a toxicity test organism possible, its exceptionally fast growth rate under simple and inexpensive culture conditions has been the key to its use as a standardized toxicity assessment system (Larsen et al. 1997). The present use of Tetrahymena in aquatic toxicity testing has its roots in the early 1970s, especially the work of Cooley et al. (1972). In the late 1970s, work began to focus on the use of T. pyriformis in toxicity testing. While early studies examined a number of biological endpoints, it was the endpoint of population growth impairment that was a major development. The first systematic effort to evaluate aquatic toxicity using a T. pyriformis growth assay examined nitrogen-containing heterocyclics (Schultz et al. 1980). Schultz (1983) first presented an overview of this method. Briefly, this assay used 250-ml Erlenmeyer flasks containing 50 ml of toxicant/proteose-peptone medium solution, and the method involved using graded concentration series evaluated in replicates. Population densities
290
were estimated by absorbance at 540 nm. From these data, the 50% population growth impairment concentration was determined by probit analysis. Although this assay has undergone modifications (e.g., Schultz et al. 1990; Schultz 1997), the basic design has stayed the same for the last 20 years. The primary goal of the present paper is to present the results of a PNN-based modeling experiment for the 48-h ICG50 (population growth inhibitory concentration) sublethal toxicity to Tetrahymena pyriformis based solely on simple molecular descriptors derived from the chemical structure and which does not use the octanol/water partition coefficient, does not involve any rules of thumb, and does not require a substance classification for the handling of the information or generating the results. We have recently succeeded in constructing such unified PNN-based models for the prediction of toxic effects to the fathead minnow (Pimephales promelas) and to the crustacean Daphnia magna. Here we investigate the potential of using the same unified approach to model the substances sublethal toxicity to the ciliate T. pyriformis. The results confirm the general validity and applicability of the method to this species.
Materials and Methods Data The modeling study we report on is based on the 825 organic compounds for which 48-h ICG50 sublethal toxicity data for T. pyriformis have been published and are available in the TerraTox™ database (TerraBase 1999). No further selection of the compounds was made, and this data set represents most of T. pyriformis 48-h ICG50 values published in the literature. It covers a large variety of chemicals classes or groups and includes chemicals that have been described as fathead minnow toxicants acting as neutral and polar narcotics (narcosis I, II, and III modes), oxidative phosphorylation uncouplers, acetylcholinesterase inhibitors, respiratory inhibitors, electrophile/proelectrophile reactive agents, and probably some compounds effective via other specific modes of action (Russom et al. 1997). The chemical classes include alkanes, ethers, alcohols, aldehydes, ketones, acids, esters, lactones, nitriles, amines, phenols, nitro compounds, thiocyante and isothiocyante compounds, oxygen, sulfur- and nitrogen-containing heterocyclics, ureas, quinones, enols, ynols, and others. The substances also include fluorine-, chlorine-, bromine-, iodine-, and sulfurcontaining compounds. The substances include estrogenic surfactant derivatives, such as 4-nonylphenol, and pesticidal compounds, such as Tralkoxydim and pentachlorophenol. All data were applied in or converted to the pT notation, i.e., the toxicity is expressed as pT ⫽ ⫺log(mmol/L). The associated structural information in the form of SMILES strings, chemical formula, and molecular weight was secured from the TerraTox database. As model input to all artificial neural network models used in this study we consider data vectors whose components consist of the values associated with the molecular descriptors and number of occurrences of the functional groups in Table 1, and the number of individual occurrences of C, H, Br, Cl, F, I, N, O, and S atoms. The list of structural and functional groups shown in Table 1 is essentially the same as has been used in some of our earlier investigations on the toxicity of large number of chemicals to bacterium Vibrio fischeri, the fathead minnow (P. promelas), and the waterflea D. magna (Kaiser and Niculescu 1999a, 1999b; Niculescu et al. 1998). This list is based on the most prevalent structural features found in the compounds selected by various researchers for the toxicity tests for these species. Other groups (Verhaar et al. 1992; Neuwen et al. 1997;
S. P. Niculescu et al.
Russom et al. 1997) have used somewhat similar lists of substructure fragments to classify substances as to their toxic mode of action; however, these schemes do not lend themselves to develop quantitative predictions for all structures. While additional (and/or better) fragment descriptors are quite conceivable, our experience indicates that the present list is quite encompassing for most smaller molecules as commonly tested in aquatic bioassays. The output from all models is the predicted 48-h ICG50 sublethal toxicity value to T. pyriformis in pT notation.
Methodological Considerations There are various approaches scientists are using in building toxicity models. At a time when only statistical tools were available, such models involved variations on the classical multivariate regression paradigm, later extended to include partial least squares and principal component analysis. The main problem with such types of models is their limited scope: They are valid only for very narrow classes of compounds. Furthermore, the mathematics behind such models require the validity of several hidden assumptions, conditions that were often completely ignored by the model users, or worse, by the modelers themselves. In most of the published (quantitative) structure-activity relationship (QSAR) models, the structure part of the relationship de facto consists of a set of molecular structure characteristics (presence of substructures or structural similarity), thus narrowing the class of compounds that the model is supposed to handle. The quantitative aspect of such QSARs consists in using one or more physicochemical parameters as models’ input variable(s). When applying such a model to make predictions, the first step is to verify the substructural requirements imposed by its scope (class(es) of compounds for which the model is supposed to be valid). The next step is to prepare the input data corresponding to the variables involved in the relationship and feed them into the QSAR equation(s) in order to generate the prediction. Typically, such equations are using the structure information indirectly (for example, one of the input variables in the Hansch-type relationships is the octanol/water partition coefficient or an incremental derivative thereof, such as a substitutent’s hydrophobic fragment value). The operation of preparing the needed input frequently requires computation of the associated physicochemical parameters, which is using (more or less) the structural information as direct input in other structure-activity relationships, which are subject themselves to various restrictions (candidly ignored by most users) and are themselves generators of serious additional errors. Though some of these QSARs are performing well on very narrow classes of compounds, the larger challenge is to handle a variety of more complex substances simultaneously whether or not they are sharing common molecular backbones, fragments, or characteristics of a congeneric class. A very serious problem is arising at this level: how to choose between the available relationships for the compounds common to two or more of such classes. Until now, no one has succeeded in finding a satisfactory solution. Most available solutions are empirical and consist essentially on various rules of thumb on how to handle various classes of compounds. For instance, a notorious combination of such rules of thumb has been implemented inside the US EPA’s ECOSAR toxicity assessment expert system and has been recently subject to very serious criticism because it involves the use of a very large number of statistically insignificant QSARs (Kaiser et al. 1999). Further proof of this assertion can be found in the recent comparison of the predictive capability of a number of widely used modeling methodologies on an independent data set (Moore et al. 1999). These findings showed that the ASTER, CNN, ECOSAR, OASIS, and TOPKAT methods provided poor predictions when compared with the PNN methodology used here. Undoubtedly, much of that is due to their heavy reliance on the octanol/water partition coefficient as an independent variable.
PNN Modeling of Toxicity to Tetrahymena
291
Table 1. Functional group descriptors used in this study Group
Description
5ringY 6ringY ar-OH Alk-OH ⫽O -COOH -COOR -CONH2 ar-NH2 Alk-NH2 -NO2 -CN ar-X Alk-X -CF3 -F 6ring 5ring -O⫽CR2 Thio Thion Sox -N⫽ Qu Ionic ⫽CH-COIbut n-C -yne NOC Anrg MW
aromatic (two double bonds) 5-rings containing one or more hetero atoms (e.g., N, O, S) aromatic (three double bonds) 6-rings containing one or more hetero atoms (e.g., N, O, S) aryl hydroxy (-OH), sulfhydryl (-SH), and ar-O⫺ groups alkyl hydroxy (-OH) groups, includes -OH in ⫽N-OH and S-OH groups keto (⬎C⫽O) groups, includes ⫽O in S⫽O, and -N⫽O (nitroso) groups acid groups where C⫽ C, S ester groups where C⫽ C, S amide groups where C⫽ C, S aryl amino (-NH2) groups alkyl amino (-NH2) groups nitro (-NO2) groups cyano (-CN) groups halide atoms on aromatic rings or other sp2 carbon atoms with X ⫽ Br, Cl, and/or I, but not F alkyl halide (Alk-X) where X ⫽ Br, Cl, and/or I, but not F Trifluoromethyl (CF3) groups number of aryl and/or alkyl fluoride (Ar-F and/or Alk-F) groups, other than in CF3 groups aromatic 6 member rings, excluding those containing nitrogen, sulfur, and oxygen aromatic 5 member rings, excluding those containing nitrogen, sulfur, and oxygen ether (R-O-R⬘) linkages number of terminal (⫽CR2) groups where R ⫽ H, Cl, or Br sulfide bridge or thiol, e.g., -SThiocarbonyl group, e.g., -(C⫽S)sulfoxide (-SO-), sulfon (-SO2), or sulfate (-OSO3-); sp2 hybridized nitrogen atoms, such as in azo (-N⫽N-), (⫽N-N⫽), azoxy (-N⫽N(O)-), and (⫽N⫹⫽) in aromatic dye compounds Quinone (value 1), hydroquinone (value 0.5), or peroxide (value 0.5) character ionic state compound is conjugated to produce an acidic (CH) group, e.g., ⫽CH-C(⫽O)-R tertiary butyl (value 1) or isopropyl (value 0.5) group the longest aliphatic chain, not terminating with an aromatic ring, carboxy group or other large substituent number of carbon-carbon triple bonds (acetylenic bonds) in the compound sum of nitrogen and oxygen atoms, minus three times the number of nitro groups, divided by the number of carbon atoms degree of fixed connectivity of ring fragments molecular weight
For the last 30 years, as part of the Medicinal Chemistry Project at Pomona College, Hansch and his group have been developing the C-QSAR computer package distributed by BioByte Corp. The most valuable part of this software consists of a database summarizing the available information on more than 14,000 QSARs, including equations, parameters, list of compounds on which the models are based, and source of publication. It is a highly valuable source of QSAR equations, and it also reveals the same picture—namely, most of these models are based on very small, sometimes statistically insignificant data sets. The Internet version of this database (queried for the octanol/ water partition era) returned references to 53 such QSARs for the toxicity to Tetrahymena of a variety of classes of congeners. Most of the attempts of building a prediction system for the toxic potential of chemicals simultaneously involving more than one substance class QSAR resulted in failure, for obvious reasons, the most important of them being the limited scope and conflicting predictions for compounds sharing features common to several classes. A more orderly and logical set of rules of thumb is implemented in the 3.1 version of the M-CASE program created by Klopman et al. (2000). The first step consists of identifying the baseline relationship between the octanol/water partition coefficient and toxicity. All compounds whose toxicity is not accounted for by this relationship are assumed to derive their activity from other factors. The deviations of measured toxicity and baseline predictions form a new training set, based on which the program hierarchically identifies the substructures that appear mostly in active molecules and may be responsible for these deviations. After identifying the top biophore (most active frag-
ment), the compounds containing this fragment are eliminated from the training set. The next strongest biophore is identified in the same way based on the remaining training data. The procedure is repeated until all activity of molecules in the training set have been accounted for or no other statistically significant substructure can be found. The program then builds a multiple regression–type QSAR for each congeneric subset corresponding to each of these biophores using modulators (indicator variables for the presence of substructures of interest, or calculated physicochemical parameters, such as highest occupied molecular orbital and lowest unoccupied molecular orbital, octanol/ water partition coefficient, etc.) The final QSAR summarizes the predictions of the baseline QSAR and all deviations generated by the QSARs corresponding to biophores present in the molecule. Obviously, its scope is limited to the scope of the underlying baseline relationship and that of the congeneric QSARs involved in performing the computation. There are also unanswered questions generated by proper handling of compounds containing multiple biophores. Nonlinear contributions of biophores to the toxicity have not been investigated. A complete cross-validation study revealing a more detailed and more comprehensive picture of the recognitive and predictive capabilities of M-CASE, based exclusively on the analysis of model errors and not on indicator values (based on subjective performance criteria), will be welcome. The classical example of the GABA dataset, which the M-CASE methodology is not able to handle (no substructures are statistically significant), is analyzed by Downs et al. (1995). We discuss the M-CASE program because it is representative for a period in toxicology research classified by Hansch and co-workers at Pomona
292
S. P. Niculescu et al.
Table 2. The list of the 750 randomly selected compounds used for the cross-validation experiment (i ⫽ the index of the corresponding T set) Compound
i
Compound
i
Compound
i
Compound
i
Compound
i
Compound
50-30-6 50-43-1 50-45-3 50-79-3 50-84-0 51-28-5 51-36-5 51-44-5 51-79-6 55-21-0 57-06-7 57-57-8 58-27-5 58-90-2 59-50-7 60-09-3 60-12-8 62-23-7 64-17-5 65-45-2 65-85-0 66-25-1 66-76-2 67-36-7 67-56-1 67-63-0 67-64-1 67-68-5 69-72-7 71-23-8 71-36-3 71-41-0 74-11-3 75-05-8 75-31-0 75-64-9 75-65-0 75-84-3 75-85-4 78-81-9 78-83-1 78-92-2 78-93-3 78-94-4 78-96-6
5 3 2 5 3 1 1 5 2 1 2 5 1 5 1 1 3 2 3 2 4 1 4 1 3 2 3 2 1 1 4 1 2 4 5 2 4 3 2 4 5 2 5 5 5
79-06-1 79-20-9 80-62-6 83-42-1 83-72-7 84-11-7 84-66-2 84-74-2 86-53-3 86-56-6 86-74-8 87-25-2 87-60-5 87-62-7 87-63-8 87-86-5 87-87-6 88-18-6 88-65-3 88-69-7 88-72-2 88-73-3 88-75-5 88-89-1 89-59-8 89-61-2 89-95-2 90-01-7 90-02-8 90-12-0 90-15-3 90-30-2 90-41-5 90-43-7 90-72-2 90-90-4 91-15-6 91-19-0 91-20-3 91-22-5 91-57-6 91-63-4 91-66-7 92-52-4 92-67-1
2 2 4 4 4 1 3 1 5 4 5 5 3 3 5 4 4 1 3 2 5 4 5 4 1 3 1 3 4 2 5 1 4 3 4 3 1 4 5 1 4 3 5 2 5
92-69-3 92-82-0 92-92-2 93-55-0 93-58-3 93-89-0 94-09-7 94-67-7 94-71-3 95-01-2 95-15-8 95-16-9 95-50-1 95-51-2 95-53-4 95-55-6 95-57-8 95-65-8 95-68-1 95-69-2 95-74-9 95-75-0 95-76-1 95-77-2 95-78-3 95-81-8 95-82-9 95-88-5 95-92-1 95-95-4 96-05-9 96-15-1 96-17-3 96-22-0 96-48-0 97-00-7 97-02-9 97-54-1 97-88-1 97-96-1 98-06-6 98-54-4 98-82-8 98-86-2 98-95-3
5 5 4 1 1 4 3 3 1 3 3 5 2 4 2 2 5 2 4 5 4 3 2 3 2 2 1 4 2 1 1 2 1 2 2 5 3 1 2 3 2 4 2 5 4
99-04-7 99-05-8 99-06-9 99-08-1 99-09-2 99-28-5 99-51-4 99-61-6 99-65-0 99-71-8 99-75-2 99-76-3 99-77-4 99-88-7 99-89-8 99-90-1 99-93-4 99-94-5 99-99-0 100-00-5 100-01-6 100-02-7 100-10-7 100-12-9 100-14-1 100-25-4 100-29-8 100-39-0 100-44-7 100-46-9 100-47-0 100-48-1 100-51-6 100-52-7 100-54-9 100-61-8 100-66-3 100-68-5 100-70-9 101-41-7 103-05-9 103-16-2 103-63-9 103-69-5 103-72-0
5 4 3 3 3 1 2 1 3 5 3 3 2 4 1 5 3 4 1 2 1 1 2 4 3 1 1 3 2 3 5 5 2 3 4 2 2 2 2 2 5 5 5 4 3
103-73-1 103-83-3 103-84-4 103-85-5 104-40-5 104-50-7 104-51-8 104-61-0 104-76-7 104-85-8 104-86-9 104-87-0 104-88-1 104-90-5 104-91-6 104-94-9 105-07-7 105-37-3 105-46-4 105-50-0 105-53-3 105-59-9 105-75-9 105-76-0 106-40-1 106-44-5 106-47-8 106-49-0 106-50-3 106-51-4 106-63-8 106-65-0 106-79-6 107-02-8 107-12-0 107-18-6 107-19-7 107-85-7 107-87-9 108-03-2 108-10-1 108-29-2 108-39-4 108-42-9 108-43-0
2 3 2 4 3 4 5 1 4 5 4 1 1 2 4 2 1 3 3 3 5 4 3 3 3 2 3 4 5 1 1 2 5 3 1 1 5 2 1 2 5 4 3 2 4
108-44-1 1 108-45-2 1 108-46-3 4 108-48-5 5 108-59-8 3 108-68-9 2 108-69-0 3 108-86-1 1 108-89-4 2 108-90-7 2 108-93-0 4 108-95-2 3 108-99-6 4 109-08-0 3 109-09-1 2 109-49-9 5 109-73-9 4 109-74-0 3 109-75-1 4 109-83-1 3 109-97-7 4 110-40-7 4 110-42-9 5 110-43-0 3 110-58-7 3 110-59-8 3 110-62-3 1 110-65-6 5 110-68-9 5 110-73-6 5 110-74-7 3 110-86-1 2 111-11-5 3 111-13-7 2 111-26-2 5 111-27-3 5 111-42-2 1 111-68-2 5 111-70-6 2 111-71-7 1 111-86-4 4 111-87-5 3 112-12-9 5 112-17-4 1 112-20-9 4 (Continued)
College as the octanol/water partition era in contrast to the pre-octanol/ water partition era. As any historical period, an era has a beginning and also an end. Recent developments reported by Eldred et al. (1999), Kaiser and Niculescu (1999a), and Niculescu et al. (1998) proved the possibility of building very competitive models for the toxicity of chemicals to various species based exclusively on simple structural molecular fragment information, thus opening the door for a postoctanol/water partition era. Finding a more comprehensive solution to the toxicity effects modeling problem for thousands of substances of all kinds of structures (such as on the DSL and NDSL lists) requires fundamental changes in our attitude toward the problem itself. In our opinion, the best solution is to not get involved in any attempt to classify substances in the first
i
place. Obviously, the key to the toxic effects of chemicals resides exclusively in their molecular structure. Any relationship based on the popular physicochemical parameters, such as log P, various kinetic parameters, orbital energies, and so on, involves practically a large range of QSARs, each contributing with additional uncertainties to the final prediction and using the molecular structure information as input. So, the best way to eliminate this additional source of errors will be to eliminate the intermediate QSARs from the relationship and use directly as input to the model only information reflecting structural features, such as number of occurrences of specific molecular groups/ atoms and information reflecting how they are connected inside the molecule. Naturally, the larger the portion of the chemical universe that we consider for the modeling exercise, the larger the number of
PNN Modeling of Toxicity to Tetrahymena
293
Table 2. (Continued) Compound
i
Compound
i
Compound
i
Compound
i
Compound
i
Compound
112-30-1 112-31-2 112-42-5 112-44-7 112-53-8 112-54-9 112-70-9 115-19-5 117-80-6 118-75-2 118-79-6 118-91-2 118-92-3 118-93-4 119-61-9 120-47-8 120-51-4 120-72-9 120-80-9 120-82-1 120-83-2 121-14-2 121-69-7 121-71-1 121-87-9 121-89-1 122-03-2 122-79-2 122-97-4 123-07-9 123-08-0 123-15-9 123-19-3 123-25-1 123-30-8 123-31-9 123-51-3 123-54-6 123-66-0 123-72-8 123-86-4 124-12-9 124-13-0 124-19-6 124-68-5 126-98-7 127-06-0 127-66-2 128-37-0
4 3 2 1 4 1 1 4 5 4 4 3 2 4 2 3 4 3 1 3 4 4 1 5 2 4 4 4 4 4 4 3 2 2 1 5 1 4 4 5 2 5 4 2 2 1 5 4 4
130-15-4 131-11-3 131-16-8 131-58-8 132-64-9 133-08-4 133-11-9 133-13-1 134-84-9 134-85-0 136-60-7 136-77-6 137-18-8 137-19-9 137-32-6 139-59-3 140-66-9 140-88-5 141-28-6 141-32-2 141-43-5 141-78-6 141-79-7 142-62-1 142-92-7 143-08-8 150-13-0 150-76-5 156-87-6 253-52-1 260-94-6 271-89-6 288-32-4 289-80-5 289-95-2 290-37-9 305-85-1 329-71-5 342-24-5 350-46-9 363-03-1 367-12-4 367-27-1 371-40-4 371-41-5 372-19-0 372-20-3 392-71-2 393-39-5
1 3 1 1 2 2 1 2 4 3 5 2 1 3 1 5 5 1 5 3 3 4 4 5 1 2 3 2 1 3 4 2 5 2 4 2 5 2 2 2 5 5 2 1 5 2 5 5 5
394-32-1 402-45-9 446-51-5 446-52-6 456-47-3 459-56-3 459-57-4 481-39-0 481-42-5 490-79-9 498-00-0 499-83-2 500-22-1 500-66-3 500-99-2 501-94-0 502-44-3 502-56-7 524-42-5 526-75-0 527-17-3 527-54-8 527-60-6 527-61-7 528-29-0 529-20-4 530-55-2 534-52-1 535-80-8 536-60-7 536-74-3 536-90-3 538-68-1 540-37-4 540-38-5 540-88-5 542-54-1 542-85-8 552-16-9 552-41-0 552-89-6 553-90-2 553-97-9 554-00-7 554-12-1 554-84-7 555-03-3 555-16-8 555-21-5
3 1 3 2 5 4 3 4 1 3 1 5 5 5 5 5 3 4 5 3 2 4 5 1 5 3 2 4 3 4 1 3 4 5 4 4 2 1 3 1 3 2 3 1 2 3 3 4 5
563-80-4 571-58-4 576-24-9 576-55-6 577-19-5 578-54-1 579-66-8 580-51-8 582-22-9 582-33-2 583-58-4 583-78-8 585-34-2 585-79-5 586-76-5 586-78-7 587-03-1 590-86-3 591-19-5 591-27-5 591-35-5 591-50-4 591-78-6 592-82-5 592-84-7 593-08-8 594-39-8 597-97-7 598-56-1 598-75-4 605-69-6 606-22-4 607-81-8 608-27-5 608-71-9 609-08-5 609-89-2 609-93-8 610-15-1 610-71-9 611-06-3 611-20-1 612-24-8 612-25-9 613-45-6 613-90-1 615-36-1 615-43-0 615-58-7
5 3 2 2 1 4 5 1 4 5 2 4 4 5 5 2 4 4 3 5 1 1 3 3 5 4 4 1 5 5 5 4 4 2 2 2 1 1 1 4 3 4 4 1 1 2 2 2 3
615-74-7 615-93-0 615-94-1 616-24-0 616-86-4 618-36-0 618-45-1 618-62-2 618-87-1 619-24-9 619-25-0 619-42-1 619-45-4 619-50-1 619-72-7 619-80-7 620-17-7 620-22-4 620-24-6 622-62-8 622-78-6 623-00-7 623-03-0 623-04-1 623-12-1 623-91-6 624-48-6 624-95-3 625-28-5 625-30-9 625-33-2 626-43-7 626-89-1 627-05-4 627-35-0 628-05-7 628-30-8 628-99-9 629-12-9 630-18-2 634-67-3 634-83-3 634-91-3 634-93-5 635-12-1 636-30-6 643-28-7 644-08-6 645-09-0
3 5 4 1 5 3 5 2 2 1 3 3 5 2 5 1 1 4 2 1 3 4 2 1 4 4 1 5 1 2 1 3 3 2 1 5 3 1 4 5 4 2 1 4 2 1 2 2 4
645-56-7 2 653-37-2 2 681-58-6 1 693-16-3 2 693-54-9 5 695-06-7 5 695-99-8 5 697-82-5 4 705-86-2 4 706-14-9 1 732-26-3 2 762-21-0 2 762-42-5 1 764-01-2 2 764-60-3 4 766-51-8 5 766-84-7 3 767-00-0 2 768-59-2 5 769-28-8 3 769-39-1 4 771-60-8 4 771-61-9 2 818-61-1 1 818-72-4 5 821-09-0 4 821-41-0 2 821-55-6 5 827-23-6 3 831-82-3 2 836-30-6 4 844-51-9 4 868-77-9 3 873-32-5 1 873-62-1 2 873-63-2 5 873-74-5 1 873-75-6 5 873-76-7 1 874-42-0 1 874-90-8 1 875-59-2 4 875-79-6 1 877-43-0 3 877-65-6 1 927-74-2 5 928-80-3 3 928-90-5 3 928-95-0 4 (Continued)
these parameters will have to be. For example, the descriptors in Table 1 alone accommodate in excess of 230 ⫺ 1 classes of organic compounds. Obviously, a huge number of structures, but still far insufficient to cover the whole chemical universe. For example, inorganics, organometallics, and organophosphorous compounds, three very important groups of compounds, are not represented at all in this analysis, as there are no available compatible data. Next, let us assume we have access to the appropriate choice of parameters suited to properly represent the corner of chemical universe we are interested in. The type of relationship we attempt to identify is completely unknown, so we must start from the assumption that this
i
relationship is a nonlinear type. The complexity and peculiarities of the available data make it practically impossible to use classical nonlinear multivariate statistical methodologies. That is because the mathematical assumptions these methodologies require are usually not satisfied or impossible to verify. Therefore, modeling must be handled in a completely different way. The most affordable alternative is to use mixed models where the nonlinear part of the relationship is managed through one or more feed-forward neural networks combined with statistical linear corrections based on the errors generated by these neural networks. However, it is still an open question as to which is the best choice of neural networks.
294
S. P. Niculescu et al.
Table 2. (Continued) Compound
i
Compound
i
Compound
i
Compound
i
Compound
i
928-96-1 935-95-5 942-92-7 999-61-1 1002-28-4 1002-36-4 1009-14-9 1016-78-0 1072-97-5 1119-40-0 1119-44-4 1122-91-4 1126-46-1 1129-35-7 1129-37-9 1137-41-3 1194-02-1 1198-55-6 1423-60-5 1443-80-7 1450-72-2 1462-12-0 1484-26-0 1504-58-1 1518-83-8 1527-89-5 1570-64-5 1576-95-0 1629-58-9 1629-60-3 1669-44-9 1671-75-6 1674-37-9 1679-36-3 1679-47-6 1689-82-3 1731-84-6 1732-09-8
2 5 4 5 1 3 5 1 1 4 5 3 5 3 5 4 2 5 5 3 2 5 3 3 2 1 5 3 3 1 2 5 2 5 5 4 4 3
1745-81-9 1821-39-2 1877-77-6 1885-29-6 2028-63-9 2038-57-5 2046-18-6 2049-80-1 2050-20-6 2050-23-9 2050-60-4 2051-31-2 2113-58-8 2117-11-5 2138-22-9 2163-48-6 2176-62-7 2227-79-4 2243-27-8 2357-47-3 2370-63-0 2409-55-4 2416-94-6 2437-25-4 2443-47-2 2495-37-6 2497-21-4 2499-95-8 2508-29-4 2613-23-2 2683-43-4 2696-84-6 2845-89-8 2905-69-3 2920-38-9 2928-43-0 2978-58-7 3012-37-1
5 1 2 3 1 5 3 2 4 2 3 3 2 1 3 5 1 1 4 2 3 1 4 5 3 2 1 4 2 2 2 3 1 1 3 1 1 5
3031-66-1 3034-34-2 3068-88-0 3117-03-1 3209-22-1 3240-09-3 3261-62-9 3360-41-6 3441-01-8 3481-20-7 3544-25-0 3597-91-9 4048-33-3 4097-49-8 4117-14-0 4128-31-8 4214-79-3 4312-99-6 4344-55-2 4383-06-6 4404-45-9 4421-08-3 4620-70-6 4655-34-9 4798-44-1 4901-51-3 4916-57-8 5159-41-1 5332-73-0 5344-90-1 5446-02-6 5665-74-7 5683-33-0 5728-52-9 5910-89-4 5921-73-3 5922-60-1 5965-53-7
3 5 3 4 2 4 2 3 5 1 3 1 1 5 2 5 1 1 5 1 5 1 3 2 5 3 2 1 5 4 3 4 4 3 1 4 4 3
6032-29-7 6153-39-5 6168-72-5 6175-49-1 6261-22-9 6284-83-9 6291-85-6 6373-50-8 6418-38-8 6602-32-0 6627-55-0 6636-78-8 6641-64-1 6802-75-1 7149-10-2 7153-22-2 7244-78-2 7251-61-8 7307-55-3 7383-19-9 7452-79-1 7533-40-6 7781-98-8 10229-10-4 13037-86-0 13054-87-0 13214-66-9 13325-10-5 13952-84-6 14321-27-8 14548-45-9 14898-87-4 15128-82-2 15852-73-0 15892-23-6 16090-77-0 16245-79-7 16369-21-4
4 4 4 1 1 5 5 4 3 3 3 3 1 2 2 5 4 1 1 2 2 4 3 4 1 1 3 1 5 5 3 3 5 3 5 2 4 2
16879-02-0 17041-60-0 17849-38-6 18368-63-3 18979-55-0 20296-29-1 20739-58-6 24544-04-5 24629-25-2 24964-64-5 26734-09-8 28689-08-9 33228-44-3 33719-74-3 33966-50-6 35161-71-8 39905-50-5 40420-22-2 42882-31-5 51887-25-3 53222-92-7 54135-80-7 55815-20-8 57455-06-8 58175-57-8 64063-37-2 65337-13-5 66086-33-7 66793-96-2 78056-39-0 87820-88-0 100900-16-1 109704-32-7 112245-13-3
4 4 3 4 5 3 3 1 2 5 1 2 4 3 5 1 5 5 3 4 1 2 4 5 2 4 1 3 3 5 2 2 4 1
The feed-forward backpropagation neural network (BNN) and the probabilistic neural network (PNN) are the two best candidates to perform this task. The training/learning paradigms used by these two neural networks are completely different. Consequently, they behave in very different ways. In terms of practical accuracy, the two neural networks produce similar results. Their big difference is in the speed of training. For the BNN, the training/learning involves multiple passes through the training set. Choosing the best network architecture as number of hidden layers and number of hidden neurons, as well as the best moment to stop the training, are additional problems the BNN has to deal with. Fine-tuning a BNN can be a very time-consuming endeavor. None of these problems exist for the PNNs for the very reason that their architecture is governed by very precise rules and their training/learning phase involves only a single pass through the training data. No rules of thumb and no artificial classification are involved. When additional data become available at a later time, they are easily incorporated into a trained PNN model by updating its training using only the new data—a very big advantage when compared to the classical models, which must be completely rebuilt to accommodate new data. For reasons described by Niculescu et al. (1998) we selected, as our preferred modeling tool, the basic PNN with Gaussian kernel and data preprocessing consisting of a combination of Z-transform, hyperbolic
tangent, and convenient linear compression functions. A detailed presentation of the PNN paradigm and the associated computer algorithms, including the principles of data preprocessing, may be found in the work of Masters (1993). We have successfully used this methodology to model the acute toxicity of chemicals to the fathead minnow (P. promelas), the bacterium V. fischeri (formerly known as Photobacterium phosphoreum), and to D. magna (Kaiser et al. 1997b; Niculescu et al. 1998; Kaiser and Niculescu 1999a, 1999b). As a first step in building a predictive model based on artificial neural networks, it is vital to investigate if the network’s architecture and computational paradigm are appropriate for the task at hand. This is done through cross-validation and/or classical validation. For the specifics of the cross-validation in the case of large data sets we refer to Kaiser and Niculescu (1999a). Practical details on this subject will be presented in the next section. Classical validation using an external test set can be seen as the first of the two steps involved in performing an unbalanced cross-validation experiment. The second step consists of switching the training with the test set. From this perspective, it appears that clearly the assessment of the model stability and predictive performance produced by cross-validation will be more accurate than the one using exclusively external validation. Nevertheless, when it is to compare the performance of different models, the real picture is revealed by external validation, which consists of a comparison of
PNN Modeling of Toxicity to Tetrahymena
295
Table 3. Statistical estimators associated with the cross-validation performed on the 750-compound data set and the errors generated by them on the corresponding training and cross-validation test sets
Training Average errors Standard deviation errors Sum square errors Average square error Skewness errors Kurtosis errors Pearson’s r2 (training) Training set size Test Average error Standard deviation errors Sum square errors Average square error Skewness errors Kurtosis errors Pearson’s r2 (test) Test set size Training ⫹ Test Average errors Standard deviation errors Sum square errors Average square error Skewness errors Kurtosis errors Pearson’s r2 (training ⫹ test) Whole set size
m1
m2
m3
m4
m5
⫺0.0285 0.3285 65.2223 0.1087 ⫺0.7618 3.1005 0.9301 600
⫺0.0228 0.2756 45.8898 0.0765 ⫺0.4627 4.3474 0.9457 600
⫺0.0313 0.3065 56.9665 0.0949 ⫺0.6668 2.7081 0.9409 600
⫺0.0285 0.3365 68.4438 0.1141 ⫺0.4650 2.6901 0.9282 600
⫺0.0268 0.3092 57.7996 0.0963 ⫺0.7879 3.4024 0.9379 600
⫺0.1435 0.4392 32.0168 0.2134 ⫺0.2777 0.1022 0.8547 150
0.0070 0.4802 34.5996 0.2307 ⫺0.3714 0.7910 0.8284 150
0.0248 0.4509 30.5931 0.2040 ⫺0.5110 4.4838 0.8023 150
⫺0.0570 0.5022 38.3226 0.2555 ⫺0.6758 1.3199 0.8059 150
⫺0.1217 0.4974 39.3289 0.2622 ⫺0.8452 1.6017 0.7971 150
⫺0.0515 0.3564 97.2391 0.1297 ⫺0.6871 2.0568 0.9137 750
⫺0.0168 0.3272 80.4894 0.1073 ⫺0.3909 3.3942 0.9202 750
⫺0.0201 0.3411 87.5596 0.1167 ⫺0.5463 4.4248 0.9173 750
⫺0.0342 0.3757 106.7663 0.1424 ⫺0.6233 2.7135 0.9030 750
⫺0.0458 0.3569 97.1285 0.1295 ⫺1.0174 3.7140 0.9085 750
Table 4. Statistical estimators associated with the models (M1)–(M4) on the training set Training (750 compounds)
M1
M2
M3
M4
Average errors Standard deviation errors Sum square errors Average square error Skewness errors Kurtosis errors Pearson’s r2 (training)
0 0.2641 52.3217 0.0698 ⫺0.1846 1.2057 0.9358
0 0.2775 57.7389 0.0770 ⫺0.1288 0.9694 0.9293
0 0.2693 54.3791 0.0725 ⫺0.2195 1.2292 0.9333
0 0.2708 55.0107 0.0733 ⫺0.1873 1.5406 0.9314
the prediction errors for compounds that were not used in the training phase by any of the applied models. For the models considered in this study, the output consists of only one variable. Consequently, their recognitive and predictive performance is best reflected by basic statistical indicators, such as the square of the Pearson’s correlation coefficient between measured and predicted values, the average and standard deviations of the errors, skewness, and kurtosis of the errors, as well as the sum of square errors and, associated with the latter and the sample size, the average square error. We will use them extensively in presenting the results.
Results and Discussion Cross-Validation Following the methodological considerations presented in the previous section, the first step is to assess how appropriate the
PNN is as architecture and choice of parameters to handle the task of modeling the 48-h ICG50 pT T. pyriformis toxicity endpoints. With this in mind we proceed to perform a crossvalidation modeling experiment using a leave 20% out crossvalidation procedure on a subset containing 750 randomly selected compounds from the 825-compound data set for which such measured toxicity endpoints were available. For this purpose, the whole data set attached to the 750 compounds was randomly split into five disjoint subsets, T1, T2, T3, T4, and T5, each containing 20% of the data (150 compounds). For each 1 ⱕ i ⱕ 5, the subset T i was used as a test set for a PNN-based model m i trained on D i ⫽ the remaining 80% of the data. All these models used the same input data preprocessing and corresponding output data reverse-processing algorithms. The list of the CAS registry numbers (or substitutes) corresponding to the 750 compounds and information about their grouping during this cross-validation experiment are presented in Table
296
S. P. Niculescu et al.
Table 5. List of the 75 compounds used for testing together with the measured 48-h ICG50 pT toxicity values to Tetrahymena pyriformis (m-pT) and the predictions generated by the models (M1)–(M4) Compound
m-pT
M1
M2
M3
M4
Compound
m-pT
M1
M2
M3
M4
62-53-3 78-82-0 78-84-2 80-46-6 86-00-0 87-28-5 88-04-0 88-06-2 88-74-4 89-69-0 91-10-1 95-64-7 95-79-4 97-53-0 99-92-3 100-17-4 100-83-4 104-13-2 105-54-4 105-67-9 106-48-9 107-10-8 109-60-4 110-12-3 110-93-0 118-31-0 121-73-3 121-92-6 122-94-1 123-38-6 150-19-6 253-82-7 288-13-1 348-54-9 475-38-7 495-40-9 529-19-1 573-56-8
⫺0.290 0.220 0.640 1.230 2.040 0.700 ⫺0.490 1.780 ⫺1.089 0.540 1.380 0.420 0.550 0.240 0.910 1.200 ⫺0.750 1.120 ⫺0.050 ⫺0.750 ⫺0.290 0.040 1.118 ⫺0.080 0.570 0.840 ⫺0.710 ⫺1.240 0.080 ⫺0.450 ⫺0.600 ⫺0.240 0.330 1.410 ⫺0.380 1.530 ⫺0.310 0.470
⫺0.147 ⫺1.485 ⫺0.823 1.185 1.373 0.208 1.107 1.790 0.529 1.363 ⫺0.191 ⫺0.176 0.557 0.756 ⫺0.103 0.833 0.186 0.822 ⫺0.914 0.246 0.960 ⫺0.904 ⫺1.201 ⫺0.668 ⫺0.458 0.429 1.075 ⫺0.372 1.299 ⫺1.035 ⫺0.242 ⫺0.278 ⫺1.337 0.132 2.703 0.543 ⫺0.246 1.009
⫺0.145 ⫺1.509 ⫺0.850 1.206 1.319 0.131 1.138 1.816 0.599 1.398 ⫺0.201 ⫺0.167 0.579 0.774 ⫺0.134 0.854 0.174 0.821 ⫺0.939 0.263 0.989 ⫺0.928 ⫺1.229 ⫺0.679 ⫺0.454 0.430 1.113 ⫺0.364 1.293 ⫺1.052 ⫺0.244 ⫺0.284 ⫺1.395 0.165 2.532 0.569 ⫺0.234 1.030
⫺0.130 ⫺1.489 ⫺0.911 1.180 1.335 0.221 1.105 1.767 0.543 1.391 ⫺0.192 ⫺0.172 0.569 0.756 ⫺0.122 0.845 0.174 0.882 ⫺0.935 0.253 0.940 ⫺0.931 ⫺1.216 ⫺0.701 ⫺0.446 0.430 1.086 ⫺0.351 1.251 ⫺1.070 ⫺0.243 ⫺0.282 ⫺1.361 0.132 2.543 0.502 ⫺0.226 1.016
⫺0.141 ⫺1.451 ⫺0.792 1.175 1.241 0.122 1.086 1.750 0.603 1.324 ⫺0.211 ⫺0.157 0.545 0.753 ⫺0.136 0.826 0.198 0.867 ⫺0.883 0.280 0.942 ⫺0.871 ⫺1.176 ⫺0.653 ⫺0.436 0.415 1.038 ⫺0.352 1.264 ⫺0.959 ⫺0.247 ⫺0.276 ⫺1.319 0.232 2.761 0.558 ⫺0.242 0.960
584-02-1 587-02-0 589-16-2 589-18-4 608-31-1 609-09-6 613-94-5 615-65-6 619-73-8 626-01-7 626-02-8 628-73-9 646-14-0 830-81-9 928-92-7 1137-42-4 1472-87-3 2016-57-1 2150-47-2 2237-30-1 2379-55-7 2973-76-4 3066-71-5 3217-15-0 3218-36-8 4146-04-7 4460-86-0 4748-78-1 5390-04-5 6361-21-3 10031-82-0 14191-95-8 14309-57-0 18982-54-2 19438-10-9 33228-45-4 39905-57-2
0.080 ⫺0.530 0.120 2.060 2.830 0.760 ⫺1.240 ⫺0.380 0.100 ⫺1.710 1.650 0.610 0.280 ⫺0.270 0.070 ⫺1.500 ⫺0.470 ⫺0.520 ⫺0.490 ⫺0.470 0.070 ⫺1.620 ⫺0.670 0.040 0.530 ⫺0.380 0.100 0.620 1.020 ⫺0.120 0.250 1.050 0.980 0.510 ⫺0.100 1.301 ⫺0.750
⫺1.136 ⫺0.114 ⫺0.078 ⫺0.484 0.793 ⫺0.520 ⫺0.860 0.472 0.019 0.951 1.147 ⫺0.759 ⫺0.554 0.659 ⫺1.004 0.871 0.822 1.553 0.369 ⫺0.365 0.339 0.565 0.656 1.949 1.348 ⫺0.983 ⫺0.016 0.125 ⫺0.834 0.965 0.023 0.031 0.884 0.595 0.329 1.507 1.029
⫺1.149 ⫺0.106 ⫺0.069 ⫺0.478 0.812 ⫺0.513 ⫺0.822 0.494 0.127 1.013 1.216 ⫺0.757 ⫺0.609 0.665 ⫺1.012 0.920 0.821 1.542 0.403 ⫺0.344 0.345 0.569 0.646 1.977 1.394 ⫺1.016 0.005 0.167 ⫺0.855 0.994 0.040 0.018 0.904 0.594 0.290 1.443 1.144
⫺1.141 ⫺0.113 ⫺0.070 ⫺0.468 0.793 ⫺0.503 ⫺0.853 0.492 0.071 0.945 1.163 ⫺0.789 ⫺0.565 0.619 ⫺1.024 0.903 0.799 1.543 0.368 ⫺0.377 0.333 0.551 0.659 1.913 1.363 ⫺0.999 0.013 0.173 ⫺0.823 0.979 0.030 0.038 0.917 0.574 0.321 1.570 1.048
⫺1.107 ⫺0.100 ⫺0.065 ⫺0.433 0.770 ⫺0.454 ⫺0.806 0.465 0.134 1.089 1.107 ⫺0.656 ⫺0.530 0.728 ⫺1.010 0.860 0.807 1.559 0.407 ⫺0.288 0.332 0.553 0.667 1.914 1.392 ⫺0.945 0.003 0.148 ⫺0.818 0.915 0.032 0.007 0.867 0.579 0.265 1.582 1.091
2. The limited space does not allow us to show the exact mathematical equations governing the five resulting PNNs, but we present the values of some relevant statistical estimators associated with them in Table 3. The values of the average and standard deviation of errors and the average square error produced by all partial crossvalidation PNN models on their associated training sets indicate excellent recognitive performance. The values of the same statistical indicators corresponding to the test sets indicate very good predictive capabilities. The distribution of the prediction errors generated by all cross validation models on their Training, Test, and Training ⫹ Test sets, is consistently skewed in the negative direction. For both the Training and Training ⫹ Test sets, this distribution is in the range of the normal distribution with the same mean and variance. The overall picture confirms the fact that both the choice of input parameters and the choice of the PNN as model are appropriate for the task of computing 48-h ICG50 pT T. pyriformis toxicity endpoints. Moreover, the high values associated with the square of the Pearson’s correlation coefficient for the Training and Train-
ing ⫹ Test sets in all models built as part of this crossvalidation experiment strongly suggest the possibility of improving the quality of the PNN-based predictions by using appropriate linear corrections.
Corrected Models and Comparisons The four PNN models involving linear corrections we will discuss are represented by the following equations: (M1)
TH ⫽ 1.2423 䡠 m 0 ⫺ 0.0343
(M2)
TH ⫽ 1.2812 䡠 Average共m i, 1 ⱕ i ⱕ 5兲 ⫺ 0.0376
(M3)
TH ⫽ 1.2606 䡠 Median共m i, 1 ⱕ i ⱕ 5兲 ⫺ 0.0361
(M4)
TH ⫽ 0.2247 䡠 m 1 ⫹ 0.4738 䡠 m 2 ⫹ 0.3639 䡠 m 3 ⫺ 0.0371 䡠 m 4 ⫹ 0.1909 䡠 m 5 ⫺ 0.0277
PNN Modeling of Toxicity to Tetrahymena
297
Table 6. Values of statistical estimators associated with the prediction errors generated by model (M1)–(M4) on the external test set of 75 compounds Test (75 compounds)
M1
M2
M3
M4
Average errors Standard deviation errors Sum square errors Average square error Skewness errors Kurtosis errors Pearson’s r2 (test)
0.0450 0.3062 7.1824 0.0958 ⫺0.1691 ⫺0.0349 0.8823
0.0499 0.3121 7.4937 0.0999 ⫺0.2499 0.0230 0.8783
0.0420 0.3079 7.2420 0.0966 ⫺0.1975 0.0334 0.8809
0.0567 0.2955 6.7898 0.0905 ⫺0.1820 0.2980 0.8896
Fig. 1. Plot of predicted versus measured Tetrahymena ICG50 data for model M1 on the training data set of 750 compounds listed in Table 2
Fig. 2. Plot of predicted versus measured Tetrahymena ICG50 data for model M1 on the external validation data set of 75 compounds listed in Table 5
Here TH stands for the predicted 48-h ICG50 pT toxicity endpoint to T. pyriformis, m 0 is the PNN trained on the whole 750-compound data set, and m 1 to m 5 are the five PNN crossvalidation models discussed in the previous section. The first three models, (M1)–(M3), use geometric regression forcing the intercept to zero as linear correction, while (M4) uses classical multivariate regression. The 750-compound data set functions as a training set practically for all these models. Various relevant statistical estimators associated with these models on the training set are summarized in Table 4. All models, (M1)–(M4), show practically similar recognitive performance, but model (M1) seems to do best. The distribution of the errors generated by all models is consistently skewed in the negative direction, with sharpness similar to the one the standard normal distribution has. To evaluate the predictive performance of these models, we use the remaining 75 compounds as an external test set. The predictions generated by the four corrected models, (M1) to (M4), for all compounds in this test set are presented in Table 5. We summarize the values of the statistical estimators associated to prediction errors generated by these models in Table 6. All models, (M1)–(M4), reveal excellent predictive perfor-
mance and are very similar to each other. The distributions of the predictive errors on the external test set are slightly skewed to the right and considerably flatter than the standard normal distribution. The values of the standard deviation and the average square error estimators point toward (M4) as the best predictor. Figures 1 and 2 give plots of predicted versus measured values for model (M1) for each of the training and external validation sets. Plots for the other models are very similar.
Conclusions When compared with the traditional type of QSARs, the neural network models we focus on have larger scope, excellent performance, and do not involve any rules of thumb to accommodate a larger variety of compounds. In the same time we must also be aware of their peculiarities. The quality of neural network– based models is conditioned by the quality and variety of the training/learning data. Clearly, the presence of errors in the data will reduce the performance of the resulting model, but on the whole they are less sensitive to outliers than the classical statistical models. To avoid superfluous variables, the
298
choice of input variables must be justified by the existence of consistent data to be used in the training/learning phase of building the model. As all neural network models are based on learning/training, the breadth of the training data will strongly influence the final recognitive and predictive capabilities of that model. Consequently, the best predictions will be generated for the compounds well represented inside the training set. For the rest of the compounds the predictions will be based on the characteristics shared with the well-represented ones. Of course, when using such models for substance toxicity screening, we have first to check if the associated chemical structures satisfy the scope requirements. Future research will focus on enlarging the scope of these models by adding new input parameters reflecting the threedimensional aspect of the molecular structures and refining the existing ones, as well as using larger data sets for the training/ learning phase of building the imbedded neural networks.
References BioByte Corp. (1999) C-QSAR software system. Medicinal Chemistry Project/Pomona College, and BioByte Corp., Claremont, CA Cooley NR, Keitner JM Jr, Forester J (1972) Mirex and Aroclor 1254: Effects on and accumulation by Tetrahymena pyriformis strain W. J Protozool 18:636 – 638 Downs GM, Gill GS, Willett P, Walsh P (1995) Automated descriptor selection and hyperstructure generation to assist SAR studies. SAR and QSAR Environ Res 3:253–264 Eldred DV, Weikel CL, Jurs PC, Kaiser KLE (1999) Prediction of fathead minnow acute toxicity of organic compounds from molecular structure. Chem Res Toxicol 12:670 – 678 Gilron GL, Lynn DH (1998) Ciliated protozoa as test organisms in toxicity assessments. In Wells PG, Lee K, Blaise C (eds) Microscale testing in aquatic toxicology. CRC Press, Boca Raton, FL, pp 323–336 Kaiser KLE, Dearden JC, Klein W, Schultz TW (1999) A note of caution to users of ECOSAR. Water Qual Res J Can 34:179 –182 Kaiser KLE, Niculescu SP (1999a) Using probabilistic neural networks to model the toxicity of chemicals to the fathead minnow (Pimephales promelas): a study based on 865 compounds. Chemosphere 38:3237–3245 Kaiser KLE, Niculescu SP (1999b) Modeling the acute toxicity of chemicals to Daphnia magna: A probabilistic neural network approach. Environ Toxicol Chem (in press) Kaiser KLE, Niculescu SP, McKinnon MB (1997a) On the simple linear regression, the multiple linear regression and the elementary probabilistic neural network with Gaussian kernel’s performance in modeling toxicity values to fathead minnow based on Microtox data, the octanol/water partition coefficient and various structural descriptors for a 419 compound data set. In Chen F, Schu¨u¨rmann G (eds) QSAR in environmental sciences—VII. SETAC Press, Pensacola, FL, pp 285–297
S. P. Niculescu et al.
Kaiser KLE, Niculescu SP, Schu¨u¨rmann G (1997b) Feed forward back-propagation neural networks and their use in predicting the acute toxicity of chemicals to the fathead minnow. Wat Qual Res J Can 32:637– 657 Klopman G, Saiakhov R, Rosenkranz HR (2000) Multiple computerautomated structure evaluation study of aquatic toxicity II. Fathead minnow. Environ Toxicol Chem 19:441– 447 Larsen J, Schultz TW, Rasmussen L, Hooftman R, Pauli W (1997) Progress in an ecotoxicological standard protocol with protozoa: results from a pilot ring test with Tetrahymena pyriformis. Chemosphere 35:1023–1041 Masters T (1993) Practical neural network recipes in C⫹⫹. Academic Press, San Diego, CA Moore DRJ, Breton RI, McDonald D, Harris PW, Chenier R, Sutcliffe R, Sanderson J, Bradbury S, Russom C, Jurs PC (1999) Ability of ECOSAR, TOPKAT, neural networks, and ASTER to predict toxicity of chemicals to aquatic biota. SETAC 20th Annual Meeting, Philadelphia, Abstract Book, SETAC, Pensacola, FL, Abstract 260 Neuwen J, Lindgren F, Hansen B, Karcher W, Verhaar HJM, Hermens JLM (1997) Classification of environmentally occurring chemicals using structural fragments and PLS discriminant analysis. Environ Sci Technol 31:2313–2318 Niculescu SP, Kaiser KLE, Schu¨u¨rmann G (1998) Influence of data preprocessing and kernel selection on probabilistic neural network modeling of the acute toxicity of chemicals to the fathead minnow and Vibrio fischeri bacteria. Wat Qual Res J Can 33:153–165 Russom CL, Bradbury SP, Broderius SJ, Hammermeister DE, Drummond RA (1997) Predicting modes of toxic action from chemical structure: acute toxicity in the fathead minnow (Pimephales promelas). Environ Toxicol Chem 16:948 –967 Schultz TW (1983) Aquatic toxicology of nitrogen heterocyclic molecules: quantitative structure-activity relationships. In Nriagu JO (ed) Aquatic toxicology. John Wiley and Sons, New York, NY, pp 579 – 612 Schultz TW (1997) Tetratox: Tetrahymena pyriformis population growth impairment endpoint—a surrogate for fish lethality. Toxicol Meth 7:289 –309 Schultz TW, Cajina-Quezada M, Dumont JN (1980) Structure toxicity relationships of selected nitrogenous heterocyclic compounds. Arch Environ Contam Toxicol 9:951–998 Schultz TW, Lin DT, Wilke TS, Arnold LM (1990) Quantitative structure-activity relationships for the Tetrahymena pyriformis population growth endpoint: a mechanism of toxic action approach. In Karcher W, Devillers J (eds) Practical applications of quantitative structure-activity relationships (QSAR) in environmental chemistry and toxicology. Kluwer, Dordrecht, Holland, pp 241–262 TerraBase (1999) TerraTox™ 2000: explorer. CD-ROM, TerraBase Inc., Burlington, Ontario, Canada Verhaar HJM, Van Leeuwen CJ, Hermens JLM (1992) Classifying environmental pollutants. 1. Structure-activity relationships for prediction of aquatic toxicity. Chemosphere 25:471– 491