Environ Sci Pollut Res DOI 10.1007/s11356-017-8667-4
RESEARCH ARTICLE
Examining predictors of chemical toxicity in freshwater fish using the random forest technique Baigal-Amar Tuulaikhuu 1,2 & Helena Guasch 1 & Emili García-Berthou 1
Received: 18 July 2016 / Accepted: 20 February 2017 # Springer-Verlag Berlin Heidelberg 2017
Abstract Chemical pollution is one of the main issues globally threatening the enormous biodiversity of freshwater ecosystems. The toxicity of substances depends on many factors such as the chemical itself, the species affected, environmental conditions, exposure duration, and concentration. We used the random forest technique to examine the factors that mediate toxicity in a set of widespread fishes and analyses of covariance to further assess the importance of differential sensitivity among fish species. Among 13 variables, the 5 most important predictors of toxicity with random forests were, by order of importance, the chemical substance itself (i.e., Chemical Abstracts Service number considered as a categorical factor), octanol-water partition coefficient (log P), pollutant prioritization, ecological structure-activity relationship (ECOSAR) classification, and fish species for 50% lethal concentrations (LC50) and the chemical substance, fish species, log P, ECOSAR classification, and water temperature for no observed effect concentrations (NOECs). Fish species was a very important predictor for both endpoints and with the two contrasting statistical techniques used. Different fish species displayed very different relationships with log P, often with different slopes and with as much importance as the partition Responsible editor: Henner Hollert Electronic supplementary material The online version of this article (doi:10.1007/s11356-017-8667-4) contains supplementary material, which is available to authorized users. * Emili García-Berthou
[email protected] 1
GRECO, Institute of Aquatic Ecology, University of Girona, 17003 Girona, Spain
2
Department of Ecology, School of Agroecology, Mongolian University of Life Sciences, Ulaanbaatar, Mongolia
coefficient. Therefore, caution should be exercised when extrapolating toxicological results or relationships among species. In addition, further research is needed to determine species-specific sensitivities and unravel the mechanisms behind them. Keywords Ecotoxicology . Octanol-water partition coefficient . Species-specific sensitivity
Introduction Chemical pollution is one of the main issues globally threatening the enormous biodiversity of freshwater ecosystems (Dudgeon et al. 2006). There is a large amount of different chemicals (over 100,000; Hansen et al. 1999), differing in their toxicity in the aquatic environment. The toxicity of substances depends on many factors such as the characteristics of the chemical substance itself (e.g., its mode of action; Vaal et al. 1997); the species affected and its life stage (Woltering 1984); the exposure duration and concentration; and the environmental conditions such as temperature (Li et al. 2014), pH (Thurston and Russo 1981), salinity (Grosell et al. 2007), hardness, or alkalinity (Riethmuller et al. 2001). Thousands of toxicological assays have been performed to determine the effects of different chemical compounds on a multitude of organisms and conditions, and syntheses and quantitative comparisons are needed to provide further understanding. The availability of databases with extensive data provides increasing opportunities to statistically evaluate the most important predictors of chemical toxicities and its interactions (Tebby et al. 2011). As shown by previous ecotoxicological studies, contrasting sensitivity to different types of chemicals has been observed at different trophic levels, such as algae, crustaceans, and fish (Henegar et al. 2011; Riethmuller et al.
Environ Sci Pollut Res
2001; Tebby et al. 2011), and also among different fish species (Vittozzi and De Angelis 1991). Quantitative structure-activity relationships (QSARs) are often used to predict toxicity from physicochemical properties (e.g., molecular descriptors) for chemicals not well known or to reduce the number of assays (Netzeva et al. 2008). Among many molecular descriptors, the most widely used include the molecular weight and the octanol-water partition coefficient (log P) (Levet et al. 2013). Log P is a measure of hydrophobicity, which mediates several processes, including sorption and accumulation (Katritzky et al. 2010; Meylan et al. 1999), and is also a key parameter for the environmental fate and effects of toxicants (Lifongo and Nfon 2009). Identifying the potential mechanisms or modes of action and closest structural similarity of new chemicals is an initial step for predicting their toxicity with QSARs. For acute aquatic toxicity effects, several endpoint-specific profilers, including ecological structure-activity relationship (ECOSAR) (Mayo-Bean et al. 2012) classification, Optimized Approach based on Structural Indices Set (OASIS) acute toxicity mode of action, Verhaar classification, and Cramer classification have been suggested (OECD 2009). In organic chemistry, functional groups are a set of specific atoms (e.g., aldehydes, ethers, ketones, or phenols) that occur in a wide range of compounds and confer upon them a common kind of reactivity (Vollhardt and Schore 2011) and are thus helpful in predicting their toxicity (Mayo-Bean et al. 2012). In addition, regulations often distinguish between Bpriority pollutants,^ i.e., human-produced pollutants, frequent in streams, with published analytical test methods and more well-known toxicity, and Bemerging pollutants,^ i.e., substances that have been discovered in the environment more recently (often because of improved analytical chemistry detection levels) and potentially cause deleterious effects in aquatic life at environmentally relevant concentrations (USEPA 2014). Comparing priority and emerging contaminants may help to understand the environmental significance of the latter. Although physiochemical properties of substances are considered good predictors of toxicology of industrial chemicals, ecological effects of a specific chemical may be under- or over-estimated and differences among species are often neglected. Fish are excellent ecological indicators due to a number of reasons (e.g., greater life span and high trophic level, which often imply integration of perturbations at larger spatial and temporal scales), and differences in tolerance among species are widely used in biomonitoring and ecosystem health assessment (Maceda-Veiga and De Sostoa 2011; Oberdorff et al. 2002). For example, Fedorenkova et al. (2013) reported a three to four orders of magnitude difference in 50% lethal concentrations (LC50) among fish species in River Rhine and its tributaries, based on data from the ECOTOX database. Differences in species sensitivity can be very substantial, and the number of species for which some
toxicological information is available represents only a small fraction of the total number of species existing (Yang and Randall 1997). For this reason, toxicity is often inferred from model species or from taxonomically related species, implying high uncertainty in predictions. Random forests (RFs) are a recent machine-learning technique increasingly used in many scientific areas due to their high accuracy and ability to characterize complex interactions among predictors (Breiman 2001; Strobl et al. 2008). RFs are an extension of classification or regression trees, which in turn are a form of binary recursive partitioning that builds a decision tree by repeatedly partitioning the data set into a nested series of mutually exclusive groups, selecting the best candidate split at each step and then selecting the optimal tree. RFs fit many regression trees to a data set and then combine the predictions from all the trees (Breiman 2001). RF have many advantages over other more conventional statistical techniques, which are run efficiently on large databases with many correlated predictors, give estimates of what variables are more important, have an effective method for estimating missing data while maintaining accuracy, and handle particularly well non-linearities and interactions (Prasad et al. 2006; Cutler et al. 2007). In this paper, we aimed to (i) rank the importance of predictors of aquatic toxicity of the most frequently tested chemicals in a set of widespread fishes using RF and (ii) to quantify the differences in sensitivity among these species and to test if the relationship with log P varies among them. We hypothesized that (i) species might rank among the most important predictors for modeling toxicity and (ii) log P is well related to toxicity but this relationship varies markedly among species. Our aim was not to develop another predictive model of toxicity but to compare the usefulness of several previous toxicological tools and other predictors in a large set of fish and substances. Thus, this study exemplifies how random forests can contribute to understand the effects of chemicals on the environment.
Methods Data compilation We compiled ecotoxicological data mainly from the ECOTOX database (accessed through https://cfpub.epa.gov/ecotox/ in March 2014). The ECOTOX database was created and is maintained by the United States Environmental Protection Agency and provides single chemical toxicity information for aquatic and terrestrial organisms (USEPA 2015). In ECOTOX, we searched for toxicological information of all fish species native or naturalized (i.e., alien established) to Spain. The list of fish species in Spain was based on the last published atlas (Doadrio 2002). Initially, 73,734 cases describing every single
Environ Sci Pollut Res
result of toxicity tests for 37 fish species were found. Within them, we chose for analyses only trials done in freshwater and with endpoints of LC50 of test animals or no observed effect concentrations (NOECs), since those two endpoints had many more cases than other endpoints such as EC50 and LOEC. Although many toxicologists consider that NOEC and related endpoints have many serious drawbacks and should not be used (van der Hoeven 1997; Crane and Newman 2000; Laskowski et al. 2010), we consider it because of its extensive ongoing use (Lewis et al. 1994) and to test whether it provided similar conclusions to LC50. In the downloaded ECOTOX data, we ignored subspecies names and corrected some obvious taxonomic misspellings, we converted all concentration units into μg/L of toxic substance concentrations in water, and a few unrealistic concentrations were compared with the original references and corrected or excluded from the analyses. This led to a database with 34,118 cases of 87 substances and 25 freshwater fish species (Tables S1 and S2). Afterwards, we assigned substances to the following classifications, using the QSAR 3.0 toolbox (Bhatia et al. 2015): i)
ii)
iii)
iv)
Acute aquatic toxicity classification by Verhaar et al. (1992), which separates them into inert chemicals, less inert chemicals, reactive chemicals, specifically acting chemicals, and Bnot possible to classify^; Acute aquatic toxicity mode of action (MOA) by OASIS (Russom et al. 1997), which divides them in aldehydes, phenols and anilines, esters, alpha, betaunsaturated alcohols, base surface narcotics, narcotic amines, and Breactive unspecified.^ Aquatic toxicity classification by ECOSAR (MayoBean et al. 2012), which identifies chemicals in 118 classes. In case that the program provided results with multiple classes for the input substance, we chose the one that exhibited greatest toxicity. We pooled various esters (e.g., mono- or di-thiophosphate esters) in singleclass Besters.^ Therefore, the substances that we used for analyses finally corresponded to 16 classes, which were acid moieties, aliphatic amines, amides, aromatic triazines, carbamate esters, esters, inorganic compounds, neutral organics, phenols, polynitrobenzenes, pyrethroids, substituted ureas, thiocarbamate, vinyl/ allyl ethers, vinyl/allyl halides, and substances that Bshould not be profiled.^ The substances that contain a metal atom are classified by ECOSAR in the group named should not be profiled because sufficient toxicological knowledge is not available regarding this type of compounds. Toxic hazard classification by Cramer (Cramer et al. 1978), which places substances in three classes (classes I, II, and III). Class I substances are those with structures and related data suggesting a low order of toxicity;
if combined with low exposure, they should enjoy a very low priority for investigation. Class III substances are those that permit no strong initial presumptions of safety, or that may even suggest significant toxicity and thus deserve the highest priority for investigation, whereas class II are intermediate. The classification was developed by expert judgment and consists of a decision tree based on 33 questions mostly about features of chemical structure (Cramer et al. 1978). We also classified compounds on the basis of their principal functional groups and substitutive chemical nomenclature (Leigh et al. 1998) into acids, alcohols, aldehydes, alkanes, alkenes, amines and amides, aromatics, azides, esters, ethers, halogens, inorganic compounds, ketones, metallic or organometallic compounds, nitriles, and organosulfur compounds. In substitutive nomenclature, the suffix of a compound is given by the principal functional group (Leigh et al. 1998). We also classified toxicants in three groups, which are priority pollutants, contaminants of emerging concern (hereafter, Bemerging contaminants^), and non-classified, based on the lists by the USEPA (2014). Octanol-water partition coefficients (log P), estimated by the atomic contribution method (Ghose et al. 1998), were obtained from the European Inventory of Existing Commercial Chemical Substances (European Union Reference Laboratory for alternatives to animal testing (EURL-ECVAM) 2015). The effect types (e.g., mortality, accumulation, and behavior) were directly obtained from the ECOTOX database. In the database, NOEC was related to 20 different effect types; most of them were biochemical, cellular, and growth responses, and a few of them were behavioral or physiological measurements. Water temperature was converted to degrees Celsius and hardness units to ppm CaCO3. Data analyses We used random forests (Breiman 2001), as implemented in the package BrandomForest^ (Liaw and Wiener 2002) of the R software (R Development Core Team 2015), to analyze the factors that best predict toxicity (LC 50 , NOEC) in the compiled dataset. We used as categorical predictors the Chemical Abstracts Service (CAS) registry number, pollutant prioritization, ECOSAR classification, functional group, Cramer classification, Verhaar classification, and acute aquatic toxicity MOA by OASIS as toxicant classifications and fish family, fish species, and effect type as features of the test organism and their response. As numerical predictors, we used the octanolwater partition coefficient (log P) as a key physicochemical parameter of chemicals and water temperature and hardness as environmental conditions of the toxicological test. We excluded many other predictors because they had
Environ Sci Pollut Res
many missing values, which would reduce the sample size, and because in preliminary RF analyses, they were not among the 15 most important predictors. Since randomForest cannot handle categorical predictors with more than 53 categories and to improve robustness, we selected substances that had more than 49 assays for LC50 and more than 7 assays for NOEC, where assay is an independent datum or row in the ECOTOX database (e.g., test of a substance at a different concentration, water temperature or of a different fish species). On the other hand, to avoid excluding many fish species, we included fish species with more than three assays. After this selection, the database consisted of 7892 cases with 52 substances with >49 assays for LC50 and 1987 cases with 51 substances with >7 assays for NOEC. We ran analyses with and without CAS number to test the ability of commonly used predictors without the knowledge of the chemical substance itself and with a larger dataset for comparative purposes. Note that for categorical predictors such as CAS number, numbers are not used in the analysis and only the affiliation to a certain group is used to assess the relevance as a predictor. We used 500 trees to build the RF because increasing this default number did not substantially change the results of variable importance or explained variation (Liaw and Wiener 2002). As the number of variables randomly sampled as candidates at each split, we used the recommended default of the square root of number of predictors (Liaw and Wiener 2002). Note that the out-of-bag estimate of variance used in RF is as accurate as using a test set of the same size as the training set and thus removes the need for a set-aside test set in standard applications (Breiman 2001; Prasad et al. 2006). RFs provide a measure of percentage of variance explained or pseudo-R2 for the model obtained and a measure of variable importance, i.e., the importance of the predictor variables, which can be obtained as the total decrease in node impurities from splitting on the variable, averaged over all trees (Liaw and Wiener 2002). RF partial dependence plots (Friedman 2001) were also obtained for the most important predictors. These plots give a graphical depiction of the marginal effect of a predictor on the response variable, i.e., the dependence of the toxicity response on a specific predictor after partialling out the effects of the other predictors in the model. Analysis of covariance (ANCOVA) was also used to further compare the effect concentrations among fish species and chemical substances, using log P as a covariate. An ANCOVA design with the factor × covariate interaction allows us to test the assumption of homogeneity of regression coefficients of the standard ANCOVA design, i.e., to compare slopes (García-Berthou and Moreno-Amich 1993). Quantitative variables were log10-transformed for ANCOVA because residual plots suggested that the assumptions (normality, homoscedasticity, and linearity) were thus satisfied.
Results Random forests RF explained most of the variation in toxicity (89.3% for LC50 and 94.6% for NOEC). The five most important variables to predict LC50 with RF were the chemical substance (i.e., CAS number), log P, pollutant prioritization, ECOSAR classification, and fish species (Fig. 1a). The results for NOEC were similar, but water temperature was among the first five predictors, replacing pollutant prioritization, which was the third most important for LC50. The order of the most important variables was moderately different for the two endpoints; although four of the predictors were among the top five for both statistics and CAS number was always the most important, log P was the second for LC50 and third for NOEC, whereas fish species was ranked as the fifth and second most important predictor for LC50 and NOEC, respectively (Fig. 1a, b). Priority pollutants were the most toxic, followed by emerging pollutants and substances Bnot classified,^ the differences being also less apparent for NOEC (Fig. 2). For both endpoints, pyrethroids and vinyl/allyl halides were the most toxic compounds, metal-contained substances were also among the most toxic, whereas aliphatic amines were the least toxic (Fig. S3). We also ran random forests without CAS number as a predictor, for the same dataset as above (51 substances, 7843 cases for LC50 and 1987 cases for NOEC) and for a larger dataset (320 substances, 10,788 and 2174 cases for LC50 and NOEC, respectively). For both LC50 and NOEC, the five most important predictors and their order were very similar in the two analyses that excluded CAS number (Fig. 1). However, the percentage of explained variation ignoring the chemical substance (CAS number) was slightly lower (89.3 vs. 87.8% or lower for LC50 and 94.6 vs. 93.3% or lower for NOEC), and CAS number explained most of the variation when used (Fig. 1), suggesting that predictive models perform relatively well but do not capture all toxicological information of substances. Therefore, we used the models with CAS number to further analyze the effects of predictor variables. Partial dependence on octanol-water partition coefficient showed that toxicity increased (LC50 values decreased) with log P but rather sharply at values of log P > 3 (Fig. 3). For NOEC, the relationship was weaker and non-monotonic, with toxicity first decreasing and then increasing. Fish species was the second most important predictor for NOEC and the fifth one for LC50 (Fig. 1). The two endpoints agreed well in the order of species-specific sensitivities, with rainbow trout (Oncorhynchus mykiss), northern pike (Esox lucius), coho salmon (Oncorhynchus kisutch), brown trout (Salmo trutta), and Atlantic salmon (Salmo salar) being less tolerant and goldfish (Carassius auratus), guppy (Poecilia reticulata), bream (Abramis brama), common carp
Environ Sci Pollut Res Fig. 1 Variable importance of predictors of LC50 (left) and NOEC (right) values according to the random forest technique. a, b Included CAS number as a categorical predictor. c, d Corresponded to the same data but not using CAS number as a predictor. e, f Did not include CAS number as predictor and for a larger data set. Percentages of explained variation were 89.3, 87.8, and 86.4 for LC50 and 94.6, 93.3, and 91.7% for NOEC
Fig. 2 Partial dependence of LC50 and NOEC on pollutant prioritization based on the random forest prediction
(Cyprinus carpio), and European perch (Perca fluviatilis) being more resistant (Fig. 4). Although the five salmonids are among the eight most sensitive species according to both endpoints (Fig. 4), fish family is not as a good predictor as species (Fig. 1), because of the contrasting tolerance among species within families. In particular, within cyprinids, tench showed high sensitivity, whereas goldfish and common carp were among the most tolerant. The mean values predicted by RF (partial dependence plots) were significantly correlated with the observed means for chemical substances and fish species, and the order of predicted and observed values was very similar (Figs. S4– S7). However, the range of predicted values was much reduced compared to observed values, because the former adjust for other predictors (i.e., predicted values are the expected values after accounting for all other predictors); this fact
Environ Sci Pollut Res
both endpoints; (ii) explained variation was slightly lower for LC50; (iii) for NOEC, fish species is a more important predictor than log P; and (iv) for a given log P, widely tested species such as rainbow trout or brown trout generally displayed low values (more sensitivity), whereas some cyprinids such as goldfish and common carp showed much higher values (Table S3). However, ANCOVAs also showed that (i) the slopes are significantly heterogeneous (see log P × fish species in Table 1) and vary markedly among species (Figs. 5 and 6; see also Table S3); (ii) the relationships with log P are more heterogeneous for NOEC, where some slopes are flat or even markedly positive and that fish species explains a much higher proportion of the variation (see SS in Table 1); and (iii) although generally significant, the species-specific relationships with log P that ignore the chemical substance are weak (Figs. 5 and 6), with explained variation generally less than 20% and hence low predictive power (Table S3).
Fig. 3 Partial dependence of LC50 and NOEC on octanol-water partition coefficients (log P) based on the random forest prediction
suggests that substances with quite different toxicity have been tested with different fish species and that many factors contribute to the variability in toxicological databases. Analyses of covariance ANCOVAs (Table 1) were quite in agreement with RF, as they showed that (i) chemical substance and secondarily fish species and log P explained most of the variation (85–91%) of Fig. 4 Partial dependencies of LC50 and NOEC on fish species based on the random forest prediction
Discussion RF showed that the chemical substance itself, octanol-water partition coefficient (log P), pollutant prioritization, ECOSAR classification, and fish species for LC50 and also water temperature for NOEC were the best predictors of toxicity. Chemical substance (CAS number as a categorical factor) was the most important predictor, and in agreement with other studies, no other correlates are as good predictors as the chemical itself (Vaal et al. 1997, 2000). However, RF without CAS numbers explained almost as much variability in toxicity (only about 1.3–1.5% less), suggesting that the predictors used in the RF models explained much of the variability in toxicity of
Environ Sci Pollut Res Table 1 Analyses of covariance of LC50 and NOEC values with chemical substance (CAS number) and fish species as categorical factors and the octanol-water partition coefficient (log P) as a covariate
LC50 (R2adj = 0.85) SS
NOEC (R2adj = 0.91)
df
P
SS
df
P
log P
1,100.2
1
<0.0005
173.1
1
<0.0005
Fish species CAS
1,015.1 10,094.9
24 50
<0.0005 <0.0005
1,196.5 3,479.7
14 47
<0.0005 <0.0005
log P × fish species
40.6
17
<0.0005
67.7
8
<0.0005
CAS × fish species Error
847.7 0.53
280 7,519
<0.0005
97.6 0.5
18 1,898
<0.0005
SS sum of squares, df degrees of freedom, P P value, R2 adj adjusted coefficient of determination
the chemicals. Simple models such as ANCOVAs with only three predictors (and their interactions) explained as much Fish species S. trutta S. fontinalis S. erythrophthalmus T. tinca
O. mykiss P. parva R. rutilus S. salar
A. brama C. auratus C. carpio O. kisutch
6
log10(LC50)
4
2
0
variation as RFs (although note that the R2 measures are not the same). Log P, which is the parameter most commonly used for toxicity prediction in QSARs and often shows a positive relationship with toxicity (Blum and Speece 1991; Meylan et al. 1999), was after CAS number the most important predictor of LC50 both with RF and ANCOVA, in agreement with previous studies that suggest that predictions based on this parameter are reasonable (Levet et al. 2013). However, the explained variation by log P was much lower than for CAS number, and the relationship of toxicological endpoints with log P varied markedly with ECOSAR groups (Fig. S2), further illustrating potential predictive problems and that chemicals with similar structures can have very different toxicity. For instance, dieldrin (CAS 60571) and endrin (CAS 72208) are in the same ECOSAR chemical class (vinyl/allyl halides) with almost the same log P values (3.4) but marked toxicity differences (Figs. S1 and S2). This toxicity difference
-2 -6
-4
-2
0
2
4
Fish species
6
Fish species A. melas A. angu illa A. facetus E. lucius
F. heteroclitus G. aculeatus I. punctatus L. gibbosus
M. salmoides P. fluviatilis P. marinus P. reticulata
C. auratus C. carpio E. lucius I. punctatus
S.lucioperca
M. salmoides O.kisutch O. mykiss P. fluviatilis
P. marinus P. reticulata S. salar S. trutta
S. fontinalis T. tinca
8
6 6
log10 (NOEC)
log10(LC50)
4 4
2
2
0
0
-6
-4
-2 0 2 4 Octanol-water partition coefficient (log P)
6
Fig. 5 Relationship of LC50 with octanol-water partition coefficient (log P) among fish species. Upper panel is for the species, belonging to the families Cyprinidae and Salmonidae; the lower panel is for the rest of species. Note that both axes were log-transformed. The regression lines by species are also shown. See ST3 for regression
-2
-2 0 2 4 Octanol-water partition coefficient (log P)
6
Fig. 6 Relationship of NOEC with octanol-water partition coefficient (log P) among fish species. Note the both axis were log-transformed. The regression lines by species are also shown. See ST3 for regression
Environ Sci Pollut Res
was also observed in a rat study by Allen et al. (2013), who found that small structural changes in dieldrin (compared to the stereoisomer endrin) yielded significant differences in toxicity. Pollutant prioritization (i.e., the lists of priority and emerging pollutants), which is mostly based on their occurrence in aquatic environments and their toxicity risk, was an important predictor for LC50 but much less for NOEC, probably in part because regulatory rules and previous predictive tools have focused on lethal effects rather than non-lethal effects, the latter likely being less understood. Priority pollutants were more toxic than emerging pollutants, suggesting that the former include the most toxic compounds; the number of cases in the original database was similar for the priority vs. emerging pollutants (9287 vs. 9144) so does not affect this conclusion. ECOSAR classification was the best predictor among the aquatic toxicity classifications. Moore et al. (2003) evaluated model performance for six QSAR packages that predict acute toxicity to fish and showed that ECOSAR and OASIS had higher correlations between predicted and measured toxicities. The ECOSAR model has been shown to correctly classify about 65% of a large test set of industrial chemicals into defined classes of aquatic toxicity for six fish species (Reuschenbach et al. 2008). Cramer classification, mode of action by OASIS, fish family, and Verhaar classification were less useful predictors, while water temperature, hardness, and chemical functional group also influenced the toxicity results. Evaluation of compounds’ hazard through Cramer scheme has been assumed useful (Patlewicz et al. 2008), but it was among the least important predictors among the variables that we analyzed. Bhatia et al. (2015) evaluated 1016 fragrance materials by conducting Cramer classification using Toxtree, the OECD QSAR toolbox, and expert judgment and recommended possible coding changes to reduce disparities among classifications. Aquatic toxicity classifications based on modes of action (MOA by OASIS and Verhaar classification) did not perform so well for the substances included in the present analyses; 70–75% of the substances were assigned as reactive unspecified by OASIS and 72–84% to the class Bnot possible to classify according to these rules^ by Verhaar classification, possibly explaining the low importance of these two classifications for toxicity prediction in the RF models. We were not able to study many environmental conditions (e.g., pH, oxygen, nutrient concentrations, and salinity) that are known to affect toxicity (Grosell et al. 2007; Pickering 1968; Thurston and Russo 1981), because they were generally less reported in the database and likely vary less in laboratory conditions. In real environmental exposure, they may have significant effects on toxicity of chemical substances. In general, LC50 and NOEC yielded similar results (e.g., explained variance by RF and ANCOVA, importance order of predictors, relationship with log P). As hypothesized, fish species was one of the most important factors, being the fifth for LC50 and second for NOEC, suggesting that this factor (and its interactions with log P and
chemical substance) is fundamental for prediction and study of chemical toxicity to fish. Our results suggest a good correlation between log P and toxicity, but the interaction between species, chemical substance, and log P as well as the interaction between species and chemicals are pervasive and important. Thus, using only log P to predict toxicity of non-tested substances might be inaccurate. Although for more than half of the fish species (13 out of 25 species) analyzed here, LC50 decreased with log P, many did not show significant relationships and explained that variation was generally low, particularly for NOEC, indicating that chemicals with similar log P may differ widely in their toxicity. Vittozzi and De Angelis (1991) also showed species-dependent acute toxicity of chemicals among fish species. Although fish species sensitivity to toxicants varies among substances (Ibrahim et al. 2014), our results confirm that salmonids and also northern pike are more intolerant species and can live in narrower ranges of water and habitat quality (Hung et al. 2004; Kennard et al. 2005; Oberdorff et al. 2001). Goldfish, common carp, guppy, roach, and bream were the most tolerant species, in partial agreement with traditional views (Lyons 2006; Maceda-Veiga and De Sostoa 2011). However, the toxicity of substances was barely dependent on fish family, suggesting that this is not a good surrogate of species and that more understanding of species-specific differences and the mechanisms that cause them is needed. Species-specific tolerance might be further explained by their morphology, physiology, and ecological condition. Guénard et al. (2011) proposed a method to predict species tolerance using phylogenetic information, for 25 aquatic species and some pesticides (carbaryl, malathion, DDT, and lindane). Sensitivity differences between taxa may depend on life history traits (e.g., respiratory strategy and body size). In conclusion, we have illustrated that modern machinelearning techniques such as RF can help to understand the complexity of toxicological processes and quantify the importance of the multitude of factors that mediate them. Our results confirm the hypothesis that fish species is among the most important predictors for modeling toxicity. Therefore, researchers should be cautious when generalizing ecotoxicological results of models that use a few predictors and species, since chemicals with very similar structure and log P can have different toxicity and different species in same taxonomic family can display different sensitivity.
Acknowledgments This research was financially supported by the Spanish Ministry of Economy and Competitiveness (projects CGL2013-43822-R, CGL2015-69311-REDT, and CGL2016-80820-R), the Government of Catalonia (ref. 2014 SGR 484), the University of Girona (MPCUdG2016/120), and the European Commission (COST Action TD1209). BT benefited from a doctoral fellowship from the European Commission (Erasmus Mundus Partnership BTechno II,^ 372228-1-2012-1-FR-ERA MUNDUS-EMA21). We thank Dr. Pao Srean for the help in using the R software and anonymous reviewers for the comments on the manuscript.
Environ Sci Pollut Res
References Allen EM, Florang VR, Davenport LL, Jinsmaa Y, Doorn JA (2013) Cellular localization of dieldrin and structure–activity relationship of dieldrin analogues in dopaminergic cells. Chem Res Toxicol 26: 1043–1054 Bhatia S, Schultz T, Roberts D, Shen J, Kromidas L, Api AM (2015) Comparison of Cramer classification between Toxtree, the OECD QSAR toolbox and expert judgment. Regul Toxicol Pharmacol 71: 52–62. doi:10.1016/j.yrtph.2014.11.005 Blum DJW, Speece RE (1991) Quantitative relationships for chemical toxicity to environmental bacteria. Ecotox Environ Safe 22:198–224 Breiman L (2001) Random forests. Mach Learn 45:5–32 Cramer GM, Ford RA, Hall RL (1978) Estimation of toxic hazard—a decision tree approach. Food Cosmet Toxicol 16:255–276 Crane M, Newman MC (2000) What level of effect is a no observed effect? Environ Toxicol Chem 19:516–519 Cutler D, Edwards T, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forests for classification in ecology. Ecology 88: 2783–2792 Doadrio I (2002) Atlas and red book of the inland fish of Spain. Ministry of Environment, Madrid (in Spanish) Dudgeon D, Arthington AH, Gessner MO, Kawabata Z-I, Knowler DJ, Lévêque C, Naiman RJ, Prieur-Richard A-H, Soto D, Stiassny MLJ, Sullivan CA (2006) Freshwater biodiversity: importance, threats, status and conservation challenges. Biol Rev 81:163–182. doi:10. 1017/S1464793105006950 European Union Reference Laboratory for alternatives to animal testing (EURL-ECVAM) (2015) EC Inventory; EINECS. https://eurlecvam.jrc.ec.europa.eu/laboratories-research/predictive_ toxicology/information-sources/ec_inventory. Accessed 15 Sep 2015 Fedorenkova A, Vonk JA, Breure AM, Hendriks AJ, Leuven R (2013) Tolerance of native and non-native fish species to chemical stress: a case study for the river Rhine. Aquat Invasions 8:231–241. doi:10. 3391/ai.2013.8.2.10 Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232 García-Berthou E, Moreno-Amich R (1993) Multivariate analysis of covariance in morphometric studies of the reproductive cycle. Can J Fish Aquat Sci 50:1394–1399 Ghose A, Viswanadhan V, Wendoloski J (1998) Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of ALOGP and CLOGP methods. J Phys Chem A 102:3762–3772 Grosell M, Blanchard J, Brix KV, Gerdes R (2007) Physiology is pivotal for interactions between salinity and acute copper toxicity to fish and invertebrates. Aquat Toxicol 84:162–172. doi:10.1016/j.aquatox. 2007.03.026 Guénard G, von der Ohe PC, de Zwart D, Legendre P, Lek S (2011) Using phylogenetic information to predict species tolerances to toxic chemicals. Ecol Appl 21:3178–3190. doi:10.1890/10-2242.1 Hansen BJ, van Haelst AG, van Leeuwen K, van der Zandt P (1999) Priority setting for existing chemicals: European Union risk ranking method. Environ Toxicol Chem 18:772–779. doi:10.1002/etc. 5620180425 Henegar A, Mombelli E, Pandard P, Péry ARR (2011) What can be learnt from an ecotoxicity database in the framework of the REACh regulation? Sci Total Environ 409:489–494. doi:10.1016/j.scitotenv. 2010.10.028 Hung DQ, Nekrassova O, Compton RG (2004) Analytical methods for inorganic arsenic in water: a review. Talanta 64:269–277. doi:10. 1016/j.talanta.2004.01.027 Ibrahim L, Preuss TG, Schaeffer A, Hommen U (2014) A contribution to the identification of representative vulnerable fish species for
pesticide risk assessment in Europe—a comparison of population resilience using matrix models. Ecol Model 280:65–75. doi:10. 1016/j.ecolmodel.2013.08.001 Katritzky AR, Kuanar M, Slavov S, Hall CD, Karelson M, Kahn I, Dobchev DA (2010) Quantitative correlation of physical and chemical properties with chemical structure: utility for prediction. Chem Rev 110:5714–5789. doi:10.1021/cr900238d Kennard MJ, Arthington AH, Pusey BJ, Harch BD (2005) Are alien fish a reliable indicator of river health? Freshw Biol 50:174–193. doi:10. 1111/j.1365-2427.2004.01293.x Laskowski R, Bednarska AJ, Kramarz PE, Loureiro S, Scheil V, Kudłek J, Holmstrup M (2010) Interactions between toxic chemicals and natural environmental factors—a meta-analysis and case studies. Sci Total Environ 408:3763–3774. doi:10.1016/j.scitotenv.2010.01.043 Leigh GJ, Favre HA, Metanomski WV (1998) Principles of chemical nomenclature: a guide to IUPAC recommendations. Blackwell, Oxford. doi:10.1515/ci.2007.29.4.23 Levet A, Bordes C, Clément Y, Mignon P, Chermette H, Marote P, CrenOlivé C, Lantéri P (2013) Quantitative structure–activity relationship to predict acute fish toxicity of organic solvents. Chemosphere 93:1094–1103. doi:10.1016/j.chemosphere.2013.06.002 Lewis PA, Klemm DJ, Lazorchak JM, Norberg-King TJ, Peltier WH, Heber MA (1994) Short-term methods for estimating the chronic toxicity of effluents and receiving waters to freshwater organisms. U.S. Environmental Protection Agency, Cincinnati Li AJ, Leung PTY, Bao VWW, Yi AXL, Leung KMY (2014) Temperature-dependent toxicities of four common chemical pollutants to the marine medaka fish, copepod and rotifer. Ecotoxicology 23:1564–1573. doi:10.1007/s10646-014-1297-4 Liaw A, Wiener M (2002) Classification and regression by random forest. R news 2:18–22 Lifongo L, Nfon E (2009) Evaluating the fate of organic compounds in the Cameroon environment using a level III multimedia fugacity model. African J Environ Sci Technol 3:376–386 Lyons J (2006) A fish-based index of biotic integrity to assess intermittent headwater streams in Wisconsin, USA. Environ Monit Assess 122: 239–258. doi:10.1007/s10661-005-9178-1 Maceda-Veiga A, De Sostoa A (2011) Observational evidence of the sensitivity of some fish species to environmental stressors in Mediterranean rivers. Ecol Indic 11:311–317. doi:10.1016/j. ecolind.2010.05.009 Mayo-Bean K, Kendra Moran L, Meylan B, Ranslow P (2012) Methodology document for the ecological structure-activity relationship model (ECOSAR) class program; estimating toxicity of industrial chemicals to aquatic organisms. U.S. Environmental Protection Agency, Washington. https://www.epa.gov/sites/ production/files/2015-09/documents/ecosartechfinal.pdf . Accessed 26 July 2015 Meylan WM, Howard PH, Boethling RS, Aronson D, Printup H, G o u c h i e l S (1 9 9 9 ) I m p ro v e d m e t h o d f o r e s t i m a t i n g bioconcentration/bioaccumulation factor from octanol/water partition coefficient. Environ Toxicol Chem 18:664–672 Moore DRJ, Breton RL, MacDonald DB (2003) A comparison of model performance for six quantitative structure-activity relationship packages that predict acute toxicity to fish. Environ Toxicol Chem 22: 1799–1809 Netzeva TI, Pavan M, Worth AP (2008) Review of (quantitative) structure–activity relationships for acute aquatic toxicity. QSAR Comb Sci 27:77–90. doi:10.1002/qsar.200710099 Oberdorff T, Pont D, Hugueny B, Chessel D (2001) A probabilistic model characterizing riverine fish communities of French rivers: a framework for environmental assessment. Freshw Biol 46:399–415 Oberdorff T, Pont D, Hugueny B, Porcher JP (2002) Development and validation of a fish-based index for the assessment of river health in France. Freshw Biol 47:1720–1734
Environ Sci Pollut Res OECD (2009) Guidance document for using the OECD (Q)SAR application toolbox to develop chemical categories according to the OECD guidance on grouping of chemicals. http://www.oecd.org/ officialdocuments/. Accessed 14 June 2015 Patlewicz G, Jeliazkova N, Safford RJ, Worth AP, Aleksiev B (2008) An evaluation of the implementation of the Cramer classification scheme in the Toxtree software. SAR QSAR Environ Res 19:495– 524. doi:10.1080/10629360802083871 Pickering QH (1968) Some effects of dissolved oxygen concentrations upon the toxicity of zinc to the bluegill, Lepomis macrochirus Raf. Water Res 2:187–194 Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199. doi:10.1007/s10021-005-0054-1 R Development Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna Reuschenbach P, Silvani M, Dammann M, Warnecke D, Knacker T (2008) ECOSAR model performance with a large test set of industrial chemicals. Chemosphere 71:1986–1995. doi:10.1016/j. chemosphere.2007.12.006 Riethmuller N, Markich SJ, Van Dam RA, Parry D (2001) Effects of water hardness and alkalinity on the toxicity of uranium to a tropical freshwater hydra (Hydra viridissima). Biomarkers 6:45–51. doi:10. 1080/135475001452788 Russom CL, Bradbury SP, Broderius SJ, Hammermeister DE, Drummond RA (1997) Predicting modes of toxic action from chemical structure: acute toxicity in the fathead minnow (Pimephales promelas). Environ Toxicol Chem 16:948–967 Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9:307. doi:10.1186/1471-2105-9-307 Tebby C, Mombelli E, Pandard P, Péry ARR (2011) Exploring an ecotoxicity database with the OECD (Q)SAR toolbox and DRAGON descriptors in order to prioritise testing on algae,
daphnids, and fish. Sci Total Environ 409:3334–3343. doi:10. 1016/j.scitotenv.2011.05.029 Thurston RV, Russo RC (1981) Ammonia toxicity to fishes. Effect of pH on the toxicity of the un-ionized ammonia species. Environ Sci Technol 15:837–840 USEPA (U.S. Environmental Protection Agency) (2014) Aquatic life criteria development documents. https://www.epa.gov/wqc/ aquatic-life-criteria-development-documents. Accessed 9 Feb 2017) USEPA (U.S. Environmental Protection Agency) (2015) ECOTOX user guide: ECOTOXicology database system. Version 4.0. Available at: https://cfpub.epa.gov/ecotox/. Accessed 9 Feb 2017 Vaal MA, Van Leeuwen CJ, Hoekstra JA, Hermens JLM (2000) Variation in sensitivity of aquatic species to toxicants: practical consequences for effect assessment of chemical substances. Environ Manag 25: 415–423. doi:10.1007/s002679910033 Vaal MA, Wall T, Hoekstra JA, Hermens JLM (1997) Variation in the sensitivity of aquatic species in relation to the classification of environmental pollutants. Chemosphere 35:1311–1327 van der Hoeven N (1997) How to measure no effect. Part III: statistical aspects of NOEC, ECx and NEC estimates. Environmetrics 8(3): 255–261 Verhaar HJM, Leeuwen CJV, Hermens JLM (1992) Classifying environmental pollutants. 1: structure-activity relationships for prediction of aquatic toxicity. Chemosphere 25:471–491 Vittozzi L, De Angelis G (1991) A critical review of comparative acute toxicity data on freshwater fish. Aquat Toxicol 19:167–204. doi:10. 1016/0166-445X(91)90017-4 Vollhardt P, Schore N (2011) Organic chemistry, Sixth edn. WH Freeman and Company, New York Woltering DM (1984) The growth response in fish chronic and early life stage toxicity tests: a critical review. Aquat Toxicol 5:1–21 Yang R, Randall DJ (1997) Biomarkers for rainbow trout (Oncorhynchus mykiss) and coho salmon (Oncorhynchus kisutch) exposed to 1,2,4, 5-tetrachlorobenzene and tetrachloroguaiacol. Chemosphere 34: 1167–1180