Genetica (2007) 131:151–156 DOI 10.1007/s10709-006-9125-2
ORIGINAL PAPER
Understanding relationship between sequence and functional evolution in yeast proteins Seong-Ho Kim Æ Soojin V. Yi
Received: 25 July 2006 / Accepted: 9 November 2006 / Published online: 12 December 2006 Springer Science+Business Media B.V. 2006
Abstract The underlying relationship between functional variables and sequence evolutionary rates is often assessed by partial correlation analysis. However, this strategy is impeded by the difficulty of conducting meaningful statistical analysis using noisy biological data. A recent study suggested that the partial correlation analysis is misleading when data is noisy and that the principal component regression analysis is a better tool to analyze biological data. In this paper, we evaluate how these two statistical tools (partial correlation and principal component regression) perform when data are noisy. Contrary to the earlier conclusion, we found that these two tools perform comparably in most cases. Furthermore, when there is more than one ‘true’ independent variable, partial correlation analysis delivers a better representation of the data. Employing both tools may provide a more complete and complementary representation of the real data. In this light, and with new analyses, we suggest that protein length and gene dispensability play significant, independent roles in yeast protein evolution. Electronic supplementary material Supplementary material is available in the online version of this article at http://dx.doi.org/ 10.1007/s10709-006-9125-2 and is accessible for authorized users. S.-H. Kim S. V. Yi (&) School of Biology, Georgia Institute of Technology, 310 Ferst Drive, Atlanta, GA 30332, USA e-mail:
[email protected] Present Address: S.-H. Kim Division of Biostatistics, School of Medicine, Indiana University, 1050 Wishard Boulevard, RG 4101, Indianapolis, IN 46202-2872, USA e-mail:
[email protected]
Keywords Partial correlation Principal component regression Functional genomic data Yeast protein evolution
Introduction Understanding constraints on functional evolution is an important goal in molecular biology and evolutionary genetics. Before functional genomic data were available, constraints were often measured by evolutionary rates, because sequence data were more easily obtained than functional data. Recently a plethora of functional data has become available for several model species, enabling evolutionary biologists to directly assess the relationships between evolutionary rates and functional variables. The usual strategy in such studies is to look for significant relationships between evolutionary rates (such as dN and dS) and the measures of a functional variable (such as, levels of expression or gene dispensability). This is usually done by computing correlation, a bivariate measure of association of the relationship between two variables. Many significant relationships have been discovered as a result (e.g., Duret and Mouchiroud 1999; Rocha and Danchin 2004; Lemos et al. 2005; Zhang and He 2005; Drummond et al. 2006; Kim and Yi 2006). However, since many functional variables are correlated with each other, a significant pairwise relationship does not necessarily connote an independent effect of the specific variable of interest. This problem is well recognized and addressed (Lemos et al. 2005; Drummond et al. 2006). A commonly used method is partial correlation, which is the correlation of two
123
152
variables while controlling for a third or more other variables. However, a recent study (Drummond et al. 2006) proposed that the partial correlation analysis (which we will refer to as ‘PC’ throughout this paper) is misleading when the data is noisy. Instead, they suggest the principal component regression analysis (referred to as ‘PCR’ throughout this paper) as a better tool to analyze noisy data. Here, we investigated the performances of these two analytical tools under a variety of biologically relevant situations, and show that in most cases they provide comparable results. Therefore, not all significant results from partial correlation analyses can be considered as spurious correlation of noisy data. Rather, partial correlation and principal component regression reveal different aspects of data structure, and provide complementary information. By comparing the results of these two analyses, we propose that besides expression-related variables, gene dispensability and protein length significantly affect yeast protein evolutionary rates.
Materials and methods Data All data on evolutionary and functional measures of yeast, transformed to approximate normality, are obtained from the supplementary material of Drummond et al. (2006). Briefly, five measures of evolutionary rates (dN, dS, dN/dS, dS’, dN/dS’), are obtained from a four-way yeast species alignment for 3,306 S. cerevisiae genes (Wall et al. 2005). There are seven functional variables (Dispensability, CAI, Degree, Centrality, Expression, Abundance, Length). Data on the average growth rates of homozygous deletion strains, downloaded from http://chemogenomics.stanford.edu/supplements/01yfh/files/orfgenedata.txt, were used to derive measures of protein dispensability (Dispensability). CAI values (CAI) are originally published in Coghlan and Wolfe (2000). The filtered yeast interactome data set (Han et al. 2004) was used to obtain number of interactions in protein-protein network (Degree) and calculate centrality (Centrality). Data on protein abundance (Abundance) are from Ghaemmaghami et al. (2003). Expression levels (Expression) for each ORF are obtained from Holstege et al. (1998). After removing missing data, there are 568 proteins that have data for all seven variables. Some analyses are performed excluding the variables Degree and Centrality (see Table 2).
123
Genetica (2007) 131:151–156
Analyses Statistical analyses and simulations are performed using the statistical packages S-Plus (S-PLUS 6.2 for Windows, Insightful, Seattle, WA) and R (R Development Core Team 2004). For simulations in Fig. 1, we first found the variance-covariance matrix to calculate the eigenvectors and then the principal components. Then, we determined the coefficients of determinants between the response variable and each principal component (details are shown in the Supplementary Material). Mathematical expressions and variances of each variable were used to compute the partial correlation between the response variable (p 134, Whittaker 1996).
Results and discussion Comparative analyses of PC and PCR Let us first review Drummond et al.’s (2006) analysis on spurious partial correlation (PC). We will use the same notations as therein. Assume two variables D and K, both affected by an independent variable X so that D = X+eD and K = X+eK. Both error terms have mean 0 and independent variances r2D and r2K , respectively. Now, instead of X, one has a noisy X¢ so that X¢ = X + eX¢. Drummond et al. (2006) demonstrate that in such case one still finds significant partial correlation rDK|X¢ after removing the effect of X¢, and this spurious partial correlation increases when the data is very noisy (rX¢ fi ¥), or when D and K are perfectly clean (rD fi 0 and rK fi 0). These findings are intuitive, given that the variables D and K are no longer independent after removing X¢: now we have D ¼ X 0 þ eD eX 0 and K ¼ X 0 þ eK eX 0 . Therefore, rDK|X¢ will increase as r2X 0 ! 1 or rD;K ! 0, since in either case eX¢ will dominate the relationship between D and K after removing the effect of X¢. We investigated whether PCR is free from errors when data are noisy, using a model of two independent variables and one response variable, as in Drummond et al. (2006). Extending this model to include more variables is straightforward. Consider a variable X determining two variables, a putative determinant Y and a response variable Z, so that Y = X + eY and Z = X + eZ (noise terms have variances r2Y and r2Z ). Further, we have a noisy X¢ = X + eX¢ where eX¢ has variance r2X 0 . There is clearly only one ‘true’ determinant (X) that contributes to Z. Now we will discuss several specific examples and compare the results of PC and PCR. For simplicity, we consider the case when
153 3.0
Genetica (2007) 131:151–156
PCR is better than PC
σ
1.5
X’
2.0
2.5
PC is better than PCR
1.0
σ Y = σ X’
0.5
PC is better than PCR
0.0
PC and PCR have one significant contributor
0.0
0.5
1.0
1.5
2.0
2.5
3.0
σY
Fig. 1 Comparison of PC and PCR using models of ‘noisy data’, with two putative determinants X¢ and Y (governed by one true underlying determinant X) and a response variable Z (not shown). The two axes are the standard deviations (rX¢ and rY) of noise terms of X¢ and Y. Sample size is 3,000. Shaded area near the two axes represent cases where PC identifies one significant variable but PCR finds two significant principal components. In
shaded area near the dashed line (rX¢ = rY), PCR identifies one significant principal component while PC finds two significant contributors. In all other cases, both analyses identify the same numbers of significant relationships. Only at the dark area near the origin, both analyses correctly identify only one significant contributor
X¢ has variance 1 and every noise term is independent of one another and of X¢. Please see the Supplementary Material for a more in-depth discussion on statistical significance of the principal components determined by PCR in general. In the first example, r2X 0 ¼ r2Z ¼ 0:052 and r2Y ¼ 1:22 (Table 1A: the number of samples (N) = 3,000 in all examples in this study unless otherwise noted). PCR falsely identifies two significant principal components, while PC correctly identifies one significant relationship. However, if r2X 0 ¼ r2Z ¼ r2Y ¼ 1:22 , then PCR correctly identifies one significant principal component
while PC suggests two significant relationships (Table 1B). This latter case is identical to the example in the Box 1 of Drummond et al. (2006). The reason for these contradictory observations is because both PC and PCR can produce erroneous results. Specifically, when the ratio of the variance of independent variables is large, PCR may be worse than PC. On the other hand, when this ratio is close to 1, PCR may present a more reasonable picture of the data (see below). However, in most cases, both PC and PCR identify two significant contributors.
Table 1 Specific examples comparing partial correlation analysis and principal component analysis
Significant contributors are shown in bold Sample sizes (N) are 3,000 except (C), which used 1,000 samples (See text for models used)
Partial correlation Predictor
% variance in Z explained (R2)
A. r2X 0 ¼ r2Z ¼ 0:052 and r2Y ¼ 1:22 X¢ 99.2 Y 0.09 B. r2X 0 ¼ r2Z ¼ r2Y ¼ 1:22 X¢ 8.45 Y 8.45 C. r2X 0 ¼ 0:07 and r2Y ¼ r2Z ¼ 1 X¢ 29.2 Y 0.39 D. r2A ¼ r2B ¼ r2C ¼ r2X ¼ r2Y ¼ r2Z X 6.26 Y 6.26
Principal component regression P-value
Principal Component
% variance in Z explained (R2)
P-value
<<10–9 0.1076
0.46X¢ + 0.89Y 0.89X¢ – 0.46Y
61.2 38.3
<<10–9 <<10–9
<<10–9 <<10–9
0.71Y + 0.71X¢ 0.71Y – 0.71X¢
23.8 0
<<10–9 1
<<10–9 <0.05 ¼ 12 <<10–9 <<10–9
0.54X¢ + 0.84Y 0.84X¢ – 0.54Y
36.1 10.8
<<10–9 <<10–9
0.71X + 0.71Y 0.71X – 0.71Y
16.7 0.0
<<10–9 1
123
154
In Fig. 1, we illustrate a comparative view of performances of PC and PCR with varying levels of noise. The range of the standard deviation of noise levels are 0.01~3.00. This range is chosen so that every correlation among variables is significantly greater than 0.1. PCR identifies one significant principal component while PC suggests two significant variables near the dashed line where rX¢ is equal to rY. On the other hand, around the two axes, PC identifies one significant variable while PCR falsely identifies two significant principal components. In all other cases, both analyses suggest the same number of significant contributors. In the dark area near the origin, both analyses correctly identify one significant contributor. However, both analyses find two significant contributors in white area. For example, Drummond et al. (2006) show that a spurious partial correlation is found even with a relatively modest noise level, specifically when r2X 0 = 0.07 and r2Y ¼ r2Z ¼ 1 (N = 1,000). In this case, PCR also finds two significant principal components (Table 1C): both PC and PCR are misleading in such cases. Our results show that these two methods are generally comparable. The accuracies of PC and PCR are largely determined by the error terms associated with data. Unfortunately we currently have no means to assess the error structure inherent in biological data. Hence, one analytical tool may not be branded errorprone compared to others. Pitfalls of principal component regression analysis in another situation Furthermore, PCR may be inferior to PC when there is more than one ‘true’ independent variable. Suppose two variables A and B determine variables X, Y, and Z, where X = A + C + eX, Y = B + C + eY, Z = A + B + eZ (the noise terms eX, eY, and eZ have variances r2X , r2Y , and r2Z ) and a variable C determines variables X and Y. Here, X and Y are putative determinants of a response variable Z. For example, evolutionary rates (Z) may be governed by two (unknown) biological factors (A and B), each together with another common factor (C) affecting expression (Y) and codon usage bias (X). In an example where all noise terms have variance 1, PCR finds that Z is determined by just one hidden variable, even though there are two determinants A and B contributing to Z (Table 1D). In contrast, PC shows both X and Y contributes to Z independently. Therefore, PC gives a more reasonable relation in this case.
123
Genetica (2007) 131:151–156
PC and PCR provide complementary information on the determinants of yeast protein evolutionary rates We have shown above that PC and PCR may both fail in some biological situations. Generally speaking, PCR can find the main hidden factors. However, it is difficult to analyze the true meaning of these factors. In comparison, PC provides information on observable factors. Hence, if we cannot clearly analyze hidden factors, or if we want to detect main observable factors, PC may be a better choice. Further, combining PC and PCR can provide a complementary view of the data. We emphasize that it is possible to investigate the influence of main hidden variables without the principal component regression, by comparing normal and partial correlations. If the relationship between independent variables and a response variable is mainly governed by a hidden variable, then the amount of total variance explained by partial correlation will be substantially reduced than that by normal correlation (Weisberg 1985; also, see Supplementary Material). We will demonstrate this by re-visiting the determinants of yeast protein evolution in the following. In Table 2, we compared the results of normal correlation and partial correlation analysis on the determinants of yeast protein evolutionary rates. Overall, both analyses show that variables related to expression level are the strongest determinants of yeast protein evolution (Table 2), confirming earlier results (Drummond et al. 2006; Kim and Yi 2006). Further, by analyzing normal and partial correlations, we can investigate the presence of hidden variables. In Table 2, we can see that partial correlations between evolutionary rates and variables Expression, CAI, and Abundance tend to be substantially reduced than normal correlations, as if there is a major hidden variable behind these expression-related variables. This is similar to the results of PCR, in which the principal component 1 (Expression, CAI, and Abundance) had a main effect. Interestingly, the contributions from Length and Dispensability do not change much when either normal or partial correlation is used (Table 2). This suggests that their effects on evolutionary rates are independent of other factors such as those related to expression levels (Wall et al. 2005; Zhang and He 2005). However, the effects of Degree are reduced substantially when we compare normal and partial correlations, suggesting that its effect is not independent of other factors. Again, this is in accord with other analyses that concluded that Degree has at most
Genetica (2007) 131:151–156
155
Table 2 Correlation and partial correlation analysis of determinants of evolutionary rates in yeast Evaluationary Rates Predictors
dN
dS
Normal Partial
Normal Partial
dN/dS
dS’
dN/dS’
Normal Partial
Normal Partial
Normal Partial
(A) Relationship between evolutionary rates and seven functional measures (predictors) 93 42.44*** 3.16* 93 19.42*** 1.15 94 9.49*** Expression 34.37*** 2.36* 90 1.15 CAI 31.93*** 8.24*** 74 55.00*** 31.88*** 42 14.70*** 1.47# 95 24.88*** 0.75 97 16.85*** 1.82# 89 2.23* Abundance 26.22*** 1.26# 24 3.75* 5.28** - 3.54* 1.87# 47 8.46*** Length 4.95** 3.74* * * * * 2.00 29 0.02 0.16 - 3.66 2.37 35 0.07 Dispensability 2.83 81 3.11* 1.15* 63 3.34* 0.41 88 3.02* Degree 4.54** 0.85 Centrality 0.35 0 100 0.02 0 100 0.41 0.01 98 0.5 (B) Relationship between evolutionary rates and five functional measures 94 29.53*** 1.88** 94 15.51*** 0.81* 95 2.85*** Expression 26.19*** 1.64** CAI 30.65*** 8.25*** 73 39.21*** 17.13*** 56 17.05*** 2.88*** 83 0.02 100 16.09*** 1.80** 89 0.64* Abundance 24.17*** 1.91*** 92 20.97*** 0.07 36 2.98*** 2.89*** 3 0.78* 0.31 60 4.81*** Length 1.69** 1.08* *** *** # *** ** 1.95 36 0.54 0.11 80 3.06 1.82 41 0.16 Dispensability 3.05
4.20** 0.48 0.02 3.53* 0.07 0.66 0.06
56 32.92*** 58 33.21*** 99 26.85*** 58 3.68* 0 2.98* 78 4.03* 88 0.28
2.34*** 1.68** 0.12 2.29*** 0.07
18 25.64*** 1.15* - 33.25*** 10.21*** 81 24.93*** 1.97*** 52 1.02* 0.62* *** 56 3.07 1.98***
1.68# 95 9.17*** 72 1.52# 94 2.78* 24 2.26* 24 0.71 82 0 100 96 69 92 39 36
‘Normal’ refers to squared Pearson’s correlation coefficients q2 (%) and ‘Partial’ refers to squared Pearson’s partial correlation coefficients q2p (%) between each measure of evolutionary rates and each predictor. (A) Analyses including all seven functional genomic measures (predictors). Sample size is 568 after removing missing data. (B) Results when only five predictors are considered, excluding the variables Degree and Centrality. Sample size is 1,939 Significance codes: #P < 0.01;*P < 10–3;**P < 10–6;***P < 10–9 Reduction in R2 explained by this variable (%). Cases where the reduction is >50% are shown in bold
weak relationship with protein evolutionary rates (Jordan, Wolf and Koonin 2003; Hahn, Conant and Wagner 2004). There are further similarities between the results of PC and PCR. For example, PCR detects two significant principal components for dS¢ (which stands for the synonymous rates (dS) adjusted for codon usage bias: Hirsh, Fraser and Wall 2005), the principal component 1 (combined effects of Expression, CAI, and Abundance) and the principal component 3 (Length) (Table 2 in Drummond et al. 2006). In PC (Table 2), the two most important determinants of dS¢ are Expression and Length. Both analyses indicate that Length and Expression are two major determinants of dS, independent of codon usage bias. imilarly, when only five predictors are considered in PCR (Table 3 of Drummond et al. 2006), the strongest contribution to dS¢ is from the principal components 2 and 3 (combined effects of Length and Dispensability). The principal component 5 (Expression and CAI) makes the next contribution. PC with the same data reveals that Expression and Length are the main determinants of dS¢ (Table 2B). Therefore, the results of PC and PCR correspond well. In summary, our study demonstrated that noisy data do not necessarily preclude the use of PC in biological data analyses. Employing PC and PCR together can provide complementary results. Our study also confirmed that expression is the major variable in pre-
dicting yeast evolutionary rates. In addition, we conclude that the variables Length and Dispensability are significantly and independently related to evolutionary rates. Acknowledgements We thank D. Allan Drummond and Claus Wilke for helpful personal communications, Charles Warden for critical reading of the manuscript. SY is supported by funds from the Georgia Institute of Technology.
References Coghlan A, Wolfe KH (2000) Yeast 16:1131–1145 Drummond DA, Raval A, Wilke CO (2006) A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol 23:327–337 Duret L, Mouchiroud D (1999) Expression pattern and, surprisingly, gene length shape codon usage in Caenorhapbditis, Drosoophila, and Arabidopsis. Proc Nat Acad Sci USA 96:4482–4487 Ghaemmaghaml S, Huh W-K, Bower K, Howson RW, Belle A, Dephoure N, O’Shea JS, Weissman (2003) Global analysis of protein expression in yeast. Nature 425:737–741 Hahn MW, Conant GC, Wagner A (2004) Molecular evolution in large genetic networks: does connectivity equal constraint? J Mol Evol 58:203–211 Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP, Vidal M (2004) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88–93 Hirsh AE, Fraser HB, Wall DP (2005) Adjusting for selection on synonymous sites in estimates of evolutionary distance. Mol Biol Evol 22:174–177
123
156 Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95:717–728 Jordan IK, Wolf YI, Koonin EV (2003) No simple dependence between protein evolution and the number of proteinprotein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol 3:1 Kim S-H, Yi S (2006) Correlated asymmetry between sequence and functional divergence of duplicate proteins in Saccharomyces cerevisiae. Mol Biol Evol 23:1068–1075 Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL (2005) Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Mol Biol Evol 22:1345–1354 R Development Core Team (2004) R: A language and environment for statistical computing. R Foundation for Statistical
123
Genetica (2007) 131:151–156 Computing, Vienna, Austria. ISBN 3–900051–00-3, URL http://www.R-project.org Rocha EP, Danchin A (2004) An analysis of determinants of amino acids substitution rates in bacteria. Mol Biol Evol 21:108–116 Wall DP, Hirsh AE, Fraser HB, Kumm J, Giaever G, Eisen MB, Feldman M W (2005) Functional genomic analysis of the rates of protein evolution. Proc Natl Acad Sci USA 102:5483–5488 Weisberg S (1985) Applied linear regression. John Wiley and Sons, 336 pp Whittaker J (1996) Graphical models in applied multivariate statistics. John Wiley and Sons, New York, 466 pp Zhang JG, He X (2005) Significant impact of protein dispensability on the instantaneous rate of protein evolution. Mol Biol Evol 22:1147–1155