ISSN 1022-7954, Russian Journal of Genetics, 2006, Vol. 42, No. 10, pp. 1199–1207. © Pleiades Publishing, Inc., 2006. Original Russian Text © L.A. Zhivotovsky, 2006, published in Genetika, 2006, Vol. 42, No. 10, pp. 1426–1436.
METHODS
Population Aspects of Forensic Genetics L. A. Zhivotovsky Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991 Russia; e-mail:
[email protected] Received March 29, 2006
Abstract—The paper presents the methodology of forensic genetics as a synthesis of population genetics and forensic medicine. Main population genetic problems, appearing in calculation of probability statistics and interpretation of the results of forensic genetic investigations, are discussed in detail. DOI: 10.1134/S1022795406100127
INTRODUCTION In 1870, the French lawyer and anthropologist Alphonse Bertillon has developed the first system of morphometric measurements and description of human body parts, which was aimed at identifying a person on the basis of impartial individual traits rather than a verbal portrait. This system, albeit rather cumbersome, immediately proved to be successful in personal identification. Two decades later, it was replaced by dactyloscopy, a more reliable, based on finger skin patterns method of personal identification, which had been developed by the British anthropologist Francis Galton. In 1987, the English geneticist Alex Jeffreys has devised the DNA fingerprinting procedure for detecting highly individual multiple patterns of DNA fragments of different lengths. However, this method required large amounts of freshly isolated DNA, whereas components of the pattern were labile. Since the mid-1990s, analysis of amplified DNA fragments, which provides reliable detection of polymorphic loci, has become the main method for genotyping in forensic studies [1]. Analysis of DNA markers in forensic expertise is a powerful tool for investigating crimes (murders, rapes, kidnappings), identifying body remains in sites of catastrophes and military actions, determining paternity, rejoining families, etc. It permits identifying the “genetic passport” of an individual on the basis of any DNA-containing genetic sample (blood drops, sperm, hair roots, etc.). Forensic studies use various types of DNA markers. However, the most frequently employed are microsatellites, which are highly polymorphic and have numerous alleles at each locus, thus providing high probability of personal identification and determination of biological kinship [1, 2]. Because of small sizes of microsatellite loci, polymerase chain reaction can be used for amplification (multiple increase of the copy number) of a given DNA fragment [3]. For this, a small amount of DNA is needed, which can be extracted even from greatly degraded biological material. Important for forensic studies is also analysis of mitochondrial DNA, which, occurring in multiple cop-
ies, can be present in sufficient amounts even in decayed tissues and osseous remains, in which examination of other DNA types becomes problematic [4]. Note that the exceptional potency of the DNA identification methods is also their disadvantage: they are extremely sensitive to possible errors committed at any stages of forensic genetic investigation, which may result in false expertise decisions. DNA analysis consists of three main stages. The first stage is collection and storage of DNA samples. A total collection of blood, saliva, or sperm traces from the criminal site, resulting in a possible mixture of samples from several individuals in the same tube, mixing of the samples or their substitution may lead to serious judicial errors. The second stage is a laboratory analysis of biological samples with the aim of genotyping, i.e., determining the genotypes of the samples. This stage may involve technical errors, which can happen in any laboratory. To diminish this, there is a practice of large-scale testing of laboratories performing forensic studies, in the developed countries. The third stage consists in comparing genotypes of the biological samples and of the persons, involved in the given case; verifying the consistency between the data and the case circumstances; and expert decision on significance of their consistency or inconsistency. The key aspect of the third stage is that the employed methods of DNA typing cannot guarantee that the given genotype is unique and that there is no other person carrying the same markers. Because of this, the probabilities are computed: e.g., the probability that the person has left the biological sample left at the criminal site, or the probability that the man is the biological father of the child. Estimating the probabilities is based on knowledge of the frequencies of the genotypes, found in the forensic investigation, in the population, to which belong the people involved in this case. The interpopulation differences in allele or genotype frequencies, genetic disequilibrium in the populations, relatedness of the persons involved in the case, and other factors require thorough population genetic analysis, since disregarding them may significantly affect the probability value. The
1199
1200
ZHIVOTOVSKY
probability estimate is recorded in the expertise conclusions and is decisive for the legal verdict, particularly when other, non-genetic, evidence is absent; hence, a detailed population genetic analysis of DNA evidence in each case is required [1, 2, 5–8]. The aim of this article is to focus on major population problems that arise in interpretation of the results of forensic genetic investigations, which determine the methodology, approaches, and probability methods of forensic genetics, the science combining population genetics and forensic medicine.* TWO MAJOR TYPES OF FORENSIC DNA INVESTIGATION There are two kinds of forensic genetic expertise. One of them is personal identification, in which the biological samples are individualized, i.e., their origin is established. The second is determining biological relatedness of the persons. These are somewhat different genetically. In personal identification, the individual identity of samples in case of their genotypic identity is to be established, i.e., the decision is based on comparing the genotypes. In determining biological relatedness, the allele composition of individual genotypes (biological samples) is compared to establish their identity by origin, i.e., the decision is based on comparing the alleles. This issue is discussed in detail in [2, 9]. Personal identification. Let us consider a hypothetical case, in which the expert is asked whether the blood trace found in a stolen car could belong to the victim, discovered in a wood. For this, the genotype (or genotypic profile) of the blood trace and the genotype of the victim at a number of DNA markers are established and compared. If these genotypes differ from one another at least at one locus, the answer to this question is strictly negative (if the genotyping was done without mistakes, the biological blood samples were not mixed up or changed, and these samples are not a mixture of biological material from different persons). However, if the genotypes coincide at all loci, the answer is “yes, it could.” Nevertheless, coincidence of the genotypes even at a large number of loci does not mean that the blood found in the car is indeed the blood of the victim. It may well be that there are several individuals with the identical genotypes at the examined loci in the population. Naturally, the more loci taken into analysis, the less probable accidental concordance of the genotypes of the blood and the victim. However, it should be borne in mind that there were cases in the world practice, in which concordance of the genotypes at many loci was found, but additional analysis of other loci revealed a difference. Therefore, the statement on the source of the blood sample in question can be only made in terms of probabilities. This means that the * In
the text, the basic terms are italicized, and important issues, underlined.
probability of the given person leaving the blood trace should be estimated and presented in the expert conclusion. The less frequent is the genotype of the blood trace, found in the car, in the population, the more conclusive is this “genetic evidence,” because the probability of the fact that the blood in fact belongs to the victim, increases. Conversely, the more frequent these genotypes in the population, the less significant is the genetic evidence. Consequently, it is necessary to present in the expert conclusion the calculation of the probability value aimed at quantitatively assessing the genotype concordance. Biological relatedness. An example of such problems is disputable paternity: the expert is to determine whether the putative father is the biological father of the child (if the maternity is considered established). After determining the genotypes (genotypic profiles) of the mother, the child, and the putative father, they are compared with one another and the alleles that could be transmitted to the child from the mother are determined. Other, nonmaternal, alleles apparently were transmitted from the biological father. If the child’s set of nonmaternal alleles includes those absent in the putative father, biological paternity of the latter is rejected, if this incongruence is not explained by mutations. However, if the genotypic profile of the putative father is concordant to those of the mother and the child, his biological paternity is not rejected. However, this does not mean that the putative father is indeed the biological father of the child. Even if the biological samples in question are not substituted, several males in the population may have the same alleles that are inherited by the child from its biological father. The probability of this is higher if the biological father of the child is a relative of the putative father. The current approaches, methods, and problems are discussed in a number of publications [1, 2, 6, 7, 9–11] and others. DNA Markers Used in Forensic Studies Microsatellites. A DNA region with a definite genome localization, containing short tandem repeats, is called a microsatellite locus (or microsatellite), often STR locus (or simply STR), from Short Tandem Repeats, or SSR (Simple Sequence Repeats). Amplification of a DNA fragment, containing a microsatellite locus, is done by means of polymerase chain reaction (PCR) with a DNA primer. An allele of an STR locus is a DNA fragment, flanked by the given pair of primers and carrying a certain number of the tandem repeats. The alleles are amplified from both homologous chromosomes. Since the choice of primers is largely arbitrary, the terms allele and locus here are of symbolic character, indicating only the fact that this locus contains the DNA region with the repeats analyzed. If another pair of primers, flanking a larger or a smaller DNA fragment, but containing the same repeats, is chosen, this will be the same STR locus, if the study is
RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
2006
POPULATION ASPECTS OF FORENSIC GENETICS
aimed at estimating the number of nucleotide composition of repeats in the given DNA region. The corresponding alleles, detected by different primer pairs, differ from one another by the same nucleotide number, which is determined only by the difference in the end regions of the amplified DNA fragments. Regardless of the fact whether the different pairs of primers determine larger or smaller DNA fragments, it is important that these fragments contained the whole region with the repeats examined, which are the main characteristic of the microsatellite locus. The so-called multiplex reactions were developed that permit simultaneous (in one reaction) genotyping at several loci; they differ in coloring and position provided by the fluorescent primers and nonoverlapping variation of allele sizes at different loci. Microsatellites are virtually scattered over the genome: at present, about 10 000 microsatellite loci have been examined in humans [3] (see [12] for a detailed review of human microsatellite variability). Of these, in forensic expertise are used such microsatellite loci that are highly polymorphic (with many alleles in populations), ensure highly repeatable results, could be combined in multiplex reactions, and those with large population data bases. Depending on the repeat length, microsatellites are classified into loci with mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats. Loci with higher repeat size provide reliable identification of alleles close in length. Loci most commonly used today in forensic studies are those with tetranucleotide repeats, although loci with a larger motif are also often used: for example, the repeat length at locus D1S18 (MCT118) is 16 bp (according to the formal classification, this locus is assigned to minisatellites). For DNA identification, various groups of markers are employed: in the United States, markers of the CODIS system for forensic identification (http://www.fbi.gov/hq/lab/codis/ index1.htm); in Europe, ENFSI; in Russia, systems provided by Promega are often used. These sets of loci partially overlap. Microsatellite loci of the Y chromosome are also utilized for personal identification and determination of biological relatedness [13]. Mitochondrial DNA. Another type of DNA markers, used in forensic investigations, is mitochondrial DNA (mtDNA), which is a circular sequence of about 16 500 bp in length. As a rule, two mtDNA regions near the D-loop are used for identifications. Because of high mutation rate and high polymorphism caused by it, they are referred to as hypervariable segments: the first (denoted HVS1), between nucleotide position 16024 and 16365; the second (HVS2), between positions 73 and 340. The main differences in mtDNA among individuals fall to these two segments, mainly to HVS1. A particular nucleotide sequence in one or both segments is an identifying character; it is called haplotype or mitotype. This character is ambiguous: an individual can have more than one haplotype (heteroplasmy). In RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
1201
this case, it is not excluded that one haplotype is preferentially detected in some type of tissues, and the other, in other tissues or, for instance, two hairs of the same person have different haplotypes [14]. Haplotypes of compared biological samples are considered matching, if their nucleotide sequences in the mtDNA segments examined are identical. However, as in the case of autosomal microsatellite markers, match of the haplotypes of two different samples does not mean that these samples originated from the same person (see [7] for detailed discussion). Only in the case, when identification is performed on a small group of unrelated maternally individuals with a precisely known familial composition, analysis of mtDNA of the remains and the relatives of the deceased could provide unambiguous identification. Similar problems arise in Y-chromosome analysis. Note that haplotypic monolocus multiallelic markers, such as mitochondrial DNA and the Y chromosome, are not individualizing ones. They only detect haplotypes, which can belong to all relatives in the maternal (or, correspondingly, paternal) lineages, if no mutation occurred. For example, to establish paternity of the United States president Thomas Jefferson with regard to the son of his black slave, Y chromosomes of living Jefferson’s relatives and offspring of his putative son in the paternal lineage were tested. The expertise confirmed identity at the loci examined, but could not clarify, who had been involved: the president himself, his brother, or his brother’s son (see [15]). In addition to microsatellite loci in autosomes and sex chromosomes and mitochondrial DNA nucleotide sequences, any other markers, e.g., nucleotide sequences of nuclear DNA, can be used in forensic investigations. Their main required properties are the same: high identifying power, reliable genotyping in a wide range of storage conditions, and the available population databases. In particularly important cases, reliability is ensured by genotyping with different pairs of primers covering the locus. This is done if a mutation (nucleotide substitution or deletion/insertion) in the primer region cannot be excluded; at that, amplification of the microsatellite allele, associated with the mutation, does not produce any product or produce a smaller amount of it than the normal allele—in this procedure, such allele behaves as a null allele. The presence of a null allele in an individual is indicated by its homozygosity at a autosomal locus, which in this case is false. Using another pair of primers, which do not cover the mutation, permits conducting PCR and establishing the actual genotype of the individual at the microsatellite locus; the mutation in the primer region can serve as an additional marker. There are also null alleles at the amelogenin locus, employed for determining sex by biological traces: the absence of amplification of the corresponding Y-chromosome fragment is a false indication of female sex [16]. 2006
1202
ZHIVOTOVSKY
The Probability Theory in DNA Identification Probability Model. After genotyping biological samples, the resultant genetic data are interpreted in terms of population genetics in accordance with the versions available for the case and the probability of identification is determined [1, 2]. In essence, this probability gives a quantitative estimate of significance of the biological trace. In population genetics, two principally different causes of congruence of genotypes and alleles are distinguished. One is the coincidence caused by identity by descent. For mitochondrial DNA and the Y chromosome, the identity by descent is explained by belonging to the same maternal or paternal lineage; for autosomal loci, copies of alleles of parents or other relatives are identical by descent; genotypes of biological samples of the same individual are also regarded as identical by descent. The other cause is coincidence caused by identity by state, rather than by descent. Exactly distinguishing these two cases in terms of mathematical probability theory and estimating the probability of coincidence due to identity by descent is the ultimate aim of population genetic analysis of the results of DNA typing in problems of identification and determination of biological relatedness. Let us consider the case, in which the expert ought to determine, whether the biological sample in question originated from the individual X, i.e., whether the event D (designation by the first letter in the word descent) took place? If the genotypes of this sample and individual X do not coincide, and the accuracy of the sample collection and their molecular biological analysis is not doubted, then the individual X could not leave this biological trace (we omit complicated cases requiring detailed analysis, i.e., when the sample is in fact a mixture of samples from different individuals). What if the genotypes coincide? In this case, the following alternatives (versions) are possible. One version, advanced by the prosecution, is as follows: the suspect, who has genotype G at the loci examined, is the source of the given biological sample, which also have genotype G. Thus, the prosecution states that event D took place. According to the defense, this biological sample originated from another, unknown person, who also has genotype G. In other words, the defense states that the biological trace is left unknown and hence the genotype of the biological sample and the genotype of the suspect are identical by state, rather than by descent (event S). This unknown person belongs to a certain population, i.e., a certain group of people (e.g., ethnic group). This group is called reference group, because probability analysis of forensic data is conducted with a reference to the prevalence frequency of individuals with the given genotype in this population. Denote pG the frequency of individuals with genotype G in the population. This frequency can be interpreted as the probability of the fact that the biological sample was left by another individual from the refer-
ence population. However, this interpretation is not quite logical, because pG is the probability to have genotype G on condition that the biological sample is left by an unknown person, i.e., here the probability refers to the genotype of the sample and the condition does to the individual. In terms of conditional probabilities, this is written as Pr(G |S), where Pr stands for probability, and the vertical line separates the event from the condition. The logical slip lies in the fact that the value 1 – pG, additional to 1, should give the probability that the trace is left by the suspect. However, the value 1 – Pr(G |S) is the proportion of persons with other genotypes in the reference population, which does not reflect the probability that the suspect is the source of the biological sample (i.e., this proportion is not connected to event D). Actually, forensic expertise required that the probability be related to the individual, and the condition, to the genotype of the sample, because the biological sample and its genotype are a definitive fact (in the absence of errors, deleterious change, or mixing of biological material of several individuals), and there is still doubt as to the origin of the sample. To put it differently, the value of Pr(S |G) (the probability that the sample is left by a person from the reference group on condition that this person has genotype G) should be estimated, rather than Pr(G |S). Here, the value 1 – Pr(S |G), complementary to 1, is Pr(D |G), i.e., the probability that the biological sample comes from the suspect. In this case, it is the sought probability of identification. A prior and a posterior probabilities. So, two probabilities, Pr(D) and Pr(D |G), should be determined. The first is the so-called a prior probability, i.e., preliminarily (prior to conducting the DNA test) assumed probability that the biological sample belongs to the suspect. This probability should be based on the totality of the available material evidence and witness testimonies, obtained before DNA analysis. The second is the a posterior probability, or the probability of identification, based on all of the available data, including the DNA evidence. Exactly the identification probability is the numerical ultimate result of a forensic DNA investigation and it is given in the conclusions of the expert report. In the case when the version that the sample was left by the suspect is taken, Pr(S |G) = 1 – Pr(D |G) is interpreted as the probability of identification error, i.e., the probability of the fact that this version is erroneous. The theory of conditional probabilities and the Bayesian theorem give an important expression for DNA identification [2], based on which the posterior probability is estimated: LR × Pr ( D ) Pr ( D G ) = ---------------------------------------------------. 1 + Pr ( D ) × ( LR – 1 )
(1)
The value LR = Pr(G|D)/Pr(G|S) is the so-called likelihood ratio, i.e., the ratio of the probability that the
RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
2006
POPULATION ASPECTS OF FORENSIC GENETICS
biological sample has genotype G on condition that this sample is from the suspect, to the same probability, but on condition that the sample is from an unknown individual from the reference population. Approaches to obtaining a formula for LR in the general form are presented in [2, 17]. In the simplest case, considered here, Pr(G|D) = 1, Pr(G|S) = pG and, consequently, the likelihood ratio is 1/pG. It is clear that the less frequent is the genotype, the more probable that the individual in question is the source of the biological sample; conversely, the more frequent the genotype, the lower the probability. Thus, the most informative for forensic purposes are rare genotypes, but because of their rarity, their population frequencies are problematic to estimate with high precision. This yields one of the population-related problems of DNA identification. In determining biological relatedness, calculation of LR proves to be more complicated because of genetic segregation, since here the gamete serves as the genetic evidence. For instance, in determining paternity, this is the gamete received by the child from his (her) biological parent: it is denoted as the set of “nonmaternal” alleles, detected in the child based on the data on the paternal and maternal genotypes in the case, when maternity is undisputable), and the probability of receiving this gamete from the putative father is determined by his genotype. Consider a pair of different variants on an example of one locus. Let the child, the mother, and the father be homozygous at the same allele A. Then, the likelihood ratio is 1/pA, where pA is the frequency of allele A in the reference population. Now, consider the following situation: according to newly discovered circumstances, another man, heterozygous for the allele, is suspected to be the father. Since he could transmit the alternative allele to the child, the likelihood ratio for him decreases twice, becoming 1/(2pA). If the three of them were heterozygous for alleles A1 and A2, then LR = 1/( p A1 + p A2 ). Thus, the lower the allele frequency in the population, the higher the informativity of this allele with regard to identification of the biological father; conversely, the higher the frequency of the allele, the lower its discriminating ability. In paternity cases, the so-called paternal index (PI) is used as LR; this is described in detail in [2]. Estimation of the probability of paternity is considerably exacerbated by the presence of relatives [2]. Apart from the Bayesian approach to analysis of conditional probabilities, approaches in terms of unconditional probabilities are considered in forensic studies [18]. Threshold probability levels. Critical in court decision is the choice of the threshold probability, the exceeding of which can be sufficient for admission of the expert conclusion on the individual origin of the biological source or of biological relatedness [19, 20]. Detailed discussion of this point is beyond the scope of this paper, but the issue itself is so important that it should be at least touched upon. The problem is related RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
1203
to the psychology of perceiving the values of identification probability, close to 1, and the complementary small values of error probabilities. For example, is the identification probability of 0.9999 low or high, and, which is the same, is the identification error of 0.0001 small or great? Giving lectures on DNA identification for different audiences (geneticists, lawyers, mathematicians), I have repeatedly seen that at the first glance, already the probability value of 0.99 seems to an inexperienced person sufficiently high to admit that the biological trace was left by the individual in question. However, various interpretations of this value (e.g., explaining that the identification error of 0.01 means that in one case in a hundred, an erroneous sentence is passed) change this first impression. The perception of the probability level as high or low also greatly depends whether the respondent related it to himself or to his relatives and friends, or to an unknown person, let alone an enemy. For instance, in the United Sates, in paternity expertise, the probability on which depends the number of loci examined is generally not lower than 99.99%, while the practically attained probability of paternity is usually higher. Note that the value of 99.99% pertains to expertise of disputable paternity, conducted for civil law cases. In the case of grave crimes, when the verdict of guilty may mean death sentence or life imprisonment, the practice of the United States criminal court stated that the reliability of DNA identification must be so high as to assure that the genotype analyzed will not be found even in a population of the size, which is hundredfold higher than the global population. This recommendation signified that the persecution should examine the number of loci as high as to ensure the identification error not exceeding one in a billion, with the corresponding identification probability being not lower than 99.9999999999%. The Ministry of Health of the Russian Federation order no. 161 of April 24, 2003 “On the Guidelines for Organizing and Performing Expert Investigations in the Forensic Bureau” gives a different view on the value of probability, sufficient for making conclusions. This documents recommends extremely low levels of probability in expertise of the origin of children. Namely, item 7.3.7 states: “The level of reliability of an expert investigation in the case of not excluding paternity must be as follows: for the complete threesome (mother–child–putative father), on condition that the identity of the other parent is undoubted, not lower than 99.90%… For a twosome (child–putative father) in the absence of the other parent, not lower than 99.75%.” However, this document does not explain, why different probability levels are accepted depending on the presence or absence of data on the other parent, and these probability levels (99.90 and 99.75%) are not substantiated, notwithstanding the fact that these low probability levels are recommended for both civil and criminal court procedures (see [20] for detailed discussion). 2006
1204
ZHIVOTOVSKY
Key Parameters Determining the Probability Value A prior probability. Evaluating the identification significance of genetic data and corresponding probabilities in forensic studies involves some methodological and technical difficulties. The prior probability Pr(D) is the most indefinite value in formula (1). However, its knowledge is crucial for estimating the identification probability Pr(D|G). Some guidelines recommend to take this probability equal to 0.5, mistakenly believing that this means that the chances to consider the person guilty or not guilty are equal. In actuality, the unfounded assumption of the prior probability to be 0.5 violates the presumption of innocence, because it means that a priori, i.e., before DNA analysis, the suspect is assumed to be 50% guilty without any evidence. Thereby the a priori fate of the suspect is decided by, so to say, tossing a coin and looking, which side of it is on. Of course, if nongenetic evidence with certainty indicates that the person in question was involved in the case, the prior probability can exceed 0.5. For example, if blood traces were found under the nails of the victim, and the suspect’s face is scratched, and some witnesses have seen them together before the crime, the prior probability is high, although hard to be expressed quantitatively. However, if there are several—say, four—suspects, without decisive indications in favor or against participation of any of them in the crime, the prior probability for each of them cannot exceed 1/(1 + 4) = 0.2. In the case, when a person was detained on suspicion, but there is no significant evidence against him, the prior probability must be of the order of 1/(N + 1), where N is the number of the population cohort, to which the true criminal belongs (the information as to which cohort or ethnic group the true criminal belongs, might be contained in the available material or testimonial evidence). If the determination of paternity is required in a civil or criminal case, but one party argues that the biological father of the child is man X, and the other, man Y, then the ratio of the prior probabilities for each of them may be determined by the frequency of their sexual intercourses with child’s mother in the period of conception, medical records of aspermia or other dysfunctions, preventing fertilization, etc. The above examples demonstrate that the numerical value of a prior probability is often problematic to estimate and may be disputable. A satisfactory compromise seems to be giving in the expert decision not only the concrete LR value, chosen by the expert on the basis of all available data, but the range of possible identification values, computed by formula (1) for different a priori probabilities (e.g., from Pr(D) = 0.01 to Pr(D) = 0.99). This allows the court members to be more aware of the significance if the DNA investigation results. Choosing the reference population. The choice of the reference population is crucial in estimating the identification probability, because the genotype frequency is required for calculating the likelihood ratio
LR. Indeed, it follows from expression (1) that the posterior probability value depends on the allele frequencies, i.e., is directly determined by the group of people chosen as the reference population. Imagine that the suspect belongs to a certain population (for this reason, this population is chosen as the reference), whereas the true person who has left the biological trace actually belongs to another population. In this case, the error in choosing the reference population may lead to a significant bias in the LR estimate, if the genotype frequency varies among populations. The reference population can be selected only when there is evidence assigning the person who left the biological trace to a particular population group. The reference population is a population genetic unit of forensic studies. Its borders are defined on the basis of testimonial and other evidence, identifying a group of people, to which, according to the legal version, the individual who left the biological trace is assigned. The reference population does not necessarily coincide with any of the groups that are investigated in ethnological, anthropological, demographic and other human studies. For instance, if the evidence indicates a particular town as the place of residence of the person who left the biological trace figuring in the court case, then the estimations of probabilities should be oriented to possible allele and genotype frequencies in this town. If additional evidence points to a certain ethnic group, living in this town and having arrived from another region, the allele and genotype frequencies employed should be related to this group rather than to the town population as a whole. The views of different parties involved in the case on what is the reference group may be different. The problem is further exacerbated, when the case involves many biological samples assigned to different reference populations, each of which belongs to different race or ethnic groups from various world regions. Clearly, there is no real population genetic database that can completely meet all the requirements of all forensic DNA investigations^ practically each DNA expertise will encounter gaps in reference DNA data bases on allele and genotype frequencies. One of the possible solutions of the reference DNA data is as follows. First, the population genetic data on DNA markers, used in forensic studies, should be accumulated, taking into account ethnic (geographic) origin of the populations. The “population unit” of such study may be rather large—for example, the whole ethnic group in the given geographic region. This is substantiated by the fact that groups of people of different origin can have different genetic compositions because of their different evolutionary history [21, 22]. Second, the maximum degree should be found, to which different groups within the population (e.g., within the ethnic group) can differ genetically in allele and genotype frequencies. (Works on such fine genetic differentiation of populations may be conducted selectively, on the basis of the data on demography and history of the regions examined). Then, genetic differentiation should be
RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
2006
POPULATION ASPECTS OF FORENSIC GENETICS
assessed and quantitatively expressed as an index (e.g., FST, see [23]). Finally, one should utilize as reference estimates the allele and genotype frequencies in the population (ethnic/geographic group), retrieved from a database, with an obligatory FST -based correction. This correction substitutes the allele and genotype frequencies, unknown for the reference population, for their calculated values with account of the differentiation level measured by FST (see [2, 17]). These calculated values indicate the maximum possible range of the differences of the reference population, detected for the given legal case, from the empirical allele and genotype frequencies from the database. The probability values, given in the expert decision, must conform to the principle of presumption of innocence: for example, within the FST -based correction limits, yield the minimally possible probability of identification or relatedness in criminal cases. However, for the sake of objectivity, the expert decision should present the whole range of possible probabilities (both minimally and maximally possible), based on the calculated allele and genotype frequencies. The necessity of knowing the level of genetic differentiation follows from the available data on human genetics, indicating the possibility of great differences between populations [24], including those from Russia [25]. This is all the more important because the population of Russia comprises diverse ethnic groups, whose different evolutionary history could result in genetic differentiation both among the groups and within them. Among-population differences can lead to significant probability errors in the cases when the reference population is incorrectly chosen or when the data on allele frequencies in the reference population are absent and substituted by the data from another population. For groups inhabiting the same territory but having had considerable genetic isolation in the past generations, the differences at DNA markers may be high, strongly affecting the forensic estimates [26]. In particular, it was shown for two ethnic groups that, when each of them is turn was taken as the reference population, the conclusion changed to the opposite [27]. It is important to know the extent of genetic differences among populations not only in the regions that have been long inhabited by members of different ethnic groups, but also in large cities with high immigration level, where significant migration and ethnic admixture may complicate the situation with regard to choice of the reference population for DNA identification. For instance, in many large cities, migrants live in diasporas, but genetic differences may be observed even within such an ethnic group; for example, the Russian population of Moscow is heterogeneous owing to constant inflow of migrants from different regions of the Russian Federation [28]. The fact that children of mixed marriages often attribute themselves to a particular ethnic group, depending on the region of residence, further exacerbates choice of the reference population. For example, parents residing in Moscow, one of whom RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
1205
is ethnic Russian, often consider their children Russian [28], because of which the allele frequencies in the Russian part of Moscow population may differ from the corresponding frequencies in Russian, arriving from other regions, which, in their turn, may differ from one another and from other ethnic (regional) groups of people. Genotypic structure of the reference population. Likelihood ratio LR is the parameter comprising population genetic data on the reference population and determining the probability value. This ratio shows the number of times by which the probability that the given biological sample originates from X is higher than the probability that the source of this sample is an unknown person from the reference population. For the aims of identification, the LR value depends on the frequency of genotype G in the reference population, pG. The current databases include, as a rule, only allele frequencies, because in case of numerous loci, each of which has many alleles, the particular multiple-locus genotype occurs in the samples virtually in the single number. Consequently, the frequency of the revealed genotype G in the reference population is deduced from the allele frequencies. At that, two assumptions on genetic structure of the reference population are necessarily made; i.e., it is assumed that the reference population is in genetic equilibrium both at individual loci and on their totality: (1) Hardy–Weinberg ratio holds for each locus (this assumption allows to estimate genotype frequency at each of the loci examined, which constitute the reference genotype G: for homozygotes, as the squared frequency of the allele (see modifications of homozygotes frequencies in [1, 2]); for a heterozygotes, as the doubled product of the frequencies of the alleles); (2) genotype frequencies at the loci examined are statistically independent from one another (this allows to estimate the genotype G frequency as the product of genotype frequencies of different loci). In the great majority of forensic genetic investigations, these assumptions are tacitly regarded to be true. However, this is not always so. Rather than taken for granted, these assumptions should be tested statistically [29]. Moreover, the choice of loci in each given DNA case depends on these assumptions. For instance, one should not choose linked loci, which is especially evident in disputed paternity cases. Indeed, if loci are tightly linked, they are transmitted from the father to the child together, except for rare cases of recombination. Thus, in this example, the information on the linked loci is not independent, and assumption (2) does not hold. Therefore, investigations of Y-chromosomal loci, which absolutely linked with one another, are based on frequency of haplotype, i.e., the totality of alleles of the loci examined, i.e., in essence, address directly genotype (in this case, haploid genotype) frequency. 2006
1206
ZHIVOTOVSKY
Thus, the probability value, presented by forensic genetic studies, depends not only on the allele frequencies, but also on the total composition of the genotypes at the loci analyzed in the reference population. Knowledge of genotypic composition of the reference populations is decisive in computing and formulating the conclusions of an expert investigation. An assumption of genetic equilibrium, when this equilibrium is lacking, may lead to significant bias in the probability estimates. Therefore, databases should include not only allele frequencies in population samples, but also individual and population data for all loci studied, so that the expert will be able to make his own assessment of the allele frequencies in the population, chosen as reference, test it for genetic equilibrium, and find the effects of interpopulation differences and deviations from the equilibrium on the values of forensic parameters. ACKNOWLEDGMENTS This study was supported by the programs of the Presidium of the Russian Academy of Sciences Fundamental Science for Medicine (theme no. 5, Molecular Polymorphism in Humans) and Gene Pool Dynamics, and the Russian Foundation for Basic Research (grant no. 04-04-48639).
9.
10.
11.
12. 13.
REFERENCES 1. National Research Council. The Evaluation of Forensic DNA Evidence, Washington, DC: Natl. Acad., 1996, p. 254. 2. Evett, I.W. and Weir, B.S., Interpreting DNA Evidence, Massachusetts: Sinauer Associates, Inc. Sund, 1998, p. 278. 3. Weber, J.L., and Broman, K.W., Genotyping for Human Whole-Genome Scans: Past, Present, and Future, Adv. Genet., 2001, vol. 42, pp. 77–96. 4. Carracedo, A., Bar, W., Lincoln, P., et al., DNA Commission of the International Society for Forensic Genetics: Guidelines for Mitochondrial DNA Typing, Forensic Sci. Int., 2000, vol. 110, pp. 79–85. 5. Zhivotovsky, L.A., DNA Identification in Forensic Medical Examination: Should It be Unconditionally Accepted by the Court? Aktual’nye voprosy identifikatsii lichnosti, (Proc. Conf. on the Problems of Personal Identification), St. Petersburg, 1999, pp. 75–76. 6. Zhivotovsky, L.A., Critical Comments to “Practical Guidelines” of P.L. Ivanov, “The Use of Individualizing Systems Based on the Amplified DNA Fragments Length Polymorphism (AFLP) in Forensic Medical Personal Identification and Kinship Tests”, Sib. Medits. J., 2002, no. 2, pp. 85–86. 7. Zhivotovsky, L.A., Mitochondrial DNA in Forensic Medical Investigations, Medits. Genet., 2003, vol. 2, no. 3, pp. 106–114. 8. Zhivotovsky, L.A., DNA Markers in Forensic Medical Examination: How the Choice of Reference Population Affects the Probabilistic Observations, in Kriminalisticheskie sredstva i metody v raskrytii i rassledovanii prestuplenii (Criminalistic Tools and Methods of Crime
14.
15. 16. 17. 18.
19.
20.
21.
Disclosing and Investigation), Moscow: EKTs MVD RF, 2004, vol. 3, pp. 12–15. Perepechina, I.O., DNA Typing upon the Expert Examination of Biological Objects, Veshchestvennye dokazatel’stva. Informatsionnye tekhnologii protsessual’nogo dokazyvaniya, (Material Evidence. Informational Technologies of Processing Proof) Koldin, V.Ya., Ed., Moscow, 2002, chapter XI, pp. 521–564. Ivanov, P.L., The Use of Individualizing Systems Based on the Amplified DNA Fragments Length Polymorphism (AFLP) in Forensic Medical Personal Identification and Kinship Tests, (Practical Guidelines), SudebnoMedits. Ekspertiza, 1999, no. 5, pp. 35–41. Ivanov, P.L., Metodicheskie ukazaniya. Primenenie molekulyarno-geneticheskoi individualiziruyushchei sistemy na osnove polimorfizma nukleotidnykh posledovatel’nostei mitokhondrial’noi DNK v sudebno-meditsinskoi ekspertize identifikatsii lichnosti i ustanovleniya biologicheskogo rodstva (Practical Guidelines. The Use of Molecular Genetic Individualizing System Based on the Polymorphism of Mitochondrial DNA Sequences in Forensic Medical Examination, Personal Identification, and Kinship Tests), Moscow: Ros. Tsentr Sudebno-Medits. Ekspertizy MZ RF, 2001, p. 16. Zhivotovsky, L.A., Microsatellite Variability in Human Populations and the methods of its Analysis, Vestn. VOGi, 2006, vol. 10, no. 1, pp. 74–96. Kayser, M., Caglia, A., Corach, D., et al., Evaluation of Y-Chromosomal STRs: A Multicenter Study, Int. J. Legal Med., 1997, vol. 110, pp. 125–133. Bendall, K.E., Macaulay, V.A., and Sykes, B.C., Variable Levels of a Heteroplasmic Point Mutation in Individual Hair Roots, Am. J. Hum. Genet., 1997, vol. 61, pp. 1303– 1308. Zhivotovsky, L.A., DNA in the Court, Khimiya i Zhizn’, 2001, no. 12, pp. 23–27. Cadenas, A.M., Regueiro, M., Gayden, T., et al., Male Amelogenin Dropouts: Origins and Implications, Forensic Sci. Int., 2006, (in press). Human Identification: The Use of DNA Markers, Weir, B., Ed., London: Kluwer, 1995. Perepechina, I.O. and Grishechkin, S.A., Veroyatnostnye raschety v DNK-daktiloskopii (Metodicheskie rekomendatsii) (Probability Calculations in DNA Fingerprinting (Practical Guidelines)), Moscow: Ekspertno-kriminalisticheskii tsentr MVD RF, 1996, p. 14. Perepechina, I.O., DNA Typing in Forensic Medical Examination of Material Evidences: The Problem of Individualization, Sudebno-Medits. Ekspertiza, 2002, no. 4, pp. 29–35. Perepechina, I.O. and Zhivotovsky, L.A., Estimating the Identification Significance of Genetic Data in Forensic Determination of Paternity (Maternity), in Kriminalisticheskie sredstva i metody v raskrytii i rassledovanii prestuplenii (Evaluation of the Identification Significance of the Genetic Data in Forensic Medical Paternity (Maternity) Tests), Moscow: EKTs MVD RF, 2004, vol. 3, pp. 23–26. Rosenberg, N.A., Pritchard, J.K., Weber, J.L., et al., Genetic Structure of Human Populations, Science, 2002, vol. 298, pp. 2381–2385.
RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
2006
POPULATION ASPECTS OF FORENSIC GENETICS 22. Zhivotovsky, L.A., Rosenberg, N.A., and Feldman, M.W., Features of Evolution and Expansion of Modern Humans Inferred from Genome-Wide Microsatellite Markers, Am. J. Hum. Genet., 2003, vol. 72, pp. 1171– 1186. 23. Weir, B., Analiz geneticheskikh dannykh (Genetic Data Analysis), Moscow: Mir, 1995. 24. Cavalli-Sforza, L.L., Menozzi, P., and Piazza, A., The History and Geography of Human Genes, Princeton; N.J.: Princ. Univ. Press, 1994, p. 413. 25. Gene Pool and Gene Geography of the Populations, in Genofond naseleniya Rossii i sopredel’nykh stran (Gene Pool of the Population of Russia and the Neighboring Countries), Rychkov, Yu.G., Ed., St. Petersburg: Nauka, 2000, vol. 1, p. 611. 26. Zhivotovsky, L.A., Ahmed, S., Wang, W., and Bittles, A.H., The Forensic DNA Implications of Genetic Differentiation Between Endogamous Communities, Forensic Sci. Int., 2001, vol. 119, pp. 269–272.
RUSSIAN JOURNAL OF GENETICS
Vol. 42
No. 10
1207
27. Zhivotovsky, L.A. and Khusnutdinova, E.K., Reference Population and Forensic Medical DNA Typing: Interand Intraethnic Differences over the DNA Markers and Evaluation of the Identification Probability, Medits. Genetika, 2003, vol. 2, no. 5, pp. 201–206. 28. Kurbatova, O.L., Ethnic Demographic Processes and Ecological Situation in Moscow in the Light of the problem of the Genetic Security of the Population, in Bezopasnost’ Rossii: Pravovye, sotsial’no-ekonomicheskie i nauchno-tekhnicheskie aspekty. Bezopasnost’ i ustoichivoe razvitie krupnykh gorodov (Security of Russia: Legal, Social Economic and Scientific Technical Aspects. Security and Steady Development of Large Cities), Moscow: Znanie, 1998, pp. 311–335. 29. Zaykin, D, Zhivotovsky, L, and Weir, B.S, Exact Tests for Association between Alleles at Arbitrary Number of Loci, in Human Identification: The Use of DNA Markers, Weir, B., Ed., London: Kluwer, 1995, pp. 169–178.
2006