Probabilities Associated with DNA Profiling

 

(A copy with smaller font size and better suited for printing is here.)

 

This talk was motivated by two columns by Keith Devlin which appear on the Mathematical Association of America (MAA) website.

 

Statisticians not wanted

 

http://www.maa.org/devlin/devlin_09_06.html

 

Damned lies

 

http://www.maa.org/devlin/devlin_10_06.html

 

In these columns he indicates there are some serious problems with calculating probabilities associated with DNA profiling.  By “DNA profiling” we mean an attempt to identify a person by matching his and/or her DNA to that of biological evidence left at a crime scene.

 

Recall that a DNA molecule consists of two strands twisted around each other in the familiar double helix pattern.  The strands are joined together by pairs of bases (aka nucleotides) which are of four types: adenine (A), guanine (G), cytosine (C), and thymine (T).  Each pair is of the form AT or GC, so knowing one side is enough to deduce the other.  So we can think of DNA as a sequence of the four letters A, C, G, and T.

 

Mind you, this sequence is a bit long, there are about 3 billion pairs in the human genome.  Fortunately the DNA is arranged into large bodies called chromosomes, and as is well known, humans have 23 pairs of chromosomes.  A gene is a locus on a chromosome.  A gene may occur in different versions known as alleles.  A pair of chromosomes have the same loci along their length, but may have different alleles at some of the loci.  A genotype is a pair of alleles at one locus.  If a gene is observed to occur in n different alleles, then there are  possible genotypes at that locus.  For example, with 3 alleles we could have the pairs AA, AB, AC, BB, BC, CC.)

 

The FBI established the COmbined Dna Index System (known by the acronym CODIS) in November, 1997.  This system features the 13 loci CSF1PO, FGA, TH01, TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11.  (They also test the amelogenin gene for sex determination.)  European agencies use a standard of 10 loci, of which 8 are part of the CODIS system (Butler online reference). 

 

Before we get to the main topics recall that if E1, E2, …, En are mutually independent events, then .  Much work centers around determining just when we have independence.

 

The distribution of alleles at a given locus is assumed to be in Hardy-Weinberg equilibrium.  This implies that if a fraction p of the population has a certain allele, then the probability a person has two of these in a pair is p2, while if two alleles occur in fractions p and q of the population, then the probability a person has each allele in a pair is 2pq.  Here is a hypothetical numerical example.

 

allele

a

b

c

frequency

.3

.5

.2

 

pair

aa

ab

ac

bb

bc

cc

freq.

.09

.30

.12

.25

.20

.04

 

If there are n alleles occurring with probabilities , then the probability of having any one pair is the corresponding value in the binomial expansion of .  Empirical results have shown that each of the 13 loci used in the CODIS system are in Hardy-Weinberg equilibrium.  However, we need to restrict to subgroups to obtain equilibrium (Caucasian, African-American, west coast Hispanic, east coast Hispanic, etc.). Also, a locus which is in Hardy-Weinberg equilibrium will not have its distribution of alleles change over time.  Thus we do not have to worry about the age of a suspect while doing DNA profiling.

 

The probability of identity PI of a locus is the probability that two individuals share the same genotype at that locus, and equals the sum of the squares of the probabilities of the alleles.  In the above example we have .  The power of discrimination PD equals 1 - PI, and is the probability two individuals have different genotypes at that locus.  In general we prefer loci with many different alleles that occur with roughly equal probabilities in a system designed to distinguish between the DNA of different people.  That is, we want all alleles at the locus to have small probability of occurring.

 

Authorities use the CODIS system to test for a match between the DNA of a suspect and the DNA of biological evidence left at a crime scene.  The idea is that a match at any one locus should be fairly small, say .1.  Thus the probability of two different people matching DNA at all 13 loci is 10-13, or virtually zero, assuming that matches at any two given loci occur independently.  The practitioners of DNA profiling routinely cite match probabilities of 10-15 or so.  Since there are less than 1010 people in the world, there is very slight chance that two people can independently match DNA at 13 loci, if these probabilities are correct. 

 

It was actually another problem which aroused Devlin’s interest in the subject of DNA profiling, but he is quite clearly concerned about the assumption of independence in DNA profiling.  The 13 loci used in the CODIS system are located on 12 chromosomes, and the two loci which are on the same chromosome are separated by approximately 26.3 mgb (megabase pairs = one million pairs of genes; Butler online reference).  So independence is not out of the question.  However, there is some dependence, as shown by the fact that statistics are computed for different racial/ethnic groups.  For example, we usually partition the U.S. into African-American, Caucasian, Hispanic, and Asian categories, mainly to preserve Hardy-Weinberg equilibrium.  Butler even notes that, for example, one may have to partition Hispanic into East Coast and West Coast subpopulations to achieve Hardy-Weinberg equilibrium (Butler book, pages 161 and 162).  Empirical studies to determine allele distribution usually include about 100 to 150 subjects per category (Butler, online reference, page 16).  For example, in the Budowle et al reference concerning the distribution of alleles for the two loci used in Europe but not the U.S., a total of 604 subjects were used to determine the distribution among 5 subgroups.  (Samples sizes ranged from 71 to 167.)

 

Devlin knows of only one instance of someone investigating a “large” DNA database to see how many matches occur.  This involved the CODIS database maintained by the state of Arizona.  The results were not encouraging.  From a database of 65,493 samples matches were made at 12 and 11 loci (in each case by siblings), there were 20 matches at 10 loci, and 120 matches at 9 loci.  (The legal document of Sandler and Mercer says the chances of a nine-loci match among Caucasians was supposed to be 1 in 754,100,000, see their online reference.  Also see the testimony of Katherine Troyer.  In this legal document figures are stated of 1 in 750,000,000 for a Caucasian match, 1 in 560,000,000 for an African-American match.  The original nine loci match was between a Caucasian and an African-American.)  This suggests that in a database containing several million entries (such as the FBI national database) there could easily be matches occurring at all 13 loci.  (In an article available on the web, Nick Paton Walsh reported that British forensic scientists admitted at a private conference they had found two people whose DNA profiles matched.)  Devlin raises the interesting question of why a check of the national database (or a collection of smaller databases) has never been done.  In any event the claims of extremely small probabilities of matches seem questionable.

 

Authorities do take into account lack of independence in the population as a whole.  It is certainly recognized that identical twins will match DNA profiles, and other close relatives might match as well.  The citing of probabilities for separate subpopulations (Caucasians, African-Americans, Filipino, etc.) also indicates a general lack of independence.  In some cases investigators use “theta corrections”, to obtain a probability of a match when the suspect and crime scene DNA both come from a well-defined small subgroup (see pages 29 and 30 of the 1996 National Research Council report).

 

By the way, it was extremely difficult to track down the exact numbers in the Arizona CODIS database study, partly because there has never been a formal, thoroughly reviewed, published study.  It really started with a clerk, Katherine Troyer, whose curiosity was aroused when a nine loci match occurred.  She then sought out other matches in her spare time.  The best source is the online reference of Brenner, which appeared very recently in response to Devlin’s columns.

 

The question of independence is not what originated Devlin’s question about DNA profiling.  The origin of his articles involves the difference between a “hot hit” and a “cold hit”.  A hot hit refers to DNA matching evidence of someone for whom other evidence indicates a connection to the crime.  Even if DNA matches truly occur with probabilities as high as 10-6 or 10-7, a DNA match coupled with other evidence of involvement pretty clearly indicates guilt.

 

A cold hit refers to a case in which the profile of DNA left at a crime scene is checked against profiles in some database and a match occurs.  Legal authorities have struggled for quite some time to interpret the chances of this occurring.  On two occasions the FBI has asked the National Research Council to study the issue and provide a report.  In the first report the National Research Council advised using additional loci to determine whether a true match has occurred.  This turned out not to always be practical, so in the second report the National Research Council gave a second suggestion of calculating a “database match probability” by multiplying the random match probability by the size of the database being searched.  For example, assume a random match probability of 10-15 and a sample size of 106.  Then the database match probability is 10-9.  By the way, frequently only partial samples are available due to degradation of DNA evidence.  So the random match probability might be, say, 10-7 and the sample size is 105, which makes for a database match probability of .01.  This is strong but not necessarily exclusionary evidence of a match.  Some statisticians have disagreed with the National Council Research reports, and have advocated using the random match probability,  or a calculation very close to it numerically (usually very small).  Due to the lack of certainty there has been a court case in California in which a panel of judges has issued a ruling that statistical considerations are not to be considered in this affair, that it is strictly a legal matter.  (They also appear to believe there is no difference between a hot hit and a cold hit, and thus place great emphasis on the low probability of a match.)  This is what raised Devlin’s concerns in the first place.  He believes that statisticians and geneticists need to decide the issue, and that current legal rulings should take their testimony into account.

 

I believe that initially legal authorities assumed independence of matching at different loci so they could use the product rule to simplify calculations of probabilities.  At that time there were no large databases so as to empirically check for independence.  (The 1996 NRC report mentions two studies, but the largest still involved fewer than 11,000 people.)  I think now most people are hoping that DNA profile kits will be expanded to include so many loci that DNA profiles do become unique.  Consider this statement form the 1996 report of the National Research Council (page 8).

 

“The committee has not attempted to define a specific probability that corresponds to uniqueness, but the report outlines a framework for considering the issue in terms of probabilities, and it urges that research into new and cumulatively more powerful systems continue until a clear consensus emerges that DNA profiles, like dermal fingerprints, are unique.”

 

Postscript: The article by Brenner implies there is nothing very unusual about those hits in the Arizona CODIS database.  He mentions that we need to concentrate on pairs of entries in the database.  So the expected number of 9-loci matches (assuming his probability of 1/13.66 for a match at any one locus) equals .

 

But this is still a bit troubling to me.  Suppose in the near future everyone’s DNA profile is in a national database.  Let us assume the population of the U.S. is 350,000,000 at that time (perhaps 2020?).  Then the expected number of  full matches is .  If we have profiles on several more loci perhaps then full matches will be rare.

 

References

 

Butler, John M., Forensic DNA Typing,  RA1057.55.B88 2001

 

Kobilinsky, Lawrence, Liotti, Thomas, Oeser-Sweat, Jamel, DNA, Forensic and Legal Applications, KF9666.5.K63 2005

 

National Research Council, The Evaluation of forensic DNA evidence, National Academy Press, 1996

 

Rudin, Norah, Inman, Keith, An Introduction to Forensic DNA Analysis (second edition), RA1057.55.I65 2002

 

Butler, John M., Genetics and Genomics of Core STR Loci Used in Human Identity Testing, http://www.cstl.nist.gov/div831/strbase/pub_pres/Butler_coreSTRloci_JFS_Mar2006.pdf

 

Budowle, B. et al, Population Data on the STR Loci D2S1338 and D19S433, Forensic Science Communications, July 2001  (Tables are available online at

http://www.fbi.gov/hq/lab/fsc/backissu/july2001/budtabs.htm.)

 

Sandler & Mercer, P.C., Supreme Court document at

http://www.sandlermercer.com/dna-certiorari-diffendal.html

 

Walsh, Nick Paton, False result fear over DNA tests, Guardian Unlimited Observer, January 27, 2002

http://www.portia.org/chapter10/DNAmis.html

 

Transcript of Katherine Troyer concerning Arizona CODIS database matches, http://www.nlada.org/Defender/forensics/for_lib/Index/DNA/DNA%20Database%20Issues/Arizona%20CODIS%20Match%20Information/advanced_search_form

 

Brenner, Charles, DNA Database Matches, http://dna-view.com/ArizonaMatch.htm