Probabilities Associated with DNA Profiling
(A copy with smaller font size and better suited for printing is here.)
This talk was motivated by
two columns by Keith Devlin which appear on the Mathematical Association of America
(MAA) website.
Statisticians not wanted
http://www.maa.org/devlin/devlin_09_06.html
Damned lies
http://www.maa.org/devlin/devlin_10_06.html
In these columns he indicates
there are some serious problems with calculating probabilities associated with
DNA profiling. By “DNA profiling” we
mean an attempt to identify a person by matching his and/or her DNA to that of
biological evidence left at a crime scene.
Recall that a DNA molecule
consists of two strands twisted around each other in the familiar double helix
pattern. The strands are joined together
by pairs of bases (aka nucleotides) which are of four types: adenine (A),
guanine (G), cytosine (C), and thymine (T).
Each pair is of the form AT or GC, so knowing one side is enough to
deduce the other. So we can think of DNA
as a sequence of the four letters A, C, G, and T.
Mind you, this sequence is a
bit long, there are about 3 billion pairs in the human genome. Fortunately the DNA is arranged into large
bodies called chromosomes, and as is
well known, humans have 23 pairs of chromosomes. A gene
is a locus on a chromosome. A gene may
occur in different versions known as alleles. A pair of chromosomes have the same loci
along their length, but may have different alleles at some of the loci. A genotype
is a pair of alleles at one locus. If a
gene is observed to occur in n
different alleles, then there are
possible genotypes at
that locus. For example, with 3 alleles
we could have the pairs AA, AB, AC, BB, BC, CC.)
The FBI established the
COmbined Dna Index System (known by the acronym CODIS) in November, 1997. This system features the 13 loci CSF1PO, FGA,
TH01, TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51,
D21S11. (They also test the amelogenin
gene for sex determination.) European
agencies use a standard of 10 loci, of which 8 are part of the CODIS system (
Before we get to the main
topics recall that if E1, E2, …, En are mutually independent events, then
. Much work centers
around determining just when we have independence.
The distribution of alleles
at a given locus is assumed to be in Hardy-Weinberg equilibrium. This implies that if a fraction p of the population has a certain allele,
then the probability a person has two of these in a pair is p2, while if two alleles
occur in fractions p and q of the population, then the
probability a person has each allele in a pair is 2pq. Here is a hypothetical
numerical example.
|
allele |
a |
b |
c |
|
frequency |
.3 |
.5 |
.2 |
|
pair |
aa |
ab |
ac |
bb |
bc |
cc |
|
freq. |
.09 |
.30 |
.12 |
.25 |
.20 |
.04 |
If there are n alleles occurring with probabilities
, then the probability of having any one pair is the corresponding
value in the binomial expansion of
. Empirical results
have shown that each of the 13 loci used in the CODIS system are in
Hardy-Weinberg equilibrium. However, we
need to restrict to subgroups to obtain equilibrium (Caucasian,
African-American, west coast Hispanic, east coast Hispanic, etc.). Also, a
locus which is in Hardy-Weinberg equilibrium will not have its distribution of
alleles change over time. Thus we do not
have to worry about the age of a suspect while doing DNA profiling.
The probability of identity PI of a locus is the
probability that two individuals share the same genotype at that locus, and
equals the sum of the squares of the probabilities of the alleles. In the above example we have
. The power of
discrimination PD equals 1
- PI, and is the
probability two individuals have different genotypes at that locus. In general we prefer loci with many different
alleles that occur with roughly equal probabilities in a system designed to
distinguish between the DNA of different people. That is, we want all alleles at the locus to
have small probability of occurring.
Authorities use the CODIS
system to test for a match between the DNA of a suspect and the DNA of
biological evidence left at a crime scene.
The idea is that a match at any one locus should be fairly small, say
.1. Thus the probability of two
different people matching DNA at all 13 loci is 10-13, or virtually
zero, assuming that matches at any two given loci occur independently. The practitioners of DNA profiling routinely
cite match probabilities of 10-15 or so. Since there are less than 1010
people in the world, there is very slight chance that two people can
independently match DNA at 13 loci, if these probabilities are correct.
It was actually another
problem which aroused Devlin’s interest in the subject of DNA profiling, but he
is quite clearly concerned about the assumption of independence in DNA
profiling. The 13 loci used in the CODIS
system are located on 12 chromosomes, and the two loci which are on the same
chromosome are separated by approximately 26.3 mgb (megabase pairs = one
million pairs of genes;
Devlin knows of only one
instance of someone investigating a “large” DNA database to see how many
matches occur. This involved the CODIS
database maintained by the state of
Authorities do take into
account lack of independence in the population as a whole. It is certainly recognized that identical
twins will match DNA profiles, and other close relatives might match as well. The citing of probabilities for separate
subpopulations (Caucasians, African-Americans, Filipino, etc.) also indicates a
general lack of independence. In some
cases investigators use “theta corrections”, to obtain a probability of a match
when the suspect and crime scene DNA both come from a well-defined small
subgroup (see pages 29 and 30 of the 1996 National Research Council report).
By the way, it was extremely
difficult to track down the exact numbers in the Arizona CODIS database study,
partly because there has never been a formal, thoroughly reviewed, published
study. It really started with a clerk,
Katherine Troyer, whose curiosity was aroused when a nine loci match
occurred. She then sought out other
matches in her spare time. The best
source is the online reference of Brenner, which appeared very recently in response
to Devlin’s columns.
The question of independence
is not what originated Devlin’s question about DNA profiling. The origin of his articles involves the
difference between a “hot hit” and a “cold hit”. A hot hit refers to DNA matching evidence of
someone for whom other evidence indicates a connection to the crime. Even if DNA matches truly occur with
probabilities as high as 10-6 or 10-7, a DNA match
coupled with other evidence of involvement pretty clearly indicates guilt.
A cold hit refers to a case
in which the profile of DNA left at a crime scene is checked against profiles
in some database and a match occurs.
Legal authorities have struggled for quite some time to interpret the
chances of this occurring. On two
occasions the FBI has asked the National Research Council to study the issue
and provide a report. In the first
report the National Research Council advised using additional loci to determine
whether a true match has occurred. This
turned out not to always be practical, so in the second report the National
Research Council gave a second suggestion of calculating a “database match
probability” by multiplying the random match probability by the size of the
database being searched. For example, assume
a random match probability of 10-15 and a sample size of 106. Then the database match probability is 10-9. By the way, frequently only partial samples
are available due to degradation of DNA evidence. So the random match probability might be,
say, 10-7 and the sample size is 105, which makes for a
database match probability of .01. This
is strong but not necessarily exclusionary evidence of a match. Some statisticians have disagreed with the
National Council Research reports, and have advocated using the random match
probability, or a calculation very close
to it numerically (usually very small).
Due to the lack of certainty there has been a court case in
I believe that initially
legal authorities assumed independence of matching at different loci so they could
use the product rule to simplify calculations of probabilities. At that time there were no large databases so
as to empirically check for independence.
(The 1996 NRC report mentions two studies, but the largest still
involved fewer than 11,000 people.) I
think now most people are hoping that DNA profile kits will be expanded to
include so many loci that DNA profiles do become unique. Consider this statement form the 1996 report
of the National Research Council (page 8).
“The committee has not attempted
to define a specific probability that corresponds to uniqueness, but the report
outlines a framework for considering the issue in terms of probabilities, and
it urges that research into new and cumulatively more powerful systems continue
until a clear consensus emerges that DNA profiles, like dermal fingerprints,
are unique.”
Postscript: The article by
Brenner implies there is nothing very unusual about those hits in the Arizona
CODIS database. He mentions that we need
to concentrate on pairs of entries in the database. So the expected number of 9-loci matches
(assuming his probability of 1/13.66 for a match at any one locus) equals
.
But this is still a bit
troubling to me. Suppose in the near
future everyone’s DNA profile is in a national database. Let us assume the population of the
. If we have profiles
on several more loci perhaps then full matches will be rare.
References
Kobilinsky,
National Research Council, The Evaluation of forensic DNA evidence,
National Academy Press, 1996
Rudin, Norah, Inman, Keith, An Introduction to Forensic DNA Analysis
(second edition), RA1057.55.I65 2002
Budowle, B. et al, Population Data on the STR Loci D2S1338
and D19S433, Forensic Science Communications, July 2001 (Tables are available online at
http://www.fbi.gov/hq/lab/fsc/backissu/july2001/budtabs.htm.)
Sandler & Mercer, P.C.,
Supreme Court document at
http://www.sandlermercer.com/dna-certiorari-diffendal.html
Walsh, Nick Paton, False result fear over DNA tests,
Guardian Unlimited Observer, January 27, 2002
http://www.portia.org/chapter10/DNAmis.html
Transcript of Katherine
Troyer concerning Arizona CODIS database matches, http://www.nlada.org/Defender/forensics/for_lib/Index/DNA/DNA%20Database%20Issues/Arizona%20CODIS%20Match%20Information/advanced_search_form
Brenner, Charles, DNA Database Matches, http://dna-view.com/ArizonaMatch.htm