Software for quantifying and simulating microsatellite genotyping error.

Microsatellite genetic marker data are exploited in a variety of fields, including forensics, gene mapping, kinship inference and population genetics. In all of these fields, inference can be thwarted by failure to quantify and account for data errors, and kinship inference in particular can benefit from separating errors into two distinct classes: allelic dropout and false alleles. Pedant is MS Windows software for estimating locus-specific maximum likelihood rates of these two classes of error. Estimation is based on comparison of duplicate error-prone genotypes: neither reference genotypes nor pedigree data are required. Other functions include: plotting of error rate estimates and confidence intervals; simulations for performing power analysis and for testing the robustness of error rate estimates to violation of the underlying assumptions; and estimation of expected heterozygosity, which is a required input. The program, documentation and source code are available from http://www.stats.gla.ac.uk/~paulj/pedant.html.


Introduction
Microsatellite genotype data underpin a wide diversity of genetic studies, in fi elds including individual identifi cation in forensics, mapping disease genes, kinship inference and population genetics. All but the smallest microsatellite data sets will contain genotyping errors, which can be substantially misleading when undetected. Awareness of the implications of genotyping error has increased greatly in recent years, and there is an emerging consensus that error rates should be estimated and reported, particularly when performing error-sensitive analyses such as parentage analysis or individual identifi cation (Bonin et al. 2004;Hoffman and Amos, 2005;Pompanon et al. 2005;DeWoody et al. 2006). Although the benefi ts of estimating and correcting for genotyping errors are most obvious when genotyping highly error-prone (e.g. non-invasive) samples, even low levels of genotyping error of 0.5-2% per genotype can mislead inference from microsatellite data (Feakes et al. 1999;Hoffman and Amos, 2005;Walters, 2005).
Ideally, error rates should be estimated by a combination of approaches, the most widely used being: (1) counting Mendelian inconsistencies in parent-offspring pairs; (2) comparing error-prone genotypes with high-quality reference samples; and (3) counting mismatches between duplicated genotypes. The fi rst two approaches will generally yield more reliable estimates than the third, but may not be feasible because pedigree data and reference samples will often be unavailable. The third option of re-genotyping a random subset of samples at all loci is widely recommended (Bonin et al. 2004;Hoffman and Amos, 2005;Pompanon et al. 2005;DeWoody et al. 2006) and will almost always be feasible but suffers from two disadvantages. Firstly, when errors are frequent a non-negligible proportion of repeat genotypes will sustain errors in both genotypes, leading to undercounting of errors (e.g. due to an underlying heterozygous genotype of AB being genotyped twice as AA). Secondly, many analyses benefi t from breaking down single error rates into two distinct classes of error-allelic dropout and false alleles (Broquet and Petit, 2004)-which is not straightforward when both duplicates are error-prone.
Pedant is a program for MS Windows that implements a maximum likelihood method for estimating rates of allelic dropout and false allele error from microsatellite genotype data (Johnson and Haydon, 2007). Allelic dropout occurs when a heterozygote is genotyped as a homozygote due to the random failure of one allele to PCR-amplify, and a false allele occurs when an allele is mis-genotyped as any other allele (Broquet and Petit, 2004). Allelic dropout and false allele error rates can be estimated as per-genotype probabilities, but here we treat them as per-allele probabilities. In a single genotype, ε 1 is the probability of an allele dropping out, and ε 2 is its probability of being mis-typed as another allele (Wang, 2004). Conversion between the perallele and per-genotype rates is simple, and both are calculated by Pedant.
Division of errors into allelic dropouts and false alleles is useful not only because allelic dropouts are typically more frequent than false alleles (Broquet and Petit, 2004), but also because the two classes of error affect analyses in fundamentally different ways (Wang, 2004;Hadfi eld et al. 2006). For example, consider a simple paternity analysis using one locus at which every allele has a 7% probability of error. There are three individuals: one offspring (genotype AB), and only two possible fathers (genotypes CD and CC). Neither candidate father matches the offspring, so an error must be invoked, but because we cannot say in which candidate the error is more likely to have occurred, the data are uninformative. However, suppose we can break down the total error probability into ε 1 = 5% for allelic dropout and ε 2 = 2% for false alleles. We will then have greater confi dence in genotypes AB and CD than in CC, because only CC can have been affected by allelic dropout (assuming only one error per genotype), and attributing AB or CD to error would require us to invoke a less likely class of error. Thus, when separate error rates are known, the data become more informative: the candidate with genotype CC is the more likely father. This principle has been successfully applied to pedigree reconstruction (Wang, 2004;Hadfi eld et al. 2006) and individual identifi cation (Kalinowski et al. 2006).
Unlike estimating a single error rate, estimating ε 1 and ε 2 is not a simple exercise in counting errors because when both repeat genotypes are error prone the class of error responsible for a mismatch can be ambiguous. For example, the mismatch AA/AB could be caused by allelic dropout in an underlying genotype of AB (giving AA) or a false allele in an underlying AA (giving AB). Instead, Pedant counts different types of matching and mismatching duplicate genotypes (AA/AA, AA/AB, AB/AB, AB/AC, etc) and searches for the pair of error rates that maximises the likelihood of the observed counts (Johnson and Haydon, 2007). For example, a duplicated data set with few AA/AB-type mismatches and many of type AB/AC supports a low allelic dropout rate and a high false allele rate.
The fi rst step in using Pedant is to ascertain whether the data fi t the underlying assumptions of the error model. The principal assumption is that the probability that an underlying genotype is heterozygous (observed heterozygosity, H o ) is known. However, as H o will be biased by allelic dropouts and false alleles, Pedant uses instead expected heterozygosity, H e , which will equal H o only when the study population is in Hardy Weinberg equilibrium (HWE). Deviations from HWE, such as a substantial heterozygote defi cit (F IS Ͼ 0.1), that cannot be corrected for using prior knowledge will lead to biased estimates (Johnson and Haydon, 2007). Known or suspected population structure can easily be corrected for by using the average within-subpopulation H e in place of the global H e . Pedant also assumes that errors occur with equal probability across samples and that each allele of a heterozygote is equally likely to drop out. Pedant is fairly robust to deviations from both of these assumptions, but a combination of severe variation in sample quality (Pompanon et al. 2005), short allele dominance (Wattier et al. 1998) and high allelic dropout rate (ε 1 Ͼ 0.1) can lead to underestimation of both ε 1 and ε 2 . Nevertheless, testing of Pedant on real and simulated data suggests that the method is robust to modest deviations from its underlying assumptions (Johnson and Haydon, 2007).
The second step in using Pedant is to decide how many genotypes to duplicate at each locus. This will depend mainly on the degree of accuracy required by the downstream analysis, but approximate estimates will generally be adequate (SanCristobal and Chevalet, 1997;Sieberts et al. 2002;Wang, 2004). The most important factor in determining accuracy is generally the number of visible errors in the input data. Thus, error rate estimation will be most accurate from data with high error rates and high H e (because at low H e many dropouts will be hidden), and when these conditions do not apply sample size should be increased to compensate. If average cross-locus error rate estimates are suffi cient, all loci can be pooled as a single locus, so with 10 loci perhaps only 25 to 50 duplicate genotypes per locus might be necessary. On the other hand, if locus-specifi c estimates are required then at least 100 repeats per locus might be required. These fi gures are very rough guidelines only. Specifi c scenarios can be explored using Pedant's simulation function, which allows users to simulate the estimation of error rates from multiple data sets while varying parameters (ε 1 , ε 2 , H e , F IS and sample size). For each error class the individual estimates and their mean and standard deviation are plotted, allowing rapid visual assessment of estimation accuracy under a variety of parameter combinations. Five hundred simulated data sets can be analysed in about 30 seconds.
When the data have been deemed suitable for analysis using Pedant and sample size has been decided, samples should be selected at random and repeat-genotyped blind at all loci. Pedant accepts genotypes in tab-delimited text format, which is easy to create from a spreadsheet. Alternatively, a Pedant input fi le can be converted from genepop format using an R script included in the download. The input fi le must contain H e estimates for each locus. To minimise the effect of sampling error, H e should be estimated from the whole data set, not just the subset that has been duplicated. H e (and its standard error) can be calculated in Pedant.
When the input fi le has been loaded, Pedant locates the maximum likelihood (ML) estimates, of ε1 and ε2, which takes about one second per locus, and outputs the estimates together with 95% confi dence intervals calculated using the relative log-likelihood function. Estimates and confi dence regions can also be plotted for individual loci (Fig.  1). Details of the ML search algorithm are given in the program documentation.
An alternative method for estimating ε 1 and ε 2 across markers is available in MasterBayes, an R package for Bayesian pedigree reconstruction (Hadfi eld et al. 2006). MasterBayes samples the posterior distributions of ε 1 and ε 2 , from which point estimates and credible intervals can be obtained using summary statistics describing the relevant marginal distribution. Although Pedant and MasterBayes differ in their approaches to parameter estimation (maximum likelihood versus Bayesian), their underlying error models are similar and they would be expected to reach similar estimates. We compared the two programs using two simulated data sets: low error (ε 1 = 0.01, ε 2 = 0.005) and high error (ε 1 = 0.05, ε 2 = 0.02). The low rates are typical of highly automated genotyping while the higher rates represent smaller, more error-prone studies (Pompanon et al. 2005). We set ε 1 higher than ε 2 because allelic dropouts are generally more frequent than false alleles (e.g. Broquet and Petit 2004). Comparison of the mean error estimates across 100 loci (Table 1) shows that both programs reach broadly similar and accurate estimates, particularly at higher error rates (for the reasons noted above). For both programs bias was highest in estimating the lowest of the four error rates (ε 2 = 0.005), with MasterBayes overestimating (ε 2 = 0.0066, bias = 31%) and Pedant underestimating (ε 2 = 0.005, ε 2 = 0.0041, bias = -18%). Overall Pedant recovered the population error rates more accurately, although it should be noted that the accuracy of both programs will vary with other parameters such as H e and sample size. For the small sample sizes that are likely to be feasible, by far the greatest source of estimation error will be sampling error (as refl ected by the width of the interquartile range), which is beyond the control of either method. Given that the two programs produce similar results, the choice of which to use should be guided by more pragmatic considerations. Pedant is simpler to use, in that it has a user-friendly interface (Fig. 1) and does not require familiarity with computer programming. Moreover, production of locus-specifi c estimates, which are used by some analyses (e.g. Wang, 2004), is more straightforward. However, error rate estimation is more fl exible in MasterBayes because it can use more than two repeat genotypes and it allows the incorporation of pedigree information. In addition, like other R programs it has the useful property of being adaptable by users with some knowledge of the R programming language. Thus, the choice of program will depend upon the experience and requirements of the user.
The MS Windows executable program, documentation, input fi le converter and source code for Pedant (written in Borland Delphi version 7.0) are available from http://www.stats.gla.ac.uk/~paulj/ pedant.html. a For each data set 100 duplicate genotypes were simulated in MasterBayes from 100 loci with 10 alleles per locus. Allele frequencies for each locus were generated randomly from the broken stick distribution. b In MasterBayes locus-specifi c error rates were estimated as the median of the relevant marginal distribution based on 100,000 MCMC iterations (after discarding 10,000 burn-in iterations). In Pedant the ML estimates located during 10,000 search iterations were used.