An analysis pipeline for genome-wide association studies.

We developed an efficient pipeline to analyze genome-wide association study single nucleotide polymorphism scan results. Purl scripts were used to convert genotypes called using the BRLMM algorithm into a modified PB format. We computed summary statistics characteristic of our case and control populations including allele counts, missing values, heterozygosity, measures of compliance with Hardy-Weinberg equilibrium, and several population difference statistics. In addition, we computed association tests, including exact tests of association for genotypes, alleles, the Cochran-Armitage linear trend test, and dominant, recessive, and over dominant models at every single nucleotide polymorphism (SNP). In addition, pairwise linkage disequilibrium statistics were elaborated, using the command line version of HaploView, which was possible by writing a reformatting script. Additional Perl scripts permit loading the results into a MySQL database conjoined with a Generic Genome Browser (gbrowse) for comprehensive visualization. This browser incorporates a download feature that provides actual case and control genotypes to users in associated genomic regions. Thus, re-analysis "on the fly" is possible for casual browser users from anywhere on the Internet.


Introduction
Genome-wide polymorphism and copy number variation arrays are an accepted and standardized approach to disease association mapping, which is gaining in popularity in the genetic research community. Nearly two hundred genome-wide association studies (GWAS) have now been cataloged by the National Human Genome Research Institute of the National Institutes of Health (NIH; see: http:// www.genome.gov/GWAstudies/). In addition, newer platforms provide detailed typing for both common and rare copy number variations (CNVs) in the human genome.
Before mid-2007, a 500K Affymetrix array was the most common platform used to study the genetic epidemiology of cancer and to help understand the population genetics of common diseases. Since that time, Illumina has released a 550 K, 650 K and a One Million (1 M) single nucleotide polymorphism (SNP) platform; and Affymetrix has released a 6.0 version that contains over 900,000 SNPs (with over 700,000 scorable) and approximately 950,000 CNV probes. Each platform provides copious genetic data on every typed research subject.
The mass application of the 500 K array around the world has produced an enormous amount of data susceptible to independent and creative analysis in many locations. Both population history studies and association analysis can be carried forward with these data. The ease and availability of analysis can contribute immensely into gaining new insights from the data provided from dense SNP and CNV panels.
A methodologic pipeline for analysis of the microarray genotype calls, performed directly on a Unix platform, using an average PC-Linux or Mac computer, avoids cumbersome and time-consuming statistical software programming and permits rapid conclusions to be drawn about population structure and associations. Here, we provide methodology and detailed instructions for implementing the analysis of GWAS results.

Construction of a single fi le in modifi ed PB format
The main feature of our analytical pipeline is the concentration of all available information into a single text fi le. We call it a modifi ed PB fi le. This single fi le comprises structured SNP calls, confi dence scores, and references for all mapped SNPs for hundreds or thousands of genotyping microarrays.
Our modified PB format is based on the Prettybase format developed by the SeattleSNPs project (http://pga.gs.washington.edu) to easily represent SNP information for groups of subjects over a given span of reference sequences. Our modifi ed PB format has an inherent feature of easy adaptability to the descriptive applications and tools written by Ross Lazarus and publicly available at http://innateimmunity.net.
The following Prettybase features were adapted to our modifi ed PB format: (Fig. 1, Panel B): 1. The fi le consists of rows with four fi elds separated by a white space. These fi elds are: SNP displacement, Patient Identifi er (PID), genotype call for the fi rst allele, and genotype call for the second allele.
2. A missing genotype, if present, is denoted N N in the third and fourth columns. 3. The fi rst letter of a PID indicates the population to which this subject belongs.
We added some other characteristics to the Prettybase format, thus rendering it into the modifi ed PB format. The fi rst of these characteristics is the 11-digit chromosome-displacement fi eld (ChrDisp) specifying the unique position of the SNP in the human genome. The fi rst two digits contain the chromosome number padded with leading zeros, if necessary, while the last nine digits contain the base pair position (again padded with leading zeros). [See Fig. 1] Also, a modifi ed PB fi le can have as many fi elds as necessary: While the four core fi elds do not alter relative position, additional fi elds containing meta-data can be appended on the right. Adding these fi elds may include new

ChrDisp PID
First Letter Indiv. PID Figure 1. Input File Formats. A) Schematic structure of a traditional Prettybase fi le with the four columns delineated. The leftmost column is the displacement relative to a reference value; the middle column is a patient identifi er (PID); and the third and fourth columns are the allele 1 and allele 2 base calls (A1 A2). B) A modifi ed format designated as a Modifi ed PB fi le. The leftmost displacement fi le consists of an 11 digit displacement, the fi rst two digits of which are the chromosome number (X is 23, Y is 24), and the rightmost 9 digits of which are the displacement relative to the beginning of the chromosome. The PID consists of alphanumeric identifi ers preceded by a group identifi er (for example, U for Utah or Z for Zuni Native American) with the third and fourth columns as allele 1 and allele 2 base calls (A1 A2). Additional fi elds appended to the right hand portion may include dbSNP identifi ers, Affymetrix probe identifi ers, or confi dence values, as shown in the right hand portion of panel B. C) Cartoon clarifying the relationship of the 11 digit chromosome-displacement identifi er (ChrDisp) with the population subgroup PID identifi er, ordered through a UNIX sort.
information but does not complicate the reading or parsing of the fi le for reformatting or computation. In modifi ed PB format ( Fig. 1, Panel B), fi elds for dbSNP RefSNP ID, Affymetrix SNP ID, and confi dence score of genotyping have been added. The modified PB file can be sorted with a simple "sort" command (UNIX, Perl or Excel type), which will group the record lines into blocks of the same SNP, sorted according their position in the genome. Each block, in turn, is sorted according to the PID (Fig. 1, Panel C).
This modifi ed PB fi le can have different numbers of fi elds and lines; a variety of chromosomedisplacements or population specifi cations. Thus, we defi ne a "regular" modifi ed PB as a fi le that contains an equal number of persons per SNP block and equal number of fi elds per record. In general, we always refer to that type of modifi ed PB fi le, since the number of the records equals the product of the number of SNPs multiplied by the number of genotyped research subjects.

Affymetrix 500K and Illumina
GoldenGate analysis using modifi ed PB fi le format and minimum software The modified PB-file pipeline is used in our laboratory and is the backbone for a recent study of genetic variation in Ashkenazi Jews (Olshen et al. 2008)  Two sources of input data were provided to us from Affymetrix 500 K scans: 1. Data: CEL fi les from the Affymetrix GeneChip reader from a genotyping facility: The CEL fi les are binary fi les containing the fl uorescence intensities for each probe on the microarray. Those fi les together with SNP-specifi c annotation fi les obtained from the Affymetrix website were processed in two steps into a modifi ed PB fi le within a few hours.
Step 1: Running BRLMM genotyping algorithm A command line program called apt-probesetgenotype was used. It is part of the open-source Affymetrix Power Tools (APT) software package (Affymetrix, 2008a). This is an application for making genotype calls from mapping arrays. The application supports the BRLMM genotype calling algorithm (a modifi cation of Rabbee and Speed, 2006). The user can select different outputs including SNP calls and confi dence scores. All SNP values are tagged with SNP Affymetrix ID.
Step 2: Converting retrieved data into the modifi ed PB format SNP Affymetrix IDs in the output data were mapped with their respective dbSNP RefSNP IDs (through a Perl script or MySQL) or were discarded. It was important to use dbSNP identifi ers instead of Affymetrix IDs for further processing with annotations. While each SNP is undoubtedly identifi ed by its dbSNP ID, it can have more than one Affymetrix ID depending upon the probeset used to detect it. Thus, different Affymetrix IDs on different chip releases may detect the same dbSNP ID. For Affymetrix genotyping microarrays, a Perl script was used to convert all data into a single modifi ed PB fi le with seven columns.
Optionally, a LINKAGE-format individuals' fi le could be used to select for PIDs.

Annotations: A NetAffx annotation information
file that provides consolidated data from multiple public sources, including probe sequences, gene annotations, and extensive annotation for each SNP (Affymetrix, 2008b) is required for further processing.
For the processing of Illumna GoldenGate or Infi nium data, A Perl script was constructed that permitted fl at BeadStudio V3.x "Full Data Table  Output" fi les to be converted to modifi ed PB format through accurate parsing and reformatting.
Statistical analysis using modifi ed PB fi les and SAS SAS software (SAS Institute Inc., Cary, NC) version 9.1.3, running on a SUSE Linux platform, converted the PB fi les into SAS data sets for further analysis. Commands and instructions were written according to the Language Reference (SAS Institute, 2004a) and Procedures Guide (SAS Institute, 2004b) and are available at the authors' FTP site (ftp://ftp.ncifcrf.gov/pub/users/goldb).
The genotypes were recoded so that the most common allele in a reference population (reference allele) was assigned a value of 0 and the other allele (variant allele) was assigned a value of 1. For each marker in each population summary statistics were computed that included the number of observed alleles (1 or 2), the allele and genotype counts and frequencies, the Hardy-Weinberg (H-W) disequilibrium coeffi cient (D), inbreeding coeffi cient (F IS ), and the test statistic and P-value for the chi-square test for Hardy-Weinberg equilibrium. The genotype counts were used to compute exact P-values by the method of Wiggington et al. (2005). The direction of H-W deviation was also assessed.
Several genetic distance statistics were computed for the populations including the Fixation Index of the Subpopulation within the Total (F ST ) (Weir, 1996) and Nei's standard genetic distance measure (D S ) (Nei, 1972(Nei, , 1978. In addition, several information theory-based statistics were computed, including entropy for admixed populations (Smith et al. 2003), Kullback-Leibler divergence (Cover and Thomas, 1991) and the informativeness (I n ) for assignment statistic of Rosenberg et al. (2003).
Three chi-square tests comparing marker allele and/or genotype frequencies were implemented with SAS software.  Table 1).

Discussion
Genome-wide association studies require teamwork between disparate centers, laboratories, and technologies. Samples located in one center may be received in another, typed in a third and analyzed in a fourth location, often across continents. Today, in spite of rapid Internet connections, the transfer of hundreds of gigabytes of row genotype data can be a serious time and security challenge. While the newest hardware confi gurations of PC and Mac computers could be found in nearly every research institution, this is not the case with expensive proprietary software, SNP genotyping arrays or medical samples. Therefore, we proposed a method for standard Analysis Pipeline Workfl ow. The pipeline was initially written for the analysis of Affymetrix genotyping microarrays, but has since been adapted to Illumina BeadArrays. The upper portion of the fi gure provides a workfl ow for dual style (500 K) microarrays, but has also been used for single-chip Affy 6.0 microarrays. Illumina BeadArray 317 K, 500 K, 550 K, 650 K and 1 M data feeds into the workfl ow on the right-hand side, after fl at table export of a BeadStudio V3.x "Full Data Table Output" fi le. The arrays provide a basis for assembly of the modifi ed PB fi le, which is the raw input for the analytic pipeline. assembly and quick extraction of genotype information from row data. This method provides easy transportability, a standardized, easily understandable and robust format that would be advantageous for any SNP-typing or mapping association study.
Our analytic platform allows collection of a huge amount of data in a short amount of time. Thus, the modifi ed PB fi le can be promptly forwarded for further statistical analysis.
As part of our effort, Perl scripts were written which converted and supplied the modifi ed PB for analysis using SAS, Haploview, PHASE and the R platform. This permitted the generation of three different types of statistics from the data assembled in the PB: In addition, we have implemented Perl scripts that permit calculation of statistics that report the consistency of our data with HapMap data and consistency of segregation in families. These statistics provide a basis for quality assurance of the typing data.
We have compared our GWAS analysis pipeline with the PLINK package authored by Purcell and his collaborators (Purcell et al. 2007). PLINK provides an outstanding collection of software routines aimed at GWAS analysis; however, its extensive association testing capabilities do not provide a uniform output format. In fact, each PLINK command yields a novel output, some organized by SNP, some by SNP group or sliding window, some by CNV, or by PID. Instead, we chose a method that provides a uniform output format with 112 columns or fi elds that could be uploaded into a database or browser for further analysis. In addition, several statistics are available through our pipeline that are not available through PLINK: Among these are some association statistics, such as the co-dominant and overdominant models, an elaboration of Hardy-Weinberg compliance statistics (Weir's D-coeffi cient for the direction and magnitude of H-W deviations [Weir, 1996]), and several statistics that describe differences in populations through allele frequency calculations. The latter include F ST between the ostensible case (S in fi rst PID column) and control (L in fi rst PID column) populations (Weir, 1996 equation 5.2, p. 173), entropy with a 10%, 50% or 90% control ancestry calculation (Smith et al. 2004, equation 1), the Kullback-Leibler divergence in both directions averaged (Rosenberg et al. 2003 at page 1403 a factor of 1/2 used; see note c) and a Kullback-Leibler divergencea "resistor" average (Johnson et al. 2001, equation 11); In, informativeness for assignment (Rosenberg et al. 2003, equation 4) and Nei's D S or standard genetic distance (Nei, 1972;Takezaki and Nei, 1996; equation 1).
The central element of our analysis is the modifi ed PB text fi le. This single fi le summarizes all data from the genotype calls and confi dence scores of many chips and individuals (grouped into populations). The fi le can be zipped and transported, and may be used as input for many types of analysis. Furthermore, with the wide availability of hard disks of 1.5 terabyte capacity, data may be analyzed all at once on PC-Linux and Macintosh computers using only UNIX, Perl, MySQL, R or other General Public License (GPL) software. This will permit users to avoid high-performance hardware and expensive proprietary software. The fi le can also be parsed and dispersed into 24 chromosome files (25 if mitochondrial variants are included) for individual chromosomal analysis. In our experience, although the Windows XP operating system is user-friendly compared to UNIX, the restrictions it puts on the size of input data for its basic applications (e.g. ∼65,000 lines for Excel) render it impractical for this volume of data analysis.
On a Linux PC with 2 gigabytes of Random Access Memory (RAM), the process of building a modifi ed PB fi le from Affymetrix 500 K CEL files for few hundred individuals may take 1-2 hours. This time estimate accounts for a minimum of other processes running on the PC. Although this is an imprecise performance measure, this processing time frame estimate should provide potential users with a "rule of thumb guide" for processing time requirements of the pipeline we have described.