Gene copy number analysis for family data using semiparametric copula model.

Gene copy number changes are common characteristics of many genetic disorders. A new technology, array comparative genomic hybridization (a-CGH), is widely used today to screen for gains and losses in cancers and other genetic diseases with high resolution at the genome level or for specific chromosomal region. Statistical methods for analyzing such a-CGH data have been developed. However, most of the existing methods are for unrelated individual data and the results from them provide explanation for horizontal variations in copy number changes. It is potentially meaningful to develop a statistical method that will allow for the analysis of family data to investigate the vertical kinship effects as well. Here we consider a semiparametric model based on clustering method in which the marginal distributions are estimated nonparametrically, and the familial dependence structure is modeled by copula. The model is illustrated and evaluated using simulated data. Our results show that the proposed method is more robust than the commonly used multivariate normal model. Finally, we demonstrated the utility of our method using a real dataset.


Introduction
The gene copy number (also called "copy number variants"-CNV) is the number of copies of a particular gene in the genome of an organism. Recent evidences show that gene copy number (GCN) amplifi cations and deletions are common characteristics of many genetic diseases. For example, GCN can be elevated in cancer cells as demonstrated in the epidermal growth factor receptor (EGFR) gene in patients with non-small cell lung cancer (Cappuzzo et al. 2005) and also higher copy number of CCL3L1 has been associated with susceptibility to human HIV infection (Gonzalez et al. 2005). Thus identifying these genetic gains and losses provides useful information about specifi c disease susceptibility or resistance. GCN analysis among normal people within the human genome is also of interest. However, these genetic characteristics are usually not directly observable. Recent technological development in array comparative genomic hybridization (a-CGH) provides scientists with an effi cient tool to conduct whole genome and high-density region specifi c investigation of GCN (Solinas et al. 1997;Pinkel et al. 1998;Snijders et al. 2001).
Briefl y, a-CGH technique involves the labeling of genomic DNA from disease tissues (e.g. cancer) and normal control tissue (reference) with different colors (fl uorochrome). These samples are then co-hybridize to a metaphase spread from a normal reference cell. After hybridization, emission from each of the two fl uorescent dyes is measured, and the signal intensity ratios are indicative of the relative copy number of the two samples. The ratio of the two fl uorochrome intensities is then calculated and regions where the disease DNA are amplifi ed or deleted are readily detected on the metaphase spread. The resulting data are in the form of microarrays. This technique not only gives us information about copy number gains and losses in the disease genomic DNA but also allows the identifi cation of the specifi c chromosomes and the regions of the chromosomes where these changes occurred.
However, the a-CGH data does not provide direct measurements of the GCN changes. Hence, several statistical approaches for analyzing and describing results from these experiments have been developed. Differences exist in these approaches and newer approaches addressing some of the limitations of existing method are needed. For example, some of these methods do not take into account the spatial dependence within the chromosome (Hodgson et al. 2001;Pollack et al. 2002;Cheng et al. 2003;Wang et al. 2004) while others have implemented such dependence structure into their models to enhance the inference (Jong et al. 2004;Picard et al. 2004;Fridlyand et al. 2004;Eilers et al. 2004). With the exception of a few that are Bayesian (Barash and Friedman, 2002;Daruwala et al. 2004;Broet and Richardson, 2005), most of the existing methods are frequentist's. All of the existing methods are designed for analyzing data from unrelated persons and are therefore effective in explaining horizontal changes in GCN. However and when available, family data present wonderful opportunity to investigate the vertical kinship effects of GCN as well as the horizontal changes. For this type of data, the main challenge is to model the high dimensional familial dependence structure, and no such approach was found following a careful review of the literature. In this paper, we present such a method in which we used the nonparametric approach to model the marginal distributions and then linked the joint distribution by a copula structure.
Typically, GCN changes observed from a-CGH experiments are classified into three groups corresponding to the three statuses of copy number changes-amplification, deletion and normal; Thus, allowing the microarray responses to have similar features. The practical challenge in the problem that we describe here is that of high dimensionality due to familial dependence among pedigree members. As we indicated above, several of the statistical tools for microarry data clustering deal with low dimensional data (usually one dimensional) and do not take into account the familial dependence among the pedigree members. Such methods can be divided into two main groups, the model based and non-model based (semiparametric). The former assumes specifi c probability models for the sub-distributions of response in each cluster and is effi cient when the specifi cations are correct but may be seriously biased if implemented specifi cations deviate from the true unknown models. The semiparametric does not make any assumption about the models except that of a mixing structure, in which the unknown sub-distributions are estimated nonparametrically from the data themselves, and the inference is robust. This method is adequate when the data size is large so that the sub-distributions can be estimated accurately. Yuan and He (2008) proposed such a method for low dimensional microarray clustering for independent data generated from unrelated persons. For data with high dimensionality, the commonly used multivariate normal model rarely fi ts the actual data, and the nonparametric method is not directly applicable in cluster analysis, so neither of the above models based or non-model based methods are suitable for analyzing the dependence and high dimensionality of family data.
In statistics, the copula is a widely used tool for modeling the dependence structure of high dimensional data (Sklar, 1959;Joe, 1997), and is particularly suitable for pedigree data modeling. Here we propose and implement a semiparametric copula model to address this problem. Specifically, the marginal distributions are estimated nonparametrically, the within pedigree dependence structure is modeled by copula with parameters specifi ed by the kinship coeffi cients. A penalty term is used against non-unimodality of the subdistributions. The optimal partition is performed using a classifi cation-estimation (of densities)maximization-estimation (of parameters) algorithm. The algorithm shares the property of ascending the penalized semiparametric likelihood, just like the well known EM algorithm for ascending the parametric likelihood, and thus, under fair conditions, converges to the optimal partition of the microarray.

The Method
In a-CGH data, the fl uorescence ratios between two samples, case and control, are measured across a genomic region. For loci with different copy number changes, the corresponding log-ratio measurement tend to be different. Thus in a-CGH data analysis, often a three-state mixture model is used: deletion state, normal state and amplifi cation state, and we arbitrarily lable them as state 1, 2 and 3. Genes with copy number deletion tend to have smaller log-ratio measurements, those with normal status tend to have moderate measurements, and those with amplifi cation tend to have larger measurements.
We focus on the case of a given chromosome. When there are more than one chromosome under consideration, the method is similar by modeling the chromosomes one by one. Suppose there are n loci of interest and r pedigrees of individuals. The measurement at each locus for each individual is observed. The j-th pedigree has s j individuals ( j = 1, ..., r), at locus k, the l-th individual of the j-th pedigree has microarray measurement y jkl . Denote y y y jk jk jks j = … ′ ( , , ) 1 be the measurements of the j-th pedigree at locus k, for each fixed pair ( j, k) the y jkl 's are familiarly dependent due to kinship.
Generally this question is formulated as a cluster problem, in which each of the gene locus in classifi ed into one of the clusters B 1 , B 2 and B 3 represent the three states deletion, normal and amplifi cation. Let y be a general random vector of the observation y jk 's, a mixture model on y is specifi ed as (1) Where f (⋅|B i ) is the sub-density of the responses in the i-th cluster, and the α i 's are the mixing proportions satisfying 0 Յ α i Յ 1, Σ i k = = 1 1 α . In the literature usually the f(⋅|B i )'s are specifi ed as multi dimensional normal density functions with cluster specifi c mean vectors and variance matrices. Typically for this type of pedigree data the dimension is around 10 to 15.
In practice, such high dimensional dependent data hardly conforms to a multivariate normal distribution. A commonly used statistical tool to deal with high dimensional dependence structure is the copula. In this method, it is only necessary to specify each of the marginal densities, and then use a link (copula) to compose all the marginal densities into a joint multivariate density with desired dependence structure. There are large number of copulas to be used, and some optimality criteria to select the best copula for a given problem and data. When the copula is selected, we can incorporate the kinship coeffi cients among the pedigree members into its dependence structure. Also, there is the question of how to specify the marginal densities. There are various parametric densities to choose from, but if the wrong one is used the results may be seriously biased. On the other hand, for data with large sample size, the nonparametric density can adapt to any distributional feature. Since we do not know the true subdensities we model each of the marginal densities by nonparametric method for robustness. Finally, a modified BIC criterion is used to select the optimal number of clusters. We describe each of the above steps in different sub-sections below.

The marginal distributions
Since commonly available pedigree data usually consist of three generations and to account for the age and gender difference, the distributions of the measurements are divided into six groups in the following order: fi rst generation male, fi rst generation female, second generation male, second generation female, third generation male and third generation female. We use G s to denote the s-th group. For example if an individual with observation y jkl is a second generation female in any given pedigree, she is in group 4, we simply denote y jkl ∈ G 4 , and so on. Denote f s (⋅|B i ) be the subdensity of array cluster i of group s.
Since the f s (⋅|B i )s are unknown, they can be estimated by the well known nonparametric estimator (Rosenblatt, 1969) where n is is the sample size (number of individuals) of group s in cluster i, K(⋅) is arbitrary given kernel density, and h n is is a given bandwidth to be specifi ed below.
In the density estimation literature, the choice of kernel is not of particular importance (Diggle, 1983;Silverman, 1986). Studies suggest that most unimodal densities perform about the same as the other when used as a kernel, and the choice between kernels can be made on other grounds such as computational effi ciency. However, there are some popular options in practice for different reasons. For some general introduction for the choice of kernels,we refer to Silverman (1986) and Scott and Wand (1991). The normal kernel (i.e. K(⋅) is the density function of the standard normal distribution) is widely used in practice for convenience and other nice features.
In contrast, the choice of bandwidth is crucial in density estimation (Silverman, 1986). Interesting proposals which address this problem can be found in Fan and Gijbels (1992). There is literature on automatic methods that attempt to minimize a lackof-fi t criterion, such as integrated squared error. From Silverman (1986), we choose to use the bandwidth where σ 2 is is the empirical variance of the y jkl 's in the s-th group and the i-th cluster.
In the copula formulation we also need the corresponding marginal distribution functions. Let Fs(⋅|B i ) denote the marginal distribution functions for cluster i and group s, where χ(⋅) is the indicator function.

The joint distribution
The copula is a commonly used statistical tool to model multivariate joint distribution, it appeared in the early work of Hoeffding, Fréchet and others and formally introduced by Sklar (1959). We fi rst give a very brief account of it and we refer to Hunchinson and Lai (1990); Joe (1997) and Nelson (1999) for detailed review.
This property provides a convenient way to construct a joint distribution with given marginal ones. On the other hand, given a set of marginal distribution is the true joint distribution of them (Sklar, 1959). Also, for any joint Given a copula, the dependence structure can be characterized in several ways, including Pearson's correlation, Kendall's tau, Spearman's rho, tail dependence, etc. Kendall's tau is generally easier to compute for copulas, so we use this dependence measure. For a two-dimensional copula, Kendall's tau is given by where (X 1 , Y 1 ) and (X, Y ) are independent with the same distribution. −1 Յ τ Յ 1, τ = 0 for independence, −1 and 1 for perfect negative and positive dependence. Genest et al. (1995) suggested a pseudo-likelihood approach to estimate the dependence parameters, in which the observed data is transformed via the empirical marginal distributions to obtain pseudo-data that are used in the estimation. Using the special relationships among relative pairs, we can implement the dependence parameters in the copula via the relationships among kinship coeffi cients, Kendall's tau and the copula dependence parameters without estimation. For pedigree data, the dependence relationships among familial members (i, j) are best described by the kinship coeffcients, γ ij = Δ 7ij /2 + Δ 8ij /4, where Δ 7ij , Δ 8ij , Δ 9ij are the condensed kinship coeffi cient (Jacquard, 1974) between relative pair i and j. The Δ kij s (k = 1, ..., 9) are the probabilities for the nine possible condensed identical by descent (IBD) status as in Jacquard (1974), in which Δ 7ij , Δ 8ij and Δ 9ij are commonly used in practice. They are the population probabilities of sharing 2, 1 and 0 genes IBD for relative pair (i, j), without regard to their particular genotypes, but only (i, j)'s kinship relationships, under the Mendelian inheritance. Also, 2γ ij is the expected proportion of gene IBD for relative pair (i, j) at this locus. For convenience we list the values of these coeffi cients for some common relative pairs (Lange, 1997), and we compute corresponding Kendall's tau in the last column after the computations below.
For trait underlined by single locus or multiple loci, Pearson's correlation for relative pair (i, j) is 2γ ij (Lange, 1997). Assume that gene copy number change statuses are determined only by the underlying genetic sources, and that the amounts of dependence among them are additive with respect to their shared genetic sources. Then at any fi xed locus, Kendall's tau between a fi xed type of relative pair As is true for Pearson's correlation, we postulate that Kendall's tau remain the same, or approximately so, when the trait is infl uenced by multiple loci. As the kinship coeffi cients are already known as in Table 1, we get Kendall's tau by the above relationships and in turn, the dependence parameters in the copula model is obtained from the relationship among the dependence parameters and kendal's tau for specifi ed copula. Thus we can easily implement the dependence kinship coefficients in the copula in terms of Kendall's tau without estimating them. For this, we fi rst need to review several commonly used copulas. Note, for family data the dependence are not constant among different relative types, hence copulas with constant dependence parameters, such as Clayton's copula or Frank's copula can not be used here.
By relationships (6) and (7), the dependence parameters θ ij 's in the multivariate normal copula are easily obtained given the τ ij 's, which are computed via the known kinship coeffi cients Δ kij 's, as long as we know the kin type of relative pair (i, j).

Multivariate T-copula
Let Θ be the correlation matrix given in the multivariate normal distribution, x = (x 1 , ..., x d )' The density function d-dimensional T-distribution with r degrees of freedom is The corresponding copula density is ( / ( ( ))), For this copula, the relationships between the θ ij 's and the τ ij 's are the same as for the multivariate normal copula.

Selection of copula
Given several candidate copulas C 1 , ..., C h with densities c 1 , ..., c h , a natural question is how to select the optimal copula for the data. Let ˆ( ) When there are parameters to be estimated in the copula, the optimal copula can be selected by AIC criteria (Oakes, 1989;Dias and Embrechts, 2003). Here, our copula has no parameters to be estimated, by the likelihood principle and (5), an intuitive way is to select the C with This is equivalent to choosing the copula with the largest likelihood. Now for given copula, the joint density for the data y = {y jkl } is modeled by  We point out that although we used the same notation c, for different families, the number of individuals may differ and so are the dimensionalities of the c's.
However, under the semi-parametric mixture model assumption, the sub-distributions can take any shape, even the shape of the entire distribution, and as a result any cluster partition will give about the same likelihood value via (9). So optimizing (9) over all possible cluster partitions will not be able to identify the desired clusters. Thus we put some constraints on the selection of clusters such that the sub-distribution is approximately unimodal and optimizing model (9) will give the desired clusters, as in Yuan and He (2008). The reference Yuan and He will be refered to as YH in subsequent citations. However there are two major differences between the method we are proposing and that of YH. Our method can handle high-dimensional data and the link among the marginal densities in copula.
Specifi cally, let g(⋅|B i ) be the multivariate normal density with mean given by the sample mean for data in B i , and covariance matrix Θ, for observations in B i (i = 1, ..., 3), and denote g = ( g 1 , g 2 , g 3 ), where g i = g (⋅|B i for some 0 Յ λ Յ 1 to be specifi ed. This model can be viewed as an extension of the traditional mixture model. When λ = 0, it corresponds to a nonparametric specifi cation of sub-distributions, when λ = 1 it is a full parametric model given by the g(⋅|B i )s, and when 0 Ͻ λ Ͻ 1 it corresponds to an intermediate model. By doing this, we are forcing the distributions to be close to normal, more than what is needed for unimodal. The tunning parameter λ is chosen according to simulation for the given type of data. The choice of a multi-variate normal here is for convenience as other choices could be made but may result in additional complication.

The CEME algorithm
However, directly optimizing the mixture model (10) is usually not easy. A common practice of estimating the cluster membership of each observation in the data while evaluating the maximum likelihood estimate α of α in (10) is the EM algorithm (Dempster et al. 1977). The EM algorithm is a much easier (though much slower) endeavor computationally than the direct optimization. For fi xed k, let u ij = 1 if the i-th locus belongs to the j-th cluster, u i = (u i1 , u i2 , u i3 ) be its membership vector, and u = {u ij }. Treating u as missing data, (y, u) is referred to as the "complete" data. Then the likelihood for the "complete" data is where g(y jk | B s ) is the analogue of ˆ( ) f y B jk s | . The above log-likelihood is optimized iteratively, with the clusters B s 's are classifi ed along each iteration. We specify the starting values at iteration zero as below. Given the current t-th iteration estimates α α α α

Starting values
s from the t-th iteration, we update them in the (t + 1)-th iteration according to the following CEME steps.
1. Classifi cation-step: Each response locus k, is classifi ed into a candidate cluster B s , if This is the optimal classifi cation rule in the sense of minimizing the expected loss (Anderson, 1984), and it is also the so-called Bayesian assignment. In the cases of ties, we use uniform random assignment among the tied clusters. Let Then use (5) to get the candidate sub-joint density for the j-th pedigree at locus k, as follows Let g be the reference densities corresponding to B . We update the quadruple (B (t + 1) , f (t + 1) , F (t + 1) , g (t + 1) ) as The estimate of f (⋅) at the (t + 1)-th iteration is then ˆ( )ˆˆ( | ). α Note that at each iteration t, α (t) , B (t) , and the u ij t ( ) s are updated, but not necessarily so for f (t) and g (t) .
The above four steps are iterated until convergence of (α (t) , f (t) , B (t) ). (Note by the following Proposition, we may check the stability of the α (t) as a simple criterion for the convergence of the triple). We may use the relative error criterion for the convergence of the (α (t) 's, that is, for some prespecified δ Ͼ 0, we stop the iteration when As in YH, for k = 1, ..., n; s = 1, 2, 3, we have  ( α (t) , f (t) , B (t) ) → (a * , f * , B * ).
In the above, λ = 1 corresponds to a normal model, and 0 Ͻ λ Ͻ 1 correspond to a mixed model. For this type of data, the model has difficulty in parameter convergence for small values of λ, refl ecting the fact that the multivariate data distribution is too noisy for nonparametric part of the model to work alone; thus a parametric unimodal component is needed to help cluster the data. The normal copula model has larger likelihood value in all these cases. This means the normal copula model is more robust than the multivariate normal model. For the data from multi-normal model, when λ = 0.5, 38 loci from cluster three are classifi ed to cluster two. Over all, λ = 0.75 performs well for all the data set, and so we recommend this value of λ in this analysis.
To assess the robustness of the method, we simulated larger data sets with family sizes of 4 and 5, each with 100 families and 200 candidate loci. The simulated clusters and means for male and female are: cluster one, 1-70, (4.9, 4.2); cluster two, 81-150, (8.9,8.2) and cluster three, 171-200, (12.9,12.2) respectively. To refl ect some complexity we added minor clusters to some of the clusters. The means for male and female for the minor clusters are 71-80, (5.4, 4.7) and 151-170, (12.4,11.7). The results are summarized in Table 3.
Overall, the results are consistent: the smaller the value of λ, the better the model fi tness, as indicated by larger likelihood value. This means that the nonparametric model component capture the data distribution in fine details. But in many cases, the computation breaks down for λ = 0 as pointed out earlier. It is seen that for either the data generated from multi-normal or normal copula distributions, the overall performances of the semiparametric model is robust for a range of the tuning parameter λ.

Real data analysis
We use the proposed method to analyze the Genetics Analysis Workshop 15 (GAW15) data set with 14 pedigrees of CEPH Utah families, each with three generations and about a dozen normal individuals. Expression level of genes in lymphoblastoid cells of the above subjects were obtained using the Affymetrix Human Focus Arrays that contain probes for 8,500 transcripts. Gene copy number variations in normal people within human genome has been the subject of study (Freeman et al. 2006;Pugh et al. 2008). For 3,554 of the 8,500 SNPs tested, Morley et al. (2004) found greater variation among individuals than between replicate determinations on the same individual. These 3,554 expression phenotypes (expressed genes) were chosen for copy number change analysis. The first step is to find out the best copula model for the data. We considered three different models, the multi-normal model, the semi-parametric multivariate normal-copula model, and the semi-parametric multivariate T-copula model. Then the criterion in (8) is used to select the optimal model. The average copula likelihood values for the three models are −3217389.15, −2094272.97, −2296408.96 respectively. Thus the semi-parametric multivariate normal-copula model is the best of the three and was used for clustering. The outcome of the analysis of the GAW15 data is displayed in the figure below. The horizontal axis represents the sequential numbering of genes from 1 to 3550, and the vertical axis indicates the classified states of the genes with 1, 2 and 3 representing deletion, normal and amplification. As shown in the fi gure, most of the SNPs are in clusters 1 and 3, this observation is consistent with the large variation of the expression levels. The SNPs with deletion status are more likely to be contained in cluster 1, and those with amplifi cation status are more likely to be in cluster 3.

Concluding Remarks
We proposed, studied and demonstrated a semiparametric copula method for microarray-SNP genomewide association analysis using pedigree data. We successfully implemented the kinship relationship into the model for more robust analysis of family data than the commonly used multivariate normal model.