A novel rare variants association test for binary traits in family-based designs via copulas

With the cost-effectiveness technology in whole-genome sequencing, more sophisticated statistical methods for testing genetic association with both rare and common variants are being investigated to identify the genetic variation between individuals. Several methods which group variants, also called gene-based approaches, are developed. For instance, advanced extensions of the sequence kernel association test, which is a widely used variant-set test, have been proposed for unrelated samples and extended for family data. Family data have been shown to be powerful when analyzing rare variants. However, most of such methods capture familial relatedness using a random effect component within the generalized linear mixed model framework. Therefore, there is a need to develop unified and flexible methods to study the association between a set of genetic variants and a trait, especially for a binary outcome. Copulas are multivariate distribution functions with uniform margins on the [0,1] interval and they provide suitable models to capture familial dependence structure. In this work, we propose a flexible family-based association test for both rare and common variants in the presence of binary traits. The method, termed novel rare variant association test (NRVAT), uses a marginal logistic model and a Gaussian Copula. The latter is employed to model the dependence between relatives. An analytic score-type test is derived. Through simulations, we show that our method can achieve greater power than existing approaches. The proposed model is applied to investigate the association between schizophrenia and bipolar disorder in a family-based cohort consisting of 17 extended families from Eastern Quebec.


Power function
Empirical Bias of the nuissance parameters and the polygenic heritability Tables S5 -S8 show the empirical bias (×100) of the nuissance parameters (γ 0 , γ 1 , γ 2 ) including the intercepte and the polygenic heritability (h 2 ) under the null hypothesis of no SNP/phenotype association where data are generated under the Gaussian copula model, generalized linear mixed models (GLMM) model, Student copula model and Chi-square copula model, respectively.
Table S5: Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the null hypothesis of no SNP/phenotype association within gaussian copula where the response variable is computed from 10000 data sets generated under Setting 1, using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Additional Simulations
Here, we presented the Algorithm of the four mechanisms used for Additional Simulations.
"Selection" : Generate the genotypes of at least 2000 families composed of 03 individuals using Simulate3.
• iii.Use the new genotypes and Y of the three categories of 40 families to determine the results.
"MAR (Missing At Random)" : here, the deletion of the parental lines (to obtain the missing data at the level of the families in which we have 04 and 08 members respectively) depends on the phenotype Y of the child, i.e., the last member of the family.
• "MCAR (Missing Complete At Random)" : we assume that 20% of the data is missing at the level of the families in which we have 04 and 08 members respectively.
• QQ-Plots under H 0 of selection bias and missing genotypes

Empirical Bias of the nuissance parameters and the polygenic heritability under Additional Simulations
Tables S9 -S10 show the empirical bias (×100) of the nuissance parameters (γ 0 , γ 1 , γ 2 ) including the intercepte and the polygenic heritability (h 2 ) for the null hypothesis of no SNP/phenotype association under the selection bias; the missing at random (MAR) and the missing completely at random (MCAR) approaches, respectively, where data are generated with the Gaussian copula model.
Table S9: selection bias: Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the null hypothesis of no SNP/phenotype association within Gaussian copula model where the response variable is computed from 10000 data sets generated where the data are generated under the selection bias using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Figure S1 :
Figure S1: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0, where the data are generated under the Gaussian copula.Results are computed from 10 000 data sets generated under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S2 :
Figure S2: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Gaussian copula.Results are computed from 10 000 data sets generated under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S3 :
Figure S3: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0, where the data are generated under the generalized linear mixed model (GLMM).Results are computed from 10 000 data sets generated under Setting 2. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S4 :
Figure S4: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the generalized linear mixed model (GLMM).Results are computed from 10 000 data sets generated under Setting 2. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S5 :Figure S6 :Figure S7 :
Figure S5: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0, where the data are generated under the Student-t copula model (df = 3).Results are computed from 10 000 data sets generated under Scenario 1 of Setting 3. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S8 :
Figure S8: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Chi-square copula model, with a non-centrality parameter a = 1.Results are computed from 10 000 data sets generated under Scenario 2 of Setting 3. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variantset mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure
Figure S9 and S10 show the power levels as a function of a grid of values of the variance-component τ for the two values of the polygenic heritability, h 2 ∈ {0, 0.5}, under the Gaussian copula model (Setting 1).Again, these figures illustrate the important gain in power achieved by NRVAT with the IBS and the Gaussian Kernel similarity matrices.

Figure S9 :
Figure S9: Power function under the alternative hypothesis of SNPs/phenotype association of grid of τ ∈ {0, 0.01, 0.05, 0.2} for the polygenic heritability parameter h 2 = 0 where the data are generated under the Gaussian copula.Results are computed from 1000 data sets generated with twenty-five percent of causal variants taken randomly from the regions size (20) under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-bystate (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S10 :
Figure S10: Power function under the alternative hypothesis of SNPs/phenotype association of grid of τ ∈ {0, 0.01, 0.05, 0.2} for the polygenic heritability parameter h 2 = 0.5 where the data are generated under the Gaussian copula.Results are computed from 1000 data sets generated with twenty-five percent of causal variants taken randomly from the regions size (20) under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-bystate (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits the phenotype Y of individuals from each of the 2000 families according to the Gaussian copula.ii.Retain the first 40 families for which the last individual (the child) has a Y = 1.Use the same procedures ((Step 1 & Step 2)) to obtain the 40 families composed of 04 individuals and those composed of 08 individuals.
Step 1 Generate according to Simulate3 the genotypes for a total number of 120 families of which 40 are composed of 03 individuals, 40 others are composed of 04 individuals and the remaining 40 are composed of 08 individuals.• Step 2 i.Simulate the Y according to the Gaussian copula ii.For families composed of 04 people, delete the first two rows (parents) of the family if the last member has a Y = 0. Otherwise (Y = 1), keep the whole family.iii.For families made up of 08 members, delete the first two lines (grandparents) of the family if the last member has a Y = 0. Otherwise, keep the whole family.iv.Use the new genotypes and Y (obtained after removing these parental lines) to determine the results.
Step 1 Generate according to Simulate3 the genotypes for a total number of 120 families of which 40 are composed of 03 individuals, 40 others are composed of 04 individuals and the remaining 40 are composed of 08 individuals.• Step 2 i.Simulate the phenotype Y according to the Gaussian copula ii.Randomly delete a few parental lines (this for families of 04 and 08 people) with a probability of success equal to 0.2 (20%)iii.Use the new genotypes and Y (obtained after removing these parental lines) to determine the results.

Figures
Figures S11 -S13; S14 -S16 and S17 -S19 show QQ-plots of the p-values of NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices under the selection bias; the missing at random (MAR) approaches and the missing completely at random (MCAR), respectively, where data are generated under the Gaussian copula model.

Figure S11 :
Figure S11: selection bias: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the selection bias.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S12 :
Figure S12: selection bias: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the selection bias.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S15 :
Figure S15: MAR: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the missing at random.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S16 :
Figure S16: MAR: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.5, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the missing at random.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S18 :
Figure S18: MCAR: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the missing completely at random.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S19 :
Figure S19: MCAR: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.5, where the data are generated under the Gaussian copula model.Results are computed from 10 000 data sets generated under the missing completely at random for.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figures
Figures S20 -S22 and S23 -S25 show QQ-plots of all the considered methods, for d 2 = 0.25, and d 2 = 0.36, respectively, where data are generated under the Gaussian copula model.

Figure S20 :
Figure S20: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0, where the data are generated under the Gaussian copula model for d 2 = 0.25.Results are computed from 10 000 data sets.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Figure S21 :
Figure S21: QQ-plot under the null hypothesis of no SNPs/phenotype association (τ = 0), with the heritability parameter h 2 = 0.2, where the data are generated under the Gaussian copula model for d 2 = 0.25.Results are computed from 10 000 data sets.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Table S1 :
Empirical type I error rate (×100) under the null hypothesis of no SNPs/phenotype association (τ = 0) where the data are generated under the Gaussian copula model.

Table S3 :
Empirical type I error rate (×100) under the null hypothesis of no SNPs/phenotype association (τ = 0) where the data are generated under the Student-t copula (df = 3).Results are computed from 10 000 data sets generated under Scenario 1 of Setting 3. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Table S4 :
Empirical type I error rate (×100) under the null hypothesis of no SNPs/phenotype association (τ = 0) where the data are generated under the Chi-square copula with a non centrality parameter a = 1.Results are computed from 10 000 data sets generated under Scenario 2 of Setting 3. The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene

Table S6 :
Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the null hypothesis of no SNP/phenotype association within generalized linear mixed models (GLMM) where the response variable is computed from 10000 data sets generated under Setting 2, using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Table S7 :
Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the Null hypothesis of no SNP/phenotype association within Studentt Copula (df = 3) where the response variable is computed from 10000 data sets generated under Setting 3 and scenario 1, using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Table S8 :
Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the Null hypothesis of no SNP/phenotype association within Chisquare copula model, with a non-centrality parameter a = 1, where the response variable is computed from 10000 data sets generated under Setting 3 and scenario 2, using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Table S10 :
MAR: Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the Null hypothesis of no SNP/phenotype association within Gaussian copula model where the response variable is computed from 10000 data sets generated under the missing at random (MAR), using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Table S11 :
MCAR: Empirical bias (×100) of the nuissance parameters including the intercepte and the polygenic heritability under the null hypothesis of no SNP/phenotype association within Gaussian copula model where the response variable is computed from 10000 data sets generated under the missing completely at random (MCAR), using the polygenic heritability parameter h 2 ∈ {0, 0.2, 0.5}; Sd: Standard Deviation; Se: Standard Error.

Table S12 :
Empirical type I error rate (×100) under the null hypothesis of no SNPs/phenotype association (τ = 0) where the data are generated under the Gaussian copula model for d 2 = 0.25.Results are computed from 10 000 data sets generated under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits

Table S13 :
Empirical type I error rate (×100) under the null hypothesis of no SNPs/phenotype association (τ = 0) where the data are generated under the Gaussian copula model for d 2 = 0.36.Results are computed from 10 000 data sets generated under Setting 1.The Compared methods are: NRVAT model with the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with X 2 c (Xc), and W QLS (QLS); and gSKAT model with the Asymptotic and Pertubed.SMMAT: variant-set mixed model association tests; AFC: Allele Frequency Comparison tests; gSKAT: burden and kernel-based gene set association tests for binary traits