Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Background: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, L0L1 penalisation and L0L2 penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. Methods: Five large (n4000) genomic datasets were extracted from Gene Expression Omnibus. ‘Gold-standard’ regression models were trained on subspaces of these datasets ( n4000 , p=500 ). Penalised regression models were trained on small samples from these subspaces ( n∈{25,75,150},p=500 ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty ‘preselection’ according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. Results: L1L2 -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L0L2 -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L0L2 also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. Conclusions: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L0L2 penalisation for structural selection and L1L2 penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.


2
Cancer Informatics data. We demonstrate that modelling decision may be supported by our evaluation method, an approach which may complement cross-validation.

Regression models in cancer genomics
High-dimensional regression problems are ubiquitous in modern oncological research, as datasets often contain fewer observations than variables. [1][2][3][4][5][6][7] The tractability of penalised regression approaches in this setting has led to a large volume of research into their applications. 1,[7][8][9] Penalised regression offers robust predictions in high dimensional data and mechanistic insights through the estimated coefficient vector. 1,7 L 0 and L 1 penalties perform variable selection inherently, by shrinking small dependencies to zero. [9][10][11] However, it is difficult to test the assumptions which penalised approaches require for valid model selection in real world datasets. 12,13 Furthermore, standard model selection approaches such as cross-validation and the Bayesian information criterion may be unreliable for model selection in the high-dimensional setting. 14,15 Penalised regression The inverse covariance matrix, X X T ( ) −1 , is undefined if n p < , precluding the use of ordinary least squared regression. 13,16 Penalised regression methods facilitate modelling in the high-dimensional setting through the addition of bias terms. L 0 , L 1 and L 2 penalised linear regression may be generally formulated such that: Here, notation is conventionally abused such that the L 0 'pseudo-norm' counts the number of nonzero elements in β . 10 β β Ridge regression 17 penalises the model by the L 2 norm of the coefficients ( λ λ 0 1 2 , balancing predictive error against coefficient magnitude. The imposed preference for smaller coefficients is termed 'shrinkage'. The magnitude of the shrinkage effect is controlled by the λ 2 hyperparameter.
Ridge regression partially alleviates instability under collinearity by constraining coefficient magnitude. 16 The Least Absolute Selection and Shrinkage Operator (LASSO) 11 penalty penalises the model by the L 1 norm of the coefficients, ( λ λ λ  The LASSO approach has 'oracle' properties under some conditions, meaning that predictions are nearly as good as if the true set of predictor variables were known. 18,19 An additional benefit of LASSO shrinkage is a tendency to shrink small coefficients to zero, leading to a 'sparse' β  , in which non-zero coefficients are deemed predictive. Thus, LASSO inherently performs variable selection. 11 This behaviour is highly useful in bioinformatics, where analytic tasks often require the selection of a small number of predictive variables given a large candidate set. However, the lasso model structure is subject to inconsistency under subsampling. 12 The Elastic Net 20 is a combines the sparsity of L 1 penalisation with the consistency of L 2 penalisation ( λ λ λ , ), with improved results in several bioinformatic studies. 1,21 Penalties of ridge regression, LASSO and elastic net affect large coefficients more than small coefficients, biassing coefficient estimates. 'Best subset selection' provides a theoretical solution to this issue through the selection of the optimal model attainable with k ∈  or fewer predictor variables, such that 10 : Thus, for some λ 0 ∈  , we have an equivalent Lagrangian expression: Best subset selection be may be approximated through L 0 penalisation in some conditions ( , , ) λ λ λ 0 1 2 0 0 0 ≠ = = . 10 L 0 penalisation applies no shrinkage to the selected predictors, resulting in unbiased coefficient estimates. 10 This combination of simplicity and unbiasedness has been described as a 'holy grail' of sparse modelling. 9 However, models suffer from inconsistency. 22 Furthermore, issues such as non-convexity and NP-hardness complicate best-subset model selection. 9,23 Recent developments such as mixed integer optimisation 10 have facilitated best subset model learning. Combinations of L 0 penalties with L 1 ( λ λ λ ) have been suggested to increase the consistency of best subset selection whilst maintaining minimal bias. 24

Assessing variable selection in genomic models
The true generating distribution for observational biological data is typically uncertain, complicating validation of estimated coefficient vectors. Consequently, many model assessments O'Shea et al 3 have employed synthetic 9,15,[24][25][26][27] or semi-synthetic 1,10,28-30 datasets to assess variable selection performance. Real data analyses have focussed primarily on the models' predictive capacity. [31][32][33] Accurate predictions may not guarantee correct model structure, especially in the highly collinear conditions commonly encountered in genomics. The representativeness of synthetic datasets is both uncertain and untestable. 29 Furthermore, results of these studies have been discordant, suggesting dependence on the benchmark datasets and validation techniques. 9,10 Genomic databases such as REACTOME 34 and KEGG 35 contain experimentally verified interactions, which may be used to externally validate genomic model structure. This approach has been used in previous analyses 27,29,36,37 and is limited by the uncertain completeness of such databases. Furthermore, the activity profile of interactions between a given set of genes may change with experimental conditions and unobserved confounders. 38,39 Consequently, the set of active predictors for a specific dataset may not align exactly with a static database. Finally, effect sizes may not be comparable between documented interactions, precluding the assessment of model coefficients by this method. Data-partitioning facilitates model validation without ground truth data, by assessing model generalisability to unseen observations. As training and validation observations are sampled from the same data, their distribution is asymptotically identical. However, the distribution may be difficult to estimate when n p  , and data-partitioning favours excessively complex models in this setting. 14,15 Given the limitations of currently available methods for assessment of variable selection performance in genomic data, an urgent need exists for a novel approach.

Study objectives
The primary objectives of this study were to: • Provide a real-world data-driven approach for comparing performance of high dimensional model inference algorithms in cancer genomics for both prediction and variable selection. We evaluate models by simulating n p  conditions in real n p > genomic datasets, allowing for robust evaluation of predictions in large-sample test partitions.
• Compare the performance of penalised linear regression methods for prediction and variable selection. • Compare algorithmic selection by internal cross-validation to preselection according to performance in external test datasets under our validation approach.
These objectives are realised by subsampling real n p > genomic datasets to simulate n p  conditions, allowing for robust data-driven validation of model structure and predictions in large-sample test partitions.

Materials and Methods Data
Five cancer genomics datasets were extracted from Gene Expression Omnibus 40 with the GEOquery library. 41 Local institutional review board approval and informed participant consent were documented in each data publication. 42

GSE89567
GSE89567 46 contains 6341 single-cell RNAseq profiles from patients with isocitrate dehydrogenase mutant astrocytoma at Massachusetts General Hospital. Tumour tissue was collected from surgical resections and malignancy confirmed under frozen section. Following disaggregation, profiling was performed by Smart-seq2. Sequencing was performed on the Illumina NextSeq 500 and transcript-per-million values reported.

Data preprocessing
Where datasets had > 5000 variables (GSE103322 and GSE146026), subspaces were extracted, retaining the 1000 variables with the fewest nonzero entries. Datasets were transformed with the Gaussian ECDF function 47,48 : Here Φ ⋅ ( ) is the standard normal cumulative distribution function. To ensure uniqueness of the gold-standard model, QR-factorisation was performed, and perfectly collinear variables were removed.
Here Q is an orthogonal matrix, R is an upper triangular matrix and P is a permutation matrix. A full-rank subspace was extracted from X using QR factorisation, such that:

Experiment setup
In each experiment, 500 design variables and a response were randomly selected from the available gene expression variables in 1 of the 5 datasets. A small proportion of the observations penalised regression models were fitted using default library parameters (Table 1). Regularisation hyperparameters were selected by either 5-fold or 10-fold cross-validation on the training observations, optimising the mean squared error, a typical approach in genomic analyses. 1,6,7,49,50 The same crossvalidation folds were employed for each penalisation method in a given experiment. Predictive performance and variable selection performance were assessed using the remaining test observations. Experiments were repeated for 100 different training samples, for each of 5 datasets and for both cross-validation routines, yielding 1000 experiments with which to compare penalisation methods for each sample size.

Metrics
Model assessment metrics and notation followed previous comparative analyses. 9,10 As the true coefficient vector, β ∈  p , was unknown in our experiments, it was estimated by ordinary least squares regression (without intercept) on the whole dataset n p ≈ = ( ) 4000 500 , , such that: Thus, β * represents a noisy gold-standard rather than strict ground truth. Here x p 0 ∈  denotes the test observations from the design matrix and y 0 ∈  denotes the associated response. Hastie et al 9 measured 3 metrics of predictive performance -proportion of variance explained (PVE), relative risk (RR) and relative test error (RTE).
Higher PVE indicates superior fit, and PVE is limited by the signal to noise ratio (SNR) such that 9 : Relative risk (RR) was employed as an performance metric in Bertsimas' analysis. 10 Optimal relative risk is 0 and nullity is 1.
Relative test error (RTE) compares error to the noise variance: Following calls for model coefficient similarity assessment, 9 we measured the cosine similarity of β  and β * , such that: Active (non-zero) variable selection performance was also estimated under β * . Coefficient significance of was estimated with t-tests: Significance was adjusted for multiple comparisons using falsediscovery-rate (FDR) control 53 and predictors were classified according to a cutoff α = 0 05 . . Precision, recall, F1 score were measured. Hereafter, these metrics are referred to collectively as the 'discrete' variable selection metrics. Undefined variable selection results (due to division-by-zero errors) were replaced with zeros. Figure 1 depicts the variable selection validation method graphically.
To evaluate our model validation approach, we deployed it as a penalty preselection method, comparing it to traditional selection by minimisation of the internal cross-validation error. For each experiment, for each of 3 comparison metrics (PVE, F1 and coefficient similarity), a penalisation method was 'preselected' according to performance in experiments of equivalent sample size in the other 4 datasets. In each relevant experiment, penalisation methods' performances were ranked and the method with the lowest rank aggregate performance was selected. The test performance of this method was compared to that of the penalisation method which yielded the lowest mean squared error on internal cross-validation. Overall performance of preselected penalties was compared to internal cross-validation selected penalties using a 2-sided paired t-test over all 3000 experiments.

Predictive performance
Predictive performance metrics are provided in Figure 2 and Table  2. L L 0 2 -penalised models achieved the highest PVE overall

Variable selection
Variable selection performance metrics are provided in Figure 4 and Table 3.    Figure 5). Cumulative distribution functions of the performance improvements yielded under preselection are provided in ( Figure 5). In other experiments internal validation outperformed preselection (Table 4).

Discussion
The optimal penalisation method for a particular dataset depends upon the project objectives, data distribution and noise levels. In most applications, reliability is paramount -the strong median predictive performance provided by L L 0 1 and L L 0 2 penalisation is unlikely to compensate for their   worst-case performance, which may be undetectable in application. L 1 2 L penalisation offered strong coefficient similarity, though few coefficients are shrunk to zero, limiting its utility for the selection of parsimonious model structures. L 1 and L L 1 2 penalties also offered reliable test predictions in noisy data. L 1 is simpler to implement than combined penalties, requiring tuning of a single hyperparameter. Furthermore, the theory surrounding L 1 penalisation in the n p  setting is well studied. 1,7,12,54 Various computational implementations of this method are available, and it is the fundamental building block for graph inference methods such as the graphical LASSO 55 and the nodewise LASSO. 56 L 0 penalisation resulted in weakly predictive models and poor variable selection, due primary to inadequate recall. These limitations overshadowed any potential advantage of theoretical unbiasedness. 10 Penalty preselection yielded small, yet significant improvements over internal cross-validation based selection in each examined metric, demonstrating the value of external datadriven preselection of model learning algorithms for n p  datasets. This approach may serve as a complementary methodological validation measure for genomic datasets.

Related work
Bertsimas et al 10 found that L 0 penalisation outperformed the L 1 and forward stepwise regression in their comparisons. However, this result was contested in the comparisons of Hastie et al, 9 who concluded that L 1 outperformed L 0 in all but high signal-to-noise conditions. Hazimeh and Mazumder 24 found that L L 0 1 and L L 0 2 penalties typically outperformed L 1 , 24 a finding which concurs with our experiments.

Limitations
The primary limitation of this analysis is uncertainty regarding the true generating distributions of the datasets. In place of ground truth, a 'gold-standard' was set using a much larger number of observations. Thus, our analysis evaluates its capacity to recover the model which would have been found in a Figure 5. Cumulative distribution functions for performance improvement under penalty preselection compared with comparison to selection by internal cross-validation. For each experiment and each comparison metric, the penalisation method was selected with the best test performance in the other 4 datasets. This 'preselected' penalisation method was compared to that which minimised the mean squared error in internal cross-validation. About 3000 experiments were included in the comparison.
14 Cancer Informatics much larger study of the same population, a reasonable objective in many clinical studies. As the gold standard models were fitted to a finite number of observations, they were susceptible to some degree of overfitting.
Observations were not strictly partitioned on a patient-disjoint basis. In the typical clinical modelling scenario, estimation of model generalisability to new patients would require patient-disjoint partitioning and validation. 57 However, distributional identicality of the training and test data would not have been guaranteed in such conditions, biassing assessment metrics in favour of underfitted models.
Bertsimas and Hastie both considered which SNR ranges were 'realistic'; Bertsimas generated tasks with SNR ∈[ , ] 2 10 and Hastie examined the SNR ∈[ . , ] 0 05 6 setting. 9,10 Our estimated SNRs align with those of Hastie. In the case that the gold standard models overfitted, noise levels would have been underestimated. Therefore, SNR estimates in this analysis are positively biassed. Nonzero coefficients were defined according to a traditional, yet arbitrary significance cutoff -therefore small effects may have been omitted erroneously. Likewise, some spuriously large coefficients may have been included.
Discrete variable selection metrics (precision, recall and F1 score) lacked the graduation required to compare penalisation methods at the n = 25 level. This limitation was particularly important in the setting of active variables estimated according to a sharp significance cutoff. The coefficient similarity metric proved useful in this regard, as it was continuous and independent of any significance cutoff. However, coefficient similarity provides little insight into on model complexity, a central aspect of genomic network inference. Indeed, although L L 1 2 penalisation optimised the coefficient similarity metric, it selected extremely complex models in most experiments, resulting in weak precision.
Real-world genomic datasets were employed in this analysis. Accordingly, our results are expected to be more representative of actual experimental modelling conditions. Data-driven model assessment was facilitated by the large number of observations available in these datasets. However, our results may not generalise to datasets with incomparably distributed signal or noise. Logistic and Cox regression tasks present addition challenges such as class imbalance and censoring, which are beyond the scope of this analysis.

Conclusions
L L 0 2 -penalised model provided the best test predictions, though performance was unreliable in noisy data. L L 0 2 also optimised discrete variable selection metrics. L L 1 2 -penalisation returned offered reliable test predictions in all settings and superior coefficient similarity. Further research is required to establish the performance of the penalties in classification and survival tasks. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.

Author Contributions
Conception and Design -all authors. Administrative support -N/A. Provision of study materials or patients: N/A. Collection and assembly of data: Robert O'Shea. Data analysis and interpretation: Robert O'Shea. Manuscript writing: all authors. Final approval of manuscript: all authors.

Ethics Statement
This article does not contain any studies with human participants or animals performed by any of the authors.

Supplemental Material
Supplemental material for this article is available online.