Efficiency Analysis of Competing Tests for Finding Differentially Expressed Genes in Lung Adenocarcinoma

In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The ‘best’ test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the ‘best’ test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria.


Introduction
Cancer research has generated a rich and complex body of knowledge, revealing cancer to be a disease involving dynamic changes in the genome. 1 Research over the past decades has revealed a number of molecular, biochemical, and cellular traits shared by most and perhaps all types of human cancer. 2 However, today there is a need to look at all cancers from different perspectives, using genomic 1,3 and proteomic 4,5 molecular techniques and try to achieve results in ways that never seemed possible. 6 Lung cancer remains the leading cause of cancer death in industrialized countries, claiming more than 150,000 lives annually in the US. 7 The overall 10-year survival rate is a staggering 8%-10%, 8 and it is currently impossible to identify the high-risk patients. 9 Today, lung carcinoma classifi cation is based on clinicopathology. The subclassifi cation of lung cancers is very challenging, and studies have shown that different pathologists have differed on subclassifi cation of the same carcinomas more than half of the time. 10 Determining metastasis of non-lung origin from lung adenocarcinomas is also diffi cult. 11,12 A broader basis of the molecular biology of lung carcinomas could help aid in prediction of outcome, choice of therapies, and identifi cation of new biomarkers. 13 Microarray technology has made it possible for researchers to search for molecular markers of most cancers. 14-20 A number of lung cancer studies have settled upon variable numbers in different sets of genes for successful prognostic classifi ers, including Xi et al. 21 (318 genes), Lu et al. 22 (64 genes), Sun et al. 23 (50 genes), Jiang et al. 24 (36 genes), Bianchi et al. 25 (10 genes), Chen et al. 26 (5 genes), and Lau et al. 27 (3 genes). The resulting classifi ers may be used in all areas of prediction such as detection, choice of therapy, metastasis, and survival outcomes. While many of these models may exhibit very good accuracy, none have been clinically implemented.
By making their datasets publicly available, researchers provide opportunities for re-analysis of their data by others, including the opportunity for the comparison of a variety of distinct approaches to analysis. New methods of analysis can be developed and evaluated, hopefully leading to improvements in our understanding of the differences in the performance of distinct methods. In this study, we have re-analyzed the publicly available lung adenocarcinoma datasets of Beer et al. 9 and Bhattacharjee et al. 13 (both extensively studied at CAMDA 2003 conference; http://www. camda.duke.edu/camda03.html). Additionally, Effi ciency Analysis was also performed using data from the Guo et al. 28 rat toxicogenomic study data (MAQC Project). We will not try to duplicate or scrutinize their results, but rather use their data to assess a variety of analysis methods.
It is important to mention that the word 'efficiency' is used in statistics for describing the relative size of the variance of an estimator to a fully effi cient estimator like the maximum likelihood. We introduce the concept of an empirical method to study effi ciency through the consistency of statistical inferences made with any estimator. The term here is used in a slightly more generic manner as the relative internal consistency at a particular sample size. Methods that show higher internal consistency at a small N have more apparent power, and are therefore more efficient because they require lower N to achieve the same or higher observed degree of internal consistency compared to methods that require larger N.

Data retrieval
Portions of both datasets, Beer, et al. and Bhattacharjee, et al. are already available on the caGEDA website (http://bioinformatics2.pitt.edu/ GE2/GEDA.html), 29 and the data can also be obtained by links from the subsequent journal articles. 9,13 Each dataset contained 5377 genes. The Beer dataset used contains data from 69 neoplastic lung adenocarcinoma samples and 17 nonneoplastic samples. The Bhattacharjee dataset used contained 52 neoplastic and 17 non-neoplastic samples. The datasets used either the Affymetrix HUGeneFL or the HG_U95Av2 microarray chip platforms, respectively. Probe sets were merged by joining Unigene cluster ID's.

Data and statistical analysis
All test, normalization, and transformation analyses were performed using caGEDA, a freely available informatics tool. The datasets were analyzed for differentially expressed genes (DE genes). Effi ciency Analysis was performed followed by Random Resampling Validation (RRV) using a Naïve Bayes Classifier, and PACE Analysis (all described below).

Effi ciency analysis
Effi ciency Analysis is a new method for comparing the apparent internal, or external, consistency in the list of genes found by competing methods for feature selection. In its current application, Efficiency Analysis is implemented as a method that compares the differences in the percentage of overlap of two or more methods found at the same numerical index (number) of overlapping genes, in randomly split datasets. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. For any test (using the same starting list), the lists overlap is 100% when all genes are included (no feature selection), assuming the same chip content. As one applies increasing stringency to the criterion for feature selection to two datasets, the number of genes that are retained decreases, and therefore the number that overlap between the two lists decreases. For any method that is less than perfectly consistent, the proportion of overlap decays as well (as the number of genes approaches zero). Methods exhibiting the highest percent overlap, at a given threshold of a fi lter or test, (which defi nes that number of genes in the observer overlap), are considered the most effi cient. For internal Effi ciency Analysis, a dataset of N samples is split into n nonoverlapping sets, each with N/n samples. For external Effi ciency Analysis, two (or more) independent datasets are generated that address the same biological or clinical question. In either case (internal or external), a given test T (t-, F-, foldchange, etc) is applied to all genes in each dataset separately. The threshold (cut-point) associated with signifi cance level is increased in a stepwise manner over the range of significance levels (thresholds) for the test in n datasets. The percent overlap (O) in the independently determined ranked gene lists is determined at each threshold value (O is the relative size of the area of the overlap to the non-overlap in a Venn diagram (Fig. 1). N1 and N2 are the number of genes found to be signifi cant at a given threshold value in dataset 1 and 2, respectively, and N3 is the number of genes in the intersection of the datasets (size of overlap, then O = N3/ (N1 + N2 − N3). Plotting O VS. N3 creates an 'effi ciency curve' for a given test (Fig. 2). For all values of N3ϽϽM (the total number of genes), the test (t-, F-, fold-change, etc.) with the highest O at a given N3 is said to be the most effi cient. For any given test, the threshold associated with N3ϽϽM associated with the highest O is speculated to be optimal for that test. This approach does not attempt to estimate the false discovery rate (FDR; expected ratio of false positives among the signifi cant results) but instead uses empirical consistency to guide researchers in the selection of competing diverse methods. The curve also may point to a test-threshold that exhibits increased local effi ciency.
Initially, each dataset was split randomly into four representative groups; with each containing at least four normal samples and with one of the four groups containing five normals. Internal Effi ciency Analysis was performed within each dataset using the Effi ciency Analysis option in caGEDA ('internal Effi ciency Analysis'). For all of the tests described in this paper, external Efficiency Analysis was also performed using completely independent datasets ('external Effi ciency Analysis'). External efficiency curves were generated by pasting the ranked gene lists, with their associated test scores, into the Overlap4 tool http://bioinformatics.pitt.edu/GE2/Overlap4.html with a step of 0.1. The fi nal gene list contains: 1) percentage of overlap (O) at a given threshold, for a specifi ed number of genes (N3), and 2) for each selected gene, the percentage for degree of differential expression. As an independent check on the potential utility of the various gene lists, all tests were assessed for classifi cation error using internal cross-validation (RRV with a 70%-30% split, performed at 100 iterations) with a Naïve Bayes classifi er over the range of signifi cance level for each test.

Tests for differential expression
Lung cancer data Detailed descriptions about the tests examined can be found on the caGEDA website (http:// bioinformatics2.pitt.edu/GE2/GEDA.html) 29 and some brief descriptions of the threshold-based tests and normalization methods are described in Appendix A. For most tests, a threshold of 0 was used for all tests so that all genes were returned with a score. Random Feature Selection (RFS) used a threshold range that spanned from 0 to the total number of genes (5377 distinct genes) to produce an appropriate and comparable result. PPST 30 and SAM 31 are permutation tests and were performed as such using the following settings: PPST using 100 permutations, threshold 1, and 1-99 quantiles; SAM using 100 permutations, δ of −1, and a δ' of 0. The Segmented J5, described here for the fi rst time, used 100 percentiles and a threshold of 0 for analysis. Any missing values were estimated using the K-Nearest Neighbor method with 3 neighbors. 32 Nondistinct gene values were averaged. For initial assessment of the tests, no normalization/transformation method was used. For the normalizations/ transformations, the best performing test within the acceptable gene number range, D1, was used to find the best accompanying normalization/ transformation method.
Rat toxicogenomic data Data were obtained from the MAQC Project website (http://www.fda.gov/nctr/science/centers/ toxicoinformatics/maqc/). Effi ciency Analysis was performed for the desired tests using caGEDA and parameters described above for internal Efficiency Analysis and the individual tests. Because the distributed data were previously normalized and fi ltered by the authors, 28 additional normalizations were not performed in assessing the tests. Results from the J5 family, fold-change family, t-test, and random feature selection were compared.

Naïve Bayes classifi cation error estimation
Random Resampling Validation (RRV) was performed on the Beer et al. dataset with the Naïve Bayes classifi er to produce internal cross-validation estimates of classifi er performance characteristics. As a preliminary assessment of the variation in performance associated with different normalization methods, the D1 test was used in combination with transformation/normalization methods.
The following parameters were used: Mean as a measure of central tendency, Naïve Bayes Classifi er with the proportion of samples as the prior, 100 iterations, a 70%-30% training/test split, and a threshold range of 0 to 50 with a 1.0 step. The Bhattacharjee dataset threshold ranged from 0 to 1 with a step of 0.01. Unfortunately, tests that implement permutations (SAM and PPST) were not performed using RRV because caGEDA can not perform both resampling and randomization in stages simultaneously. Similar ranges and steps were used for the normalization methods, and a similar number of iterations were used in choosing the steps for the exceptions. The optimal threshold value was used for the classifi cation of the initial training/test set. The selected gene list, under the most effi cient normalization and test combination, was used to predict sample class labels using the Naïve Bayes Classifi er. Importantly, reciprocal external cross-validation was performed for each data set i.e. one dataset was used for training, and then tested on the other, and vice-versa.

Pace analysis
Additional computational validation was performed using PACE Analysis. 33 Permutation Achieved Classifi cation Error (PACE) uses permutations of the given data set to determine if the achieved classifi cation errors are signifi cant at the 95% and 99% levels.

Intensity-related bias
As our study proceeded, we observed an apparent signal intensity-related bias of the J5 and derived tests toward the high end of the intensity range. Other tests appear to favor genes either at the low end, or in the mid-intensity range. To evaluate the impact of this problem, and the effects of other potential intensity-related biases, we rescaled each gene by the formula: new_value = (old_value-min_value)/(max_ value-min_value) across all of the chips (within genes) and repeated Effi ciency Analysis on all methods.

J5 Test
The J5 test is a gene-specifi c ratio between the mean difference in expression intensity between two groups, A and B, to the average mean group difference of all M genes. The J5 is intended for use when t-tests are likely to exhibit unacceptably low specifi city (high false discovery rates) due to unstable estimates of the variance. 29 For a twogroup (A v B) comparison, the J5 test is described for the i th of m genes as: (1)

Segmented J5
The Segmented J5 genes are ranked by median intensity and the entire distribution is divided into quantiles (default 100). We then used the mean parameter estimates from within each quantile separately, therefore eliminating the signal intensity bias. The mean of the absolute value of the difference of means within a quantile is used for each gene as the J5 denominator for all genes in that quantile. 34 Intensity scaled J5 The Intensity Scaled J5 is similar to the segmented J5 but scaled using one of the mean values (in case vs. control comparisons, the control comparison). 34

D1 Test
The D1 test is a twice-iterated J5 test where the remaining genes are tested after removing the initially discovered signifi cant genes during the fi rst iteration. 29

Results
Because we were most interested in relative efficiency where N3ϽϽM, we compared methods over a range N3 = 0-250 for all tests. Three separate Effi ciency Analyses were performed for the lung adenocarcinoma data: 1) 'BETWEEN' datasets, 2) 'BEER only', and 3) 'BHATTACHARJEE only'. Two classifications were performed: 1) Beer-training/test with Bhattacharjee as a validation set, and 2) Bhattacharjee-training/test with Beer as the validation set.

Effi ciency analysis
Best test 'BETWEEN' the datasets In this exercise we compared the genelists from Beer et al. and Bhattacharjee et al. Effi ciency Analysis was performed to determine the most consistent test. Under our criterion of effi ciency, the D1 and J5, appeared to produce similar results over a range of Ͻ1000 genes. While apparently far superior to the other tests as well as random feature selection ( Fig. 2A, Appendix B1) within the Ͻ2500 gene range, the D1 and J5 were also superior in our range of interest (0-250) (Fig. 2B, Appendix B2). Further exploration of normalization was performed using the D1 test only.
Corrections for the intensity-related bias we observed did not infl uence the results greatly (for example, compare J5 to Segmented (Seg J5; Appendix B3). The results showed that the Intensity Scaled J5 seemed to be most consistent over the lower range of genes, 0-2000 genes. Other tests outperformed at higher values of N3, but not until several thousands of genes were included (Appendix B1 and B3), however, the D1, J5, and Segmented J5 outperformed Random Feature Selection greatly, and appeared to be the most consistent test at a lower number of desired genes.

Most effi cient transformation/ normalization methods, both datasets
To study the effects of competing normalization effects on the efficiency of the D1 test, we generated effi ciency curves for the D1 test under a variety of normalization methods. The locally optimized overlap at low N3 for the D1 test alone was 41.4% and occurred at 47 genes. Figure 3 shows the results from the normalization comparisons. Although none of the normalization methods led to marked improvement, (Appendix B4 and B5), the quantile normalizations appeared to improve overlap results slightly over our gene range (0-100). Quantiles 25, 75, and 99 all performed best at some point over the gene range. Therefore the "best" method of normalization depends on the number of genes in question. The Efficiency Analysis results from D1 alone improved minimally using these normalizations (Fig. 3).

Most internally effi cient methods, BEER dataset
We split the Beer dataset randomly three times producing four independent, representative sample groups. Each group produced a gene list from the two populations (normal-cancer). As before, Efficiency Analysis was performed to determine the most effi cient test. The D1, J5, and Segmented J5 all appeared to give similar results over the specifi ed range of genes (0-250), while being far superior to the other tests as well as random feature selection (Fig. 4A, Appendix B6). To reduce the slight variation in performance associated with any arbitrary split, each of the most effi cient tests were re-analyzed 3-fold (Fig. 4B). Because the D1 test was evidently the most effi cient test for this dataset (independent of intensity-scaling), further exploration of normalization was performed using the D1 test only.
As before, we wanted to assess the transformation/ normalization methods to see if, when used in combination with the D1 test, the percent overlap results would improve. The maximum overlap for the D1 test alone was 35% and occurred around 46 genes. The Z transform normalization method appeared to be most consistent over the entire range of genes (0-250). With Z transformation, the highest percent overlap occurred at 42 genes (Fig. 5, Appendix B7), with 21 of the genes being present on all lists 100% of the time (Appendix C, Table A). The normalization method increased O to almost 46% (Fig. 5, Appendix B7).

Most internally effi cient methods, BHATTACHARJEE dataset
Again, Effi ciency Analysis was performed to determine the most consistent test. As before, the D1, J5, and Segmented J5 all appeared to give similar results over the specifi ed range of genes (0-125), while being superior to the other tests as well as random feature selection (Fig. 6, Appendix B8). Again, these three tests were re-analyzed at 3-fold ( Fig. 6), with all three D1 tests most effi cient. Next, the transformation/ normalization methods were tested. The maximum overlap for the D1 test alone was 28.7% and occurred around 101 genes (∼100 genes desired).
The results from the normalization comparisons are summarized as Figure 7, Appendix B9. The Trimmed mean 5%-95% normalization method outperformed all others over almost the entire range of genes (0-250). With TM 5%-95%, the highest percent overlap occurred at 102 genes, with 32 of the genes being present on all lists 100% of the time (Appendix C, Table C). The normalization method increased the O to 29.2%.

Classifi cation
Classifi cation of the Beer dataset Initial training using Beer dataset The training and test error results using RRV (10 resampling iterations) for the Beer dataset also indicated that D1 might be most useful. The D1 (J5 family) test appeared to produce the lowest test Achieved Classifi cation Error (ACE) requiring only 4 genes, at a threshold of 31.0 (see Appendix D). The test ACE was ∼19% using the D1 alone (Fig. 8, Appendix B10) and improved to 18.4% with the  test was used with all normalization methods. The Subtract Global Minimum procedure was performed to eliminate negative values and allow for log transformation. At 100 genes, TM 5%-95% appeared to be the best normalization method.
Quantile 99 normalization method (Fig. 9, Appendix B11). Importantly, both the test and the normalization methods were predicted to be most efficient! The threshold value with maximum overlap at N3ϽϽN was used without RRV for classifi cation of individual samples for each test.
The results were as follows: A between-mean array correlation after normalization of 0.984, the between array coeffi cient of variation (COV) of 0.034 before normalization, and 0.062 after normalization, and a confounding index (CI; compares the average within-group correlation to the average between-group correlation; The CI should be as close to

Bhattacharjee external validation dataset
The data from only the four genes (Appendix B15) of interest from the Beer dataset were retained from the Bhattacharjee dataset, and the classifi cation performed without normalization. The ACE, sensitivity, and specificity can be found in Appendix C,  Figures 10 and 11, respectively. QC statistics were: between-mean array correlation after normalization of 0.996, the between array coefficient of variation (COV) of 0.028 before normalization, and 0.0030 after normalization, and a confounding index (CI) of 0.997 before and 0.998 after normalization. The lowest Achieved Classification Error (ACE) was 32.0% (Fig. 10, Appendix B12), was achieved using the N fold Ratio test, with six retained genes, at a threshold of 0.37 (see Appendix C).
This threshold was then used without RRV to classify individual samples. The combined score values from the six genes correctly classifi ed 73.9% of the samples correctly, with a sensitivity of 0.235 and a specifi city of 0.903. For the PACE Analysis, the N fold ratio test with SGM-Square Root normalization (Fig. 11, Appendix B13) and a threshold of 0.34 was used. 100 iterations were performed over a threshold of 0-10 with a step of 0.5. The PACE Analysis appeared to show fi ndings that are not signifi cant (Appendix B16).

Beer external validation dataset
The data from only the six returned genes (Appendix B17) from the Bhattacharjee dataset were removed from the Beer dataset, and the classification performed with all normalizations.
The externally valid results were as follows: 70.9% of samples were classified correctly, with a sensitivity of 0.411, and a specificity of 0.782. Again, PACE Analysis was performed with the N fold ratio test and SGM-Square Root normalization (as well as SGM-Log 2, not shown), over a range of 0-8 with a step of 0.5. PACE Analysis of the test set again seemed to show not significant fi ndings (Appendix B16). Additionally, the D1 test was performed with and without normalization and produced very similar results to the N fold ratio classifi cation. Xi et al. 35 reported the following fi ndings: a test lymph node classifi cation error of 70% for pathology-positive, and 38% for pathology-negative patients, and an overall cross-validated accuracy of 54%. Note that here the Bhattacharjee dataset was used for training, IntSc J5 Figure 10. Test ACE for tests using Bhattacharjee. As can be seen, the best ACE (30.7%) was found not using the D1 test, but the N fold ratio test. The J5 and D1 performed second and third best for this data set at 3-4 genes. This fi nding is not consistent with Effi cient Analysis.
utilizing 318 genes, and Beer was used as an external validation set.

Discussion
Microarray technologies generate large volumes of potentially useful information and there have been many papers that have described tools and approaches for data analysis. Unfortunately the relative performance of the methods implemented, in the majority of available tools have not been compared extensively. We report on Effi ciency Analysis, which we envision could possibly be an early step in a standardizing procedure for microarray data analysis. The Efficiency Analysis described in this paper is based on the assumption that the most consistent test will provide the greatest amount of overlap at a fi xed number of genes. We leave open the question of where the comparisons of the curves might be most meaningful, because this may depend entirely on the biology of a given study or investigation. An important advantage of this approach is that the identity of the genes are not examined in the comparison of different methods, thereby removing any possibility of imposing a biological bias in favoring one result over others. Choe et al. (2005) used a control dataset to evaluate different methods. Their study was more focused on combinations of normalization/ transformation/tests for differential expression, but really only compared three tests for differential expression (t-test, Cyber T, and SAM). 37 We have shown that if the appropriate test is chosen fi rst, transformations/normalizations may alter the accuracy of the results but not substantially. The only real difference is that our method chooses and assesses the test fi rst, where Choe's method follows a more classical progression of background correction to normalization to choosing the test last. By assessing the test at the beginning of the process, biases or other complications arising due to background subtraction, transformation/ normalization, and/or other variability may be avoided. Thus no "standard" method of data analysis currently exists and no approach has been described for determining the most efficient method for analysis of a given dataset.
The success of any research lies on reproducibility of results, as does Effi ciency Analysis. Dividing the dataset into additional variable groups a number of times had no signifi cant effect on the results obtained (data not shown), suggesting that the Effi ciency Analysis using the D1 test was robust. There is an issue of intensity-related bias in some tests for a certain number of genes, and this factor may impact Effi ciency Analysis. The Segmented J5 and Intensity Scaled J5 were created to further study, and may alleviate that problem (Appendix B3). By looking at the dataset over pre-assigned quantiles, the bias is addressed. The effects of this bias are evidently minimized by the Intensity Scaled J5 test (Appendix B3), and do not signifi cantly affect the results. The variation in performance among the different tests that we see has profound implications for studies that seek to determine whether a unique gene set might exist for a given clinical diagnosis, such as in breast cancer, and for metaanalysis. 38 Any metaanalysis of microarray data that simply uses the t-test, for example, without attempting to determine fi rst which of the scores of tests for fi nding differentially expressed genes might provide the highest true positive rate will be prone to fi nd vast differences among studies. In this study, only single normalization methods were considered (no combinations). There very well may be a combination of transformation/ normalization methods that would produce better results than any of those tested in this study. Also, Naïve Bayes was the only classifi er tested in our study. It may be that other classifi ers will produce even better accuracy (e.g. PAM 36 , SVM 39 , CART 40 , etc.) The classifi cation results were considered taking a combination of performance characteristics and parameters into account (number of samples classifi ed correctly, sensitivity, specifi city, and number of genes). Classifying one test as the most consistent could vary depending on the parameter that is considered the most important. For our purposes, a minimal gene number, a low percent of classification error, high sensitivity and specifi city were all considered in deciding which methods produced the 'most consistent' classifi cation. We recognize that there are many biological factors (e.g. age, smoking history, weight, sex, diet, and environmental exposures), each of which could have infl uenced the performance of the test significantly. We only stratified the cases into neoplastic and non-neoplastic groups, and did not take into consideration the potential effect of infl uences such as those listed above, or the presence of other complex diseases. Also, there very well could be misclassifi cation of the samples involved, of which we are unaware.
There is disappointment that accompanies the classifi cation results not being signifi cant according to PACE Analysis. Compared to other analyses using PAM 36 (i.e. , who achieved impressive classifi cation errors (training accuracy of 88.4% when compared to pathology), we fi nd our RRV results reassuring in that they were predicted by Effi ciency Analysis. It may simply not be possible to differentiate normal lung tissue and adenocarcinoma tissue based solely on a few genes with the data and samples that were used. Xi et al. used a 318-gene set, with generally the same samples (15 omitted from the test set), to discriminate samples based on lymph node status and although the training errors were signifi cant, the overall test validation accuracy was 39.1% (Beer-train, Bhattacharjee-test) and a cross-validated 54% (Bhattacharjee-train, Beer-test). In our study, the fi nal classifi cations were the product of the test/ normalization combination that produced the lowest test classifi cation error for the existing methods. The several genes that appear to be consistently differentially expressed may be of interest in lung adenocarcinoma. Three of the genes of interest appear to be closely related which will lead to further investigation. TFF3, CALCA, and PCSK1 all are listed by Bhattacharjee et al. as genes of interest in cluster C2 (an adenocarcinoma subclass). Of the remaining genes, only HLA-B appeared as being significant in other lung carcinoma papers investigated. 9,13,41,42 Sample and patient classifi cation is an important goal. We are encouraged by Effi ciency Analysis' ability to somewhat determine the "best" performing test and normalization methods to use for classification in a way that is independent of generating the classifi cation, with the Beer dataset. For the Bhattacharjee dataset, N fold ratio produced the best classifi cation using RRV, but it is important to note that when D1 was independently used for classifi cation it produced very similar results. D1 and J5 ranked second and third. When compared to the Beer dataset, the Bhattacharjee dataset appeared to be 'less consistent'. This may be due to experimental or biological effects, but we cannot say for certain. Obviously further investigation is required. External Effi ciency Analysis is available through the Overlap4 tool, and internal Effi ciency Analysis is automated in caGEDA under "Computational Validation" options. We invite others to participate in using caGEDA and Efficiency Analysis using their own datasets.
During the preparation of this manuscript, the Rat toxicogenomic MAQC study was released in Nature Biotechnology. 28 Guo et al. assessed the concordance in inter-site and cross-platform comparisons and the impact of gene selection methods on the reproducibility of profi ling data in terms of differentially expressed genes. 28 In order to validate their fi ndings, the authors used percent overlap of gene lists, between different laboratories and/or platforms, as their evaluation method. They concluded the same as we did (prior to reading their report) that, gene lists generated by fold-change were more reproducible than those obtained by t-test or SAM. 28 Additionally, conclusions reported by Guo et al. further support our claims as well: 1) Criteria used to defi ne differentially expressed genes can have a dramatic impact on the overlap of the resulting genelists, 2) Findings are reproducible across laboratories and platforms when the preferred gene selection criteria are used, 3) Normalization methods do not alter the gene lists unless a p-value criterion is involved in gene selection, 4) Fold-change performed the best, followed by SAM and lastly the t-test (with mention of the problems with gene selection methods based solely on t-test p-values), 5) The importance of appropriate data analysis procedure as a whole (and may we add the order). 28 Effi ciency analysis was performed on the Rat toxicogenomic study data, the results of which can be seen in Appendix E. In support of our previous conclusions, the J5 performed best overall, better than fold-change, and the t-test performed poorly in almost all situations. We believe that including an analysis of the MAQC Rat toxicogenomic data should clarify some doubt as to whether Effi ciency Analysis' has ability to predict the 'best test'. Although the results were minimally platformdependent, we believe that our previous assertions are further supported by these results. We extend their fi ndings and add that under a similar criterion, the J5 was more reproducible than fold change.
Our interpretation of the results must be stated with important caveats. Absent independent validation, for example, by RT-PCR, it may be premature to conclude the relative order of the reproducibility and consistency of the methods explored.
In conclusion, Effi ciency Analysis appears to distinguish among tests that are incapable of providing consistent results and seems to also be able to generally predict which feature selection methods yield the lowest sample classifi cation error. Although not always correct, the method does produce, if not the 'best' test, a general understanding of which tests yield similar internally and externally consistent results. We were surprised by the consistently poor performance of the t-test and variants of that test. We urge caution in the use of estimates of variance derived from small sample sets in high dimensions, and fi nding differentially expressed genes in solid tumors using tests derived from second central moment estimates. 30 Note the J5 and related tests avoid the use of estimated variance altogether. The D1 test gave the highest percent overlap in both datasets, and although the results were validated by independent datasets, the features selected did not lead to a signifi cant Naïve Bayes classifi er, but then neither did most other gene lists from other tests. This may be due to intrinsic molecular heterogeneity within lung cancer; the Naïve Bayes model, like other fi xedmarker intensity input additive linear models, uses all genes for all patients. Research is needed on classifi ers that are robust to the intrinsic molecular heterogeneity of cancers.

Appendix A
The following descriptions taken from the caGEDA website: Threshold-based tests J5: compares the difference of means in any gene to the average difference in means over the whole array. D1: performs the J5 once, removes all significant genes, and performs the J5 again to capture any genes that were missed initially.
Signal-to-Noise: similar to t test with a high positive false discovery rate.
Segmented J5: developed out of concern over intensity-related bias in the tests. The segmented J5 was designed to determine whether the J5 test was more effi cient than the other tests when the denominator was calculated only using genes with similar intensity. In the segmented J5, genes are ranked by median intensity and a number of quantiles are determined. The mean of the absolute value of the difference of means is used for each gene within a quantile as the J5 denominator.
Intensity Scaled J5: similar to segmented J5 but scaled using the entire range of intensities. Trimmed mean: all intensity values on the array are multiplied by the mean of all intensity values on the array between two specifi ed quantiles.

Among-array
Median mean: the mean intensity of each array is calculated, the median mean is identifi ed, and an array-specifi c multiplicative factor is defi ned as the ratio of that array's mean to the median mean Minimum mean: same as median mean but instead of normalizing to the median mean, the scaling factor is the minimum mean of all of the arrays.
Global mean adjustment: the global mean intensities of all arrays are calculated, then the difference between each individual array mean and the global mean is calculated, the array-specifi c difference value is then added to or subtracted from each individual expression intensity value on an array and the result is that all arrays now have the same overall mean.
Max1, Min0: re-scales each expression intensity profi le so the array-specifi c maxima and minima are equal in all arrays.
Z-transformation: similar to log transformation in that it can change the shape of the distribution, but can reduce performance of some tests for differentially expressed genes.       The J5 appears to perform the best in 3 out of 4 platforms (only being outperformed by nfold1 on the Affymetrix platform). Dark blue: J5; pink: simple t-test; yellow: nfold1; turquoise: nfold2; purple: nfold3; black: random feature selection. C-R = Comfrey Roots/Riddelliine.