Small sample sizes: A big data problem in high-dimensional data analysis

In many experiments and especially in translational and preclinical research, sample sizes are (very) small. In addition, data designs are often high dimensional, i.e. more dependent than independent replications of the trial are observed. The present paper discusses the applicability of max t-test-type statistics (multiple contrast tests) in high-dimensional designs (repeated measures or multivariate) with small sample sizes. A randomization-based approach is developed to approximate the distribution of the maximum statistic. Extensive simulation studies confirm that the new method is particularly suitable for analyzing data sets with small sample sizes. A real data set illustrates the application of the methods.


Introduction
Small sample sizes occur in various research experiments and especially in preclinical (animal) studies due to ethical, financial, and general feasibility reasons. Such studies are essential and an important part of translational medicine and other areas (e.g. rare diseases). Often, less than 20 animals per group are involved, and thus making valid inferences in these studies becomes a challenging part. In addition to the small sample sizes, repeated measurements as well as multiple endpoints are often observed on the experimental units (animals), naturally leading to a "large p, small n" situation and thus to a high-dimensional data design. Note that high-dimensional data do not only occur in animal studies, medical imaging and genomics are other well-known application areas. The first statistical problem at hand is neither the high dimensionality of the data nor the relatively low statistical power of the tests when sample sizes are very small-it is the accurate type-1 error rate control of the methods. Many of the existing statistical methods require moderate or large sample sizes and therefore tend to not control the type-1 error rate properly when sample sizes are very small; they either behave liberal and over-reject the null hypothesis or are conservative. Exact techniques (i.e. procedures that rely on the exact distribution of a test statistic for any finite sample size n) would be a great choice, but they typically rely on strict model assumptions that can hardly be verified-at least in more complex models. Indeed, making any assumptions about the underlying distributions (e.g. based on boxplots), verifying equality of variances, specific covariance structures, etc. is quasi impossible when sample sizes are so small and thus, methods, which do not rely on strict model assumptions, are the methods of choice. All in all, besides the often discussed problem of high dimensionality, small sample sizes increase the challenge of a robust and especially accurate data analysis in such situations.
Beyond these challenges and even though the statistical designs themselves are usually complex, the research questions and study aims are often very specific. These may be tackled by applying global testing procedures, which have been developed for different high-dimensional repeated measures and multivariate ANOVA models by several authors. [1][2][3][4][5][6][7][8] Furthermore, multivariate tests based on interpoint distances have been proposed. [9][10][11][12] Testing global null hypotheses and herewith answering the question whether any difference among the repeated measurements per or across endpoints exists, however, does usually not answer the main question of the practitioners-that is the specific localization of the responsible experimental conditions that lead to the overall significance conclusion. A modern data analysis requires the use of multiple comparison procedures that control the family-wise error rate in the strong sense and that are flexible in the way that they can be used to test arbitrary global and local null hypotheses and lead to compatible simultaneous confidence intervals (SCIs) for the underlying treatment effects. Furthermore, due to the often complex dependency structure across the repeated measurements and contrasts, the multiple comparison method should take the correlations of the different tests statistics for powerful data analysis into account. Such methods are also known as multiple contrast tests (MCTP) and are based on the maximum value of a vector of possibly correlated t-test type statistics (max t-test statistic). Computing its exact distribution without making strict distributional assumptions is impossible in general designs. 13 Therefore, approximations of its asymptotic distribution are needed for making inferences. Recently, it has been suggested 14 to estimate the distribution of the maximum value within a bootstrap simulation-based framework using the empirical correlation matrix. Simulation studies indicate, however, that sample sizes n i ! 50 are necessary for an accurate type-1 error rate control. When sample sizes are smaller, the methods tend to be liberal (see Section 5). In the present paper, a modification of the proposed method will be introduced that does not require the estimation of the correlation matrix. Extensive numerical studies show that the new approach controls the type-1 error very accurately even when sample sizes are very small and data do not follow multivariate normal distributions with equal covariance matrices.
The paper is organized as follows. In Section 2, a high-dimensional preclinical study on Azheimer's disease with small sample sizes is described. Existing methodology for the statistical evaluation of such designs is explained in Section 3. Here, its behavior in small sample size situations is also investigated, which motivates the development of a different approximation of the distribution of the max t-test in Section 4. The qualities of the competing approximations are compared in extensive simulation studies in Section 5. The paper closes with the evaluation of the example and a discussion about the results in Sections 6 and 7, respectively. Theoretical properties of the new approximation and proofs are provided in the supplementary material file. Throughout the paper, I d denotes the d-dimensional unit matrix, . . . ; 1Þ 0 dÂ1 .

A motivating example
As a motivating example, we consider a part of a preclinical study on Alzheimer's disease conducted in the Institute of Pharmacology at the Charit e university medical center in Berlin, Germany. The study involves n 1 ¼ 10 wild-type mice (group 1) and n 2 ¼ 9 L1 tau-transgenic type (group 2) mice. As usual, the sample sizes of this preclinical research trial are pretty small. The abundance of each of the six different proteins Syntaxin, SNAP25, VAMP2, Synaptophysin, Synapsin-1 and Alpha-synuclein were measured in six different regions of the brain of every mouse. The regions of interest were pre-defined as hippocampal CA1 region (CA1), visual cortex (VC), medial septum (MS), vertical limb of the diagonal band of Broca (VDB), primary motor cortex (M1) and nucleus accumbens (ACB), respectively. Thus, 36 observations were made on every mouse, while the number of dependent replicates exceeds the number of independent replications of the trial. Therefore, the statistical design represents a classical "large p, small n" situation with small sample sizes. We note that generating such data requires very advanced technology. For graphical representation and easy display of the data, the protein abundances were log-transformed. The results are displayed in confidence interval plots (with chosen local level 95%) in Figure 1. For illustration, additional dotplots and boxplots of the data are displayed in the supplementary material file. The dotplots give the impression that the protein abundances are roughly symmetrically distributed. However, since sample sizes are so small, making any assumption about the underlying distribution would be questionable. It can also be seen that few "outliers" are present. These values have been kept in the data set, because the range of protein abundance measurements is usually very wide. Therefore, these values are not outliers in the classical sense and provide useful information about the protein levels in the respective brain regions. Furthermore, the confidence intervals displayed in Figure 1 show a fairly amount of variance heteroscedasticity. Therefore, the data of this trial can be modeled by independent and identically distributed random vectors and arrange all analyses according to this order. In general, data modeled by Equation (1) can either be repeated measures (measurements on the same scale), multivariate data (measurements on different scales) or combinations thereof. For a convenient notation of the hypotheses, let l ¼ ðl 1 0 ; l 2 0 Þ 0 denote the combined vector of the expectations in both groups. Besides the questions whether abundances of the proteins in the different regions of the brain differ, the study specifically aims to locate specific group Â region interactions for each of the proteins. From a medical point of view, these would expose the proteins and especially the regions of the brain as biomedical biomarkers for Alzheimer's disease. To be more specific, let l ðPÞ ij denote the expected protein abundance in group i under region j of protein P, where i ¼ 1, 2; j 2 fCA1, VC, MS, VDB, M1, ACBg and P 2 P ¼ fSyntaxin, SNAP25, . . . , Alpha-synucleing, respectively. For each single protein, the major aim is to (i) decide whether there is a group Â region interaction and if so (ii) where. This can be achieved by simultaneously testing whether the group-wise differences l ðPÞ 1j À l ðPÞ 2j are identical for all regions j ¼ 1; . . . ; 6 for each protein P ¼ 1; . . . ; 6. This leads to testing the family of 72 multiple null hypotheses Protein Abundance (95% − CI, log) Figure 1. Confidence interval plot (95%) for each proteinÂ regionÂ group combination in the protein abundance trial. Each confidence interval has been computed by inverting the corresponding one-sample t-test statistic using 97.5%-t-quantiles from a t-distribution with n i À 1 degrees of freedom.
denote the corresponding means of expectations as known from linear model theory, where i ¼ 1; 2; j ¼ 1; . . . ; 6; P 2 P. Thus, the hypotheses are nothing but testing whether the differences of the expectations l ðPÞ 1j À l ðPÞ 2j ; j ¼ 1; . . . ; 6 are identical for each protein. For simplicity, we rewrite the above using matrix notation and equivalently obtain where c ' 0 denotes the 'th row vector of the contrast matrix as used in Here, in Equation (2), we denote with P m ¼ I m À 1 m 1 m 1 m 0 the m-dimensional centering matrix, while describes the direct sum to build a block diagonal matrix.
Note that this summarizing matrix notation enables us to describe the 72 null hypotheses equivalently by only q ¼ 36 which shall be tested using n 1 ¼ 10 and n 2 ¼ 9 independent replications. Therefore, the above is a highdimensional multiple testing problem. An existing statistical method to analyze the data will be discussed in the next section.

Existing methodology
The high dimensionality of the testing problem considered here makes the data analysis complex in the sense that the computation of the critical values for making statistical inference becomes an issue. Recently, Chang et al. 14 propose a simulation-based inference method for high-dimensional data. The procedure is valid for large dimensions and sample sizes. The case of small sample sizes has not been considered and therefore its applicability in such situations intrigues a detailed investigation. In their original paper, both the cases of studentized and nonstudentized statistics have been considered. For the ease of read, we will concentrate on the studentized statistics in the following only. By doing so, we follow the guidelines of resampling studentized statistics. [15][16][17][18] First, we will rewrite the null hypothesis and introduce the statistics in the same way as they were described by Chang et al. 14 who propose to test the equality of expectations of the q-variate random vectors These considerations show that the two statistical testing problems are identical. However, we will propose a different way of estimating the critical values for making reliable inference later in Section 4.
Note that the null hypothesis H 0 : Cl ¼ 0 as given in Equation (2) can be equivalently written as the "standard" multivariate null hypothesis The use of maximum t-statistics plays an important role in preclinical research, because local test decisions can be made using adjusted p-values for the comparisons Second, each t-statistic describes the distance of the observed mean difference to its respective null hypothesis in units of standard deviations. However, for the computation of the local p-values, the distribution of T 0 must be known, at least approximately. Suppose for a moment that it is known, then the where z 1Àa ðmaxÞ denotes the ð1 À aÞ-quantile from the distribution of T 0 . Compatible SCIs for the effects d ' ¼ h 1' À h 2' are given by Finally, the global null hypothesis H 0 : Cl ¼ 0 will be rejected, if In such general models (even under the assumption of multivariate normality), however, the exact distribution of T 0 remains unknown 19 and approximate methods are needed for estimating the distribution of T 0 . In lowdimensional designs (fixed dimension d and contrasts q), the vector of t-statistics follows, asymptotically, as N ! 1, a multivariate normal distribution with expectation 0 and correlation matrix denotes the covariance matrix of the differences in means and D denotes the diagonal matrix obtained from the diagonal elements of V. These considerations show that the (asymptotic) joint distribution of the vector of t-statistics T (and therefore of T 0 ) depends on the unknown correlation matrix and is non-pivotal. This is intuitively clear, since the higher the statistics are correlated, the smaller should be the critical value z 1Àa ðmaxÞ. Indeed, in case of a perfect correlation, the above reduces to a univariate testing problem. Anyway, the correlation matrix is unknown and the above cannot be used for making inferences in its present form. Chang et al. 14 propose to first estimate the correlation matrix by its empirical counterpart denote the empirical covariance matrix of the random vectors Y ik . Analogously,D denotes the diagonal matrix obtained from the diagonal elements ofV. Next, they propose to generate M random vectors from a multivariate normal distribution with expectation 0 and correlation matrixR and to estimate the critical value z 1À aðmaxÞ by computing the ð1 À aÞ-quantile y Ã 1Àa ðmaxÞ of the values . . . ; jY Ã bq jg denotes the maximum value of each of the random vectors Y Ã b (in absolute value). Finally, the unknown quantile z 1Àa ðmaxÞ is replaced with the observable estimator y Ã 1Àa ðmaxÞ in Equations (4) to (6), respectively. Note that the quantile y Ã 1Àa can also be computed directly using the R-function qmvnorm implemented in the R-package mvtnorm 20 (if the dimension is not "too large").
However, the present small sample sizes arise the question whether the method accurately controls the type-1 error rate and thus leads to reliable conclusions. Note that, in the data example, a 36 Â 36 dimensional correlation matrix is estimated upon n 1 ¼ 10 and n 2 ¼ 9 independent vectors per group. Roughly speaking, the estimator might be too inaccurate when sample sizes are so small. In order to answer this question, a motivating simulation study has been conducted. Data has been generated from (i) multivariate normal and (ii) multivariate T 3 -distributions with df ¼ 3 degrees of freedom each with group specific covariance matricesV i , i.e.
Here,V i denotes the empirical covariance matrix of group i in the Alzheimer's disease study. Data have been transformed to Y ik 0 ¼ X ik 0 6 s¼1 P 6 as described above. The T 3 -distribution is heavy tailed and might be a reasonable candidate to mimic the distributional shape of the protein abundance data. Note thatV i is singular and therefore data have been generated using singular value decomposition ofV i using the rmvnorm function implemented in the mvtnorm R-package. 21 The simulation results for varying sample sizes n 1 ¼ n 2 ¼ 8; 9; . . . ; 50 at nominal significance level a ¼ 5% are displayed in Table 1.
It follows that the procedure does not control the type-1 error rate appropriately when sample sizes are very small. With sample sizes n i ¼ 9, the empirical type-1 error rate is about 20% under normality and about 15% under heavy tailed T 3 ð0;V i Þ-distribution and hence highly inflated. Only with larger sample sizes (n i ! 50), the method controls the type-1 error rate quite appropriately under normality, while it tends to be slightly conservative under T 3 ð0;V i Þ, respectively. Digging for the reasons of this behavior, we first find that the procedure does not take the variations of the variance estimators used in the t-statistics in Equation (3) into account and second, the resampling algorithm is based on estimating the full correlation matrix of the vector of t-statistics. If a different resampling algorithm could be defined that overcomes both of these characteristics, a major improvement of the approximation might be available. In the next section, such a solution will be proposed.

Approximating the distribution of T 0
The arising challenge is finding a good approximation of the joint distribution of T for estimating critical-and p-values. Resampling methods as above are an innovative way to do so. Roughly speaking, the corresponding test will work, if both the limiting and the resampling distributions of the statistic coincide-at least asymptoticallyunder the null hypothesis of no treatment effect. As explained above, the vector of t-test type statistics follows, asymptotically, a multivariate normal distribution with expectation 0 and correlation matrix R in low-dimensional settings (d fixed). This means that a proper resampling algorithm must be designed in such a way that the resampling distribution of T, say T Ã , converges to the Nð0; RÞ distribution, respectively, where the correlation matrix must be identical to the one defined in Equation (8). Moreover, in high-dimensional settings (with d ! 1) similar observations apply, see the supplementary material, where it is, for example, shown that the distribution of T converges to a discrete Gaussian process. Detailed assumptions, especially on the covariance matrices, are listed in the supplementary material document as well. Having these thoughts in mind, not every resampling method is applicable in high-dimensional designs with an emphasis on small sample sizes. For example, the nonparametric bootstrap (drawing with replacement) shows poor finite sample performances in a similar setting under stronger conditions. 22 Moreover, the therein proposed permutation method for exchangeable designs is in general not applicable in our unbalanced heteroscedastic setting. 23,24 For more details on permutation tests, we refer to existing overviews and monographs. [25][26][27][28] Therefore, the generation of the resampling variables and especially the algorithmic build-up play an important role for achieving a adequate bootstrap test in high-dimensional designs. Furthermore, even in low-dimensional settings (d fixed), estimating the approximate null Nð0; RÞ distribution of T using a plug-in estimatorR usually requires large sample sizes for an appropriate approximation. We therefore propose to approximate the limiting distribution of T without estimating the parameter of the distribution using a Wild-bootstrap randomization approach that is applicable in low-as well as highdimensional situations. The method follows the same ideas proposed for matched pairs 18,29 and in highdimensional linear models, 30 and is as follows. Let denote the centered random vectors and let W ik denote N independent and identically distributed random signs with PðW ik ¼ AE1Þ ¼ 1 2 . Now, let denote the empirical variance of the variables obtained under the 'th condition. Now, the resampling version of the original test statistic T 0 is given by In comparison to the existing methodology discussed in Section 3, the statistic T Ã 0 mimics the computational process that lead to the original statistic T 0 . Moreover, it is shown in the supplementary material that for both low-and high-dimensional settings, the conditional distribution of the vector of statistics (given the data X) mimics the null distribution of T. For making statistical inference, the critical value z 1Àa ðmaxÞ is now estimated by the following steps: 1. Fix the data X (or Y) and compute the centered variables Z ik . 2. Generate random weights W ik , compute the resampling variables Z Ã ik , the test statistics T Ã and safe the value of Finally, the unknown quantile z 1Àa ðmaxÞ is replaced with the observable value z Ã 1Àa ðmaxÞ in Equations (4) to (6), respectively. One-sided tests and p-values are estimated analogously. The estimation of z 1Àa ðRÞ thus gets by without estimating the full correlation matrix R and additionally takes the variability of the variance estimators into account. Note that the set fH ð'Þ 0 ; T ' ; ' ¼ 1; . . . ; qg consisting of the null hypotheses and corresponding test statistics constitutes a joint testing family in the sense of Gabriel. 31 Therefore, the simultaneous test procedure controls the family-wise error rate in the strong sense asymptotically in case of fixed q. Its accuracy in terms of controlling the type-1 error rate and power to detect alternatives when sample sizes are small will be investigated in the next section.
Remark: Throughout the manuscript, we consider the maximum statistic T 0 as a combination of the q possibly correlated test statistics only. It originates from finding appropriate real valued c 1Àa such that The right-hand side holds with z 1Àa ðmaxÞ ¼ c 1Àa . The resulting test further allows inversion of the test statistics into simultaneous confidence intervals. However, we note that other combining functions than computing the maximum statistic would be possible, for instance Fisher's weighted combining function. A general overview of nonparametric combination terminologies are provided in the monographs of Pesarin and Salmaso 32 and Salmaso et al. 27 and the references therein.

Simulations
In this section, we investigate the small sample properties of the proposed randomization technique within extensive simulation studies. The study aims to compare the two different approximations of the distribution of T 0 presented in the paper. As the true distribution of T 0 remains unknown, the type-1 error control of the competing methods will be used as a quality criterion. Later, the all-pairs and the any-pairs powers of the two methods will be compared. We conducted the extensive simulation studies in R (version 3.6.1). Marozzi 12 discusses different methods to compute the numbers nsim and nboot of simulation and resampling runs in detail. Using his result and under some assumptions, nsim ¼ 10,000 simulations lead to nboot ¼ 8 ffiffiffiffiffiffiffiffiffi ffi nsim p ¼ 800 resampling runs and a maximal simulation error of 0:006%1%. Since the methods proposed in the manuscript are computationally feasible, we chose nsim ¼ 10,000 and nboot ¼ 1000 runs for each setting. Furthermore, setting a ¼ 5%, we can compute the 95% precision interval ½5%7 1:96 ffiffiffiffiffiffiffiffi 1;000 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 5% Ã 95% p ¼ ½4:6%; 5:4%. If the empirical type-1 error of a test is within this interval, the method can be seen as accurate. The simulation code is displayed in the supplementary material file for reproducibility.

Type-1 error simulation results
Due to the abundance of possible factorial designs and hypotheses, two-way designs with varying dimension d 2 f2; 4; . . . ; 150g will be simulated and the hypothesis H 0 : P d ðl 1 À l 2 Þ ¼ 0 of no interaction effect will be tested at 5% level of significance. Data were generated from model where F i ðl 0 ; R i Þ represents a multivariate distribution with expectation vector l 0 , correlation matrix R i and location shifts l i . As representative marginal data distributions, we selected three differently tailed symmetric distributions (normal, logistic, T 3 ) and three skewed distributions (ranging from mildly to very skewed) (v 2 7 ; v 2 15 , exponential) each with sample sizes n i 2 f10; 20g. A major assessment criteria of the quality of the proposed approximations is the impact of both the chosen contrast as well as the dependency structures of the dataespecially when data has different covariance matrices and thus covering a typical Behrens-Fisher situation. Here, we used normal copulas in order to generate rather complex dependency structures of the repeated measurements using the R-package copula. 33 The different allocations of the correlation matrices used in the simulation studies are summarized in Table 2.
In Setting 1, both correlation matrices R 1 and R 2 are identical and represent an autoregressive structure. In Settings 2 and 3, the covariance matrices R 1 and R 2 have different off-diagonal elements models, whereas an autoregressive structure depending on the dimension d is modeled by R 1 in Setting 2, and a linearly decreasing (symmetric) Toeplitz structure is covered by R 1 in Setting 3 (see Table 2), see Pauly et al. 7 for similar choices. Note that Setting 2 models a pretty extreme scenario. For a detailed overview of copulas, we refer to Nelsen 34 or Marozzi. 35 All these four settings will be simulated for all four sample sizes (n i 2 f10; 20g), dimensions (d 2 f2; 4; . . . ; 150g) and distributional configurations as described above. The type-1 error simulation results obtained under Setting 1 are displayed in Figure 2. All others are displayed in the supplementary material file.
First, it can be seen that the underlying covariance matrices significantly impact the accuracy of the simulationbased procedure proposed by Chang et al. 14 in small sample size situations. It can also readily be seen that this test shows an increasing liberal behavior for increasing dimension d. The over-rejection of the hypotheses occurs, because the test decision is based upon quantiles from the Nð0;RÞ distribution, which neither takes the variability nor the distribution of the variance estimators into account. On the contrary, the randomization-based test T 0 in Equation (13) tends to control the nominal type-1 error rate very well, even in case of very small sample sizes and large dimensions. The underlying covariance structures seem to impact the results only minor (if even). In case of mildly skewed distributions, the simulation results indicate that the resampling test controls the type-1 error accurately. However, in case of skewed data with different covariance matrices, the test might be very liberal when sample sizes are small. This behavior especially depends on the type of contrast of interest and whether it induces positive or negative correlations. The simulation results obtained under Setting 2 (see the supplementary material) indicate that none of the methods should be applied in these (rather extreme) cases in practice. Other methods, e.g. nonparametric methods rather than mean-based procedures, might be more appropriate for the analysis of small skewed data in such cases. However, the liberality vanishes with increasing sample sizes. 36 In the three other settings considered here, the randomization procedure controls the type-1 error rate accurately, even when sample sizes are small and data follow skewed distributions. Pauly et al. 24 report similar conclusions for general linear models with independent observations. Furthermore, in all of the settings considered here, the randomization-based resampling method is accurate when data is heavy-tailed but symmetric. As many factors might impact the behavior of the tests, however, the procedure might be sensitive to such distributional shapes in different scenarios than the ones considered here. For the generation of other copula models, e.g. Elliptical and Archimedean copulas, we refer to Marozzi. 35  (14) (Wild) and simulation-based test T in Equation (10) (Chang). Data have covariance matrices as described in Setting 1 in Table 2.
Remark: Instead of using copulas for generation of the multivariate distributions, an alternative method is generating data from model where the error terms is generated from standardized distributions, respectively. Additional s''imulation studies indicate that the empirical behavior of the test procedures is very similar.

Power simulation results
Next the all-pairs power P ("reject all false null hypotheses") as well as the any-pairs powers P ("reject any true or false null hypothesis") of the competing methods to reject the null hypothesis H 0 : P d ðl 1 À l 2 Þ ¼ 0 (a ¼ 5%) will be simulated for selected alternatives. The aim of the simulation study is investigating the impact of the underlying distributions, dependency structures of the data, sample size allocations and dimensions on the powers of the tests. Data have been generated (under the alternative) from model Equation (15) Table 2, respectively. The dimension of the random vectors was set to d ¼ 30. Due to the liberality of Chang et al.'s method for small sample sizes, large sample sizes were simulated (n i ¼ 100) in order to be able to compare the powers of the methods on a fair basis, i.e. in a situation where both of them control the type-1 error rate accurately. For illustration, an additional power simulation with small sample sizes (n i ¼ 10) has been conducted. First, it turns out that the types of covariance structures affect the powers of the tests. This is not surprising, because the higher the correlation the smaller are the variances of the effect size estimators. Overall, the simulation results indicate that the competing methods have comparable powers when sample sizes are large. Chang et al.'s method has slightly larger any-pairs and all-pairs powers than the randomization test (about 1% higher). However, when sample sizes are small, the randomization method controls the size and has a reasonable power. The simulations of the all-pairs power furthermore indicate the strong control of the FWER of both methods. Under the situations considered here, the shapes of the underlying distributions impact the results. As expected, the power of the methods under v 2 -distributions appears to be rather low. The reason is the pretty large variance of the v 2 -distribution compared with the other distributions. It should be noted that the above findings only hold for the settings considered here and might be different under other scenarios. The all-pairs and the any-pairs power curves are displayed in the supplementary material file.

Evaluation
The extensive simulation studies show that the newly proposed randomization test controls the type-1 error rate very satisfactorily, even when sample sizes are very small and data do not follow a multivariate normal distribution. In a first step, we perform a further type-1 error simulation study and investigate the accuracy of the method for analyzing this specific data set in the same way as presented in Table 1. As before in Equation (11), we mimic the data set using multivariate normal Nð0;V i Þ and multivariate T 3 ð0;V i Þ distributions, respectively. The type-1 error simulation results are displayed in Table 3. It appears that the method controls the type-1 error accurately. The high-dimensional preclinical study on Alzheimer's disease introduced in Section 2 can therefore now be analyzed with this method. For comparisons, the data set will be analyzed using both of the discussed approximations. In addition to testing for interactions motivated in Equation (2), multiple comparisons inferring the region as well as the group effects are of interest. These will be performed using the contrast matrices Note that for testing the impact of the region, the test statistics are given by . Therefore, the correlation matrix of the vector of test statistics T ¼ ðT 1 ; . . . ; T q Þ 0 is identical to the one using interaction contrasts as described in Section 3. The randomization approach for approximating the joint null distribution of these statistics is adapted accordingly. Testing for the group effects is the "standard" multivariate hypothesis. Means and empirical variances of the protein abundance data under each protein Â region Â group combination are provided in the supplementary material file.
As already indicated by the confidence interval plots in Figure 1, data show a fairly amount of variance heteroscedasticity. Therefore, assuming equal covariance matrices across the groups is doubtful. Next, the multiple hypotheses will be tested using the two different approaches. The point estimators of the contrasts in meanŝ d ' , values of the test statistics T ' equation (3), the estimated quantile z 95% ðmaxÞ as well as 95%-simultaneous confidence intervals equation (5) using both the simulation as well as randomization technique will be displayed for all of the three multiple hypotheses. In total, M ¼ 100,000 simulation and randomization runs have been performed. The results are displayed in Table 4. Different decisions (at 5% level) between the two competing methods are highlighted in boldface.
First, for all of the three different testing problems, the estimated quantiles of the maximum statistic z 95% ðmaxÞ are way larger using the randomization approach than with the simulation-based method. This is not surprising when reflecting the liberal behavior of the test. The simultaneous confidence intervals are therefore wider using the randomization procedure. In the following, results obtained for each of the three multiple null hypotheses will be discussed separately. Neither of the two competing methods detects an interaction between group and region under any of the six investigated proteins. The estimated quantiles differ remarkably, though (3.10 vs. 3.76). But, since the maximum t-statistic is T 0 ¼ 2:63 and thus T 0 < z 95% ðmaxÞ, data do not provide the evidence to reject the null hypothesis at 5% level of significance. Investigating differences in the regions, the approximation methods provide different local conclusions at 5% level of significance. The simulation-based method declares the regions ACB under Syntaxin, M1 under VAMP2 as well as M1 under Alpha-synuclein significantly different from the average of the others, while the randomization method does not. Taking a look at the boxplots give the impression that these values differ only slightly from the mean of the others. Clearly, given the amount of regional deviations, those regions differ significantly on a pairwise level. Also, overall, the protein abundances differ significantly across the regions. Investigating differences between the groups, no significant differences can be detected using any of the competing methods. It should be noted that the estimated quantile using the randomization method increases from 3.75 to a value of 3.88, while the simulation-based estimator is still about 3.10 (3.09).

Discussion and outlook
Research experiments in translational medicine and especially in preclinical areas are usually small due to ethical reasons and animal welfare. Clearly, animal studies should be abandoned, but, roughly speaking, medical research has not arrived at the point yet to replace and refine every experiment to avoid animals. Often, sample sizes of such trials are smaller than 20 per experimental group, which might be a reason to argue the quality of the outcome. However, since animals are kept under homogeneous conditions, heteroscedasticity across the animals is usually smaller compared to other scenarios in humans, depending on the outcome measures. Anyway, since preclinical studies play a significant role in medical sciences in terms of transferring the results towards the next phase, a major concern is the quality of the used statistical methods. Most of them control the type-1 error rate accurately with large sample sizes only and, indeed, they show a very liberal or conservative behavior when sample sizes are small. This observation holds for a variety of statistical procedures designed for different questions and fields, including analysis of variance methods 24,37 as well as multiple contrast test procedures using maximum ttest type statistics 38,39 for repeated measures and multivariate data. When the number of comparisons is "small" compared to the sample sizes, approximate and exact methods are available. 19,[40][41][42][43] These methods are, however, limited to the number of comparisons to be made and are not applicable in high-dimensional situations. Note that the methods are not applicable because of the test statistic itself (maximum t-test) or because of any computational difficulty, and information about its distribution is only available for low-dimensional designs. Recently, Chang et al. 14 tackled the problem and proposed a simulation-based algorithm to approximate the distribution of the maximum statistic in high-dimensional designs. Extensive simulations show, however, that large sample sizes are needed for an accurate type-1 error rate control making their method not applicable for trials with small sample sizes. In the present paper, we modified their strategy towards a robust randomization technique to approximate the null distribution of the max t-test. Simulation studies indicate that the method approximates the null distribution very satisfactorily and greatly improves the applicability of their method. Furthermore, the power of the method can be considerably improved by adapting it to a two-step screening procedure. For a given significance level a, define the index set which contains the indices of all test statistics T ' that-in absolute value-do not exceed the given bound in S 1 . If the cardinality of the set is equal to q, the null hypothesis H 0 : h 1 ¼ h 2 is not rejected. Otherwise, if its cardinality is equal to s (say) and smaller than q (s < q), select all corresponding test statistics in the sub-vectorT ¼ fT s ; s 6 2 S 1 g.
The null hypothesis will be rejected, if maxfjT s j; s 6 2 S 1 g ! y Ã 1Àa ðS 1 Þ. Here, y Ã 1Àa ðS 1 Þ denotes the ð1 À aÞ-quantile of the values max s6 2S 1 jY Ã 1s j; . . . ; max s6 2S 1 jY Ã Ms j, where the Y Ã 's are defined in equation (10). Thus, the dimension of the testing problem is basically reduced to testing the null hypotheses H ðsÞ 0 :h 1 ¼h 2 , where the hypothesis matrix C has appropriate dimensions andl collects all corresponding l i 's being excluded from S 1 . Moreover, the dimension reduction implies that the critical value of the screening modification does not exceed the original one, i.e. y 1À a Ã ðS 1 Þ y Ã 1Àa . As the value of the test statistic does not change (maxfjT s j; s 6 2 S 1 g ¼ T 0 by definition), screening indeed improves the power. For the same reason, however, screening might result in even more liberal test decisions than the original version without screening. In addition, the interpretation of the screening results may be challenging in case of arbitrary contrasts or especially interaction effects. For these two reasons, we did not consider an additional screening stage. Detailed power investigations and dimension reductions will be part of future research. All of the methods considered in the paper are based on means of data. Nonparametric methods, for instance based on ranks of the data, or simultaneous methods based on interpoint distances as well as investigating other combination functions than maximum, 12 will be part of further investigations as well.