Statistical inference for diagnostic test accuracy studies with multiple comparisons

Diagnostic accuracy studies assess the sensitivity and specificity of a new index test in relation to an established comparator or the reference standard. The development and selection of the index test are usually assumed to be conducted prior to the accuracy study. In practice, this is often violated, for instance, if the choice of the (apparently) best biomarker, model or cutpoint is based on the same data that is used later for validation purposes. In this work, we investigate several multiple comparison procedures which provide family-wise error rate control for the emerging multiple testing problem. Due to the nature of the co-primary hypothesis problem, conventional approaches for multiplicity adjustment are too conservative for the specific problem and thus need to be adapted. In an extensive simulation study, five multiple comparison procedures are compared with regard to statistical error rates in least-favourable and realistic scenarios. This covers parametric and non-parametric methods and one Bayesian approach. All methods have been implemented in the new open-source R package cases which allows us to reproduce all simulation results. Based on our numerical results, we conclude that the parametric approaches (maxT and Bonferroni) are easy to apply but can have inflated type I error rates for small sample sizes. The two investigated Bootstrap procedures, in particular the so-called pairs Bootstrap, allow for a family-wise error rate control in finite samples and in addition have a competitive statistical power.


Introduction
The aim of diagnostic and prognostic studies is generally to differentiate between two groups, e.g. the diseased from the non-diseased or those with good outcome from those with poor outcome.This differentiation can be based on a single clinical parameter or a combination of parameters.The methods and results presented here are independent of whether the goal is diagnosis or prognosis and whether a single parameter or a combination is evaluated.Furthermore, this work also covers the scenario that several investigated diagnostic procedures are defined by different (machine-learned) prediction models which are based on the same input data, e.g.medical images.For better readability, the term diagnosis, diagnostic study and diagnostic test is used in the following, but prognosis, prognosis study and prognostic marker are meant at the same time.
In order to assess the accuracy of a new diagnostic test, the so-called index test, the result is compared with the true condition, which is assessed using the gold standard or reference standard.The recommended co-primary endpoints are sensitivity as the proportion correctly diagnosed as diseased and specificity as the proportion correctly diagnosed as non-diseased.If a standard test exists, the index test should be compared with it, preferably in the within-subject design (all tests in all individuals).However, if there is no established comparator, the aim is to demonstrate a pre-specified minimum sensitivity and specificity, the so-called minimally acceptable criteria (MAC) (European Medicines Agency, CHMP, 2010; Korevaar, Gopalakrishna, Cohen, & Bossuyt, 2019).
Since the study is only considered a success if both hypotheses regarding sensitivity and specificity can be rejected, no correction is required for the multiplicity of the two endpoints (European Medicines Agency, CHMP, 2010).However, multiplicity problems may occur elsewhere in diagnostic studies, especially if the aim is to evaluate different markers in parallel and select the best diagnostic test or to determine the optimal cut-off value for a single marker.Usually, a test or cut-off value is selected in a first step data-driven and then validated in a new study.However, it is known that data-driven selection may lead to an overestimation of the diagnostic accuracy and that the results of the selection study often cannot be replicated in the validation study (Leeflang, 2008).The same phenomenon, sometimes referred to as optimization bias or selectioninduced bias, is well known to also exist in bioinformatics and predictive modelling (Boulesteix & Strobl, 2009;Jelizarow, Guillemot, Tenenhaus, Strimmer, & Boulesteix, 2010).
Therefore, Westphal and Brannath (2020) have proposed a framework for the evaluation of multiple prediction models in which not the (apparently) single-best model is chosen and subsequently validated, but rather a set of promising models from which the best is selected in the evaluation study.Westphal, Zapf, and Brannath (2022) have transferred the framework to diagnostic accuracy studies with co-primary endpoints sensitivity and specificity.In the same way, the framework can also be used to determine the optimal cut-off value for a single biomarker after choosing a range of promising cut points in a prior development study.Regardless of the context, this framework offers increased flexibility and and can increase statistical power, i.e. the probability to correctly demonstrate a sufficiently high diagnostic accuracy for one of the selected tests.
However, it is important to correct for multiplicity in the validation study to avoid inflation of the type-one error.The correction approaches can be divided into single-step and stepwise procedures.In contrast to stepwise procedures, single-step procedures have the advantage that associated simultaneous confidence intervals can be constructed (Strassburger & Bretz, 2008).This is relevant in diagnostic accuracy studies as confidence intervals are an important indicator of estimation uncertainty.
The simplest single-step correction is the Bonferroni adjustment, in which the adjusted significance level is equal to the global significance level, say α = 0.05, divided by the number of tests.This approach leads to conservative results when the diagnostic test results are positively correlated.This is for instance the case if the same biomarker or risk score is dichotomized at slightly different cut points.More elaborate multiple comparison procedures considered in this work aim to exploit the correlation structure and thereby increase statistical power.Existing approaches include the maxT-approach (Hothorn, Bretz, & Westfall, 2008) utilized by Westphal and Brannath (2020).This method is based on a multivariate normal approximation which leads to less conservative decisions in light of positive correlations.Nonparametric methods for diagnostic test evaluation were developed by Konietschke, Hothorn, and Brunner (2012) and Zapf, Brunner, and Konietschke (2015) transferred this approach to early diagnostic studies with the area under the receiver operating characteristic (ROC) curve, the so-called AUC as primary outcome measure.Rudolph (2017) in turn transferred the approaches from the AUC to sensitivity and specificity as co-primary endpoints.
The main contribution of this work is the adaptation of several multiple comparison procedures to diagnostic accuracy studies with multiple index tests with sensitivity and specificity as co-primary endpoints.The results of an extensive simulation study are presented to allow a systematic, neutral and reproducible comparison of relevant statistical properties.Finally, we introduce a new open-source R package cases3 which provides convenient implementations for all investigated methods.This work is structured as follows.Section 2 presents a motivational real-world data example.Different multiple comparison procedures and the design of the simulation study are introduced in Section 3. Section 4 then presents the results of the simulation study and the example data set.A discussion of the results and the conclusion can be found in Section 5.

Motivating example: Breast cancer diagnosis
For the motivating example, we utilize the "Breast Cancer Wisconsin (Diagnostic) Data Set"4 .In total, the dataset consists of 569 observations of digitized image of a fine needle aspirate of a breast masses.Hereby, n 1 = 212 observations are breast cancer cases and n 0 = 357 are controls (fibrocystic breast masses).Thirty real-valued features that were computed from the image data which describe characteristics of the cell nuclei present in the image.A detailed description of the dataset and feature extraction methodology is provided by Mangasarian, Street, and Wolberg (1995); Street, Wolberg, and Mangasarian (1993); Wolberg, Street, Heisey, and Mangasarian (1995); Wolberg, Street, and Mangasarian (1994).Two (independent) example scenarios are described below and analysed in Section 4.2.For both scenarios, we assume that a high sensitivity is deemed more important compared to a high specificity because false negatives have more serious consequences than false positives.

Scenario A: Biomarker assessment
In this first example scenario, our goal is the assessment of the diagnostic accuracy of multiple biomarker candidates in a single study.In this example, three of the thirty available features, have been selected before the study based on prior data and knowledge.The selected biomarker candidates are the most extreme area, compactness and concavity of any of the ceil nuclei.To define a diagnostic test based on a continuous biomarker, a threshold needs to be specified in addition to allow categorization into diseased and healthy subjects.However, the optimal threshold can usually only be specified with uncertainty before the study.In our example, we specify five threshold candidates for each biomarker, roughly corresponding to the 30%, 40%, 50%, 60% and 70% quantiles of each biomarker distribution.In this scenario, our framework will be used to investigate the diagnostic accuracy of the resulting 15 combinations of markers (3) and thresholds (5 per marker) simultaneously while adjusting for the resulting multiplicity.

Scenario B: Risk model evaluation
In modern medical applications, individual predictive features are commonly combined to risk prediction models.In this second example scenario, we evaluate several candidate models simultaneously.We split the dataset into 379 and 190 observations for model learning and evaluation, respectively, to emulate a prospective evaluation study after model development.In the development phase, five variants of the elastic net algorithm are employed, corresponding to five different values of the penalty mixing hyperparameter α ∈ {0, 0.25, 0.5, 0.75, 1} (Friedman, Hastie, & Tibshirani, 2010;Zou & Hastie, 2005).For each of the five variants, the optimal penalty strength λ is found via 10-fold-cross-validation.In the subsequent evaluation phase, the five optimized models ought to be assessed simultaneously.Because each model provides a predicted probability of cancer, the question arises which probability threshold should be used in practice to make classifications.As a high sensitivity is targeted, the thresholds 10%, 20%, 30%, 40%, 50% will be evaluated on the remaining 'prospective' evaluation data.In effect, in this scenario, our framework will be used to investigate the diagnostic accuracy of the resulting 25 combinations of risk models (5) and probability thresholds (5 per model) simultaneously while adjusting for the resulting multiplicity.

Methods
In the following, the statistical model and the hypotheses are presented (3.1).Various approaches which allow for a multiplicity adjusted analysis are then described (3.2 and 3.3).Finally, the design of the simulation study is outlined (3.4).

Statistical model and hypotheses
A sample of n individuals is assumed, divided into n 0 non-diseased and n 1 diseased (with n = n 0 + n 1 ).We consider the within-subject design, i.e. for each person i = 1, . . ., n, the reference standard and m index tests are applied.The comparison of each test T j , j = 1, . . ., m, and reference standard D results in a true sensitivity se j = P (T j = 1|D = 1) and specificity sp j = P (T j = 0|D = 0).Hereby, T j ∈ {0, 1} indicates test-positive (1) and test-negative (0) and D ∈ {0, 1} indicates diseased (1) and non-diseased (0) individuals, respectively.As stated in the introduction, the different index tests T j may be based on a single clinical parameter or biomarker which is dichotomized based on different cut points.On the other hand, each T j may also be the result of a different complex prediction model based on high-dimensional input data (e.g.medical images).A mixture of simple and complex models is also possible.We consider the two alternative hypotheses that true sensitivity and specificity are above a pre-specified minimum sensitivity and specificity, which are denoted by se 0 and sp 0 (Korevaar et al., 2019).This leads to the individual null hypotheses for each test T j .These two hypotheses are connected according to the intersection-union principle to the combined null hypothesis per index test T j , j = 1, . . ., m.The global null hypothesis for our multiple testing problem then reads as Usual (maximum-likelihood) estimates of sensitivity ( se j ) and specificity ( sp j ) are calculated as se j = n j,11 n 1 and sp j = n j,00 n 0 (4) with n j,11 as number of diseased individuals with a positive test result and n j,00 as number of non-diseased individuals with a negative test result, each for test j.In the cases package, the estimates (4) are actually slightly shrunk towards 0.5 by default to prevent singular (co)variance estimates Westphal et al. (2022).The decision to retain or reject any null hypothesis can be made by means of a multiple comparison procedure as discussed in the next two sections.

Multiple comparison procedures
As a starting point for the data analysis we define usual Wald test statistics for the j-th index test by A level α test for the combined hypothesis H j (without adjustment for multiplicity) can easily be obtained.For this, both individual test statistics need to surpass an appropriate critical value c α (European Medicines Agency, CHMP, 2010; Korevaar et al., 2019).In the simplest case, c α = z 1−α is the 1 − α quantile of the standard normal distribution.With this choice, we can define the rejection criterion This approach gives approximate control of the type I error for H 0,j under arbitrary parameter configurations (se j , sp j ).Due to asymptotic nature of the procedure, the type I error is often inflated for small sample sizes and proportions close to 0 or 1 (Newcombe, 1998).Most importantly, even in the asymptotic case, it does not provide control of the family-wise error rate (FWER) for H 0 which is defined as probability to obtain at least one false positive rejection ϕ j = 1 of one of the true null hypotheses H 0,j , j ∈ 1, . . ., m.
Various possibilities exist to allow for control of the FWER of which several are investigated in this work.
All methods are based on existing approaches which however need to be adapted to the special structure of the hypothesis problem (3).The required adaptation is similar for all methods and implies a higher power compared to a native application.This is illustrated in detail based on corresponding comparison regions in Section 3.3.In the following, we briefly describe all investigated methods.
• no adjustment: As outlined above, this naive approach is based on (6) and c α = z 1−α chosen to be the 1 − α quantile of the standard normal distribution.
• Bonferroni: This well-known multiplicity correction can also be described by ( 6) but c α = z 1−α/m now corresponds to the standard normal quantile for an adjusted local significance level α * = α/m • maxT: This asymptotic parametric approach is based on a multivariate normal approximation of the relevant vector of test statistics (Westphal et al., 2022).The critical value c α in ( 6) is calculated as an equicoordinate quantile of the multivariate normal distribution with expectation 0 and estimated correlation matrix R. Positively correlated diagnostic test decisions imply a smaller c α and thus less conservative test results.
• pairs Bootstrap: A nonparametric approach based on (6) whereby c α is chosen to be the empirical 1 − α quantile of a bootstrap sample of the maximum test statistic Z = max j (Z j ) = max j min(Z se j , Z sp j ).For that matter, bootstrap samples of the original data are drawn in a 'paired' fashion such that the original correlation structure of the problem is replicated.A general introduction to this and the next bootstrap approach in the context of regression models is given by Flachaire (2005).
• wild Bootstrap: This approach is similar to the pairs bootstrap but the mechanism to draw new bootstrap samples is different (Flachaire, 2005;Zapf et al., 2015).This approach is investigated although it is not designed for binary data as bootstrap redraws are not necessarily binary.
• mBeta: A Bayesian approach whereby two independent multivariate beta-binomial models are fitted for diseased and healthy subgroups (Westphal, 2019).A rather particular loss function needs to be employed to achieve (frequentist) control of the FWER relevant for hypothesis problem (3).
All methods are implemented in the new R package cases.Implementation details can be checked in the open source code and corresponding documentation.Besides the Bayesian approach, all methods are capable to provide adjusted p-values for decision making.
The above-mentioned hypotheses (2) can be adapted to the case that all index tests T j are compared to a common comparator T 0 , i.e. established diagnostic test with unknown diagnostic accuracy, instead of fixed performance thresholds se 0 and sp 0 .All statistical test procedures listed above are also capable to perform more general contrast tests and allow to specify non-inferiority or superiority margins.In the simulation study in Section 3.4, we however only consider the scenario described in Section 3.1.

Confidence and comparison regions
Besides multiplicity adjusted test decisions and p-values, a (two-dimensional) uncertainty quantification is also important in diagnostic accuracy studies.Confidence regions are the multivariate counterpart to confidence intervals.In our case of m index tests, each with unknown sensitivity and specificity, we are interested in 2m-dimensional confidence regions C = C 1−α with coverage probability 1 − α, i.e. we require We focus our attention on rectangular confidence regions C which are the Cartesian product of test specific confidence regions For simplicity, we only consider one-sided intervals/regions in the following, with all upper limits equal to one.
The duality of confidence intervals and statistical tests for a single parameter is well-known.Similarly, we might expect that defining a statistical test by rejecting H j when both se 0 / ∈ C se j and sp 0 / ∈ C sp j from (2) gives us valid, multiplicity adjusted test decisions.While this approach indeed allows (approximate) control of the FWER, we will illustrate in the following that this procedure is conservative and can easily be improved.Eckert and Vach (2020) introduce the concept of comparison regions to display the uncertainty in sensitivity and specificity of a diagnostic procedure.We define the region of interest R = {(se, sp) ∈ [0, 1] 2 : se > se 0 ∧ sp > sp 0 } as the complement of H 0,j in [0, 1] 2 .A comparison (or decision) region for a single index test is a region D j such that defining a statistical test ϕ j via allows control of the type I error rate.In other words, a statistical test decision based on the comparison region indicates a significant result if D j is completely contained in R. In contrast to Eckert and Vach (2020), we focus on rectangular regions of interest in this work and in addition introduce the concept of multiplicity adjusted comparison regions.
The simplest way to illustrate the difference between confidence and comparison regions is the Bonferroni adjustment.For a single index test (j = m = 1), consider the region with c α * = z 1−α * being the 1 − α * standard normal quantile.Then, C j,α = A j,α/2 (α * = α/2) defines an (asymptotic) confidence region for θ j = (se j , sp j ).An adjustment for multiplicity is needed here, as both, sensitivity and specificity shall be covered by C with high probability, compare (7).On the other hand, the comparison region, as the statistical test for the co-primary endpoint problem, does not require an adjustment for α and hence D j,α = A j,α (α * = α).
This directly extends to multiplicity adjusted confidence and comparison regions when m > 1 index tests are simultaneously evaluated.Here, the Bonferroni adjustment is α * = α/(2m) for the confidence region, as 2m parameters shall be covered.The adjustment for the corresponding comparison region however is only α * = α/m.All statistical methods introduced in Section 3 allow to calculate multiplicity adjusted confidence and comparison regions.For all procedures, one will find that D α ⊂ C α and thus decisions based on confidence regions will not be dual to test decisions but rather more conservative.
Figure 1 illustrates this difference for a synthetic data set with n 1 = 30 cases and n 0 = 90 controls and m = 4 index tests with varying accuracy.The region of interest R is displayed in the upper right-hand corner in green.In the left subfigure, multiplicity adjusted comparison regions by the maxT-approach are indicated by solid lines around each point estimate ( se j , sp j ).The study conclusion in this case would be a rejection of the null H 0,j only one of the four index tests (j = 1, solid/blue lines) as D 1 ⊂ R but D j ⊂ R for all other tests j = 2, 3, 4 (dashed/orange lines).The right subfigure reveals that a decision based on the corresponding (maxT-adjusted) confidence regions would have not resulted in any rejection of the null hypothesis as C j ⊂ R for all j = 1, 2, 3, 4.

Simulation study
We perform a simulation study to evaluate and compare the multiple comparison procedures described in Section 3.2.Our primary goal is a comparison with regards to family-wise error rate and statistical power.The (disjunctive) statistical power is the probability to obtain at least one rejection of a false null hypothesis H 0,j .For that matter, two different data-generating mechanisms ('LFC', 'Biomarker') are investigated and outlined in the following.The key features of the simulation study are described in Table 1 following the ADEMP framework (Boulesteix et al., 2020;Morris, White, & Crowther, 2019).
For both simulation settings (compare Section 3.4.1 and 3.4.2),many different scenarios are considered which are described by a number of design parameters (setting, sample size, prevalence, correlation structure, ...).For each scenario, n sim = 10, 000 synthetic datasets were generated.The simulation study is conducted such that all methods are applied to the exact same synthetic datasets.The standard (Monte-Carlo) error for estimated proportions p such as the FWER or statistical power is (p(1 − p)/n sim ) 1/2 which is bounded by 0.005 for n sim = 10, 000.In all simulation scenarios, we investigated sampling with a fixed sample sizes n 0 , n 1 which corresponds to a simple case-control within-subjects study design.Instruction for that matter are provided via an additional simulation repository which also contains the results of additional sensitivity analyses5 .As the source code for all statistical methods is available in the cases package, this also serves as a documentation of implementation-related details.To reduce the computational demand of the simulation study, we consider only the comparison of m candidate tests to fixed performance criteria se 0 = sp 0 .The cases package furthermore supports comparison of m candidate tests to the unknown sensitivity and specificity of a comparator and other relevant contrast tests.The simulations were conducted in R (version 4.2.1) with help of the batchtools6 package.

LFC setting
In the first setting, a single synthetic data set consists of two separate samples for diseased and healthy.For that matter, two different multivariate Binomial models are employed to generate binary data matrices Q 1 and Q 0 .These matrices have entries whereby Q g ij indicates if the j-th diagnostic test (data-based decision rule) T j under consideration correctly predicted the i-th disease status D i in subgroup g ∈ {0, 1}.The multivariate binomial distribution in general has 2 m parameters but reduced representations do exist which can be specified by the mean vector and correlation matrix.We utilize the R package bindata7 to generate the data Q se , Q sp with given first moments se, sp, respectively (Leisch, Weingessel, & Hornik, 1998).
In this scenario, we focus on least-favorable parameter configurations (LFC) which are defined as parameters which maximize the FWER.As discussed by Westphal and Brannath (2020), an LFC in this case is defined by mean vectors se, sp with entries whereby se 0 , sp 0 are the parameter thresholds which define the hypotheses (1) and b = (b 1 , . . ., b m ) ∈ {0, 1} m is a fixed binary vector.For each index j we thus have either se j or sp j exactly at the boundary of the null hypothesis and the other parameter equal to one.For each of the two endpoints, a correlation matrix is specified in addition.To limit computational resources, we restricted the attention to equicorrelation between non-degenerated parameters, i.e. ρ se jj = ρ se for all indices j = j with se j = 1 and se j = 1.

Biomarker setting
The least-favorable parameter (LFC) setting discussed in the last section allows to assess worst case error rates.However, LFCs are rarely representative of real-world situations (Westphal et al., 2022).The goal of this second simulation study is thus to evaluate the multiple comparison procedures under more realistic parameter configurations.
To generate a single synthetic data set, we start out by simulating l continuous biomarkers from a multivariate binormal ROC model (Pepe, 2003, Section 4.4).This means, that we consider multivariate markers Hereby, µ 1 k , is the mean of marker k in the diseased (D = 1) population.For simplicity, we assume a mean µ 0 k = 0 for all k = 1, . . ., l in the healthy population (D = 0) and a variance of one in both populations.If we aim for a certain AUC of marker k, this can be achieved by solving for µ 1 k which has the solution , 2003, Section 4.4.2).Hereby, Φ is the cumulative distribution function of the standard normal distribution.
These continuous scores V k are split at different cut points c j to obtain an overall list of diagnostic tests T j = 1(V kj > c j ), j = 1, . . ., m with m ≥ l.The resulting decision rules have true parameters depending on the cut point c j .Alternatively, this process can also be identified as thresholding risk model scores to obtain different prediction models.The correlation matrices R 0 , R 1 in ( 12) have a simple structure (e.g.equicorrelation) in our simulation study.Together with the chosen cut points, this induces a certain correlation structure for the actual binary decision rules.Naturally, two (close) nearby cut points, induce (highly) similar predictions and thus a (highly) positive correlation between estimates.

Results
4.1 Simulation study

LFC setting
The results from the LFC (least-favorable configuration) setting described in Section 3.4.1 are displayed in Figure 2. On the left, the FWER is shown depending on the total sample size n = n 1 + n 0 for a control-tocase ratio of 3 : 1 and m = 10 index tests.The parameter values are set to an LFC with se 0 = sp 0 = 0.8, compare Section 3.4.1.The red dashed vertical line shows the (one-sided) target significance level of α = 0.025.
First, applying no multiplicity adjustment at all, clearly leads to a vastly increased FWER of 20% and more compared to the target significance level of α = 0.025.Second, the asymptotic methods (maxT, Bonferroni) are capable of controlling the FWER for large n but fail to do so for small n.For n = 100 both methods show an increased FWER of up to 10%, declining to around 5% for n ≥ 800.Third, the Boostrap approaches (pairs, wild) and the Bayesian mBeta approach all control the FWER for all but the lowest sample size of n = 100.
The right part of Figure 2 shows the (disjunctive) power for the same data instances after dropping the minimal acceptance criteria se 0 , sp 0 both to 0.75.The ordering of methods stays the same as for the FWER analysis, as expected.The gap between mBeta and Bootstrap approaches is somewhat more pronounced than before.
Further analyses (not shown visually here) indicate, that these findings for FWER and power remain qualitatively similar when only m = 5 index tests are considered are when the correlation structure between  diagnostic test result is changed.When the control-to-cases ratio is changed to 10 : 1, all curves are essentially shifted.To obtain a similar power as before, a higher total sample size is required due to the dependence on min(n 1 , n 0 ).As expected, when se 0 is changed from 0.8 to 0.9, the FWER is increased for finite samples but the qualitative observations regarding the asymptotic behaviour still persist.

Biomarker setting
Figure 3 shows results from the simulation in the more realistic biomarker setting described in Section 3.4.2.
In contrast to the worst case assessment in the last section, the FWER is below 0.005 for all methods and nearly all sample sizes and as such far below the target significance level of α = 0.025.This behaviour is expected based on previous work (Westphal & Brannath, 2020).
Regarding statistical power the ordering of methods is mostly similar to the situation in the LFC setting.One important difference is the much better performance of both Bootstrap approaches which seem to adapt much better to the underlying generative distribution in this situation.

Analysis of motivating example
To conclude, we apply our methodology to real-world data and interpret the results.For that matter, the pairs bootstrap approach is applied in the two motivating example scenarios introduced in Section 2. To formalize our requirement that a high sensitivity is prioritized, we define the minimal acceptance criteria as se 0 = 0.9 and sp 0 = 0.7.The entire analysis can be reproduced by following the corresponding vignette of the new R package cases.

Scenario A: Biomarker assessment
The most promising of the 15 investigated classification rules in terms of diagnostic accuracy on the evaluation data (212 cases, 357 controls) is the maximal area of any present cell nucleus with a threshold of 700µm 2 applied for categorization into cases (> 700µm 2 ) and controls (≤ 700µm 2 ).This rule achieves an empirical sensitivity of 96.0% and an empirical specificity 80.6%.The multiplicity adjusted lower comparison bounds are 92.1% and 74.6%, respectively.As both lower bounds are larger than their respective accepetance criteria, the co-primary null hypothesis can be rejected for this test candidate.This is visualized in Figure 4 (left).None of the other comparison regions is completely contained in the region of interest corresponding to the alternative hypothesis.In effect, the null hypothesis cannot be rejected for any other test candidate even though individual sensitivity and specificity estimates of other candidates are larger than those reported above.

Scenario B: Risk model evaluation
In total, 25 risk prediction models were assessed on the evaluation data (71 cases, 119 controls).For four models, the null hypothesis se ≤ 0.9 ∨ sp ≤ 0.7 can be rejected as these models have lower comparison bounds for sensitivity and specificity that greater than the corresponding acceptance criteria.Empirical sensitivities of these models lie between 99.3% and 97.9% and specificities between 88.8% and 85.4%.This is visualized in Figure 4 (right).These four classifiers all result from thresholding different model risk scores at the lowest probability threshold of 0.10.Because all four models are deemed acceptable regarding their discriminatory performance, other criteria may influence the final model decision.For instance, the most parsimonious model can be chosen to increase interpretability.In this example, the lasso fit (elastic net mixing parameter α = 1) results in only 11 non-zero regression coefficients and is much sparser as the three competitors (with α < 1) which have 18, 19 and 23 non-zero coefficients, respectively.

Discussion
In this work, we investigated statistical inference methods for diagnostic accuracy studies with multiple index tests and co-primary endpoints sensitivity and specificity.Multiplicity corrections are relevant in this context as omitting a suitable correction can otherwise induce overoptimistic results and inflated error rates.While a control is easily possible using a traditional (FWER) correction for all 2m parameters (m index tests with unknown sensitivity and specificity), this approach is too conservative for the co-primary endpoint analysis.Several multiple comparison procedures can be adapted to the specific testing problem such that this conservative nature is resolved.This difference was illustrated by contrasting confidence regions and comparison regions in Section 3.3.Only the latter allow for a two-dimensional uncertainty quantification which is dual to the statistical test result.
We conducted an extensive simulation study to compare five multiple comparison procedures with regards to family-wise error rate (FWER) and statistical power.All five procedure are capable to control the FWER at least asymptotically.The Bonferroni adjustment and maxT-approach both need quite large sample sizes to reach the target significance level.This however depends crucially on the control-to-cases ratio.A more skewed class distribution results in higher sample size demand.The Bayesian mBeta approach and the two Bootstrap (pairs, wild) procedures managed to control the FWER much better under least-favorable parameter configurations.Based on the simulation study, we recommend the use of the pairs Bootstrap approach.This approach yielded good FWER control for small sample sizes and competitive power in realistic scenarios.The wild bootstrap approach has very similar performance, but the method is more complex, has more design choices, and was not originally designed to be used for binary data.For larger sample sizes, the maxT-approach is a competitive alternative and easier to apply compared to the bootstrap procedures.The Bonferroni method is by far the easiest method to apply but too conservative in some situations and as such suboptimal.
In this work, we focused on the family-wise error rate and the statistical power.In our framework, point estimates are slightly shrunk due to the implemented minor regularization to avoid singular (co)variance estimates.However, point estimates were not adjusted for multiplicity.Westphal and Brannath (2020); Westphal et al. (2022) illustrated how to apply the maxT-approach to obtain median-conservative point estimators.This generic approach can be also adapted to the other multiple comparison procedures.
All investigated methods can be adapted to the case that not only two (diseased and healthy) subpopulations need to be distinguished but more than two subpopulations, e.g.different disease severities.This capability is already implemented in the cases package.We expect that the performance will then crucially depend on the prevalence of the smallest subpopulation.This should however be confirmed in dedicated simulation studies.

Figure 1 :
Figure 1: Exemplary analysis of synthetic example with four index tests.Region of interest (se 0 = 0.8, sp 0 = 0.7) is displayed in green.Left: comparison regions.Right: confidence regions.Solid/blue lines imply a rejected null hypothesis; dashed/orange lines imply a non-rejected null hypothesis.

Figure 2 :
Figure 2: Simulation results for the 'LFC' setting.Left: FWER, Right: Power.The horizontal axis shows the total sample size (cases and controls n = n 1 + n 0 ) in both subplots.

Figure 3 :
Figure 3: Simulation results for the 'Biomarker' setting.Left: FWER, Right: Power.The horizontal axis shows the total sample size (cases and controls n = n 1 + n 0 ) in both subplots.

Figure 4 :
Figure 4: Results from the analysis of the two breast cancer diagnosis example scenarios.Left: biomarker assessment (scenario A).Right: risk model evaluation (scenario B).Solid/blue lines imply a rejected null hypothesis; dashed/orange lines imply a non-rejected null hypothesis.

Table 1 :
Key features of simulation studyAll numerical experiments presented in this work are reproducible by means of the new R package cases.