Criteria for evaluating risk prediction of multiple outcomes

Risk prediction models have been developed in many contexts to classify individuals according to a single outcome, such as risk of a disease. Emerging “-omic” biomarkers provide panels of features that can simultaneously predict multiple outcomes from a single biological sample, creating issues of multiplicity reminiscent of exploratory hypothesis testing. Here I propose definitions of some basic criteria for evaluating prediction models of multiple outcomes. I define calibration in the multivariate setting and then distinguish between outcome-wise and individual-wise prediction, and within the latter between joint and panel-wise prediction. I give examples such as screening and early detection in which different senses of prediction may be more appropriate. In each case I propose definitions of sensitivity, specificity, concordance, positive and negative predictive value and relative utility. I link the definitions through a multivariate probit model, showing that the accuracy of a multivariate prediction model can be summarised by its covariance with a liability vector. I illustrate the concepts on a biomarker panel for early detection of eight cancers, and on polygenic risk scores for six common diseases.

the entire genome, genetic risk can be efficiently measured with a generic micro-array, 10 and in principle could be calculated for multiple conditions at any point in life. Epigenetic variation may also provide useful risk stratification and has been advocated for the early detection of several cancers. 11 Furthermore, the emergence of large, broadly phenotyped cohorts such as UK Biobank 12 provides useful resources for developing and evaluating such models.
Apart from the practical efficiencies of conducting several assessments in parallel, simultaneous prediction has other potentially useful applications. Individuals may be more concerned about their risk across a range of conditions rather than of one in particular, a demand increasingly targeted by direct-to-consumer genetic testing companies. 13 Furthermore, some interventions may be effective for several conditions, and identification of individuals at increased risk of any of them may lead to greater impact of such interventions. As a simple example, body mass index is associated with several diseases with otherwise distinct causes, including coronary heart disease, type-2 diabetes, breast cancer and depression. 14 A weight loss intervention might be more effective when targeted to those at increased risk of any of those conditions. Similarly, evidence that aspirin usage could reduce the risk of various cancers 15 as well as of cardiovascular disease suggests that risk prediction for a set of diseases could be of benefit. More speculatively, forensic applications could utilise simultaneous prediction of phenotypes from anonymous DNA samples. 16,17 Prediction of this nature is already done informally using recurrent risk factors such as age, gender, smoking and blood pressure. For example, in the UK the NHS Health Check is offered to individuals aged between 40 and 74 on account of the strong association of age with risk of stroke, kidney disease, heart disease, type 2 diabetes and dementia. For such risk factors, their strength of association and ease of measurement obviate any need for formal evaluation over many outcomes. But for emerging risk factors it is less clear whether their utility is enhanced by their potential to predict multiple outcomes. There are problems of multiplicity reminiscent of those in exploratory hypothesis testing, but a framework is lacking for addressing these issues in the context of risk prediction.
Prediction of multiple outcomes can be distinguished from prediction of a single composite outcome. Composite outcomes have been used to group related conditions, such as cardiovascular disease, 18 and to define outcomes of specific interest such as frailty and all-cause mortality. 19 Prediction of such outcomes may be viewed as a crude form of multiple outcome prediction: here I consider the composite evaluation of multiple predictions, rather than the evaluation of a single composite prediction. Composite evaluation may offer improved accuracy over a composite outcome; pragmatically it can use predictors developed individually for each outcome without the need to develop a specific predictor for their composite.
Several authors have studied the statistical modelling of a multivariate response, using methods such as partial least squares 20 and multivariate linear regression. [21][22][23] While it is recognised that prediction can be improved by exploiting correlation among responses, the literature has emphasised methods to improve model fitting, with accuracy typically measured by squared error metrics for each response marginally 22,24 or in total across responses. 21 This may be adequate in applications such as chemometrics and genetic selection where the responses are quantitative, but is less satisfying for prediction of discrete outcomes. Here I am not concerned with model fitting per se but in evaluating models, however estimated, in the context of their joint risk predictions. There is some work on mutually exclusive events, such as polytomous outcomes 25 and competing risks, 26 but general vectors of dichotomous outcomes have not been studied.
Here I propose definitions of some basic criteria for evaluating risk prediction models of multiple outcomes. The evaluation of single outcome models, while not a settled question, has at least a standard set of core criteria that serve as a basis for more nuanced assessment. 27 The present aim is to propose a similar set of core criteria as a starting point for the development of more refined approaches. I do not aim to give a complete account of multiple outcome prediction, but to identify and open discourse around some basic issues in this emerging area.
In section 2, I identify four senses in which multiple predictions can be evaluated, termed outcome-wise, joint, and weak and strong panel-wise. Examples are given in which each sense of prediction may be appropriate. I define sensitivity, specificity, concordance, and relative utility in each of these senses. In section 3, I develop analytical expressions for each of these quantities from a multivariate probit model. These show that the accuracy of a multivariate prediction model can be summarised by its covariance with a liability vector, and from this covariance matrix all the proposed criteria can be derived. Section 4 applies the results to some examples of current applications, and uses the model of section 3 to project their future performance as improved predictors are developed. Section 5 provides some discussion.

Preliminaries
For individual i ¼ 1; . . . ; N, let D i 2 f0; 1g m be a vector of binary indicators for m dichotomous outcomes. Say that outcome j did occur when the j-th element of D i is 1, and the outcome did not occur when that element is 0. Similarly to Gail and Pfeiffer, 28 define the vector p i whose j-th component is the probability of outcome j in individual i. Where necessary, components are identified by brackets: for example, p i½j denotes the j-th component of p i . Let X i be a vector of predictors and consider a marginal risk prediction model rðxÞ as a mapping from the set X of possible values of X i to ½0; 1 m . The model is understood as marginal in that, reflecting much current practice, rðxÞ provides a risk prediction for each outcome but not for combinations of outcomes. In particular, correlations between outcomes may arise from comorbidity, competing risks or other sources, so that outcome-specific predictions may not be easily combined into predictions for groups of outcomes.
As for single outcome prediction, calibration is a desirable property of a risk predictor, and it will be generally useful for the predictor to be calibrated for all outcomes. Informally, calibration requires that predicted risks equal actual risks, but a distinction can be made between the risk among individuals with given predictors x, and risk among individuals with given predictions rðxÞ. These quantities may differ if rðxÞ has the same value for many values of x, as in the case of a risk score formed as a linear combination of many predictors. 29 Definition 1: The risk prediction model rðxÞ is strongly calibrated if EðDjxÞ ¼ EðpjxÞ ¼ rðxÞ for all x 2 X. The predictor is weakly calibrated if EðDjrðxÞ ¼ r Ã Þ ¼ EðpjrðxÞ ¼ r Ã Þ ¼ r Ã for all x 2 X and r Ã 2 ½0; 1 m . Calibration is usually assessed by plots or goodness-of-fit tests. [29][30][31] While these approaches could generalise to a multivariate setting, the following component-wise definition is sufficient for application to marginal prediction models, and can be assessed by applying univariate methods to each component of rðxÞ.
Calibration implies component-wise calibration, but the converse need not hold. In the rest of the paper I assume that rðxÞ is at least weakly component-wise calibrated.
Let t 2 ½0; 1 m be a vector of risk thresholds. Each individual i is assigned to a high-risk category for each outcome j where r ½j ðX i Þ ! t ½j .

Outcome-wise criteria
A straightforward approach is to treat outcomes, rather than individuals, as the sampling units and then apply standard criteria to the vectorised outcomes. Such a view might be appropriate when the consequences of predicting or developing the outcomes are independent. This approach has been used in evaluating carrier screening panels for Mendelian disorders. 32 Another example might be in molecular screening for allergies. 33 Definition 3: Outcome-wise sensitivity is the probability of a positive prediction for an outcome that did occur. Over the joint sample space of D and X sens O ðtÞ ¼ where I is a vector of component-wise indicators and 1 is the vector with all elements equal to one. This is equivalent to the classical sensitivity when m ¼ 1. However, whereas the classical sensitivity does not depend on the outcome probability EðpÞ, the outcome-wise sensitivity does depend on the relative outcome probabilities. To see this, write Prðr ½j ðXÞ ! t ½j jD ½j ¼ 1Þ The first term in the summand is the classical sensitivity for outcome j, so the outcome-wise sensitivity is the weighted sum of the individual outcome sensitivities, with the weights as the relative outcome probabilities. Therefore, a sample estimate of outcome-wise sensitivity may be subject to ascertainment bias, but a population estimate may be obtained by weighting the individual outcome sensitivities using external estimates of outcome probabilities. Weights may be used to attach greater importance to the prediction of some outcomes. This may be done by generalising the outcome-wise sensitivity to where W is a diagonal matrix with positive entries. Again this is equivalent to a weighted sum of individual outcome sensitivities, with the weights as the relative outcome probabilities scaled by the respective diagonal elements of W.

Definition 4:
Outcome-wise specificity is the probability of a negative prediction for an outcome that did not occur.
Similarly to the sensitivity, the outcome-wise specificity is the weighted sum of the individual outcome specificities, with the weights as the relative probabilities of the complementary outcomes. General weights may be introduced as for the sensitivity. A standard, if often criticised [34][35][36] summary of sensitivity and specificity is the area under the receiver operating characteristic (ROC) curve, which for a single outcome is constructed by plotting sensitivity against 1-specificity over the range of t. The idea of a ROC does not easily generalise to multiple outcomes because vectors t need not parameterise a one-to-one mapping of specificity to sensitivity. However, the C-(concordance) index, 37 which for a single outcome is equivalent to the area under the entire ROC curve, can be extended more readily.
The C-index for a single outcome is the probability that, given one individual with the outcome and one without, the prediction is higher for the former, i.e. PrðrðX i 1 Þ > rðX i o ÞjD i 1 ¼ 1; D i o ¼ 0Þ. An outcome-wise extension might be to evaluate the same probability over outcomes rather than individuals. However, this would compare the predicted risk for an outcome that did occur to the predicted risk of a different outcome that did not occur, which is difficult to interpret when the elements of t are unequal. Stated differently, if the aim is to quantify how well rðxÞ discriminates outcomes that did occur from those that did not, it makes little sense to compare predictions for different outcomes when the risk thresholds for those outcomes may be different.
A more satisfactory approach is to compare a prediction for an outcome that did occur to a prediction for the same outcome when it did not occur. This just yields the C-index for that outcome, so the expected C-index for multiple outcomes is the weighted sum of individual outcome C-indices. For outcome j the probability of observing a discordant pair of outcomes is EðD ½j Þð1 À EðD ½j ÞÞ giving Definition 5: Outcome-wise C-index is the weighted sum of individual outcome C-indices.
EðD ½j Þð1 À EðD ½j ÞÞ X k EðD ½k Þð1 À EðD ½k ÞÞ One criticism of the ROC curve is that it treats sensitivity and specificity equally when they may entail different benefits and costs. The relative utility curve has been proposed to address this issue, 38,39 and is especially useful for comparing different risk prediction models. Here I summarise its derivation for one outcome before developing an outcome-wise extension. Let b be the benefit of correctly predicting an outcome that did occur, and c the cost of incorrectly predicting an outcome that did not occur. Given a decision making risk threshold t, for an individual i with risk prediction rðX i Þ ¼ t the net benefit of a positive prediction is bPrðD i ¼ 1jrðX i Þ ¼ tÞ À cPrðD i ¼ 0jrðX i Þ ¼ tÞ and this is positive when It follows that if the risk predictor is weakly calibrated, the net benefit is positive if rðX i Þ > t where t is such that Therefore, use of the threshold t implies a cost-benefit ratio of t=ð1 À tÞ. With this threshold, the expected net benefit over the population is Pr The relative utility is the ratio of this expectation to its theoretical maximum when sensitivity and specificity are both 1, thus The net benefit is understood as resulting from taking action on a prediction, and so is relative to the result of taking no action. If the default, in the absence of risk prediction, is to take no action, then that is equivalent to a risk predictor with sensitivity 0 and specificity 1 at all thresholds. Conversely, if the default were always to take action then the sensitivity is 1 and the specificity is 0. A default of no action is rational when its relative utility is greater than under the default of always taking action. The definition of RUðtÞ shows that this occurs when t ! PrðD ¼ 1Þ, termed the relevant region for evaluating relative utility. 38 On the other hand, if the default is to take action, then the analogous definition for t PrðD ¼ 1Þ is These expressions assume negligible cost of evaluating rðXÞ; more general derivations are provided elsewhere. 38 Turning now to multiple outcomes, let b O and c O represent common values of benefit and cost for all outcomes.
(In practice these quantities may vary across outcomes, so they may be thought of here as average values.) Assume that benefits and costs are additive across outcomes within individuals. For an individual i with risk prediction rðX i Þ ¼ t, the net benefit of a positive prediction is now and is positive when If the risk predictor is weakly component-wise calibrated, then Therefore, the use of threshold vector t implies the cost-benefit ratio Under additive benefits and costs, the expected net benefit over the population is Definition 6: Outcome-wise relative utility for threshold vector t is As before, a diagonal weight matrix W can be used to allow some outcomes to contribute more to the relative utility, giving where sens O ðtÞ and spec O ðtÞ are also used in their weighted versions. For multiple outcomes, relative utility defines a surface over the space of threshold vectors, RU : ½0; 1 m 7 !ðÀ1; 1. The relevant region is ft : when the default, in the absence of risk prediction, is to take no action for any outcome. When there are outcomes for which the default is to take action, a pragmatic approach is to substitute the complementary outcomes, and thresholds, in the above definitions.

Joint criteria
An issue with outcome-wise measures is that actions are applied to individuals rather than outcomes. In many contexts, it is more appropriate to summarise risk predictions for each individual before taking action. To this end I now define individual-wise measures, which vary according to the definition of a true positive prediction. For joint measures, the aim is to predict the joint occurrence of all outcomes in an individual. An example might be in forensic identification from an anonymous DNA sample, where a profile could be constructed from several traits such as hair colour, 40 height 16 and weight, 41 each discretised into broad categories.
Definition 7: Joint sensitivity is the probability of predicting all outcomes to occur, in an individual for which all outcomes did occur.
If the elements of rðXÞ are jointly independent and the elements of D also are jointly independent, then Prðr ½j ðXÞ ! t ½j jD ½j ¼ 1Þ In this case, the joint sensitivity is the product of individual outcome sensitivities. However, in the general case of dependence between elements of rðXÞ or D, the joint sensitivity is not readily expressed in terms of the individual outcome sensitivities.
Definition 8: Joint specificity is the probability of predicting at least one outcome not to occur, in an individual for which at least one outcome did not occur.
Note that this may depend on the distribution of D and therefore that an estimate of spec J (t) may be subject to ascertainment bias. When information is available on the distribution of D; an unbiased estimate of spec J (t) could be obtained by weighting each observation in which D 6 ¼ 1 by the inverse of its sampling probability.
Definition 9: Joint C-index is the probability that, given one individual in which all outcomes did occur and one individual in which at least one outcome did not occur, the minimum risk prediction is higher in the former individual.
To define relative utility, let b J be the benefit of predicting all outcomes to occur when all outcomes did occur, and c J the cost of predicting all outcomes to occur when at least one outcome did not occur. For an individual i with risk prediction rðX i Þ ¼ t, the net benefit of a positive prediction is b J Pr Therefore, use of the threshold vector t implies a cost-benefit ratio of With this threshold, the expected net benefit in the population is Definition 10: Joint relative utility for threshold t is In general PrðD ¼ 1jrðXÞ ¼ tÞ must be estimated. As this may be difficult in practice, the following working definition may be useful. If risk predictions and outcomes both are jointly independent, and the risk predictor is weakly component-wise calibrated, then PrðD ¼ 1jrðXÞ ¼ tÞ ¼ Y j t ½j and The relevant region is ft : Pr D ¼ 1jrðXÞ ¼ t ! PrðD ¼ 1Þg when the default, in the absence of risk prediction, is to take no action for at least one outcome.

Panel-wise criteria
For panel-wise criteria the aim is to predict the occurrence of at least one outcome in an individual. A correct prediction may, however, be defined in different ways according to whether the predicted outcomes are the same as those that did occur. Here I propose two senses of panel-wise prediction, called the weak and strong senses by analogy to family-wise errors in hypothesis testing.
Definition 11: Weak panel-wise sensitivity is the probability of predicting at least one outcome to occur, in an individual for which at least one outcome did occur. sens S ðtÞ ¼ PrðIðrðXÞ ! tÞ 6 ¼ 0jD 6 ¼ 0Þ The subscript S stands for screening as explained later. Note that this may depend on the distribution of D and therefore that an estimate of sens S ðtÞ may be subject to ascertainment bias. When information is available on the distribution of D; an unbiased estimate of sens S ðtÞ could be obtained by weighting each observation in which D 6 ¼ 0 by the inverse of its sampling probability.
Definition 12: Weak panel-wise specificity is the probability of predicting no outcomes to occur, in an individual for which no outcomes did occur. spec S ðtÞ ¼ PrðrðXÞ < tjD ¼ 0Þ Definitions 11 and 12 are complementary to the joint sensitivity and specificity, and similarly the weak panel-wise specificity is the product of the component-wise specificities in the case that risk predictions and outcomes both are jointly independent. The complement of weak panel-wise specificity is analogous to the weak sense of familywise type-1 error rate in hypothesis testing. Similar arguments to the joint criteria give the following definitions of concordance and relative utility.
Definition 13: Weak panel-wise C-index is the probability that, given one individual in which at least one outcome did occur and one individual in which no outcomes did occur, the maximum risk prediction is higher in the former individual.
If risk predictions and outcomes both are jointly independent, and the risk predictor is weakly component-wise calibrated, then The relevant region is ft : Pr D 6 ¼ 0jrðXÞ ¼ t ! PrðD 6 ¼ 0Þg when the default, in the absence of risk prediction, is to take no action for any outcome.
Turning to the strong sense definitions, the key difference is that the predicted and actual outcomes must coincide for at least one outcome that did occur.
Definition 15: Strong panel-wise sensitivity is the probability that at least one outcome is correctly predicted to occur in an individual for which at least one outcome did occur. sens P ðtÞ ¼ PrðD 0 IðrðXÞ ! tÞ 6 ¼ 0jD 6 ¼ 0Þ Estimates of sens P ðtÞ may be subject to ascertainment bias, which could be adjusted for by weighting each observation where D 6 ¼ 0 by the inverse of its sampling probability.
Definition 16: Strong panel-wise specificity is the probability that all outcomes that did not occur are predicted not to occur in an individual for which at least one outcome did not occur. spec P ðtÞ ¼ Pr ð1 À DÞ 0 IðrðXÞ ! tÞ ¼ 0jD 6 ¼ 1 Definitions 15 and 16 complement each other in a different way to the weak sense definitions 15 and 16. The complement of strong panel-wise specificity is analogous to the strong sense of family-wise type-1 error in hypothesis testing. Note that an individual may count towards both sensitivity and specificity, a property shared with the outcome-wise measures.
Definition 17: Strong panel-wise C-index is the probability that, given one individual in which at least one outcome did occur and one individual in which at least one outcome did not occur, the maximum risk prediction is greater among the outcomes that did occur in the former individual than among the outcomes that did not occur in the latter.
Note that under this definition an individual may appear on both sides of the inequality (i.e. i 1 ¼ i 0 ) and, unlike C J and C S , C P does not have a natural interpretation as a measure of discrimination. Furthermore, it need not equal 0.5 under random predictions. Nevertheless it corresponds to definitions of sensitivity and specificity in the same way as those other measures of concordance, and could be used as a summary measure for comparing different predictors of a set of outcomes.
Relative utility cannot be developed in the same manner as RU J and RU S , but the following working definition is analogous to that of the weak panel-wise sense.
Definition 18: Strong panel-wise relative utility for threshold vector t is RU P ðtÞ ¼ sens P ðtÞ À with the relevant region t : If risk predictions and outcomes both are jointly independent, and the risk predictor is weakly component-wise calibrated, then Which of the weak or strong measures is more appropriate will depend on the application. For example, if the same action would be performed for all outcomes, it is less important to predict specific outcomes. That might be the case when screening for a range of conditions with a common intervention, as is done say when measuring blood pressure with a view to prescribing anti-hypertensives. For this reason I suggest screening, with subscript S, as a shorthand for weak panel-wise, and panel-wise itself, subscript P, as a shorthand for strong panel-wise, and will use those terms in the rest of the paper. (Strong) panel-wise measures may be appropriate in early detection settings where the action depends on the specific outcomes predicted. Figure 1 shows an example of four outcomes in eight individuals, showing which individuals count towards the different senses of sensitivity.

Multivariate probit model
For a single outcome, many measures of predictive accuracy can be expressed in terms of variance explained by the risk predictor, assuming a probit model for the outcome. 42 This allows any of the measures to be derived from reported values of any others, and argues for the use of variance explained as a fundamental measure of prediction accuracy without the caveats associated with, for example, ROC curves. Here this framework is extended to the prediction of multiple traits using a multivariate probit model for outcomes. 43 Assume that individual i has a latent liability vector L i distributed as multivariate normal with dimension m, mean vector 0 and variance-covariance matrix R L with diagonal entries 1. Define the threshold vector s such that outcome j occurs whenever L ½j ! s ½j , thus s ½j ¼ U À1 ð1 À PrðD ½j ¼ 1ÞÞ.
Assume that each outcome has a single normally distributed predictor, so that the predictor vector X i is multivariate normal with dimension m, mean vector 0 and variance-covariance matrix R X . Let the joint distribution of liability and predictor be multivariate normal with mean vector 0 and variance-covariance matrix where component ½jk of R LX is the covariance between liability for outcome j and predictor of outcome k. A notable special case is R LX ¼ R X . Then the diagonal elements of R X are the variances in each liability explained by the corresponding predictors, and for each outcome, conditional on its own predictor there is no additional information from any other predictors.
The following expressions will be useful. If each element of X estimates the corresponding element of L, the risk prediction for outcome j is given by and the risk threshold t ½j is equivalent to the predictor threshold t ½j ¼ s ½j À ð1 À R X½jj Þ 1 2 U À1 ð1 À t ½j Þ Given outcomes D ¼ d; the liability follows a multivariate truncated normal distribution, with truncation at s from below for the outcomes that did occur and from above for those that did not. Denote the conditional mean vector and variance-covariance matrix of the truncated liability by l LjD¼d and R LjD¼d ; these quantities may be computed numerically by the method of Tallis. 44,45 The Pearson-Aitken selection formulae 46 give the mean predictor in individuals with outcomes d as and the variance-covariance matrix Assume that conditional on d the predictor follows the m-variate normal distribution with the above mean and variance-covariance. Furthermore since L has mean 0 from which EðXjD 6 ¼ dÞ and varðXjD 6 ¼ dÞ follow analogously to equations (2) and (3). Finally assume that conditional on a prediction X ¼ x the liability follows the m-variate normal distribution with the mean and variance-covariance given by the Pearson-Aitken selection formulae as The outcome-wise criteria can be expressed in terms of single outcome criteria, which are special cases of the joint criteria below and are therefore omitted for brevity.

Joint criteria
From Definition 7 where UðÁ; l; RÞ denotes the multivariate normal cumulative distribution function with mean vector l and variance-covariance matrix R. From Definition 8 spec J ðtÞ ¼ PrðIðrðXÞ ! tÞ 6 ¼ 1jD 6 ¼ 1Þ ; R X Þ À UððÀs; ÀtÞ 0 ; 0; RÞ 1 À UðÀs; 0; R L Þ Calculating joint concordance requires the distribution of the maximum element of the multivariate risk predictor. This has recently been derived analytically 47 but can be approximated by simulation: First simulate a predictor from the multivariate normal distribution conditional on D ¼ 1, given by equations (2) and (3), and convert each component to a risk using equation (1). Simulate a second predictor in the same way but conditional on D 6 ¼ 1. Over a large number of simulations, the joint concordance is estimated as the proportion in which the minimum risk of the first predictor exceeds the minimum in the second.
From Definition 10, the joint relative utility is

Screening criteria
Following analogous steps to the joint measures, from Definition 11 sens S ðtÞ ¼ 1 À Uðt; 0; R X Þ À Uððs;tÞ 0 ; 0; RÞ 1 À Uðs; 0; R L Þ From Definition 12 spec S ðtÞ ¼ Uððs;tÞ 0 ; 0; RÞ Uðs; 0; R L Þ To estimate screening concordance, first simulate a predictor from the multivariate normal distribution conditional on D ¼ 0, given by equations (2) and (3), and convert each component to a risk using equation (1). Simulate a second predictor in the same way but conditional on D 6 ¼ 0. Over a large number of simulations, the screening concordance is estimated as the proportion in which the maximum risk of the second predictor exceeds the maximum in the first.
From definition 14, the screening relative utility is RU S ðtÞ ¼ sens S ðtÞ À Pr

Panel-wise criteria
Panel-wise measures can be evaluated by summing over outcome vectors d. From Definition 15 the panel-wise sensitivity is The probability in the summand is an integral of the multivariate normal density with mean vector 0 and variance-covariance matrix R. For components j where d ½j ¼ 1, the limits of integration are ½s ½j ; 1 for the liability components and ½À1;t ½j Þ for the predictor components. For components j where d ½j ¼ 0, the limits are ½À1; s ½j Þ and ½À1; 1, respectively.
From definition 16 the panel-wise specificity is For components j where d ½j ¼ 1, the limits of integration are ½s ½j ; 1] for the liability components and ½À1; 1 for the predictor components. For components j where d ½j ¼ 0, the limits are ½À1; s ½j Þ and ½À1;t ½j Þ; respectively.
To estimate panel-wise concordance, simulate liabilities L and predictors X from their joint multivariate normal distribution with mean vector 0 and variance-covariance matrix R. Concordance is estimated according to Definition 17 using pairs of simulated L and X in which one has D 6 ¼ 1 and the other has D 6 ¼ 0.
The panel-wise relative utility can be calculated from Definition 18 using expressions given above. All the criteria are now expressed in terms of the marginal outcome probabilities PrðD ½j ¼ 1Þ and the joint variance-covariance matrix R of liability and predictor. A summary measure of the prediction accuracy is suggested by the multivariate analysis of variance, via Wilks' K This is the proportion of variance of L explained by the predictor X. For a single outcome, 1 À K equals the coefficient of determination from the regression of L on X. 42

CancerSEEK
CancerSEEK is a blood-based test of circulating proteins and tumour DNA mutations that are associated with the presence of cancer. 9 It has been proposed for early detection of cancers of the ovary, liver, stomach, pancreas, esophagus, colorectum, lung, or breast. A single test is applied, from which a positive result suggests the presence of one of these cancers. Given a positive test, a secondary algorithm identifies the likely site of the cancer.
CancerSEEK tests a composite outcome, and as such the standard univariate criteria correspond to screening criteria. However, the authors reported sensitivities for each cancer individually, at a risk threshold of 0.893, and reported their incidence-weighted average as 55%. This average corresponds to outcome-wise sensitivity (Definition 3), but it is also a screening sensitivity if at most one cancer is present in each subject. The screening specificity was reported as over 99%.
The in-sample screening sensitivity at this risk threshold was 62.2% and the area under the ROC curve (AUC) was 91% (Figure 2a in Cohen et al. 9 ). However, as noted in Definition 11 these estimates are subject to ascertainment bias, in particular the under-sampling of breast cancers relative to other cancer cases, explaining the discrepancy between the in-sample and incidence-weighted sensitivities. I randomly resampled cases from each cancer (their Table S4) in proportion to their incidence rates (L. Danilova, personal communication). The insample screening sensitivity was now 55%, equal to the outcome-wise sensitivity, and the screening concordance reduced to 89%. This is the concordance that would be expected in a population screening context.

Polygenic risk scores
A polygenic risk score (PRS) is an aggregation of genetic risk,b 0 G whereb is a vector of estimated effects (e.g. log odds ratios) and G is a vector of coded genotypes (e.g. numbers of risk alleles) across many DNA sites, typically single nucleotide polymorphisms (SNPs). 48 A PRS can be computed for many diseases at once in the same individual, by forming products of different effect vectors with the fixed genotype vector. PRS have been constructed for a number of diseases and have shown promise for risk prediction. 10 Table 1 shows six diseases for which PRS have been fitted using variants across the whole genome, as opposed to a limited number of associated SNPs. The reported AUCs were converted to liability variances explained using published formulae, 49 giving the diagonal elements of R LX . Assume that the correlation between pairs of estimated PRS equals the total genetic correlation of the diseases, which was obtained from the LD-Hub database 50 (Table 2) to give the off-diagonal elements of R X . This assumption is more tenable for these PRS, which include variants across the whole genome, than for PRS constructed from a limited number of associated SNPs. Assume further that the correlation between disease liabilities also equals the genetic correlation, giving R L (Table 3). Finally assume that the PRS for disease j has no covariance with disease liability k conditional on the PRS for disease k, where j 6 ¼ k. Under this assumption R X ¼ R LX (Table 2).
Under the model developed in section 3, the event-wise concordance is 0.653, the screening concordance is 0.607, which is lower than all individual AUCs, and the joint concordance is 0.749. The panel-wise concordance is 0.49, compared to a value of 0.37 obtained when the correlation matrices are the same but all individual AUCs are set to 0.5.
For illustration, consider a screening application to identify, early in life, those at elevated risk of at least one of these diseases. Suppose the risk threshold vector is set equal to the prevalence, so that the predictor identifies individuals with above-average predicted risk for at least one disease. The screening sensitivity is 0.955, which is considerably higher that the individual sensitivities (Table 1). However, the screening specificity is much lower at 0.074. Similarly to multiple hypothesis testing, the prediction of multiple outcomes increases both the true-positive and false-positive rate at a given threshold vector, but the thresholds that reflect the cost-benefit ratio are different in the multiple prediction context than for the single predictions. The screening concordance of 0.607 suggests that, across all thresholds regarded equally, the sensitivity-specificity trade-off is not as good as for any disease individually. The screening relative utility is À0.004, suggesting that these PRS provide no benefit in a multiple screening application. The liability variance explained is 1 À K ¼ 0:332, which of itself is higher than the individual R 2 (Table 1) but, as just seen, leads to lower values of several criteria of accuracy.
In principle, PRS could be developed that explain greater proportions of liability 48 up to the so-called SNP heritability (Table 1). Under this scenario the liability variance explained increases to 1 À K ¼ 0:765, giving a screening concordance of 0.664 and relative utility of 0.275. This suggests that further progress in genetic prediction may lead to more useful applications in multiple screening contexts, especially if further combined with non-genetic risk factors.

Discussion
Standard concepts of sensitivity and specificity generalise naturally to the multivariate setting. Positive and negative predictive values generalise similarly, and for completeness their definitions are provided in the supplementary text. Although the ROC curve does not extend so easily, the related concept of concordance does so. However, in contrast to the single outcome setting, concordance is sensitive to the outcome probabilities, negating one perceived advantage of that criterion. In the strong panel-wise sense the concordance is unsatisfying because an individual can be regarded as being discordant with itself, and there is no natural interpretation in terms of   discrimination. The range of panel-wise concordance depends upon the number of outcomes and the covariance of predictors and outcomes, and may fall below 0.5. In practice its minimum value can be estimated by simulation or theory, as in section 4.2, by setting the predictors to be independent of the outcomes while maintaining the correlation among predictors and among outcomes. Strong panel-wise measures have an intermediate position between outcome-wise and screening measures, in that prediction is evaluated at the individual level but the predictions of specific outcomes are taken into account. The proposed definitions are motivated by possible applications in early detection of disease, and have convenient analogies with family-wise error in hypothesis testing, but other approaches may be possible. Relative utility, which is a useful summary of sensitivity and specificity when predicting a single outcome, presents some difficulties when predicting multiple outcomes. I propose definitions assuming common benefits and costs for all outcomes, which allow analogous development to that for a single outcome, but may lead to suboptimal assessment of utility when the benefits and costs vary across outcomes. When outcomes are correlated, accurate calculation of relative utility may be difficult, so approximations are provided assuming independent predictors and outcomes. It remains to be seen how useful these definitions prove in practice, given their assumptions of common additive benefits and costs, and independent predictors and outcomes.
Some examples of screening have been discussed, but examples of outcome-wise or joint accuracy can also be envisaged. CancerSEEK is a recent example of molecular technology applied to early detection of multiple cancers. Its performance was reported in the screening sense, but the proposed definitions clarify that all quantities can be affected by ascertainment bias. The present criteria are more sensitive to incidence and sampling rates than the corresponding univariate measures.
I have only considered the accuracy of a given predictor, and have not considered how such predictors are constructed. Multivariate predictors could be constructed simply by concatenating univariate predictors. The example of PRS shows that this is feasible and pragmatic given that such scores are currently constructed from case/control studies of individual diseases. In future, given the increasing availability of extensive phenotyping in large cohorts, it will be possible to build prediction models with the optimisation of multiple outcome prediction as the direct objective. Methodology for such model building is a fertile area for future work.
Prediction models are often evaluated for their improvement over existing models. Evaluation of incremental performance remains a controversial subject when predicting a single trait. Among several proposed measures the net reclassification index has attained a default status among practitioners yet has received strong criticism. 51,52 Such issues are likely to be magnified when predicting multiple traits.
Given predictors for a set of outcomes, a natural question is whether there is some subset of outcomes for which risk prediction is most effective. Naı¨ve comparison of, say, relative utilities for different groups of outcomes would be inappropriate without consideration of the relative benefits of predicting each group. Thus, the finding that the screening concordance of PRS is lower over six diseases than for each disease individually should not in itself argue against a screening application, because the benefits and costs of screening six diseases are different from those of screening one disease. Many authors have argued for decision-theoretic treatments of risk prediction. 28,53 Such approaches can also be developed for the multiple outcome setting and would put the comparison of predictors for different groups of outcomes on a more coherent footing.
Competing risks present a problem for mutually exclusive outcomes, such as diseases of later life. There is a distinction between accounting for competing risks in model building, and in model evaluation. The emphasis here is on evaluation, for which the proposed criteria could be adapted to account for competing risks. However, the explicit consideration of multiple outcomes may encourage more careful consideration of competing risks at the model building stage and lead to improved prediction in general.
An R library to calculate these criteria from empirical data, and to evaluate the multivariate probit formulae of section 3, is available from https://github.com/DudbridgeLab/multipred. Acknowledgement I am grateful to Paul Newcombe for reading the manuscript, and to Richard Morris, Alex Sutton and Angelica Ronald for their helpful suggestions.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Supplemental material
Supplemental material for this article is available online.