Use of the concordance index for predictors of censored survival data

The concordance index is often used to measure how well a biomarker predicts the time to an event. Estimators of the concordance index for predictors of right-censored data are reviewed, including those based on censored pairs, inverse probability weighting and a proportional-hazards model. Predictive and prognostic biomarkers often lose strength with time, and in this case the aforementioned statistics depend on the length of follow up. A semi-parametric estimator of the concordance index is developed that accommodates converging hazards through a single parameter in a Pareto model. Concordance index estimators are assessed through simulations, which demonstrate substantial bias of classical censored-pairs and proportional-hazards model estimators. Prognostic biomarkers in a cohort of women diagnosed with breast cancer are evaluated using new and classical estimators of the concordance index.


Introduction
After determining if predictors of censored survival data are significant, a common objective is to measure their predictive strength on a scale that is not sample dependent. A plethora of statistics have been suggested. Some have attempted to transfer the concept of R 2 from linear regression to censored data. 1,2 In this article we consider use of the concordance index for censored data.
The first part of the paper reviews the concordance index for predictors of censored survival data. The second part develops concordance index estimators that are valid when the strength of the predictor becomes diminished with follow up. Our proposals are compared with classical methods using computer simulations and a breast cancer prognostic biomarker example.

Concordance index
The concordance index was initially developed to estimate the degree to which a randomly chosen observation from one distribution was larger than one chosen independently from another distribution. 3 When T 1 and T 2 are continuous independent random variables with cumulative distribution functions F 1 and F 2 the concordance index is If T 1 and T 2 place positive mass at the same point then we count half for ties and define C as P(T 1 > T 2 ) þ P(T 1 ¼ T 2 )/2 so that and C ¼ 0.5 when the two distributions are the same, even with ties. The concordance index can be estimated from the normalized Wilcoxon ranksum (Mann-Whitney) statistic, bŷ where T 1i (i ¼ 1, . . ., n) and T 2j (j ¼ 1, . . ., m) are independent samples from F 1 and F 2 respectively, and I(.) denotes the indicator function. If R i denotes the rank of the T 1i (i ¼ 1, . . ., n) in the combined sample (T 11 , . . ., T 1n , T 21 , . . ., T 2m ) with the ranks of tied observations averaged, then the Wilcoxon ranksum test statistic is given by W ¼ P n i¼1 R i , which can be related toĈ through W ¼ nmĈ þ nðn þ 1Þ=2. When the samples (T 1 and T 2 ) come from cases and controls respectively, the concordance index is the area under the receiver operating characteristic curve for (F 1 , F 2 ). 4 When the samples are from two arms of a randomised control trial, C is a measure of the treatment effect. Some variations of C have also been studied. These include the odds of concordance C(1ÀC) À1 , [5][6][7] and a modification to account for matched case-control designs, 8 but they are not considered further in this article.
For a one-parameter family {T Z } of random variables indexed by real number Z from distribution {F Z }, a concordance index that quantifies the degree of association between T Z and Z is defined as where the last term essentially derives from allowing ties in Z to be broken at random. 9 The definition has the advantage of being continuous in the distribution of F Z and is equivalent to Kendall's rank correlation coefficient because C Z ¼ 0:5 þ =2. C Z and C are not the same when Z is a two-point distribution, but they are linearly related. Consider where Z ¼ 1, 2 (e.g. respectively cases and controls, or treated and untreated) and Thus for the balanced two-sample situation the range of C Z is only (1/4, 3/4) and not (0, 1) as for C. This important aspect is due to ties in Z, and interpretation of C Z is affected whenever ties in Z are possible. For example, the upper bound of C Z may decrease if a continuous Z is rounded. Although obvious from (2), this might seem surprising because in practice it is often implicitly assumed that the range of the concordance index C Z is always (0, 1). Some bounds on the range of C Z are as follows. Suppose there are n discrete values of Z. Then the smallest possible P(Z 1 ¼ Z 2 ) occurs when they are distributed uniformly so that PðZ 1 ¼ Z 2 Þ ¼ 1=n; the smallest minimum value of C Z with n points is (2n) À1 and the maximum is 1 À ð2nÞ À1 . Therefore, with discrete data one might normalize C Z so that it can theoretically attain 0 and 1 via fC Z À ð2nÞ À1 gð1 À 1=nÞ À1 . For large n the range of C Z is less of an issue, and for continuous distributions of Z the range of C Z is (0, 1), as can be seen by letting T Z ¼ {ÀZ} and T Z ¼ {Z} respectively be a set of degenerate one-point distributions for continuous Z.
In the rest of the paper we focus on estimators of C and C Z for right-censored data.

Estimator review 3.1 Censored-pairs estimators
The concordance indices (1) and (2) have been extended to censored data by ignoring pairs when the smaller survival time is censored and using a normalising constant to account for these uninformative pairs. 10,11 While such statistics can be useful for comparing different models on the same data set, Efron 12 noted that Gehan's approach 10 was dependent on the censoring distribution, and so was not not a universal measure of P(T 1 > T 2 ). Others have noted that Harrell's approach 11 likewise depends on the censoring distribution. 13 If the censoring random variable H Z is conditionally independent of T Z given Z, so that the observed survival function is (1ÀF T Z )(1ÀF H Z ), then from equation (2), the censored-pairs concordance index is given by The PðH z 1 4 T z 2 ÞPðH z 2 4 T z 2 Þ terms in the numerator and denominator arise because contributions to the statistic only occur for pairs of observations when the smaller survival time is not censored. The following methods were developed to be independent of the censoring distribution.

Efron's estimator of C
For the two-sample situation, Efron 12 suggested a solution using the Kaplan-Meier estimates for the survival distribution given by S 1 (t) ¼ 1ÀF 1 (t) and S 2 (t) ¼ 1ÀF 2 (t), and computing P(T 1 > T 2 ) based on these estimates throughĈ whereŜ 1 (u) andŜ 2 (u) are the Kaplan-Meier estimates of the survival functions S 1 and S 2 respectively. 14 That iŝ where the observed data are in pairs of event times and indicators (t 1i , y 1i ) in group 1 and (t 2j , y 2j ) in group 2, where y 1i ¼ 0 if t 1i is censored, one otherwise, and similarly for y 2j , and Qðt 1i , t 2j , y 1i , y 2j Þ ¼ PðT 1 4 T 2 j t 1i , t 2j , y 1i , y 2j Þ is estimated by substituting Kaplan-Meier estimates of survival functions into the relevant terms in Table 1.
Examples to show the difference between EðĈ E Þ and the censored-pairs approach have been reported. 15 C E overcomes limitations of the censored-pairs approach for the two-group problem but requires that the estimated survival functions decrease to zero, so that one treats the last event time in each group as not censored in the Kaplan-Meier estimator. When there is censoring due to incomplete follow up, with everyone censored by t max and where S 1 (t max ) > 0 and S 2 (t max ) > 0, then Efron's estimator may be very unstable. An important example of this situation is when individuals are enrolled sequentially in a clinical trial and events are recorded until (say) 10-years after the first entry (t max ¼ 10). In such situations taking the last time in each group to be an event will substantially bias the concordance index in the direction of the group with the longest surviving member beyond that time. For example, if 90% are at risk in both groups after the last event has Table 1. Values of Efron's Q(t i , t j , y i , y j ) for the concordance statistic. Note that for the two-sample estimator of C the 1 and 2 subscripts have been dropped, so that for example t i represents t 1i and t j is t 2j , similarly S i is S 1 etc. This notation is used so that the table generalises to estimators of C Z .
occurred, then 81% of the terms in the double summation (4) will favour the group with the longest surviving (censored) member, andĈ E is guaranteed to be greater than 0.81À0.19 ¼ 0.62.

Uno's estimator of C Z
Uno and colleagues 13 developed a censored-pairs estimator of the concordance index (2) based on inverse probability weighting. Their solution uses a Kaplan-Meier estimate of the censoring distribution S H , treating it as independent of Z and T Z , and re-weights the censored-pairs contribution when t i > t j to beŜ H ðt j Þ À2 , rather than one. The approach is justified by inspection of (3); the weighting cancels out the PðH z 1 4 T z 2 ÞPðH z 2 4 T z 2 Þ terms, so that it is (asymptotically) independent of the censoring distribution and converges to C Z . However, the resulting estimator is only completely independent of the censoring distribution if, as above for the Efron estimator, the maximal follow up for all patients is to a time such that the marginal survival distribution S() ¼ P(T > ) ¼ 0. If not, then the censored-pairs approach will converge to a quantity greater than C Z . Informally, this is because the individuals with high Z have the event first whether or not hazards also converge with time. More formally, this may be seen by re-expressing C Z as where SðtÞ ¼ R PðT 4 t j zÞdF Z ðzÞ and from Bayes' rule. As t increases, the distribution of Z in those still at risk becomes weighted towards those with longer survival, and C t decreases. When follow up is until t ¼ , the censored-pairs concordance index converges to Z 0 C t SðtÞ R 0 SðuÞdu dt and because C t is decreasing this limit is greater than C Z (anti-conservatively biased) unless S() ¼ 0. One can also see that the limit of Uno's concordance index for close to the longest follow up will be less than Harrell's version, since it gives relatively more weight to those C t that are closer to t ¼ .

Proportional-hazards model
A common approach is to estimate linear predictors of outcomes with censored event times using a proportionalhazards model. Here an estimator of the concordance index that does not depend on the censoring distribution or follow up was achieved by Go¨nen and Heller. 16 If T Z has hazard of form ðT j ZÞ ¼ 0 ðTÞ gðZÞ, then, because we have from (2) that where Z 1 and Z 2 are independent samples from distribution function F Z . When z ¼ 1 x 1 þ Á Á Á þ k x k for some linear combination of covariates x ¼ (x 1 , . . ., x k ) and coefficients b ¼ ð 1 , . . . , k Þ, gð:Þ ¼ expð:Þ and both T Z and Z are continuous, the concordance index depends on the distribution of z and equals which is linked to T Z only through the distribution of the coefficients b and covariates x. Equation (9) may be estimated by replacing F Z with its empirical distribution so that whereẑ i uses the proportional-hazards estimates 1 , . . . , k , and similarly for the more general (8). Its variance is estimable from re-sampling methods or from asymptotic formulae 16 which depend on the covariance matrix of b that is routinely available from the partial-likelihood methods of the proportional-hazards model.

New estimators 4.1 Motivation
The methods reviewed above are not universal when the predictor loses strength with time, and may depend on the length of follow up. In particular, formulas (8) and (9) depend implicitly on the validity of the proportional-hazard assumption. Further developments would be useful because hazards are often observed to converge, so that the effect of a predictive factor diminishes as follow-up time increases. This issue is pervasive in applications 5 . For example, in breast cancer epidemiology, many prognostic factors are based on characteristics of the tumour that lose relevance once an individual has survived a period of time 17 . We next propose modifications to the Efron and the proportional-hazard estimators, before introducing a more parsimonious approach.

Modified two-sample estimator
Recall that when there is censoring due to incomplete follow up, Efron's estimator may be very unstable. The following modification of Table 1 solves this problem by accounting for when the last time in each group is censored.
being respectively defined to be zero when S 1 ðt max Þ ¼ 0 or S 2 ðt max Þ ¼ 0. Now when y 1i ¼ y 2j ¼ 0, P(T 1 > T 2 ) may be partitioned as The terms are estimated by using Kaplan-Meier estimates of S 2 (t) for w 2j ; for example S 1 (t max ) is the Kaplan-Meier estimate at the last non-censored time in the first group.
As the original Efron estimator, the modified estimator is not a universal measure when censoring is due to incomplete follow up because it depends on t max , but it is more stable than the Efron estimator because it does not depend on which group has the longest surviving censored member. It is not consistent for the concordance index if S 1 ðt max Þ 4 0 and S 2 ðt max Þ 4 0 but, in this case, clearly it is not possible to obtain a consistent estimator of the concordance index with making assumptions. However, one may obtain an estimate of the concordance index for different follow-up periods by varying t max , where the modified estimator consistently estimates Thus, one approach to facilitate comparisons between studies is to present the estimate of this for different values of t max . This idea has been used in a similar context elsewhere, 6,13 and is considered further in later simulations ( Figure 2) and an example ( Figure 5).

Modified proportional-hazards model estimator
A problem with the estimator of Go¨nen and Heller 16 is that if there is no censoring but proportional hazards do not hold, then the estimator will not agree with the classical approach. A partial solution to this is to modify the approach of Efron and write where Qðt i , t j , y i , y j , z i , z j Þ ¼ PfT i 4 T j j ðt i , y i Þ, ðt j , y j Þ, z i , z j g. Under a proportional-hazards model, C EZ may be estimated via the terms in Table 1, but the proportional-hazard assumption is only needed to calculate the non-trivial terms and so the estimator agrees with the classical formula when there is no censoring. A further difference to the above is that it requires an estimate of the baseline survivor function S 0 (t). This approach will be anti-conservatively biased when the data are censored and proportional hazards hold. It is intended for use when censoring is light and one would like robustness against large departures from proportional hazards.
One might consider allowing ðT j ZÞ ¼ 0 ðTÞ g T ðZÞ for time-varying hazards g T . In this case A concordance index based on this involves O(N 2 ) evaluations of this double integral, which would need to be evaluated numerically. One also cannot use the model beyond the maximal follow-up time.

Pareto model
A parsimonious approach is to use a simple one-parameter model to account for varying degrees of convergence by introducing an unobserved additive covariate (frailty) to the proportional-hazards model, independent from other covariates, with a log-gamma distribution with mean one and variance . 18 This leads to a transformation model based on the Pareto distribution, so that if the baseline hazard and cumulative hazard are given by 0 (t) and Ã 0 (t) respectively, then an individual with covariate z ¼ expðbx 0 Þ has survival function and hazard function This very flexible model has some attractive features. The hazard ratio is given by so that a consequence of the frailty ( > 0) is that the hazard ratio approaches one as t gets large. When ¼ 0 there is no frailty and it becomes the proportional-hazards model; when ¼ 1 it becomes the proportional-odds model.
Technical aspects of estimation and inference are considered in the appendix.

Concordance index
Computation of the Pareto concordance index involves a formula with , the {Z} and the baseline cumulative hazard function Ã 0 ðtÞ where v ¼ z 2 Ã 0 ðsÞ, and analysis of concordance index (2) can proceed as the two previous approaches for proportional hazards. That is, the Pareto model can be used with f1 þ expðZ 1 À Z 2 Þg À1 in (9) replaced by (15) with s ¼ 0 or via the hybrid approach replacing the non-trivial terms in Table 1 with the Pareto terms. The integral in (15) is needed for both approaches. Although it does not appear to be analytically tractable it may be estimated numerically, and it requires much less computation than (12).

Goodness-of-fit
We lastly consider model goodness-of-fit, partly because the Pareto concordance index is not needed when a proportional-hazards assumption is appropriate. One method is an asymptotic score test for when a Pareto model is taken as the alternative hypothesis to proportional hazards. 19 Another approach in this line is to apply a likelihood-ratio test for ¼ 0, 20 with adjustment for model-boundary testing. 21 Schoenfeld residuals 22 are sometimes used, and in the general setting are defined for all i ¼ 1, . . . , N when a non-censored event occurred andðt i j x j Þ are model estimates. These residuals show the difference between the observed and expected covariate at each event time, and have expectation zero if the model is correct. Plots of sˆi against t i and fitted trends may help to identify departures from the model, and a chi-squared test based on scaled residuals is commonly used to test a proportional-hazards assumption, 23 without taking a Pareto model as the alternative. Because Schoenfeld residuals were designed to check the proportional-hazard assumption, a direct comparison with the Pareto model will help assess whether it satisfactorily addressed lack of fit. A related goodness-of-fit test is to use partial residualsPðx ! x i Þ defined as 22r Under the model these should be distributed uniformly between zero and one, independently of t i . Empirical distribution function goodness-of-fit tests 24 could be used to assess the distribution of r i in early and late periods.

Simulations 5.1 Bias
A simulation was used to demonstrate issues with existing methodology when there are converging hazards. Twenty-thousand individuals were simulated with survival times from a Pareto distribution; the rate for an individual was the exponent of a random normal covariate with unit mean and variance multiplied by a frailty sampled from a gamma distribution with mean one and variance . Type I censoring was considered, so that events occurred before a maximal follow-up time based on the expected proportion censored. For exposition we show 90%, 50% and 20% censoring. For $10-year follow up, heavy censoring might correspond to survival such as for distant recurrence in women diagnosed with estrogen-receptor positive breast cancer; 25 mid-range censoring ($50%) might be seen for survival following disease such as an acute myocardial infarction event; 7 light censoring occurs when survival rates are low, for example, for survival following complete resection of non-small-cell lung cancer. 5 In all simulation scenarios there is no difference between the censored-pairs estimators of Harrell or Uno because everyone is censored at the same time. Concordance indices using a proportional-hazards model and the censored-pairs statistic were calculated and compared with the true index, obtained using a simulation without censoring.
The results in Figure 1 show that for this model the proportional-hazard estimate was conservative when there was no censoring, but had positive bias when censoring was more than about 50%. The classical estimator substantially overestimated the concordance index when censoring was 50% or more; this bias was more pronounced for heavy censoring as the frailty variance increased.
A second simulation was used to demonstrate the dependence of the two-sample estimator on follow up. Tenthousand individuals were simulated in two groups, with survival time from an exponential distribution with rate one or two, compounded with a gamma frailty with variance , which was chosen to show the effect of a change from constant hazards ( ¼ 0) to when they converge very quickly ( ¼ 20). Censoring was generated by allowing individuals to be enrolled into a study at different times according to a uniform distribution between [0.00, 0.05], and then they were censored at a maximum follow-up time. The results in Figure 2 show that the two-sample statistic was conservatively biased when there was heavy censoring. Considering the chart from right (heavy censoring due to censoring) to left (no censoring), one can see that the concordance index estimate increased with more follow up (later censoring) until the covariate had ceased to influence survival due to converging hazards. The plot shows that the statistic is actually better when there are converging hazards, since it will converge to the true value with less follow up.

Comparison of estimators
A final simulation was used to compare estimators of C Z . Survival times were from a Pareto distribution that was the exponent of a standard random normal covariate (x) multiplied by 0.7 (i.e. z ¼ exp(x) with ¼ 0.7) and compounded by a frailty sampled from a gamma distribution with mean one and variance . Two choices of were considered (1.0 and 6.6) and three levels of censoring (follow up to time with expected censoring percentage 87%, 50% and 20%). The sample size was 1125 and 500 replications were used. The Pareto model was fitted by maximizing the profile likelihood (see Appendix). The reason for choosing ¼ 0.7, ¼ 6.6, 87% censoring and n ¼ 1125 is that these correspond to an example in the next section (Table 3(b), Ki67). We also considered ¼ 1 in order to assess a scenario where the proportionalhazards assumption is violated more slowly, and partly for theoretical interest because it corresponds to a proportional-odds model. The censoring levels were varied to help assess the estimators as more follow up is accrued.
The distribution of estimated concordance indices is shown in Figure 3. The concordance-index estimates from a Pareto model were substantially less biased than the other methods with heavy censoring ( Table 2). The Pareto estimator was biased for heavy censoring at this sample size because it fits a proportional-hazards model where there is insufficient power to detect non-proportional hazards. Harrell's statistic and the modified proportionalhazards statistic became less biased as the level of censoring decreased. The Pareto estimator had a lower mean squared error than the other estimators (Table 2). Some differences were seen between a proportional-hazards concordance index based solely on model fit and the hybrid approach using Table 1. As expected the hybrid approach worked best for light censoring. It was worse under 50% censoring for the proportional-hazards model because it shifted the estimate towards the Harrell estimate, and the censored-pairs estimators are expected to be anti-conservative unless follow up is to a point where survival is zero (c.f. Figure 1). Thus, we do not recommend the hybrid approach unless censoring is light.

Example
The example uses a sample of 1125 women with oestrogen-receptor positive breast cancer, of whom 145 had a distant recurrence after a median 8.5-years follow up in a clinical trial (ATAC trial, ISRCTN registration numer ISRCTN18233230). This sample from the transATAC study (approved by the South-East London Research Ethics Committee (REC ref no. 971037)) were previously used to show that some immunohistochemical (IHC) biomarkers added useful information to classical clinical prognostic factors. 25 For demonstration and insight we focus next on some of the individual biomarkers used in the IHC risk score. We do not present results from the hybrid estimators because censoring is heavy, but there was little difference because model assumptions dominate the calculations (87% of women were censored). Table 3 shows some univariate concordance index estimates. The following points are of note. First, the twosample estimates were different than the other form of concordance index. Second, Harrell's and Uno's statistics were closer to each other than the proportional-hazards and Pareto model statistics. This is likely due to the bias from follow up, as discussed earlier. Third, Pareto estimates were substantially lower than the proportional-hazards model when 6 ¼ 0, reflecting an assumption of converging hazards. Finally, the concordance indices of binarised predictors were less than continuous counterparts due to the information loss from dichotomising.
To explore further we focus on Ki67, whose Pareto concordance index estimate was 0.   Table 1.  Figure 4(a) show that allowing for converging hazards via a Pareto model improved the residuals at the start and end. Figure 4(b) helps to show why; the expected value of Ki67 for events decreased more rapidly than a proportional-hazards assumption. Figure 4(c) shows the fitted hazard ratio from the Pareto model, which approximately halved over the period. Figure 4(d) demonstrates that a Pareto model for a binary Ki67 predictor better matched the Kaplan-Meier estimates than a proportional-hazards model.
A goodness-of-fit test of the Pareto model is suggested by Figure 4(a), where most of the change in partial residuals between the proportional-hazards and Pareto model were in the first and last three years. Applying a two-sample Kolmogorov-Smirnov test of equality in distribution between the residuals in years 3 vs > 6 for the proportional-hazards model was rejected (D ¼ 0.28, two-sided P ¼ 0.03). The trend line shows that the Pareto model fitted somewhat better, and the same test did not reject a fit of the Pareto model (D ¼ 0.22, P ¼ 0.17). Thus the data showed some evidence to support the Pareto model fit, which was certainly better than proportional hazards, and the lower concordance index estimate than from a proportional-hazards model or the other approaches.      Figure 5. Plot of two-sample concordance index against type I censoring time (t max ) for binarized Ki67 and HER2 from the example. Point-wise 95% confidence intervals (empirical bootstrap) are also shown. Figure 5 plots the two-sample concordance index for binarised Ki67 by censoring time. The concordance index increased, and then appeared to plateau after six years. Thus one might surmise that the two-sample estimate from 10-year follow up is unlikely to increase for this variable with further follow up due to converging hazards (c.f. Figure 2). HER2 positivity is included for comparison, where the estimated concordance index increased with follow up, in better agreement with a proportional-hazards assumption.

Conclusion
The concordance index is routinely used to measure how well a variable predicts the time to a censored event. However, current estimators depend on the extent of follow up and many predictors using survival data lose their discriminatory power with follow up time. To account for this phenomenon we developed a concordance index based on a Pareto model. This semi-parametric model accounts for converging hazards, but leaves a baseline hazard function unspecified. In simulations under the model it was substantially less biased than other estimators. In a breast-cancer application the ordering of prognostic biomarker concordance index estimates changed when converging hazards were modelled, reflecting that some predictors are more useful for longer-term predictions than others. Our semi-parametric concordance index estimator is recommended for predictors of censored survival data when there is evidence of converging hazards.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by Cancer Research UK (grant number C569/A16891).