Predicting Outcomes in Patients With Diffuse Large B-Cell Lymphoma Treated With Standard of Care

In diffuse large B-cell lymphoma (DLBCL), predictive modeling may contribute to targeted drug development by enrichment of the study populations enrolled in clinical trials of DLBCL investigational drugs to include patients with lower likelihood of responding to standard of care. In clinical practice, predictive modeling has the potential to optimize therapy choices in DLBCL. The objectives of this study were to create a model for predicting health outcomes in patients with DLBCL treated with standard of care and determine informative predictors of health outcomes for patients with DLBCL. This was a retrospective observational study using data extracted from the IMS Health Database between September 2007 and April 2015. Patients were ⩾18 years of age with a DLBCL diagnosis. The index date was the date of the first DLBCL diagnosis. Patients were followed until outcome occurrence, defined as progression to a later line of therapy after ⩾60 days from the end of a previous therapy or stem cell transplantation. Patients were categorized into three cohorts depending on the post-index observation period: ⩽1 year, ⩽3 years, or ⩽5 years. Lasso logistic regression (LASSO), Naive Bayes, gradient-boosting machine (GBM), random forest (RF), and neural network models were performed for each cohort. The best-performing algorithms were predictive models based on GBM and observation periods ⩽1 and ⩽3 years after index date. Informative predictors included myocardial imaging, DLBCL stage IV, bronchiolar and renal disease, a chemotherapy regimen, and exposure to diphenhydramine and vasoprotectives on or before the first DLBCL diagnosis. These predictive models may be applied to targeted drug development and have the potential to optimize therapy choices in DLBCL. They were generated efficiently using a large number of independent variables readily available in standard insurance claims or electronic health record data systems.


2
Cancer Informatics ment decisions in clinical trials and has the potential to optimize therapy choices in clinical practice.
In the current treatment environment, clinical trials of investigational drugs in DLBCL must focus on patients with lower likelihood of responding to standard of care. As such, the design of clinical trials in DLBCL may be improved by enrichment of the study population, defined as selecting a study population in which detection of a drug effect (if one exists) is more likely than it would be in an unselected population. 10 Enrichment of a DLBCL study population may be achieved using a predictive model for response rate to standard of care, whereby a population of non-responders is identified and randomized to either the new drug or the original one.
In clinical practice, a predictive model can be used to identify patients with DLBCL that have an increased probability of response to a specific treatment. 9,10 Patient stratification based on a combination of selective variables can facilitate optimal therapy choices in DLBCL and improve the success rate of treatments. Furthermore, this approach could decrease the burden of DLBCL disease and reduce DLBCL health care costs by allowing comprehensive risk assessments and improved efficiencies in the delivery of care to DLBCL patients.
Although DLBCL has prognostic indicators, such as the International Prognostic Index (IPI) 11 and known biomarkers associated with disease responsiveness, to our knowledge, there are no predictive models of treatment response rates in DLBCL. Furthermore, outside of clinical trial or registry settings, these prognostic indicators and biomarkers are usually not readily available in secondary data sources, such as insurance claims or electronic health records. The objectives of this study were to (1) create a model for predicting health outcomes in patients with DLBCL treated with standard-of-care therapy and (2) base the model on variables readily available in standard insurance claims or electronic health record data systems.

Data sources
This retrospective observational study used data extracted from the IQVIA Real-World Data Adjudicated Claims (PharMetrics Plus) database between September 2007 and April 2015. 12,13 Study design Patients with DLBCL were eligible for this study. Inclusion criteria were as follows: (1) ⩾18 years of age; (2) ⩾one claim with a DLBCL diagnosis code in any position on an inpatient or outpatient record (Table 1); and (3) ⩾6 months of enrollment before the index date and ⩽1 year, ⩽3 years, or ⩽5 years of enrollment after the index date, depending on the length of the prediction window. The ⩾6 months pre-index enrollment requirement was to provide adequate characterization of baseline characteristics and identify potential oncology treatments before the index date (ie to reduce misclassification of incident newly diagnosed patients).
Exclusion criteria were as follows: (1) diagnosis of DLBCL during the 6 months before the index date; (2) ⩾one claim with a diagnosis code for other primary cancer in any position on an inpatient or outpatient record (nodular lymphoma [ICD 202.0] if it first occurred within 30 days of a large cell lymphoma code was not excluded, in case of early misdiagnosis) ( Table 1); or (3) ⩾one claim with a diagnosis code for secondary cancer (metastatic disease) in any position on an inpatient or outpatient record ( Table 1).
The index date was the date of the first DLBCL diagnosis. Patients were followed until outcome occurrence and categorized into three cohorts depending on the post-index observation period: ⩽1 year, ⩽ 3 years, or ⩽ 5 years.

Data collection
Outcomes assessment was binary, with patients being categorized as either disease progression or non-progression after first-line treatment. Due to a lack of granular treatment response data in insurance claims data, a proxy was used: initiation of a later line of therapy after ⩾60 days from the end of a previous therapy or stem cell transplantation, as identified by ICD-9 procedure, Healthcare Common Procedure Coding System (HCPCS), or Current Procedural Terminology (CPT) codes (Table 2).
Mortality data are not available in the IQVIA PharmetricsPlus database. To avoid confounding, potentially deceased patients (defined as patients with an enrollment period that ended without an outcome before the end of the post-index observation period) were excluded from data analysis. Abbreviations: DLBCL, diffuse large B-cell lymphoma; ICD-9, International Classification of Diseases, ninth Revision.

10
Cancer Informatics Anti-inflammatory and antirheumatic products 23 23 21  Galaznik et al 13 Select descriptive characteristics were assessed for each cohort based on availability of data; continuous measures were summarized as means and standard deviations, whereas categorical measures were summarized as counts and percentages (Tables 3 to 5). Supporting medications included erythropoiesis agents, granulocyte colony-stimulating factor (G-CSF) or granulocyte-macrophage colony-stimulating factor (GM-CSF), and blood transfusions. Pain medications and antifungals were not considered as predictors because of their potential use for other conditions.
Each cohort was randomly separated into training data and testing data at a ratio of 3:1. Lasso logistic regression (LASSO), Naive Bayes, gradient-boosting machine (GBM), random forest (RF), and neural network models (Supplemental material Table S1) were performed for each cohort. All these prediction models were built using out-of-the-box solutions provided by OHDSI packages. All available clinical and demographic data were included as potential predictors, with no pre-modeling winnowing of potential variables.
To obtain an objective estimation of the algorithms' performances, baseline prediction models were generated. The first baseline model used a random number generator in the range of 0 to 1 and a threshold. The second and third baseline models were based on a simple attempt to always predict the same outcome (only positive or only negative). All three baseline models produced a useful reference point with which to compare results and will provide information on the benefits of machinelearning algorithms as prediction models in terms of effort versus outcome.
Performance metrics included accuracy, Matthews correlation coefficient, and area under the receiver operating characteristic (ROC) curve (area under the curve [AUC]). Accuracy is a measure of the error rate (ratio of correct predictions to all predictions made). Matthews correlation coefficient is a measure of the quality of binary classifications, where 100% represents a perfect prediction. The ROC curve depicts the true-positive rate (sensitivity) versus the false-positive rate (100%-specificity) at various thresholds, and an AUC of 100% represents a perfect test, and an AUC of 50% indicates noninformative (random) predictions.

Descriptive summary
After application of inclusion and exclusion criteria, there were 4501 patients available for Cohort 1 (⩽1 year), 3115 available for Cohort 2 (⩽3 years), and 2525 available for Cohort 3 (⩽5 years). Within these cohorts, there were 1646, 1384, and 2146 patients, respectively, with evidence of progression to a new line of therapy after initial treatment. Although no formal statistical comparison was conducted, descriptive characteristics were similar across all three cohorts (Tables 3 to 5).

Model comparison
A summary of performance metrics for each predictive model by cohort are shown in Tables 6 to 8. Based on these data, GBM is recommended for predicting progression to later line of therapy after ⩾60 days from the end of a previous therapy or stem cell transplantation in this population of DLBCL patients. When the observation period was ⩽1 year after index date, GBM performed with 67.6% accuracy, a Matthews correlation coefficient of 24.0%, and an AUC of 69.2%. When the observation period was ⩽3 years after index date, GBM performed with 68.0% accuracy, a Matthews correlation coefficient of 21.1%, and an AUC of 72.7%. Accuracy decreased when the observation period was ⩽5 years after index date, as the GBM performed with 84.2% accuracy, a Matthews correlation coefficient of 5.3%, and an AUC of 80.7%.
Detailed model outputs and performance metrics are included as supplementary data (Supplemental material Figure  S1 and S2).

Discussion
This study created a model that considers a large number of independent variables to predict health outcomes after treatment or autologous stem cell transplantation in patients with DLBCL. Predictive models based on GBM and observation periods ⩽1 and ⩽3 years after index date were the best-performing algorithms. The predictive model was generated efficiently using a large number of independent variables readily available in standard insurance claims or electronic health record data systems. Within this study, outcomes assessment was simplified as binary (progression to new treatment vs non-progression) within fixed time windows, but future enhancements could also include prediction of variation in time-to-event outcomes. Validation in a 25% test hold-out sample was performed to reduce risk of overfitting and to calculate ROC curves and Matthews correlation coefficients. As a next step, further validation could be conducted in independent data sets, thereby further ensuring robustness of model accuracy. Replication in clinically richer data sources, such as oncology-specific electronic health record databases or clinical trial data sets, could further provide opportunity to enhance model accuracy.
Established uses of prognostic modeling include point-ofcare treatment decision-making and the identification of patients who warrant closer follow-up. For instance, a provider may select an alternative treatment for a patient identified as having a high likelihood of treatment response for a given therapy, 23,24 as by Porcher et al, 25 for additional radiotherapy in soft-tissue sarcoma. Predictive models such as the one developed here may also facilitate more efficient clinical development of investigational drugs in DLBCL. It could be utilized for the enrichment of the patient population recruited into clinical trials in DLBCL where the goal is to focus on patients with a lower likelihood of response to standard of care. In a hypothetical clinical trial of an investigation drug versus standard of care in DLBCL, the estimated necessary sample size to demonstrate therapeutic effect within 1 year of treatment when assuming a treatment arm response rate of 40% and a standard of care arm response rate of 20%, is 109 patients per arm (standard two-sample test for proportions; assuming a beta of 0.9 and alpha of 0.05). Applying GBM to recruit patients with a low likelihood of treatment response to standard of care at a sensitivity of 0.60 and specificity of 0.68 reduces the response rate to 12% in the standard of care arm. Assuming that the treatment arm response rate is unchanged, the expected magnitude of effect between arms is increased by 11 percentage points, reducing the required sample size to 50 patients per arm. Realistically, the treatment arm response rate would also be expected to decrease. To model this decrease, all patients who respond to standard of care are also expected to respond to the new treatment. In addition, a fraction of patients who do not respond to standard of care will not respond to the new treatment, independent of patients' baseline covariates. Even assuming treatment arm response at 34%, there is a net decrease in sample size to 75 patients per arm. When considering all scenarios, applying a predictive model for response rate to standard of care could reduce the sample size of this hypothetical clinical trial in DLBCL by 33 to 68 patients, which would readily translate into reduced costs and time needed to accrue trial patients. This is particularly impactful for oncology trials where recruitment has become increasingly difficult, and costs per patient have ranged from US$68 500 to US$125 000 and continue to increase. [26][27][28] The predictive model also provides the opportunity to implement a more systematic approach to the treatment of DLBCL patients. The model may inform clinical decisionmaking, allowing the identification of patients most likely to respond to a specific drug or drug combination, 9 support more accurate diagnoses, avoid unnecessary treatments and associated    29 Taken together, these data suggest that a predictive model of relapse or the presence of refractory disease in patients with DLBCL has the potential to increase the efficiency of DLBCL health care delivery, lessen the impact of DLBCL on health care systems by lowering the overall cost of DLBCL health care, and reduce DLBCL patient burden by decreasing the need for health agency and hospice care. An additional application of such modeling approaches can be to identify new variables or factors for predicting outcomes. The exploration of variables or patterns of variables identified as top predictors across multiple modeling approaches could be considered as a way to generate hypotheses for new predictive factors for a given outcome. Any assertions of causality, however, would require employing causal inference methodologies, 30 which are outside the scope of this study.
The framework used to develop the predictive model described in this study can overcome data sparseness, may help to generate new hypotheses for predictors of outcomes, and can be readily implemented to efficiently develop a predictive model for measurable outcomes; however, the framework is associated with several limitations. First, censored patients cannot be included, so any individual who is not observed for the complete follow-up period or experiences an outcome during follow-up is excluded, which may introduce bias in the study population. Second, not all medical events are recorded in observational data sets and some information can be recorded incorrectly, resulting in a noisy data set with potential outcome misclassification. Third, the resultant predictive model is only applicable to the population of patients represented by the data used to train the model; therefore, generalization may be limited. Finally, a limitation of any model used for clinical trial enrollment is the need to have access to all variables at the time of screening.

Conclusions
This study developed a model that considers a large number of independent variables to predict health outcomes in patients with DLBCL. The model has potential application for enriching the patient population recruited into clinical trials in DLBCL, where the goal is to focus on patients with lower likelihood of response to standard of care, improving efficiencies in the delivery of health care to patients with DLBCL and reducing health care costs.