Using different data sets to test how well clinical prediction models work to predict patients' risk of heart disease

We assessed the extent of external validation and the variation in performance across data sets, focusing on the percentage change in the C statistic on external validation compared with derivation. To evaluate factors that influence this CPM performance, we assessed the "relatedness" of t...

Full description

Bibliographic Details
Main Author: Kent, David M.
Corporate Author: Patient-Centered Outcomes Research Institute (U.S.)
Format: eBook
Language:English
Published: [Washington, D.C.] Patient-Centered Outcomes Research Institute (PCORI) 2021, [2021]
Series:Final research report
Online Access:
Collection: National Center for Biotechnology Information - Collection details see MPG.ReNa
Description
Summary:We assessed the extent of external validation and the variation in performance across data sets, focusing on the percentage change in the C statistic on external validation compared with derivation. To evaluate factors that influence this CPM performance, we assessed the "relatedness" of the data sets used for validation to those on which the model was developed, using a set of detailed rubrics we developed for each clinical condition.
To assess whether adherence to methodological standards influenced performance, we developed an abbreviated version of the Prediction model Risk Of Bias ASsessment Tool (PROBAST), here called the "short form." Finally, we performed independent validations on a set of CPMs across 3 index conditions--acute coronary syndrome, heart failure, and incident cardiovascular disease--using publicly available clinical trial data and an evaluation framework, with both novel and conventional performance measures, including a model-based C statistic to estimate the loss in discrimination due to case mix rather than model invalidity, the Harrell E statistic (standardized EAVG) to quantify the magnitude of calibration error, and net benefit based on decision curve analysis (DCA) at 3 thresholds (the outcome rate, half the outcome rate, and twice the outcome rate).
BACKGROUND: There are many clinical prediction models (CPMs) available to inform treatment decisions, but little is known about the broad trustworthiness of these models and whether their performance is likely to improve clinical care. OBJECTIVES: To (1) describe the current state of the literature on validations of cardiovascular CPMs; (2) understand how well CPMs validate on independent data sets generally, with particular attention to calibration; (3) understand when models have the potential to worsen decision-making and cause patient harm; and (4) understand the effectiveness of various updating procedures across a broad array of models. METHODS: A citation search was run on March 22, 2017, to identify external validations of the 1382 cardiovascular CPMs in the Tufts Predictive Analytics and Comparative Effectiveness Center (PACE) CPM Registry.
The median percentage decrease in the C statistic among 57 related validation data sets was 30% (IQR, −45% to −16%) and, among 101 distantly related data sets, the decrease was 55% (IQR, −68% to −40%; P < .001). The decrease was due both to a narrower case mix and to model invalidity: compared with the model-based C statistic, in related data sets, the median decrease in the validation C statistic was 11% (IQR, −25% to 0%) and 33% (IQR, −50% to −13%) in those distantly related. The median standardized EAVG was 0.5 (IQR, 0.4-0.7), indicating an average error that is half the average risk, and was not different based on relatedness. When we used DCA, only 22 of 132 (17%) of the unique evaluations we performed were either beneficial or neutral at all 3 decision thresholds examined. The risk of harm was most salient when decision thresholds departed substantially from the average risk in the target population.
The percentage change in discrimination from derivation set to validation set was −3.7% (IQR, −13.2 to 3.1) for "closely related" validation sets, −9.0% (IQR, −27.6 to 3.9) for "related validation sets," and −17.2% (IQR, −42.3 to 0) for "distantly related" validation sets (P < .001). Agreement between the short form (ie, the abbreviated PROBAST) and the full PROBAST for low vs high risk of bias (ROB) was 98% (κ = 0.79), with perfect specificity for high ROB. Using the short form, 529 of 556 CPMs (95%) were considered high ROB, 20 (3.6%) low ROB, and 7 (1.3%) unclear ROB. The median change in discrimination was less in low ROB models (−0.9%; IQR, −6.2% to −4.2%) compared with high ROB models (−12%; IQR, −33% to 2.6%). We performed a total of 158 unique validations on 108 unique CPM models across 36 data sets.
We then examined the effectiveness of model updating procedures for reducing the risk of harmful predictions, defined as predictions yielding a negative net benefit at one of the defined thresholds, where harm is identified through DCA. RESULTS: A total of 2030 external validations of 1382 CPMs were identified: 1.5 validations per CPM (range, 0-94). Of the CPMs, 807 (58%) have never been externally validated. The median external validation area under the curve was 0.73 (interquartile range [IQR], 0.66-0.79), representing a median percentage change in discrimination of −11.1% (IQR, −32.4% to 2.7%) compared with performance on derivation data. For individual CPMs evaluated more than once, there was typically large variation in performance from 1 data set to another.
The sample of data sets used in aims 2 and 3 was a convenience sample and this sample determined the CPMs selected
Updating the intercept alone prevented harm for decision thresholds at the outcome rate across all models, but the risk of harm remained substantial at the more extreme thresholds unless both the slope and intercept were updated. Findings were consistent across the 3 index conditions we tested. CONCLUSIONS: Many published cardiovascular CPMs have never been externally validated. Discrimination and calibration often decrease significantly when CPMs for primary prevention of cardiovascular disease are tested in external populations, leading to substantial risk of harm. Model updating can reduce this risk substantially and will likely be needed to realize the full potential of risk-based decision-making. LIMITATIONS: Our analysis of published validation studies (aim 1) was limited to changes in the C statistic, because this was the only performance measure routinely reported.
Physical Description:1 PDF file (120 pages) illustrations