External Validation of a Prediction Model for Falls in Older People Based on Electronic Health Records in Primary Care

Many exist but is lacking. are because the of data. We aimed to externally our EHR data. and were Model was assessed in terms of discrimination using the area under the receiver operating characteristic curve (ROC-AUC), and in terms of calibration, using calibration-in-the-large, calibration slope and calibration plots. Results: Among 39,342 older people, 5124 (13.4%) fell in the 1-year follow-up. The characteristics of the validation and the development cohorts were similar. ROC-AUCs of the validation and development cohort were 0.690 and 0.705, respectively. Calibration-in-the-large and calibration slope were 0.012 and 0.878, respectively. Calibration plots revealed overprediction for high-risk groups in a small number of individuals. Conclusions and Implications: Our previously developed prediction model for falls demonstrated good external validity by reproducing its predictive performance in the validation cohort. The implementation of this model in the primary care setting could be considered after impact assessment.

Falls and fall-related injuries are frequent in community-dwelling older people. About 30% of the older people fall at least once per year. 1 Falls are a growing public health concern worldwide as they impair quality of life and functional independence of the older people. In addition, falls are a major cause of morbidity and mortality, 2,3 increase health care costs, and thereby lead to a significant burden to our health care system. 4 Therefore, individualized prediction of the risk of falls helps clinicians to identify older persons at highest risk. Targeting falls preventive interventions based on risk screening is an essential part of falls preventive care pathways. Efficient and accurate risk prediction are important to improve the outcomes and reduce health care costs. 5,6 A recent systematic review found that although there is an abundance of risk prediction models for falls in community-dwelling older people, external validation is lacking. 7 External validation is defined as the process of testing a model in individuals from sources independent from the development cohort. External validation is imperative 8e10 to assess the ability of a particular prediction model to produce accurate predictions for individuals from a similar source population (reproducibility) or for different clinical settings (transportability). 11 In general, prediction models tend to perform better in populations in which the model was derived than in different populations for several reasons, including differences in case mix and varying outcome rates. 11,12 This lack of validation means it is unclear to what extent models can be used for new individuals. As a consequence, most of the existing prediction models for falls cannot be recommended for clinical use before external validity is established. 8e10 Clinical decisions should be based on prediction models that were externally validated to properly identify older people at higher fall risk and in order to avoid unnecessary interventions.
Routinely collected electronic health records (EHRs) are being increasingly used for research. For clinical prediction models, these data make it possible to analyze large numbers of patients at relatively low cost. However, these data have particular challenges for modeling: data incompleteness, missing and duplicate data, and variability of data capture systems. 13 External validation of prediction models developed with EHR data is therefore crucial, but unfortunately often lacks. 14 We previously developed a prediction model to predict 1-year fall risk using data extracted from primary care EHR in the Netherlands. 15 In order to justify its clinical use in other populations, external validation of this model is needed to investigate its performance using data from other general practices. The aim of this study was to conduct an external validation of our recently developed prediction model using a large population-based cohort drawn from EHR data in a primary care setting in the Netherlands.

Original Fall Risk Prediction Model
The original prediction model for falls was derived from pseudonymized primary care EHR data of 50 general practices located in the province of North Holland in the Netherlands participating in the Academic General Practitioner's Network at Academic Medical Center (AHA AMC). The development cohort contained data of 36,470 older people (age !65 years) covering the period 2018 and 2019. Fall probability can be calculated using the formula 1 1þe ÀLP where LP (linear predictor) is equal to À6.92þ0.06Âageþ0.26Âfemale sexþ0.72Âhistory of fallsþ0.29Âuse of proton pump inhibitorsþ0.24Âuse of opioidsþ0.35Âprevious injuryþ0.54Âdepressionþ0.20Â osteo-arthritisþ0.36Âurinary incontinenceþ0.41Âmemory and concentration problems. The model had a median area under the receiver operating characteristic curve (ROC-AUC) of 0.705 and reasonable calibration. More details on the development were previously published. 15

Data Source, Setting, and Study Population of the Validation Cohort
We conducted a retrospective analysis using depersonalized primary care EHR data, extracted from the pseudonymized database of the Academic Network of General Practice at VU medical center in Amsterdam (ANH VUmc). The data covered 59 general practices from 2 cities (Amsterdam and Haarlem) in the province of North Holland in the Netherlands. The development and validation cohort had no overlap regarding general practices, and expectedly very little overlap regarding patients (overlap could only occur when patients move practice). It had both structured and free text data, covering demographic data, diagnoses, medication prescriptions, and depersonalized free text (in Dutch), for enlisted patients aged !65 years between 2015 and 2020.
Our validation cohort included all patients registered with any general practitioner (GP) at any time in the period from 2018 to 2019. Baseline predictors were obtained from the year 2018 (observation period) and the determination of the occurrence of falls in the year 2019 (follow-up period). Patients were included in the validation cohort if they were 65 years old or more at the beginning of the observation period. We excluded the individuals who died in the observation period and those who registered in the follow-up period.
The ANH VUmc database is run according to Dutch privacy legislation and contains pseudonymized general practice care data from all patients of the participating general practices, except for those patients who object to this. Observational studies based on anonymized data from the ANH VUmc database are exempted from informed consent of patients. Permissions were sought and granted to access the ANH VUmc database. This study conformed to the Declaration of Helsinki principles.

Outcome
The outcome was any fall during the 1-year follow-up period (year 2019), obtained from free text written in the clinical notes during follow-up (for detailed information, see Dormosh et al 15 ).

Predictors
Ten predictors were defined and collected using exactly the same methods as in the development study. 15 The predictors were collected in the observation period in the year 2018. Demographic predictors included age in years and sex. The 2 medication groups were proton pump inhibitors and opioids. These groups were formulated by aggregating the codes of the Anatomical Therapeutic Chemical classification system of each medication associated with each individual. 16 Five chronic medical conditions were previous injury, depression, osteoarthritis, urinary incontinence, and memory and concentration problems. These medical conditions were created by grouping the codes of the International Classification of Primary Care 17 assigned to each individual. History of falls (as predictor in yes/no binary scale) was defined as any fall that occurred in the observation period (year 2018) and was obtained by applying the same automated search strategy for falls in the free text as described in the development study. We validated the algorithm's performance to determine history of falls in terms of positive predictive value, sensitivity, and specificity by manual inspection of 100 randomly selected individuals. Supplementary Table 1 provides detailed information on predictor definitions and measurements used in this study.

Missing Data and Sensitivity Analysis
Not all enlisted individuals consulted a GP during the follow-up period (year 2019). That means information on falls was missing as there was no registration by GPs for these individuals. We performed a sensitivity analysis to assess the impact of excluding these individuals on the performance of the model in the validation cohort. Because our validation cohort comprised a population from 2 different cities, we conducted a sensitivity analysis to evaluate the performance in individuals from each city separately.

Statistical Analysis
We investigated the extent of the relatedness between the development cohort and validation cohort using 2 approaches. First, individual baseline characteristics, including predictors, and the 1year fall follow-up (outcome) were compared between the development and validation cohorts using the mean with standard deviation (SD) or median with interquartile range as appropriate, and the standardized mean difference (SMD). Second, we adapted 2 procedures described by Debray et al 11 to quantify the degree of relatedness in case-mix between the development and validation cohorts. Specifically, the degree of relatedness was quantified by creating a binary logistic regression model (membership model) to predict the probability that an individual belongs to either the development cohort or the validation cohort. The ability to discriminate between individuals of both cohorts was measured using the ROC-AUC. We subsequently compared the distribution of the predicted risks of the development and validation cohorts by estimating the mean and SD of the linear predictor (LP) for each cohort. An increased SD of the LP indicates more heterogeneity in patient characteristics, whereas the difference in the mean of the LP indicates differences in the outcome incidence.
Model performance was assessed in terms of discrimination (the ability to differentiate individuals at higher falling risk from those at lower risk) and calibration (the agreement between predicted probabilities and actual rates of falls). Discrimination was measured using the ROC-AUC. Calibration was evaluated using calibration-in-thelarge, the calibration slope, and visual inspection of a calibration plot as recommended by Debray et al 11 and Steyerberg et al. 18 Calibration-in-the-large is the mean predicted risk in the validation cohort compared with the mean actual risk, whereas the calibration slope quantifies the correlation between actual and predicted risks across the individuals. The ideal values of calibration-in-the-large and calibration slope are 0 and 1, respectively. A calibration plot illustrates the agreement between predicted risks and actual risks across the range of predicted probabilities. We plotted the mean predicted risks against the mean actual risks for each decile as indicated in the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement. 19 In addition, we applied Loess smoothing to illustrate the agreement across the range of predicted risks. 20 All the analysis was done using the R statistical software environment, version 4.0 (R Foundation for Statistical Computing, Vienna, Austria). We reported the results according to the TRIPOD statement. 19 Figure 1 shows the flow of individuals through this validation study. Overall, 39,342 older people were eligible according to our inclusion criteria. Individuals who died during the observation period (n ¼ 820; 2.1%) and those who registered during the follow-up period (n ¼ 389; 1%) were excluded from the analysis. The remaining 38,133 individuals were included in the primary analysis. Of those, 5124 (13.4%) fell in the 1-year follow-up period. The number of individuals who did not contact the GP in the follow-up period was 3018 (7.9%). History of falls was observed in 5072 (13.3%) of the older people. Our search algorithm to capture falls (history of falls in this context) in free text had a sensitivity, specificity, and positive predictive value of 95%, 88%, and 97%, respectively.

Study Population and Baseline Characteristics
The characteristics of both the development and the validation cohorts are compared in Table 1. The prevalence of the outcome and the proportions of most of the predictors were comparable, although some marginal differences existed. For instance, in comparison with the development cohort, the validation cohort was older (median age 71.74 vs 73.00 years), had a higher proportion of women (53.2% vs 56.0%), were prescribed fewer medications, and had fewer comorbidities. Nonetheless, the SMDs of most of the characteristics were small. A full comparison of all the predictors including the outcome of both cohorts is given in Supplementary Table 2.
The extent of relatedness between the development and validation cohort as measured by the ROC-AUC in the membership model was 0.580 (95% CI 0.576-0.583). The ratio of the SD of the mean LP between the validation and development cohort was 1.0145 (95% CI 1.0143-1.0147). The difference in mean of LP between the validation and development cohort was 0.0146 (95% CI 0.0145-0.0147).

Model Performance
The ability of the model to discriminate between fallers and nonfallers, as indicated by the ROC-AUC, in the validation cohort was 0.690 (95% CI 0.686-0.698). Calibration-in-the-large and calibration slope were 0.012 (95% CI -0.018-0.042) and 0.878 (95% CI 0.864-0.915), respectively. The calibration plot in Figure 2 shows that the predicted risks agreed across a wide range of actual risks, but an overprediction exists specifically when the risk of falls is high. Table 2 presents the results of the sensitivity analysis after the validation of the model in multiple subcohorts. The discrimination ability of the model was similar in cohort A and B but slightly decreased in cohort C (ROC-AUC 0.694, 0.693, and 0.668, respectively). Calibration-in-the-large for cohort A, B, and C was 0.121, À0.002, and 0.107, respectively, and calibration slope was 0.912, 0.898, and 0.735, respectively. With respect to calibration plots, a similar trend was observed in the 3 validation cohorts as depicted in Figure 3. However, there was about 5% underestimation of fall risk in cohort C for predicted probabilities between 0.15 and 0.25.

Discussion
In the present study, we conducted an external validation of our previously developed prediction model for falls in communitydwelling older people, using a large pseudonymized EHR data set collected in primary care setting in the Netherlands. Our findings showed that the model was stable in terms of discrimination and calibration and reproducible in a different but related population.
The discriminative ability of the model as measured by the ROC-AUC was fair and very close to that reported in the development study. The small decrease in the ROC-AUC was expected as we applied an existing prediction model in a new population. 11 Predicted risks were systematically adequate (calibration-in-the-large ¼ 0.012) and remained proportionally accurate (calibration slope ¼ 0.878). The calibration plot had a similar trend to that of the development study, which revealed reasonable calibration reflected by the overall agreement between predicted risks and actual risks in most of the risk groups, although overprediction for higher predicted risk groups was noticed in a relatively small group of patients in both the development and the validation cohorts. Together, these results support the hypothesis that the model is usable in external, though related, populations at least in the Netherlands.
Few studies previously attempted to externally validate prediction models for falls in community-dwelling older people. A recent validation study found that simply asking the patient about a history of falls was inadequate to predict future falls. 21 This was confirmed in our study, where we found poor discrimination performance (ROC-AUC ¼ 0.589) when we fitted a model with only history of falls as predictor. Although simple and easy to administer, a prior systematic review and meta-analysis showed that no single measure was enough to predict falls. 22 By contrast, predictors in our model, including history of falls, are more likely to capture the multifaceted nature of falls. In a study by Palumbo et al, 23 the fall risk assessment tool (FRAT-up) was externally validated on 4 European cohorts. 24 The authors reported an ROC-AUC of 0.646 (95% CI 0.584-0.708) and found heterogeneous discriminative ability of the FRAT-up in the cohorts. Another study that externally validated the modified Johns Hopkins Fall Risk Assessment Tool (mJH-FRAT) in older people receiving home health visits, an ROC-AUC of 0.66 was achieved. 25 Although it is difficult to directly compare the Understanding the relatedness between the development and validation cohorts allowed us to interpret the model performance in terms of clinical reproducibility. Our results showed similar baseline characteristics and fall incidence between the development and the validation population, reflected by the small differences and the low SMD values. This was further confirmed by the poor ROC-AUC of the membership model and the similar distribution of the LP in both cohorts, which are indicative of case-mix similarity between the development and validation cohorts. As a result, this similarity may explain why the model performed well.
Of note, our sensitivity analysis showed that the performance of the model remained stable in terms of discrimination when validated in multiple subcohorts. However, calibration is also important when model output will be used by physicians to make a clinical decision. 12,26 In other words, an individualized predicted falling  probability of 0.3 for certain older individuals should ideally correspond to 30% absolute risk for falls. Our results indicated that predicted risks for the subpopulation such as individuals from general practices in the city of Haarlem were systemically low, reflected in the calibration-in-the-large of 0.107. One possible explanation of this is the relatively higher frequency of fallers in this particular cohort compared with the development cohort. Nevertheless, this issue is recognized when externally validating prediction models on another population with different baseline outcome rates, and the model may benefit from an intercept update to make it suitable for other populations. 11,18 This study has some limitations. Although the reported rate of falls in this study is in line with our development study, it is much lower than that observed in the community. 1 Fallers in this study were plausibly those patients presented to the GPs for medical attention due to falls, for example, injurious falls. Having stated that, our prediction model is more likely to be appropriate for implementation in the primary care setting, and an intercept update might be required to adjust for the average rate of falls in the community. Another limitation of this study is the fact that both the development and the external validation were performed in the Dutch population and in Dutch general practice care, and that our validation cohort comprised populations from only 2 cities in an urban area close to each other. However, our validation cohort was large, and the study certainly adds to our understanding of at least the reproducibility of our prediction model in related populations.
External validation of clinical prediction models is ideally followed by a research to assess the clinical impact. 27,28 Further studies need to be performed to assess clinical and organizational consequences after implementing this model in clinical practice. Furthermore, future work is required to establish the viability of our prediction model in different and unrelated population, that is, to assess its transportability to different populations, for example, secondary, postacute, or long-term care settings. In addition, our method to identify fallers in free text was time consuming. Future studies could consider using a machine learning algorithm to allow for more rapid detection of fall events in the free text.

Conclusions and Implications
Our validated model can potentially be used as a tool by clinicians in the primary care setting to identify older people at higher falling risk. The advantage of using the predictors of our validated prediction model is that they are easily retrievable from any EHR. The integration of this model with a clinical decision support system can provide valuable insights to clinicians to estimate falling risks and to react accordingly by providing targeted interventions.   ACEIs, angiotensin converting enzyme inhibitors; ARBs, angiotensin receptor blockers; BPH, benign prostatic hyperplasia; CCBs, calcium channel blockers; COPD, chronic obstructive pulmonary disease; MRIs, monoamine reuptake inhibitors; NSAIDs, nonsteroidal antiinflammatory drugs; SSRIs, selective serotine reuptake inhibitors; TIA, transient ischemic attack. Unless otherwise noted, data are presented as n (%).