A comparative study of model-centric and data-centric approaches in the development of cardiovascular disease risk prediction models in the UK Biobank.
Mamouei M., Fisher T., Rao S., Li Y., Salimi-Khorshidi G., Rahimi K.
AIMS: A diverse set of factors influence cardiovascular diseases (CVDs), but a systematic investigation of the interplay between these determinants and the contribution of each to CVD incidence prediction is largely missing from the literature. In this study, we leverage one of the most comprehensive biobanks worldwide, the UK Biobank, to investigate the contribution of different risk factor categories to more accurate incidence predictions in the overall population, by sex, different age groups, and ethnicity. METHODS AND RESULTS: The investigated categories include the history of medical events, behavioural factors, socioeconomic factors, environmental factors, and measurements. We included data from a cohort of 405 257 participants aged 37-73 years and trained various machine learning and deep learning models on different subsets of risk factors to predict CVD incidence. Each of the models was trained on the complete set of predictors and subsets where each category was excluded. The results were benchmarked against QRISK3. The findings highlight that (i) leveraging a more comprehensive medical history substantially improves model performance. Relative to QRISK3, the best performing models improved the discrimination by 3.78% and improved precision by 1.80%. (ii) Both model- and data-centric approaches are necessary to improve predictive performance. The benefits of using a comprehensive history of diseases were far more pronounced when a neural sequence model, BEHRT, was used. This highlights the importance of the temporality of medical events that existing clinical risk models fail to capture. (iii) Besides the history of diseases, socioeconomic factors and measurements had small but significant independent contributions to the predictive performance. CONCLUSION: These findings emphasize the need for considering broad determinants and novel modelling approaches to enhance CVD incidence prediction.