Deep learning for causal inference on electronic health records
Cardiovascular diseases (CVD) are the leading causes of mortality around the world and disentangling cause and effect is central to better understanding and treating these diseases. While randomised clinical trials are the “gold standard” of assessing the effect of an intervention, some hypotheses cannot be feasibly tested in the randomised setting. In these cases, observational studies with appropriate methods of confounding adjustment can deliver reliable evidence concerning the association between an exposure and outcome. Indeed, trusted conventional statistical models guided by subject area experts for confounder selection have been used to estimate associations in many observational studies; however, in the observational studies for which confounding is unknown and/or the population suffers from complex illness, the conventional approaches render insufficiently adjusted estimates. In parallel, recently, there has been unprecedented access to nationally representative multimodal electronic health record (EHR) datasets and advances in statistical learning including “deep” machine learning, a form of machine learning that relies on automatic feature capture dissolving the need for expert-driven feature engineering. In this doctoral research, the aim was to develop a deep learning approach for causal inference on EHR. To do so in a structured way, the research was split into three investigations: 1) The development of a deep learning model for EHR data and assessment of risk prediction performance 2) Given the “black box” nature of deep learning modelling, the development of methods to explain the proposed model. 3) The derivation of a model for causal inference, and application of the models for association estimation in elderly/at-risk patient subgroups. The model, Bidirectional EHR Transformer (BEHRT) was created for EHR representation learning and risk prediction. The model outperformed several benchmarks for risk prediction on a variety of tasks including incident heart failure prediction. Furthermore, in the second work, explainability investigations yielded that the model captured validated factors of risk (e.g., hypertension, diabetes, and other diseases) and offered several more factors that could be potentially preventative of incident heart failure. Lastly, a derivation of BEHRT was developed for association estimation, Targeted-BEHRT, that fused advances in deep learning and semi- parametric statistics. The model demonstrated superior estimation abilities on several simulated data experiments, and was applied to better understand the effects of antihypertensives, blood pressure, and paracetamol on cardiovascular endpoints, mortality, and other outcomes in at-risk patients. Overall, the doctoral research has made advances in both methodological and clinical cardiovascular research. While the research focuses on developing methods for the study of cardiovascular diseases, the methods developed and tested have several important implications for epidemiological research in the observational setting at large. Especially in patient groups with pre-existing health issues, the causal models developed can be a more appropriate approach for association analysis than conventional statistical ones. In terms of clinical impact, the research has progressed our understanding of risk and protection in the context of CVD.