Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Synthetic data offer a number of advantages over using ground truth data when working with private and personal information about individuals. Firstly, the risk of identifying individuals is reduced considerably, which enables the sharing of data for analysis amongst more organisations. Secondly, the fine tuning of synthetic datapoints to suit particular modelling and analyses could help to build more suitable models that can avoid biases found in the original ground truth data. In this paper we explore how a probabilistic synthetic data generator can be used to model data with high enough fidelity that it can be used to develop and validate state-of-the-art machine learning models. In particular, we use a Bayesian network model trained on gestational diabetes data, generated from a mobile health app collected from a number of health trusts in the UK. These data are used to train and test an established machine learning model developed by Sensyne Health using real-world data, and the resulting performance is compared to performance on ground truth data. In addition, a clinical validation is undertaken to explore if human experts can differentiate real patients from synthetic ones. We demonstrate that the Bayesian network synthetic data generator is able to mimic the ground truth closely enough to make it difficult for a human expert to distinguish between the two. We show that the data generator captures the interactions between features and the multivariate distributions close enough to enable classifiers to be inferred that imitate the key performance characteristics of models inferred from ground truth data. What is more, we demonstrate that the discovered mis-classifications found when testing using the synthetic data, are as informative as when testing using ground truth data.

Original publication




Conference paper

Publication Date





259 - 264