Ensembling Synthetic Data and Digital Twin Technologies for Predictive Modeling in Life Sciences
Techniques to counter bias and elevate privacy for clinical trials
Recent history, like the COVID-19 pandemic, has shown that the speed and accuracy of executing clinical trials in support of new vaccines and medicines can have a major impact on public health and morbidity. The key to more efficient and timely clinical trials is adequate data generated by historical patient records, test subjects that take the vaccine and control subjects that receive a placebo. This data can be difficult to obtain and takes precious time to generate. The introduction of synthetic data (data that is created by an algorithm) has seen early success, attracting the attention of clinical trial practitioners. The FDA has already approved some limited uses for synthetic data as a source of external control data. What if synthetic data could “fill the data gap” on patient history or act as a substitute for a control group?
The life sciences and healthcare industries are undergoing a rapid evolution of machine learning (ML)-enabled services and products. Radiology has started to experiment with ML-based diagnosis for malignancy in breast cancer. To train underlying models, we need substantial data – synthetic data (or images) can be a credible source and, therefore, a major accelerator to producing more ML-based solutions.
The Motivation for Using Synthetic Data
The basic use case for synthetic data arises when there is inadequate training data for modeling any kind of problem. Researchers and statisticians have been creating simple variations of synthetic data (think random number generators and Monte Carlo simulations) for many years to help bolster confidence in their models. Randomness, however, does not meet the standard for use in clinical trials and life sciences in general. Real patient data has always been the foundation of such modeling. It’s only when we can prove that a synthetic data set is a “real” representation of patient data do we establish the threshold for inclusion.
Recent advances in digital twin and deep neural network capabilities have demonstrated a capacity to create data that is indistinguishable from real data and often more substantial, accurate and with a higher utility. In addition, synthetic data can guarantee a privacy protection that the real data, even after undergoing various masking processes, cannot.
Overall, a high-quality synthetic data set has the following attributes:
- It accurately mimics statistical properties and correlations of the original data.
- It is often a generalization of the original data, as it can ignore outliers and noise (cleaning original data).
- It can be generated in any quantity, independent of the size of the original data.
- It has high utility. For example, it can replace real data in training accurate predictive models.
- The original data does not have to be big for the model to learn its properties and generate high-quality synthetic data.
- Synthetic data can be made very private without a need of masking in the original data.
The “SynTwin” combination of digital twin enriched synthetic data provides a means to counteract bias in trial data and reach near perfect anonymity.
Using a digital twin approach to data allows the training data set to be further enriched by data points that may not typically be captured. It is a combination of digital twin feature engineering and advanced ML science that are key to realizing the “art of the possible” for synthetic data.
The main driving force for using synthetic data then remains the need for accurate predictive models when faced with prohibitively small real training data sets. Clinical trials generate an additional impetus based on the timeliness and urgency associated with executing a trial and the human cost associated with waiting for a trial while trying to generate a large enough volume of real data. This is where synthetic data can be extremely useful and altruistic.
Key Considerations: Utility & Privacy
The term synthetic data has multiple meanings in different contexts. Here it means artificial data that is generated by a sophisticated ML model that learns from enhanced real data (or a digital twin). The resulting synthetic data shares statistical properties of the original data (feature distributions and value space), as well as correlation between features (for instance, if in the original data features, A and B are strongly correlated, the same features will similarly be correlated in the synthetic data). This happens because in the process of training, the ML model learns connections and dependencies within the data, making the synthetic data increasingly more like the original. Two types of ML architectures are usually employed for this task: auto encoders and generative adversarial networks. The trained models can be used to generate an unlimited number of data records (without using real data as an input, as all information about the real data is in the model itself). Each output is a virtual twin of a non-existing person that can be easily converted into an original non-digital form. For instance, a vector [1 0, 1 0…0] will become a non-smoker from Alaska and [0 1, 0 1…0] will be a smoker from Arizona.
An increasingly challenging and potentially expensive issue in medical science is data privacy, namely protecting an individuals’ information from leaking and becoming available to bad players. One form of a potential privacy loss, relevant to our example, is whether a person was part of the trial. This would constitute a loss of privacy as it would reveal the person’s condition or diagnosis. Standard techniques, such as masking or obfuscation, can be insufficient (as remaining unmasked information can be used for re-identification) or cause a data utility loss (as too many important features become unavailable for prediction). These are also often time- and resource-consuming, requiring a lot of rules and coordination, but unfortunately are often not sufficient against increasingly more sophisticated bad actors and their typically vast resources. Moreover, none of them can categorically offer any definitive guarantee on how well privacy is maintained.
The digital twin enhanced patient profile creates a richer more inclusive and less biased data set that serves as an input into the synthetic data generator.
Differential privacy is the only tool that gives a mathematical guarantee in the worst-case scenario privacy loss, which can be made very strict. The synthetic data generation process has a natural way to incorporate differential privacy during the generative model training by adding controlled random noise to the models’ coefficients. This process prevents the model from accurately “memorizing” the real data, in particular unusual features (or outliers) that are often the easiest target for re-identification. The resulting differentially private synthetic data makes it impossible to learn whether any individual was a part of the real data set. How impossible depends on the privacy budget parameter epsilon: The smaller the epsilon, the stricter the privacy is. It is important to note that while choosing a small privacy budget is extremely beneficial for privacy protection, it has a drawback — a diminished privacy utility.
Deep learning synthetic data generator models learn during training distributions, patterns and correlations of the original data — something that humans cannot do given the amount and complexity of information. With the help of digital twin engineering, the synthetic data that the trained model generates provides an accurate and diverse representation of the source patient data while protecting privacy. Synthetic data can leverage some limitations of the source data, such as small size, unbalanced representation and privacy concerns, and it can be used as an effective tool for more efficient and timely clinical trials.