UBS Hackathon: Synthetic Data

Founded in Switzerland, UBS is a multinational investment bank and the largest private bank in the world. With its strategic emphasis in digitisation, it held a Hackathon to investigate how synthetic data can be employed to train machine learning for use on the original data, so as to generate business insights while balancing data security considerations.

The synthetic data analysis was awarded champion of the Hackathon.

Company

UBS

Category

Data Science

Year

2022

Context

UBS created Artificial Intelligence Team in 2021 in effort to digitise their banking services, so as to consolidate and grow its high-net-worth individual market. To leverage the power of AI, data is the indispensible element. To take digital transformation efforts forward, the regional and domestic regulatory friction emanated from data protection cannot be underestimated. In an attempt to overcome these hurdles, UBS is now experimenting creating synthetic data, which are useful and shareable, for machine learning.

Data sharing in finance

Many new banking technologies are powered by AI. Consider robo advisory (automated financial services that require minimal human supervision), anomaly detection (for detecting instances of fraud, identity theft, and other attacks and errors) and algorithmic trading (computer programming for making predictions and executing market strategies). These are just a few of the many examples.

All these require two conditions - (i) training AI with data, and (ii) compliance with data protection regulations, including but not limited to the General Data Protection Regulations (GDPR) which have far-reaching implications globally. A common data protection practice is Pseudonymisation - replacing any information which could be used to identify an individual with a value which does not allow the individual to be directly identified. Yes, you are not supposed to know that the data in front of you belongs to Elon Musk.

However, because of the possibility of using public available data to re-identifiy the pseudonymised data, the technique will not exempt controllers from the ambit of GDPR. To make the data generally shareable without violating various regulations, we need Anonymisation, which is to make data completely unidentifiable. This is the objective of synthetic data.

Utilising synthetic data

Synthetic data is artificial data containing all of the characteristics and complexities of a real data set, without personally-identifying information. For synthetic data to be useful, it should draw a close parallel with the real data, particularly relationships between data in the set.

Synthetic data not only can cater for country specific data restrictions, but it also helps to reduce the risk of data leakage, enabling data-sharing within the company and improve the time-to-market of these tech services.

Creating & evaluating synthetic data

Original dataset

More than 62,000 STEM salary data scraped from levels.fyi for synthetic data generation and prediction. (Kaggle Link)

Synthetic data generation

We used the Synthetic Data Vault (SDV) from MIT for synthetic data generation. SDV is an ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets tfor generating new synthetic data that has the same format and statistical properties as the original dataset.    
(Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni, 2016)

For comparison and quality control, 5 different models were deployed - Tabular Preset, GaussianCopula, CopulaGAN, CTGAN and TVAE.

Evaluating synthetic data

3 metrics were used to evaluate the synthetic data. First, the statistical metrics for checking the distribution of numerical and categorical attributes. Second, the detection metrics, which is a machine learning classifier to distinguish the real data from the synthetic data. Third, the privacy metrics to test if the initial key attributes can be identified. Based on the aggregate result, the CTGAN Model was found to be the best performer.

Evaluation result

The result showed that CTGAN Model outperformed among all synthetic data generation models. It is worth to note that the model scored highest in privacy metrics.

Data preprocessing & predictive modeling

There were two requirements when designing the structure of predictive modelling for comparisons:
1. It must statistically resemble the original data to ensure authenticity.
2. It must structurally resemble the original data so that it can be deployed by any software.

The experimental setup evaluated synthetic data against a test set of original data for control purposes. This allowed us to compare the usefulness of both types of data for machine-learning. Upon pre-processing, the data would be trained by the LightGBM Model.

Prediction result

The real data and synthetic data were fed into different models and scenarios for a holistic view. As expected, LightGBM outperformed Neural Network on real data. Although we only used LightGBM model on synthetic data, it was observed that the R2 scores improved when we applied feature selection and increased the number of rows.

For further comparison, one more synthetic data generator was used - Gretel. It turned out to have received the best result, of which all scores closely followed those of the real data when applied on LightGBM.

Regression model

Upon plotting results for visualisation, it could be seen that the results of the synthetic data model closely resembled those generated by real data. In particular, the Gretel result was less scattered and the distribution looked highly similar with that of the real data.

Take the feature importance for another perspective. The Gretel result remained highly similar with the real data result in terms of the top features and their corresponding importance.

Result

Champion
“You have done a tremendously great job within only 4 days, from understanding a very new and complex subject to constructing a robust prediction model for comparison. The depth of research and results are truly impressive. There are many insightful findings which we can continue to explore.”

- Adjudication panel comprising members of the UBS AI Team

Other work