- Flat Leaderboard:
17th - Sequential Leaderboard:
11th
Timeline: May 14, 2025 - Jul 3, 2025
The goal of this challenge is to generate the best synthetic data.
There are two independent challenges:
The task is to generate a novel synthetic dataset that matches the structure of a provided dataset and preserves its statistical patterns, while its records are NOT significantly closer to the (released) original samples than to the (unreleased) holdout samples.
NOTE: To succeed, you'll typically train a generative model on the training data, ensuring it generalizes well without overfitting.
100,000 records, with 80 data columns: 60 numeric, 20 categorical.
NOTE: Any submission to the FLAT DATA challenge needs to be in CSV format, contain 100,000 records, and consists of the same columns, to be valid.
20,000 groups, with 5-10 records each, with 10 data columns: 7 numeric, 3 categorical.
NOTE: A submission to the SEQUENTIAL DATA challenge needs to be in CSV format, contain 20,000 groups, and consists of the same columns, to be valid. Note that you can pick any format for the group ID column. That column will only be used to group the corresponding events together.
To qualify for the Leaderboard, a submission must meet two privacy thresholds:
- DCR Share below 52%
- NNDR Ratio above 0.5
These metrics ensure that the generated samples are sufficiently distinct — just as far from the training data as from the holdout set — while still capturing the core statistical structure.
The final evaluations will take place on AWS EC2 instances (c5d.12xlarge for CPU runs and g5.2xlarge for GPU runs), with a strict time cap of six hours per execution.
Final evaluations will take into account five factors:
- accuracy
- privacy
- ease of use
- compute efficiency
- generalizability
I tried different batch sizes for the MostlyAI models for both datasets. The best-peforming models for synthetic data generation are:
- Batch Size:
1024 - Model:
MOSTLY_AI/Large - Accuracy:
98.1 - DCR Share:
51.0 - NNDR Ratio:
1.024 - Training Checkpoint Epoch:
100 - Validation loss:
110.672 - Samples:
7.89M - Model Report
- Batch Size:
512 - Model:
MOSTLY_AI/Medium - Accuracy:
96 - DCR Share:
51.8 - NNDR Ratio:
1.244 - Training Checkpoint Epoch:
41 - Validation loss:
14.397 - Samples:
633.04K - Model Report
- Datasets: flat-training.csv (100,000 rows, 80 columns), sequential-training.csv (20,000 groups, 5-10 records each, 10 columns)
- Hardware: NVIDIA A10/A10G GPU, 673-906 GB RAM, 17 CPUs on Modal
- SDK: MostlyAI v4.7.9 (LOCAL mode)