THE MOSTLY AI PRIZE

Flat Leaderboard: 17th
Sequential Leaderboard: 11th

Timeline: May 14, 2025 - Jul 3, 2025

Competition Overview

The goal of this challenge is to generate the best synthetic data.

There are two independent challenges:

The task is to generate a novel synthetic dataset that matches the structure of a provided dataset and preserves its statistical patterns, while its records are NOT significantly closer to the (released) original samples than to the (unreleased) holdout samples.

NOTE: To succeed, you'll typically train a generative model on the training data, ensuring it generalizes well without overfitting.

THE FLAT DATA Challenge

100,000 records, with 80 data columns: 60 numeric, 20 categorical.

NOTE: Any submission to the FLAT DATA challenge needs to be in CSV format, contain 100,000 records, and consists of the same columns, to be valid.

THE SEQUENTIAL DATA Challenge

20,000 groups, with 5-10 records each, with 10 data columns: 7 numeric, 3 categorical.

NOTE: A submission to the SEQUENTIAL DATA challenge needs to be in CSV format, contain 20,000 groups, and consists of the same columns, to be valid. Note that you can pick any format for the group ID column. That column will only be used to group the corresponding events together.

Evaluation Criteria

To qualify for the Leaderboard, a submission must meet two privacy thresholds:

DCR Share below 52%
NNDR Ratio above 0.5

These metrics ensure that the generated samples are sufficiently distinct — just as far from the training data as from the holdout set — while still capturing the core statistical structure.

The final evaluations will take place on AWS EC2 instances (c5d.12xlarge for CPU runs and g5.2xlarge for GPU runs), with a strict time cap of six hours per execution.

Final evaluations will take into account five factors:

accuracy
privacy
ease of use
compute efficiency
generalizability

Experiments

I tried different batch sizes for the MostlyAI models for both datasets. The best-peforming models for synthetic data generation are:

Flat Data Challenge

Batch Size: 1024
Model: MOSTLY_AI/Large
Accuracy: 98.1
DCR Share: 51.0
NNDR Ratio: 1.024
Training Checkpoint Epoch: 100
Validation loss: 110.672
Samples: 7.89M
Model Report

Sequential Data Challenge

Batch Size: 512
Model: MOSTLY_AI/Medium
Accuracy: 96
DCR Share: 51.8
NNDR Ratio: 1.244
Training Checkpoint Epoch: 41
Validation loss: 14.397
Samples: 633.04K
Model Report

Training Environment

Datasets: flat-training.csv (100,000 rows, 80 columns), sequential-training.csv (20,000 groups, 5-10 records each, 10 columns)
Hardware: NVIDIA A10/A10G GPU, 673-906 GB RAM, 17 CPUs on Modal
SDK: MostlyAI v4.7.9 (LOCAL mode)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
000_readings		000_readings
00_original_datasets		00_original_datasets
01_submitted_datasets		01_submitted_datasets
trained_models		trained_models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
explore_data.ipynb		explore_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

THE MOSTLY AI PRIZE

Competition Overview

THE FLAT DATA Challenge

THE SEQUENTIAL DATA Challenge

Evaluation Criteria

Experiments

Flat Data Challenge

Sequential Data Challenge

Training Environment

References

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

Ashish-Soni08/synthetic-data-challenge

Folders and files

Latest commit

History

Repository files navigation

THE MOSTLY AI PRIZE

Competition Overview

THE FLAT DATA Challenge

THE SEQUENTIAL DATA Challenge

Evaluation Criteria

Experiments

Flat Data Challenge

Sequential Data Challenge

Training Environment

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages