Skip to content

Ashish-Soni08/synthetic-data-challenge

Repository files navigation

THE MOSTLY AI PRIZE

Leaderboard

  • Flat Leaderboard: 17th
  • Sequential Leaderboard: 11th

Timeline: May 14, 2025 - Jul 3, 2025


Competition Overview

The goal of this challenge is to generate the best synthetic data.

There are two independent challenges:

The task is to generate a novel synthetic dataset that matches the structure of a provided dataset and preserves its statistical patterns, while its records are NOT significantly closer to the (released) original samples than to the (unreleased) holdout samples.

NOTE: To succeed, you'll typically train a generative model on the training data, ensuring it generalizes well without overfitting.

THE FLAT DATA Challenge

100,000 records, with 80 data columns: 60 numeric, 20 categorical.

NOTE: Any submission to the FLAT DATA challenge needs to be in CSV format, contain 100,000 records, and consists of the same columns, to be valid.

THE SEQUENTIAL DATA Challenge

20,000 groups, with 5-10 records each, with 10 data columns: 7 numeric, 3 categorical.

NOTE: A submission to the SEQUENTIAL DATA challenge needs to be in CSV format, contain 20,000 groups, and consists of the same columns, to be valid. Note that you can pick any format for the group ID column. That column will only be used to group the corresponding events together.

Evaluation Criteria

To qualify for the Leaderboard, a submission must meet two privacy thresholds:

  • DCR Share below 52%
  • NNDR Ratio above 0.5

These metrics ensure that the generated samples are sufficiently distinct — just as far from the training data as from the holdout set — while still capturing the core statistical structure.

The final evaluations will take place on AWS EC2 instances (c5d.12xlarge for CPU runs and g5.2xlarge for GPU runs), with a strict time cap of six hours per execution.

Final evaluations will take into account five factors:

  • accuracy
  • privacy
  • ease of use
  • compute efficiency
  • generalizability

Experiments

I tried different batch sizes for the MostlyAI models for both datasets. The best-peforming models for synthetic data generation are:

Flat Data Challenge

  • Batch Size: 1024
  • Model: MOSTLY_AI/Large
  • Accuracy: 98.1
  • DCR Share: 51.0
  • NNDR Ratio: 1.024
  • Training Checkpoint Epoch: 100
  • Validation loss: 110.672
  • Samples: 7.89M
  • Model Report

Sequential Data Challenge

  • Batch Size: 512
  • Model: MOSTLY_AI/Medium
  • Accuracy: 96
  • DCR Share: 51.8
  • NNDR Ratio: 1.244
  • Training Checkpoint Epoch: 41
  • Validation loss: 14.397
  • Samples: 633.04K
  • Model Report

Training Environment

  • Datasets: flat-training.csv (100,000 rows, 80 columns), sequential-training.csv (20,000 groups, 5-10 records each, 10 columns)
  • Hardware: NVIDIA A10/A10G GPU, 673-906 GB RAM, 17 CPUs on Modal
  • SDK: MostlyAI v4.7.9 (LOCAL mode)

References

Releases

No releases published

Packages

No packages published