AI Dataset Generator

Generate realistic datasets for demos, learning, and dashboards. Instantly preview data, export as CSV or SQL, and explore with Metabase.

Features:

Conversational prompt builder: choose business type, schema, row count, and more
Real-time data preview in the browser
Export as CSV (single file or multi-table ZIP) or as SQL inserts
One-click Metabase launch for data exploration (see Using Metabase for details)

Usage Flow

Select your business type, schema, and other parameters.
Click "Preview Data" to generate a 10-row sample (incurs a small LLM cost, depending on provider).
Download CSV/SQL for as many rows as you want—no extra cost, always uses the same schema/columns as the preview.

Prerequisites

Docker (includes Docker Compose)
At least one API key for a supported LLM provider (OpenAI, Anthropic, or Google GenAI)

Getting Started

Clone the repo:

git clone <your-repo-url>
cd dataset-generator

Create your .env file:

Copy the example file and fill in your LLM provider API keys (OpenAI, Anthropic, Google, etc.):

cp .env.example .env

Then edit .env and add your API keys as needed:

# For local development, you can use any value for these keys:
LITELLM_MASTER_KEY=sk-1234
LITELLM_SALT_KEY=sk-1234

# Add at least one provider key below:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=...
GOOGLE_GENAI_API_KEY=...

# Set LLM_MODEL to match your provider:
LLM_MODEL=gpt-4o

# Examples values:
# For OpenAI:      LLM_MODEL=gpt-4o
# For Anthropic:   LLM_MODEL=claude-4-sonnet
# For Google:      LLM_MODEL=gemini-2.5-flash

Start the Next.js app:
```
npm install
npm run dev
```
- The app runs at http://localhost:3000
Start LiteLLM (Required for LLM Features):

This app uses LiteLLM as a gateway for all LLM requests (OpenAI, Anthropic, Google, etc.).

You must start LiteLLM for dataset generation and preview features to work.

From your project root, run:
```
docker compose up litellm db_litellm
```
- This starts the LiteLLM gateway and its dedicated Postgres database.
- LiteLLM will listen on http://localhost:4000 by default.
Generate a dataset:
- Use the prompt builder to define your dataset.
- Click "Preview Data" to see a sample.
Export or Explore:
- Download your dataset as CSV or SQL Inserts.
- Click "Start Metabase" to spin up Metabase in Docker.
- Once Metabase is ready, click "Open Metabase" to explore your data.
  - In Metabase, use the "Upload Data" feature to analyze your CSV files
  - Or connect to your own database where you've loaded the data
- When done, click "Stop Metabase" to shut down and clean up Docker containers.

How It Works

The dataset generator uses a two-stage process to create realistic business data. First, it leverages large language models to generate detailed data specifications based on your business type and parameters. Then, it uses these specifications to create unlimited amounts of realistic data locally.

When you preview a dataset, the app uses LiteLLM (which can route to OpenAI, Anthropic, Google, etc.) to generate a detailed data spec (schema, business rules, event logic) for your chosen business type and parameters.
All actual data rows are generated locally using Faker, based on the LLM-generated spec.
Downloading or exporting data never calls an LLM again—it's instant and free.

Cost & Data Generation Summary

Action	Calls LLM?	Cost?	Uses LLM?	Uses Faker?	Row Count
Preview	Yes	~$0.05	Yes	Yes	10
Download CSV	No	$0	No	Yes	100+
Download SQL	No	$0	No	Yes	100+

The above costs and behavior are based on testing with the OpenAI GPT-4o model. Costs and token usage may vary with other providers/models.

You only pay for the preview/spec generation (e.g., ~$0.05 per preview with OpenAI GPT-4o)
All downloads use the same columns/spec, just with more rows, and are free

Caching: After your first preview, the app remembers your data structure. If you preview the same business type and settings again, it reuses that structure (free) instead of generating a new one. This saves money and time. You'll see "Using cached spec" in the terminal when this happens. Check cache stats: curl http://localhost:3000/api/cache/stats or clear: curl -X DELETE http://localhost:3000/api/cache/clear.

Project Structure

/app/page.tsx – Main UI and prompt builder
/app/api/generate/route.ts – Synthetic data generator (via LiteLLM: OpenAI, Anthropic, Google, etc.)
/app/api/metabase/start|stop|status/route.ts – Docker orchestration for Metabase
/lib/export/ – CSV/SQL export logic
/docker-compose.yml – Used for Metabase and LiteLLM services

Stack

Next.js (App Router, TypeScript)
Tailwind CSS + ShadCN UI (modern, dark-themed UI)
LiteLLM (multi-provider LLM gateway: OpenAI, Anthropic, Google, etc.)
Metabase (Dockerized, launched on demand)

Extending/Contributing

To add new business types, edit lib/spec-prompts.ts and add entries to the businessTypeInstructions object

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
app		app
components		components
lib		lib
.dockerignore		.dockerignore
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
components.json		components.json
docker-compose.yml		docker-compose.yml
litellm-config.yaml		litellm-config.yaml
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Dataset Generator

Usage Flow

Prerequisites

Getting Started

How It Works

Cost & Data Generation Summary

Project Structure

Stack

Extending/Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

metabase/dataset-generator

Folders and files

Latest commit

History

Repository files navigation

AI Dataset Generator

Usage Flow

Prerequisites

Getting Started

How It Works

Cost & Data Generation Summary

Project Structure

Stack

Extending/Contributing

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages