llmfsd: LLM Fake Structured Data

llmfsd is a Python package designed to generate fake structured data using any Large Language Model (LLM). With this package, you can execute SQL-like queries to simulate structured data in formats like JSON or CSV. The tool is highly customizable and supports the integration of multiple AI providers (thanks aisuite).

Features

Generate fake structured data via SQL queries.
Supports JSON and CSV output formats.
Language selection for descriptive attributes.
Define custom data models to control schema and descriptions.
Integrates with various AI providers (e.g., OpenAI, Mistral, Google, Anthropic).

Installation

Install llmfsd using pip:

pip install llmfsd

Install a Provider’s Package Along with aisuite

llmfsd supports all AI providers supported by aisuite. If you have not already installed the provider’s package, you can do so along with llmfsd. For example:

pip install "llmfsd[mistral]"

Alternatively, you can install the provider’s package directly with aisuite:

pip install "aisuite[mistral]"

For more details, visit the aisuite repository.

Usage

Basic Example

Here’s a simple example to get started:

from llmfsd import Faker

# Initialize Faker with your LLM model ID (AISuite ID format)
faker = Faker(model_id="mistral:mistral-large-latest")

# Or specify a language for descriptive attributes. Defaults to English.
faker = Faker(model_id="mistral:mistral-large-latest", lang="french")

# Generate JSON data
print(faker.json("SELECT uuid, name FROM phone_brands LIMIT 4"))

"""
Output:
[
 {'uuid': 'f47ac10b-58cc-4372-a567-0e02b2c3d479', 'name': 'Nokia'},
 {'uuid': 'f7bac13b-58cc-4372-a567-0e02b2c3d479', 'name': 'Samsung'}, 
 {'uuid': 'f98ac12b-58cc-4372-a567-0e02b2c3d479', 'name': 'Apple'},
 {'uuid': 'f47ac10b-58cc-4972-a567-0e02b2c3d479', 'name': 'Sony'}
]
"""

# Generate CSV data
print(faker.csv("SELECT id, color FROM colors LIMIT 2"))

"""
Output:
id,color
1,red
2,blue
"""

More Advanced Example with Data Models

You can define custom data models to control the structure of your fake data.

from llmfsd import Faker, DataModel

# Define data models

model = DataModel("dogs", 
    {"id": "Number in range(5,20)", "name": None, "breed": "Breed of the dog"}
)

# Initialize Faker with data models
faker = Faker(model_id="mistral:mistral-large-latest", data_models=[model])

# Generate JSON data for a specific model
print(faker.json("SELECT * FROM dogs LIMIT 3"))

"""
Output:
[
  {
    "id": 7,
    "name": "Buddy",
    "breed": "Labrador"
  },
  {
    "id": 12,
    "name": "Charlie",
    "breed": "Golden Retriever"
  },
  {
    "id": 15,
    "name": "Max",
    "breed": "German Shepherd"
  }
]
"""

AI Providers

To initialize with different providers, set the model_id parameter during Faker initialization using aisuite format.

Examples

faker1 = Faker(model_id="groq:llama-3.2-3b-preview")

faker2 = Faker(model_id="openai:gpt-3.5-turbo")

faker3 = Faker(model_id="huggingface:mistralai/Mistral-7B-Instruct-v0.3")

Each provider requires proper API_KEY. Use environment variables or configuration files to store your API keys securely. For example you need mistral you need MISTRAL_API_KEY

export MISTRAL_API_KEY="your-mistral-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

Generate fake structured data in JSON format.

query: The SQL query to execute.
output: File path to save the JSON output. If None, returns the data directly.

csv(query: str, output: Optional[str] = None) -> str | None

Generate fake structured data in CSV format.

query: The SQL query to execute.
output: File path to save the CSV output. If None, returns the data directly.

Custom Data Models

You can create custom schemas using DataModel, defining either a list of attributes or a dictionary with descriptions.

DataModel allows you to use * as a wildcard in queries or provide minimal descriptions for your attributes to the LLM.

Avoid providing unnecessary descriptions, as they can increase token consumption. It is recommended to use a list of attributes if the attributes are self-explanatory for the LLM. When using a dictionary-based schema, you can leave None for some attributes and provide descriptions only for those you wish to clarify.

Example:

from llmfsd import DataModel

Schema as a list

model1 = DataModel("cars", ["brand", "model", "year"])

Schema as a dictionary

model2 = DataModel("pets", {
    "id" : "uuid string",
    "name": None,
    "age":  None,
    "species": "Type of pet (e.g., dog, cat)"
})

Pass these models to Faker during initialization:

faker = Faker(model_id="openai:gpt-4o", data_models=[model1, model2])

Saving Output to a File

Both json and csv methods support saving results directly to a file.

Save JSON data to a file

faker.json("SELECT * FROM artists LIMIT 20", output="artists.json")

Save CSV data to a file

faker.csv("SELECT name, age FROM pets LIMIT 20", output="pets.csv")

Github

https://github.com/dinyad-prog00/llmfsd

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
llmfsd		llmfsd
tests		tests
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llmfsd: LLM Fake Structured Data

Features

Installation

Install a Provider’s Package Along with aisuite

Usage

Basic Example

More Advanced Example with Data Models

AI Providers

Examples

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

csv(query: str, output: Optional[str] = None) -> str | None

Custom Data Models

Example:

Schema as a list

Schema as a dictionary

Saving Output to a File

Save JSON data to a file

Save CSV data to a file

Github

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dinyad-prog00/llmfsd

Folders and files

Latest commit

History

Repository files navigation

llmfsd: LLM Fake Structured Data

Features

Installation

Install a Provider’s Package Along with aisuite

Usage

Basic Example

More Advanced Example with Data Models

AI Providers

Examples

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

csv(query: str, output: Optional[str] = None) -> str | None

Custom Data Models

Example:

Schema as a list

Schema as a dictionary

Saving Output to a File

Save JSON data to a file

Save CSV data to a file

Github

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages