Skip to content

dinyad-prog00/llmfsd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llmfsd: LLM Fake Structured Data

llmfsd is a Python package designed to generate fake structured data using any Large Language Model (LLM). With this package, you can execute SQL-like queries to simulate structured data in formats like JSON or CSV. The tool is highly customizable and supports the integration of multiple AI providers (thanks aisuite).

Features

  • Generate fake structured data via SQL queries.
  • Supports JSON and CSV output formats.
  • Language selection for descriptive attributes.
  • Define custom data models to control schema and descriptions.
  • Integrates with various AI providers (e.g., OpenAI, Mistral, Google, Anthropic).

Installation

Install llmfsd using pip:

pip install llmfsd

Install a Provider’s Package Along with aisuite

llmfsd supports all AI providers supported by aisuite. If you have not already installed the provider’s package, you can do so along with llmfsd. For example:

pip install "llmfsd[mistral]"

Alternatively, you can install the provider’s package directly with aisuite:

pip install "aisuite[mistral]"

For more details, visit the aisuite repository.

Usage

Basic Example

Here’s a simple example to get started:

from llmfsd import Faker

# Initialize Faker with your LLM model ID (AISuite ID format)
faker = Faker(model_id="mistral:mistral-large-latest")

# Or specify a language for descriptive attributes. Defaults to English.
faker = Faker(model_id="mistral:mistral-large-latest", lang="french")

# Generate JSON data
print(faker.json("SELECT uuid, name FROM phone_brands LIMIT 4"))

"""
Output:
[
 {'uuid': 'f47ac10b-58cc-4372-a567-0e02b2c3d479', 'name': 'Nokia'},
 {'uuid': 'f7bac13b-58cc-4372-a567-0e02b2c3d479', 'name': 'Samsung'}, 
 {'uuid': 'f98ac12b-58cc-4372-a567-0e02b2c3d479', 'name': 'Apple'},
 {'uuid': 'f47ac10b-58cc-4972-a567-0e02b2c3d479', 'name': 'Sony'}
]
"""

# Generate CSV data
print(faker.csv("SELECT id, color FROM colors LIMIT 2"))

"""
Output:
id,color
1,red
2,blue
"""

More Advanced Example with Data Models

You can define custom data models to control the structure of your fake data.

from llmfsd import Faker, DataModel

# Define data models

model = DataModel("dogs", 
    {"id": "Number in range(5,20)", "name": None, "breed": "Breed of the dog"}
)

# Initialize Faker with data models
faker = Faker(model_id="mistral:mistral-large-latest", data_models=[model])

# Generate JSON data for a specific model
print(faker.json("SELECT * FROM dogs LIMIT 3"))

"""
Output:
[
  {
    "id": 7,
    "name": "Buddy",
    "breed": "Labrador"
  },
  {
    "id": 12,
    "name": "Charlie",
    "breed": "Golden Retriever"
  },
  {
    "id": 15,
    "name": "Max",
    "breed": "German Shepherd"
  }
]
"""

AI Providers

To initialize with different providers, set the model_id parameter during Faker initialization using aisuite format.

Examples

faker1 = Faker(model_id="groq:llama-3.2-3b-preview")

faker2 = Faker(model_id="openai:gpt-3.5-turbo")

faker3 = Faker(model_id="huggingface:mistralai/Mistral-7B-Instruct-v0.3")

Each provider requires proper API_KEY. Use environment variables or configuration files to store your API keys securely. For example you need mistral you need MISTRAL_API_KEY

export MISTRAL_API_KEY="your-mistral-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

Generate fake structured data in JSON format.

  • query: The SQL query to execute.
  • output: File path to save the JSON output. If None, returns the data directly.

csv(query: str, output: Optional[str] = None) -> str | None

Generate fake structured data in CSV format.

  • query: The SQL query to execute.
  • output: File path to save the CSV output. If None, returns the data directly.

Custom Data Models

You can create custom schemas using DataModel, defining either a list of attributes or a dictionary with descriptions.

DataModel allows you to use * as a wildcard in queries or provide minimal descriptions for your attributes to the LLM.

Avoid providing unnecessary descriptions, as they can increase token consumption. It is recommended to use a list of attributes if the attributes are self-explanatory for the LLM. When using a dictionary-based schema, you can leave None for some attributes and provide descriptions only for those you wish to clarify.

Example:

from llmfsd import DataModel

Schema as a list

model1 = DataModel("cars", ["brand", "model", "year"])

Schema as a dictionary

model2 = DataModel("pets", {
    "id" : "uuid string",
    "name": None,
    "age":  None,
    "species": "Type of pet (e.g., dog, cat)"
})

Pass these models to Faker during initialization:

faker = Faker(model_id="openai:gpt-4o", data_models=[model1, model2])

Saving Output to a File

Both json and csv methods support saving results directly to a file.

Save JSON data to a file

faker.json("SELECT * FROM artists LIMIT 20", output="artists.json")

Save CSV data to a file

faker.csv("SELECT name, age FROM pets LIMIT 20", output="pets.csv")

Github

https://github.com/dinyad-prog00/llmfsd

About

LLMFSD: LLM Fake Structured Data, faking Structured Data from any LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages