Documentation | Technical Paper | Free Cloud Service
Create high-fidelity privacy-safe synthetic data:
- train a generative model once:
- train on flat or sequential data
- control training time & params
- monitor training progress
- optionally enable differential privacy
- optionally provide context data
- generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missing values
- incorporate fairness
- adjust sampling temperature
- predict / classify / regress
- detect outliers / anomalies
- and more
...all within your own compute environment, all with a few lines of Python code π₯.
Note: Models only need to be trained once and can then be flexibly reused for various downstream tasks β such as regression, classification, imputation, or sampling β without the need for retraining.
Two model classes with these methods are available:
TabularARGN(): For structured, flat or sequential tabular data.argn.fit(data): Train a TabularARGN modelargn.sample(n_samples): Generate samplesargn.predict(target, n_draws, agg_fn): Predict a featureargn.predict_proba(target, n_draws): Estimate probabilitiesargn.impute(data): Fill missing values
LanguageModel(): For semi-structured, flat textual tabular data..fit(data): Train a Language model.sample(n_samples): Generate samples
This library serves as the core model engine for the Synthetic Data SDK. For an easy-to-use, higher-level toolkit, please refer to the SDK.
It is highly recommended to install the package within a dedicated virtual environment using uv.
The latest release of mostlyai-engine can be installed via uv:
uv pip install -U mostlyai-engineor alternatively for a GPU setup (needed for LLM finetuning and inference):
uv pip install -U 'mostlyai-engine[gpu]'On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-engine:
uv pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-engine --extra-index-url https://download.pytorch.org/whl/cpuThe TabularARGN class provides a scikit-learn-compatible interface for working with structured tabular data. It can be used for synthetic data generation, classification, regression, and imputation.
Load your data and train the model:
import pandas as pd
from sklearn.model_selection import train_test_split
from mostlyai.engine import TabularARGN
# prepare data
data = pd.read_csv("https://github.com/user-attachments/files/23480587/census10k.csv.gz")
data_train, data_test = train_test_split(data, test_size=0.2)
# fit TabularARGN
argn = TabularARGN()
argn.fit(data_train)Generate new synthetic samples:
# unconditional sampling
argn.sample(n_samples=1000)Generate new synthetic samples conditionally:
# prepare seed
seed_data = pd.DataFrame({
"age": [25, 50],
"education": ["Bachelors", "HS-grad"]
})
# conditional sampling
argn.sample(seed_data=seed_data)Fill in missing values:
# prepare demo data with missings
data_with_missings = data_test.head(300).reset_index(drop=True)
data_with_missings.loc[0:299, "age"] = pd.NA
data_with_missings.loc[0:199, "race"] = pd.NA
data_with_missings.loc[100:299, "income"] = pd.NA
# impute missing values
argn.impute(data_with_missings)Predict any categorical target column:
from sklearn.metrics import accuracy_score, roc_auc_score
# predict class labels for a categorical
predictions = argn.predict(data_test, target="income", n_draws=10, agg_fn="mode")
# predict class probabilities for a categorical
probabilities = argn.predict_proba(data_test, target="income", n_draws=10)
# evaluate performance
accuracy = accuracy_score(data_test["income"], predictions)
auc = roc_auc_score(data_test["income"], probabilities[:, 1])
print(f"Accuracy: {accuracy:.3f}, AUC: {auc:.3f}")Predict any numerical target column:
from sklearn.metrics import mean_absolute_error
# predict target values
predictions = argn.predict(data_test, target="age", n_draws=10, agg_fn="mean")
# evaluate performance
mae = mean_absolute_error(data_test["age"], predictions)
print(f"MAE: {mae:.1f} years")For sequential data (e.g., time series or event logs), specify the context key:
import pandas as pd
from mostlyai.engine import TabularARGN
# load sequential data
tgt_data = pd.read_csv("https://github.com/user-attachments/files/23480787/batting.csv.gz")
ctx_data = pd.read_csv("https://github.com/user-attachments/files/23480786/players.csv.gz")
# fit TabularARGN with a context key column
argn = TabularARGN(
tgt_context_key="players_id",
ctx_primary_key="id",
ctx_data=ctx_data,
max_training_time=2, # 2 minutes
verbose=0,
)
argn.fit(tgt_data)Generate new synthetic samples (using existing context):
argn.sample(n_samples=5)Generate new synthetic samples conditionally (using custom context and seed):
ctx_data = pd.DataFrame({
"id": ["Player1", "Player2"],
"weight": [170, 160],
"height": [70, 68],
"bats": ["R", "L"],
"throws": ["R", "L"],
})
argn.sample(ctx_data=ctx_data)The LanguageModel class provides a scikit-learn-compatible interface for working with semi-structured textual data. It leverages pre-trained language models or trains lightweight LSTM models from scratch to generate synthetic text data.
Note: The default model is MOSTLY_AI/LSTMFromScratch-3m, a lightweight LSTM model trained from scratch (GPU strongly recommended). You can also use pre-trained HuggingFace models by setting model to e.g. microsoft/phi-1.5 (GPU required).
Load your data and train the model:
import pandas as pd
from mostlyai.engine import LanguageModel
# load data
data = pd.read_csv("https://github.com/user-attachments/files/23486562/airbnb20k.csv.gz")
# fit LanguageModel
lm = LanguageModel(
model="MOSTLY_AI/LSTMFromScratch-3m",
tgt_encoding_types={
'neighbourhood': 'LANGUAGE_CATEGORICAL',
'title': 'LANGUAGE_TEXT',
},
max_training_time=10, # 10 minutes
verbose=1,
)
lm.fit(data)Generate new synthetic samples using the trained language model:
# unconditional sampling
lm.sample(
n_samples=100,
sampling_temperature=0.8,
)# prepare seed
seed_data = pd.DataFrame({
"neighbourhood": ["Westminster", "Hackney"],
})
# conditional sampling with seed values
lm.sample(
seed_data=seed_data,
sampling_temperature=0.8,
)Example notebooks demonstrating various use cases are available in the examples directory: