You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
5
+
```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data. This library is a Python implementation of the CART method used in R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
6
6
7
-
With this library synthetic tabular data can be produced. Synthetic data refers to artificially generated data that mimics real-world data in structure and statistical properties but does not directly originate from actual events or individuals. It supports processing numerical and categorical data using sequential modelling techniques. Artificial data are generated by drawing from conditional distributions fitted to the original data using parametric (e.g., Gaussian copula) or classification and regression trees (CART) models.
7
+
Synthetic data is generated in six steps:
8
8
9
-
This Python library is a reimplementation of the R package `synthpop`. Synthetic data can be generated using the `.generate()` method after fitting the a synntesizer to the original data with the `.fit()` method. The process can be largely automated using default settings or customized through user-defined settings. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
9
+
1.**Detect data types**: detect numerical, categorial and/or datetime data;
10
+
2.**Process missing data**: process missing data: remove or impute missing values;
11
+
3.**Preprocessing**: transforms data into numerical format;
12
+
4.**Synthesizer**: fit CART or GC;
13
+
5.**Postprocessing**: map synthetic data back to its original structure and domain;
14
+
6.**Evaluation metrics**: determine quality of synthetic data, e.g., similarity, utility and privacy metrics.
10
15
11
-
☁️ [Web app](https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html) – a demo of synthetic data generation using `python-synthpop`through [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool)
16
+
☁️ [Web app](https://algorithmaudit.eu/technical-tools/sdg/#web-app) – a demo of synthetic data generation using `python-synthpop`in the browser using [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool).
12
17
13
18
# Installation
14
19
@@ -29,200 +34,143 @@ python setup.py install
29
34
30
35
# Examples
31
36
32
-
#### Adult dataset
33
-
We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
37
+
#### Social Diagnosis 2011 dataset
38
+
We will use the Social Diagnosis 2011 dataset as an example, which is a comprehensive survey conducted in Poland. This dataset includes a wide range of variables related to the social and economic conditions of Polish households and individuals. It covers aspects such as income, employment, education, health, and overall quality of life.
34
39
35
40
```
36
-
In [1]: from datasets.adult import df
41
+
In [1]: import pandas as pd
37
42
38
-
In [2]: df.head()
43
+
In [2]: df = pd.read_csv('../datasets/socialdiagnosis/data/SocialDiagnosis2011.csv', sep=';')
44
+
df.head()
39
45
Out[2]:
40
-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
41
-
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
42
-
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
43
-
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
44
-
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
45
-
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
46
+
sex age marital income ls smoke
47
+
0 FEMALE 57 MARRIED 800.0 PLEASED NO
48
+
1 MALE 20 SINGLE 350.0 MOSTLY SATISFIED NO
49
+
2 FEMALE 18 SINGLE NaN PLEASED NO
50
+
3 FEMALE 78 WIDOWED 900.0 MIXED NO
51
+
4 FEMALE 54 MARRIED 1500.0 MOSTLY SATISFIED YES
52
+
46
53
```
47
54
48
55
### python-synthpop
49
56
50
-
Use default parameters for the Adult dataset:
57
+
Using default parameters the six steps are applied on the Social Diagnosis example to generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
51
58
52
59
```
53
-
In [1]: from python-synthpop import Synthpop
54
-
55
-
In [2]: from datasets.adult import df, dtypes
56
-
57
-
In [3]: spop = Synthpop()
58
-
59
-
In [4]: spop.fit(df, dtypes)
60
-
train_age
61
-
train_workclass
62
-
train_fnlwgt
63
-
train_education
64
-
train_educational-num
65
-
train_marital-status
66
-
train_occupation
67
-
train_relationship
68
-
train_race
69
-
train_gender
70
-
train_capital-gain
71
-
train_capital-loss
72
-
train_hours-per-week
73
-
train_native-country
74
-
train_income
75
-
76
-
In [5]: synth_df = spop.generate(len(df))
77
-
generate_age
78
-
generate_workclass
79
-
generate_fnlwgt
80
-
generate_education
81
-
generate_educational-num
82
-
generate_marital-status
83
-
generate_occupation
84
-
generate_relationship
85
-
generate_race
86
-
generate_gender
87
-
generate_capital-gain
88
-
generate_capital-loss
89
-
generate_hours-per-week
90
-
generate_native-country
91
-
generate_income
92
-
93
-
In [6]: synth_df.head()
94
-
Out[6]:
95
-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
0 commit comments