Skip to content

Commit 7094ecb

Browse files
Merge pull request #8 from NGO-Algorithm-Audit/main
update merge
2 parents 144b93e + b8214cf commit 7094ecb

File tree

5 files changed

+1376
-210
lines changed

5 files changed

+1376
-210
lines changed

README.md

Lines changed: 138 additions & 190 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,19 @@
1-
![image](https://github.com/NGO-Algorithm-Audit/python-synthpop/blob/main/images/Header.png)
1+
![image](https://raw.githubusercontent.com/NGO-Algorithm-Audit/python-synthpop/b09d3fe93ac21406199810e39e2a844dc1faefd0/images/Header.png)
22

33
# python-synthpop
44

5-
Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
5+
```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data. This library is a Python implementation of the CART method used in R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
66

7-
With this library synthetic tabular data can be produced. Synthetic data refers to artificially generated data that mimics real-world data in structure and statistical properties but does not directly originate from actual events or individuals. It supports processing numerical and categorical data using sequential modelling techniques. Artificial data are generated by drawing from conditional distributions fitted to the original data using parametric (e.g., Gaussian copula) or classification and regression trees (CART) models.
7+
Synthetic data is generated in six steps:
88

9-
This Python library is a reimplementation of the R package `synthpop`. Synthetic data can be generated using the `.generate()` method after fitting the a synntesizer to the original data with the `.fit()` method. The process can be largely automated using default settings or customized through user-defined settings. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
9+
1. **Detect data types**: detect numerical, categorial and/or datetime data;
10+
2. **Process missing data**: process missing data: remove or impute missing values;
11+
3. **Preprocessing**: transforms data into numerical format;
12+
4. **Synthesizer**: fit CART or GC;
13+
5. **Postprocessing**: map synthetic data back to its original structure and domain;
14+
6. **Evaluation metrics**: determine quality of synthetic data, e.g., similarity, utility and privacy metrics.
1015

11-
☁️ [Web app](https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html) – a demo of synthetic data generation using `python-synthpop` through [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool)
16+
☁️ [Web app](https://algorithmaudit.eu/technical-tools/sdg/#web-app) – a demo of synthetic data generation using `python-synthpop` in the browser using [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool).
1217

1318
# Installation
1419

@@ -29,200 +34,143 @@ python setup.py install
2934

3035
# Examples
3136

32-
#### Adult dataset
33-
We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
37+
#### Social Diagnosis 2011 dataset
38+
We will use the Social Diagnosis 2011 dataset as an example, which is a comprehensive survey conducted in Poland. This dataset includes a wide range of variables related to the social and economic conditions of Polish households and individuals. It covers aspects such as income, employment, education, health, and overall quality of life.
3439

3540
```
36-
In [1]: from datasets.adult import df
41+
In [1]: import pandas as pd
3742
38-
In [2]: df.head()
43+
In [2]: df = pd.read_csv('../datasets/socialdiagnosis/data/SocialDiagnosis2011.csv', sep=';')
44+
df.head()
3945
Out[2]:
40-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
41-
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
42-
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
43-
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
44-
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
45-
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
46+
sex age marital income ls smoke
47+
0 FEMALE 57 MARRIED 800.0 PLEASED NO
48+
1 MALE 20 SINGLE 350.0 MOSTLY SATISFIED NO
49+
2 FEMALE 18 SINGLE NaN PLEASED NO
50+
3 FEMALE 78 WIDOWED 900.0 MIXED NO
51+
4 FEMALE 54 MARRIED 1500.0 MOSTLY SATISFIED YES
52+
4653
```
4754

4855
### python-synthpop
4956

50-
Use default parameters for the Adult dataset:
57+
Using default parameters the six steps are applied on the Social Diagnosis example to generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
5158

5259
```
53-
In [1]: from python-synthpop import Synthpop
54-
55-
In [2]: from datasets.adult import df, dtypes
56-
57-
In [3]: spop = Synthpop()
58-
59-
In [4]: spop.fit(df, dtypes)
60-
train_age
61-
train_workclass
62-
train_fnlwgt
63-
train_education
64-
train_educational-num
65-
train_marital-status
66-
train_occupation
67-
train_relationship
68-
train_race
69-
train_gender
70-
train_capital-gain
71-
train_capital-loss
72-
train_hours-per-week
73-
train_native-country
74-
train_income
75-
76-
In [5]: synth_df = spop.generate(len(df))
77-
generate_age
78-
generate_workclass
79-
generate_fnlwgt
80-
generate_education
81-
generate_educational-num
82-
generate_marital-status
83-
generate_occupation
84-
generate_relationship
85-
generate_race
86-
generate_gender
87-
generate_capital-gain
88-
generate_capital-loss
89-
generate_hours-per-week
90-
generate_native-country
91-
generate_income
92-
93-
In [6]: synth_df.head()
94-
Out[6]:
95-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
96-
0 21 ? 213055 11th 7 Never-married ? Not-in-family Other Female 0 0 30 United-States <=50K
97-
1 23 Private 150683 HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K
98-
2 61 Private 191417 10th 6 Widowed Sales Not-in-family Black Female 0 0 32 United-States <=50K
99-
3 50 Private 190762 HS-grad 9 Divorced Sales Not-in-family White Male 0 0 60 United-States <=50K
100-
4 42 Local-gov 255675 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 40 United-States <=50K
101-
102-
In [7]: spop.method
103-
Out[7]:
104-
age sample
105-
workclass cart
106-
fnlwgt cart
107-
education cart
108-
educational-num cart
109-
marital-status cart
110-
occupation cart
111-
relationship cart
112-
race cart
113-
gender cart
114-
capital-gain cart
115-
capital-loss cart
116-
hours-per-week cart
117-
native-country cart
118-
income cart
119-
dtype: object
120-
121-
In [8]: spop.visit_sequence
122-
Out[8]:
123-
age 0
124-
workclass 1
125-
fnlwgt 2
126-
education 3
127-
educational-num 4
128-
marital-status 5
129-
occupation 6
130-
relationship 7
131-
race 8
132-
gender 9
133-
capital-gain 10
134-
capital-loss 11
135-
hours-per-week 12
136-
native-country 13
137-
income 14
138-
dtype: int64
139-
140-
In [9]: spop.predictor_matrix
141-
Out[9]:
142-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
143-
age 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
144-
workclass 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
145-
fnlwgt 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
146-
education 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
147-
educational-num 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
148-
marital-status 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
149-
occupation 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
150-
relationship 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
151-
race 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
152-
gender 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
153-
capital-gain 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
154-
capital-loss 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
155-
hours-per-week 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
156-
native-country 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
157-
income 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
158-
```
60+
In [1]: from synthpop import MissingDataHandler, DataProcessor, CARTMethod
61+
62+
In [2]: # 1. Initiate metadata
63+
md_handler = MissingDataHandler()
64+
65+
# 1.1 Get data types
66+
metadata= md_handler.get_column_dtypes(df)
67+
print("Column Data Types:", metadata)
68+
69+
Column Data Types: {'sex': 'categorical', 'age': 'numerical', 'marital': 'categorical', 'income': 'numerical', 'ls': 'categorical', 'smoke': 'categorical'}
70+
71+
In [3]: # 2. Process missing data
72+
print("Missing data:")
73+
print(df.isnull().sum())
74+
75+
Missing data:
76+
sex 0
77+
age 0
78+
marital 9
79+
income 683
80+
ls 8
81+
smoke 10
82+
dtype: int64
83+
84+
In [4]: # 2.1 Detect type of missingness
85+
missingness_dict = md_handler.detect_missingness(df)
86+
print("Detected missingness type:", missingness_dict)
87+
88+
Detected missingness type: {'marital': 'MAR', 'income': 'MAR', 'ls': 'MAR', 'smoke': 'MAR'}
89+
90+
91+
In [5]: # 2.2 Impute missing values
92+
real_df = md_handler.apply_imputation(df, missingness_dict)
93+
94+
print("Missing data:")
95+
print(real_df.isnull().sum())
96+
97+
Missing data:
98+
sex 0
99+
age 0
100+
marital 0
101+
income 0
102+
ls 0
103+
smoke 0
104+
dtype: int64
105+
106+
107+
In [6]: # 3. Preprocessing: Instantiate the DataProcessor with column_dtypes
108+
processor = DataProcessor(metadata)
109+
110+
# 3.1 Preprocess the data: transforms raw data into a numerical format
111+
processed_data = processor.preprocess(real_df)
112+
print("Processed data:")
113+
display(processed_data.head())
114+
115+
Processed data:
116+
sex age marital income ls smoke
117+
0 0 0.503625 3 -0.517232 4 0
118+
1 1 -1.495187 4 -0.898113 3 0
119+
2 0 -1.603231 4 0.000000 4 0
120+
3 0 1.638086 5 -0.432591 1 0
121+
4 0 0.341559 3 0.075251 3 1
122+
123+
124+
In [7]: # 4. Fit the CART method
125+
cart = CARTMethod(metadata, smoothing=True, proper=True, minibucket=5, random_state=42)
126+
cart.fit(processed_data)
127+
128+
In [8]: # 4.1 Preview generated synthetic data
129+
synthetic_processed = cart.sample(100)
130+
print("Synthetic processed data:")
131+
display(synthetic_processed.head())
132+
133+
Synthetic processed data:
134+
sex age marital income ls smoke
135+
0 1 -1.087360 3 -1.201126 4 0
136+
1 1 -0.882289 3 1.182255 4 0
137+
2 0 1.449201 5 -0.255936 2 0
138+
3 0 0.890598 3 0.220739 4 1
139+
4 0 0.313502 3 1.395039 4 0
140+
141+
In [9]: # 5. Postprocessing: back to the original format and preview of data
142+
synthetic_df = processor.postprocess(synthetic_processed)
143+
print("Synthetic data in original format:")
144+
display(synthetic_df.head())
145+
146+
Synthetic data in original format:
147+
sex age marital income ls smoke
148+
0 FEMALE 30.377064 SINGLE -8.000000 MOSTLY DISSATISFIED NO
149+
1 MALE 54.823585 MARRIED 1861.809802 PLEASED YES
150+
2 FEMALE 78.641244 MARRIED 771.239134 MOSTLY DISSATISFIED NO
151+
3 MALE 53.458122 MARRIED 1758.942347 PLEASED NO
152+
4 FEMALE 60.354551 SINGLE 1024.351794 PLEASED NO
153+
154+
In [10]: from synthpop.metrics import (
155+
MetricsReport,
156+
EfficacyMetrics,
157+
DisclosureProtection
158+
)
159+
160+
In [11]: # 6. Evaluate the synthetic data
161+
162+
# 6.1 Diagnostic report
163+
report = MetricsReport(real_df, synthetic_df, metadata)
164+
report_df = report.generate_report()
165+
print("=== Diagnostic Report ===")
166+
display(report_df)
167+
168+
column type missing_value_similarity range_coverage boundary_adherence ks_complement tv_complement statistic_similarity category_coverage category_adherence
169+
0 sex categorical 1.0 N/A N/A N/A 0.9764 N/A 1.0 1.0
170+
1 age numerical 1.0 0.94757 1.0 0.9142 N/A 0.962239 N/A N/A
171+
2 marital categorical 1.0 N/A N/A N/A 0.967 N/A 0.666667 1.0
172+
3 income numerical 1.0 0.408926 1.0 0.9056 N/A 0.948719 N/A N/A
173+
4 ls categorical 1.0 N/A N/A N/A 0.9224 N/A 0.857143 1.0
174+
5 smoke categorical 1.0 N/A N/A N/A 0.9754 N/A 1.0 1.0
159175
160-
### Define the visit sequence for the Adult dataset:
161-
162-
```
163-
In [1]: from python-synthpop import Synthpop
164-
165-
In [2]: from datasets.adult import df, dtypes
166-
167-
In [3]: spop = Synthpop(visit_sequence=[0, 1, 5, 3, 2])
168-
169-
In [4]: spop.fit(df, dtypes)
170-
train_age
171-
train_workclass
172-
train_marital-status
173-
train_education
174-
train_fnlwgt
175-
176-
In [5]: synth_df = spop.generate(len(df))
177-
generate_age
178-
generate_workclass
179-
generate_marital-status
180-
generate_education
181-
generate_fnlwgt
182-
183-
In [6]: synth_df.head()
184-
Out[6]:
185-
age workclass fnlwgt education marital-status
186-
0 57 Self-emp-not-inc 327901 Prof-school Married-civ-spouse
187-
1 24 Private 34568 Assoc-voc Never-married
188-
2 50 Private 256861 HS-grad Married-civ-spouse
189-
3 28 Private 186239 Some-college Never-married
190-
4 38 Private 216129 Bachelors Divorced
191-
192-
In [7]: spop.method
193-
Out[7]:
194-
age sample
195-
workclass cart
196-
fnlwgt cart
197-
education cart
198-
educational-num cart
199-
marital-status cart
200-
occupation cart
201-
relationship cart
202-
race cart
203-
gender cart
204-
capital-gain cart
205-
capital-loss cart
206-
hours-per-week cart
207-
native-country cart
208-
income cart
209-
dtype: object
210-
211-
In [8]: spop.visit_sequence
212-
Out[8]:
213-
age 0
214-
workclass 1
215-
fnlwgt 4
216-
education 3
217-
marital-status 2
218-
dtype: int64
219-
220-
In [9]: spop.predictor_matrix
221-
Out[9]:
222-
age workclass fnlwgt education marital-status
223-
age 0 0 0 0 0
224-
workclass 1 0 0 0 0
225-
fnlwgt 1 1 0 1 1
226-
education 1 1 0 0 1
227-
marital-status 1 1 0 0 0
228176
```

0 commit comments

Comments
 (0)