NGO-Algorithm-Audit
diff --git a/‎README.md‎
Lines changed: 138 additions & 190 deletions b/‎README.md‎
Lines changed: 138 additions & 190 deletions
@@ -1,14 +1,19 @@
-![image](https://github.com/NGO-Algorithm-Audit/python-synthpop/blob/main/images/Header.png)
+![image](https://raw.githubusercontent.com/NGO-Algorithm-Audit/python-synthpop/b09d3fe93ac21406199810e39e2a844dc1faefd0/images/Header.png)
 
 # python-synthpop
 
-Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
+```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data. This library is a Python implementation of the CART method used in R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
 
-With this library synthetic tabular data can be produced. Synthetic data refers to artificially generated data that mimics real-world data in structure and statistical properties but does not directly originate from actual events or individuals. It supports processing numerical and categorical data using sequential modelling techniques. Artificial data are generated by drawing from conditional distributions fitted to the original data using parametric (e.g., Gaussian copula) or classification and regression trees (CART) models.
+Synthetic data is generated in six steps:
 
-This Python library is a reimplementation of the R package `synthpop`. Synthetic data can be generated using the `.generate()` method after fitting the a synntesizer to the original data with the `.fit()` method. The process can be largely automated using default settings or customized through user-defined settings. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
+1. **Detect data types**: detect numerical, categorial and/or datetime data;
+2. **Process missing data**: process missing data: remove or impute missing values;
+3. **Preprocessing**: transforms data into numerical format;
+4. **Synthesizer**: fit CART or GC;
+5. **Postprocessing**: map synthetic data back to its original structure and domain;
+6. **Evaluation metrics**: determine quality of synthetic data, e.g., similarity, utility and privacy metrics. 
 
-☁️ [Web app](https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html) – a demo of synthetic data generation using `python-synthpop` through [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool)
+☁️ [Web app](https://algorithmaudit.eu/technical-tools/sdg/#web-app) – a demo of synthetic data generation using `python-synthpop` in the browser using [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool).
 
 # Installation
 
@@ -29,200 +34,143 @@ python setup.py install
 
 # Examples
 
-#### Adult dataset
-We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
+#### Social Diagnosis 2011 dataset
+We will use the Social Diagnosis 2011 dataset as an example, which is a comprehensive survey conducted in Poland. This dataset includes a wide range of variables related to the social and economic conditions of Polish households and individuals. It covers aspects such as income, employment, education, health, and overall quality of life. 
 
 ```
-In [1]: from datasets.adult import df
+In [1]:  import pandas as pd
 
-In [2]: df.head()
+In [2]:  df = pd.read_csv('../datasets/socialdiagnosis/data/SocialDiagnosis2011.csv', sep=';')
+         df.head()
 Out[2]:
-   age          workclass  fnlwgt   education  educational-num       marital-status          occupation    relationship    race   gender  capital-gain  capital-loss  hours-per-week  native-country  income
-0   39          State-gov   77516   Bachelors               13        Never-married        Adm-clerical   Not-in-family   White     Male          2174             0              40   United-States   <=50K
-1   50   Self-emp-not-inc   83311   Bachelors               13   Married-civ-spouse     Exec-managerial         Husband   White     Male             0             0              13   United-States   <=50K
-2   38            Private  215646     HS-grad                9             Divorced   Handlers-cleaners   Not-in-family   White     Male             0             0              40   United-States   <=50K
-3   53            Private  234721        11th                7   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male             0             0              40   United-States   <=50K
-4   28            Private  338409   Bachelors               13   Married-civ-spouse      Prof-specialty            Wife   Black   Female             0             0              40            Cuba   <=50K
+	sex	age	marital	income	ls	smoke
+0	FEMALE	57	MARRIED	800.0	PLEASED	NO
+1	MALE	20	SINGLE	350.0	MOSTLY SATISFIED	NO
+2	FEMALE	18	SINGLE	NaN	PLEASED	NO
+3	FEMALE	78	WIDOWED	900.0	MIXED	NO
+4	FEMALE	54	MARRIED	1500.0	MOSTLY SATISFIED	YES
+
 ```
 
 ### python-synthpop
 
-Use default parameters for the Adult dataset:
+Using default parameters the six steps are applied on the Social Diagnosis example to generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
 
 ```
-In [1]: from python-synthpop import Synthpop
-
-In [2]: from datasets.adult import df, dtypes
-
-In [3]: spop = Synthpop()
-
-In [4]: spop.fit(df, dtypes)
-train_age
-train_workclass
-train_fnlwgt
-train_education
-train_educational-num
-train_marital-status
-train_occupation
-train_relationship
-train_race
-train_gender
-train_capital-gain
-train_capital-loss
-train_hours-per-week
-train_native-country
-train_income
-
-In [5]: synth_df = spop.generate(len(df))
-generate_age
-generate_workclass
-generate_fnlwgt
-generate_education
-generate_educational-num
-generate_marital-status
-generate_occupation
-generate_relationship
-generate_race
-generate_gender
-generate_capital-gain
-generate_capital-loss
-generate_hours-per-week
-generate_native-country
-generate_income
-
-In [6]: synth_df.head()
-Out[6]:
-   age   workclass  fnlwgt education  educational-num       marital-status      occupation    relationship    race   gender  capital-gain  capital-loss  hours-per-week  native-country  income
-0   21           ?  213055      11th                7        Never-married               ?   Not-in-family   Other   Female             0             0              30   United-States   <=50K
-1   23     Private  150683   HS-grad                9        Never-married    Adm-clerical   Not-in-family   White   Female             0             0              40   United-States   <=50K
-2   61     Private  191417      10th                6              Widowed           Sales   Not-in-family   Black   Female             0             0              32   United-States   <=50K
-3   50     Private  190762   HS-grad                9             Divorced           Sales   Not-in-family   White     Male             0             0              60   United-States   <=50K
-4   42   Local-gov  255675   HS-grad                9   Married-civ-spouse   Other-service         Husband   Black     Male             0             0              40   United-States   <=50K
-
-In [7]: spop.method
-Out[7]:
-age                sample
-workclass            cart
-fnlwgt               cart
-education            cart
-educational-num      cart
-marital-status       cart
-occupation           cart
-relationship         cart
-race                 cart
-gender               cart
-capital-gain         cart
-capital-loss         cart
-hours-per-week       cart
-native-country       cart
-income               cart
-dtype: object
-
-In [8]: spop.visit_sequence
-Out[8]:
-age                 0
-workclass           1
-fnlwgt              2
-education           3
-educational-num     4
-marital-status      5
-occupation          6
-relationship        7
-race                8
-gender              9
-capital-gain       10
-capital-loss       11
-hours-per-week     12
-native-country     13
-income             14
-dtype: int64
-
-In [9]: spop.predictor_matrix
-Out[9]:
-                 age  workclass  fnlwgt  education  educational-num  marital-status  occupation  relationship  race  gender  capital-gain  capital-loss  hours-per-week  native-country  income
-age                0          0       0          0                0               0           0             0     0       0             0             0               0               0       0
-workclass          1          0       0          0                0               0           0             0     0       0             0             0               0               0       0
-fnlwgt             1          1       0          0                0               0           0             0     0       0             0             0               0               0       0
-education          1          1       1          0                0               0           0             0     0       0             0             0               0               0       0
-educational-num    1          1       1          1                0               0           0             0     0       0             0             0               0               0       0
-marital-status     1          1       1          1                1               0           0             0     0       0             0             0               0               0       0
-occupation         1          1       1          1                1               1           0             0     0       0             0             0               0               0       0
-relationship       1          1       1          1                1               1           1             0     0       0             0             0               0               0       0
-race               1          1       1          1                1               1           1             1     0       0             0             0               0               0       0
-gender             1          1       1          1                1               1           1             1     1       0             0             0               0               0       0
-capital-gain       1          1       1          1                1               1           1             1     1       1             0             0               0               0       0
-capital-loss       1          1       1          1                1               1           1             1     1       1             1             0               0               0       0
-hours-per-week     1          1       1          1                1               1           1             1     1       1             1             1               0               0       0
-native-country     1          1       1          1                1               1           1             1     1       1             1             1               1               0       0
-income             1          1       1          1                1               1           1             1     1       1             1             1               1               1       0
-```
+In [1]:     from synthpop import MissingDataHandler, DataProcessor, CARTMethod
+
+In [2]:     # 1. Initiate metadata
+            md_handler = MissingDataHandler()
+
+            # 1.1 Get data types
+            metadata= md_handler.get_column_dtypes(df)
+            print("Column Data Types:", metadata)
+
+            Column Data Types: {'sex': 'categorical', 'age': 'numerical', 'marital': 'categorical', 'income': 'numerical', 'ls': 'categorical', 'smoke': 'categorical'}
+
+In [3]:     # 2. Process missing data
+            print("Missing data:")
+            print(df.isnull().sum())
+
+            Missing data:
+            sex          0
+            age          0
+            marital      9
+            income     683
+            ls           8
+            smoke       10
+            dtype: int64
+
+In [4]:     # 2.1 Detect type of missingness
+            missingness_dict = md_handler.detect_missingness(df)
+            print("Detected missingness type:", missingness_dict)
+
+            Detected missingness type: {'marital': 'MAR', 'income': 'MAR', 'ls': 'MAR', 'smoke': 'MAR'}
+
+
+In [5]:     # 2.2 Impute missing values
+            real_df = md_handler.apply_imputation(df, missingness_dict)
+
+            print("Missing data:")
+            print(real_df.isnull().sum())
+
+            Missing data:
+            sex        0
+            age        0
+            marital    0
+            income     0
+            ls         0
+            smoke      0
+            dtype: int64
+
+
+In [6]:     # 3. Preprocessing: Instantiate the DataProcessor with column_dtypes
+            processor = DataProcessor(metadata)
+
+            # 3.1 Preprocess the data: transforms raw data into a numerical format
+            processed_data = processor.preprocess(real_df)
+            print("Processed data:")
+            display(processed_data.head())
+
+            Processed data:
+            sex	age	marital	income	ls	smoke
+            0	0	0.503625	3	-0.517232	4	0
+            1	1	-1.495187	4	-0.898113	3	0
+            2	0	-1.603231	4	0.000000	4	0
+            3	0	1.638086	5	-0.432591	1	0
+            4	0	0.341559	3	0.075251	3	1
+
+
+In [7]:     # 4. Fit the CART method
+            cart = CARTMethod(metadata, smoothing=True, proper=True, minibucket=5, random_state=42)
+            cart.fit(processed_data)
+
+In [8]:     # 4.1 Preview generated synthetic data
+            synthetic_processed = cart.sample(100)
+            print("Synthetic processed data:")
+            display(synthetic_processed.head())
+
+            Synthetic processed data:
+            sex	age	marital	income	ls	smoke
+            0	1	-1.087360	3	-1.201126	4	0
+            1	1	-0.882289	3	1.182255	4	0
+            2	0	1.449201	5	-0.255936	2	0
+            3	0	0.890598	3	0.220739	4	1
+            4	0	0.313502	3	1.395039	4	0
+
+In [9]:     # 5. Postprocessing: back to the original format and preview of data
+            synthetic_df = processor.postprocess(synthetic_processed)
+            print("Synthetic data in original format:")
+            display(synthetic_df.head())
+
+            Synthetic data in original format:
+            sex	age	marital	income	ls	smoke
+            0	FEMALE	30.377064	SINGLE	-8.000000	MOSTLY DISSATISFIED	NO
+            1	MALE	54.823585	MARRIED	1861.809802	PLEASED	YES
+            2	FEMALE	78.641244	MARRIED	771.239134	MOSTLY DISSATISFIED	NO
+            3	MALE	53.458122	MARRIED	1758.942347	PLEASED	NO
+            4	FEMALE	60.354551	SINGLE	1024.351794	PLEASED	NO
+
+In [10]:    from synthpop.metrics import (
+                MetricsReport,
+                EfficacyMetrics,
+                DisclosureProtection
+            )
+
+In [11]:    # 6. Evaluate the synthetic data
+
+            # 6.1 Diagnostic report
+            report = MetricsReport(real_df, synthetic_df, metadata)
+            report_df = report.generate_report()
+            print("=== Diagnostic Report ===")
+            display(report_df)
+
+            	column	type	missing_value_similarity	range_coverage	boundary_adherence	ks_complement	tv_complement	statistic_similarity	category_coverage	category_adherence
+                0	sex	categorical	1.0	N/A	N/A	N/A	0.9764	N/A	1.0	1.0
+                1	age	numerical	1.0	0.94757	1.0	0.9142	N/A	0.962239	N/A	N/A
+                2	marital	categorical	1.0	N/A	N/A	N/A	0.967	N/A	0.666667	1.0
+                3	income	numerical	1.0	0.408926	1.0	0.9056	N/A	0.948719	N/A	N/A
+                4	ls	categorical	1.0	N/A	N/A	N/A	0.9224	N/A	0.857143	1.0
+                5	smoke	categorical	1.0	N/A	N/A	N/A	0.9754	N/A	1.0	1.0
 
-### Define the visit sequence for the Adult dataset:
-
-```
-In [1]: from python-synthpop import Synthpop
-
-In [2]: from datasets.adult import df, dtypes
-
-In [3]: spop = Synthpop(visit_sequence=[0, 1, 5, 3, 2])
-
-In [4]: spop.fit(df, dtypes)
-train_age
-train_workclass
-train_marital-status
-train_education
-train_fnlwgt
-
-In [5]: synth_df = spop.generate(len(df))
-generate_age
-generate_workclass
-generate_marital-status
-generate_education
-generate_fnlwgt
-
-In [6]: synth_df.head()
-Out[6]:
-   age          workclass  fnlwgt      education       marital-status
-0   57   Self-emp-not-inc  327901    Prof-school   Married-civ-spouse
-1   24            Private   34568      Assoc-voc        Never-married
-2   50            Private  256861        HS-grad   Married-civ-spouse
-3   28            Private  186239   Some-college        Never-married
-4   38            Private  216129      Bachelors             Divorced
-
-In [7]: spop.method
-Out[7]:
-age                sample
-workclass            cart
-fnlwgt               cart
-education            cart
-educational-num      cart
-marital-status       cart
-occupation           cart
-relationship         cart
-race                 cart
-gender               cart
-capital-gain         cart
-capital-loss         cart
-hours-per-week       cart
-native-country       cart
-income               cart
-dtype: object
-
-In [8]: spop.visit_sequence
-Out[8]:
-age               0
-workclass         1
-fnlwgt            4
-education         3
-marital-status    2
-dtype: int64
-
-In [9]: spop.predictor_matrix
-Out[9]:
-                age  workclass  fnlwgt  education  marital-status
-age               0          0       0          0               0
-workclass         1          0       0          0               0
-fnlwgt            1          1       0          1               1
-education         1          1       0          0               1
-marital-status    1          1       0          0               0
 ```