This script connects to a Snowflake database, retrieves data from specified tables, and sends it to the Protecto Tokenization API for masking. It tracks the processing status asynchronously and saves the final masked table results into an Excel file.
For more details, refer to the Protecto Tokenization API documentation.
- Python 3.10+
- Required Python libraries:
- requests
- sqlalchemy
- openpyxl
- snowflake.sqlalchemy
- python-dotenv
- Credentials stored in a JSON file
- Protecto API key for data tokenization
- Install the required dependencies:
pip install -r requirements.txt
Store your Snowflake credentials and Protecto API key in a JSON file (e.g., config/credentials.json) with the following format:
{
"account": "your_snowflake_account",
"user": "your_snowflake_username",
"password": "your_snowflake_password",
"warehouse": "your_snowflake_warehouse",
"role": "your_snowflake_role",
"protecto_api_key": "your_protecto_api_key"
}Note: Register at https://portal.protecto.ai/ to obtain a Protecto API key
Please enter the list of tables to be processed, with one table per line in tables.txt:
database_name.schema_name.table_name_1
database_name.schema_name.table_name_2
Create a config/mask_configuration.json file to define custom masking rules for specific table columns:
{
"DATABASE_NAME.SCHEMA_NAME.TABLE_NAME_1": {
"0": {
"token_name": "Numeric Token" # Column 0: Masking with numeric token only
},
"1": {
"token_name": "Text Token" # Column 1: Masking with text token only
},
"2": {
"format": "Phone Number", # Column 2: Masking phone numbers with numeric token
"token_name": "Numeric Token"
},
"3": {
"format": "URL", # Column 3: Masking URLs with text token
"token_name": "Text Token"
},
"4": {}, # Column 4: Auto-detection of sensitive data (no format and no token specified)
"5": {
"format": null, # Column 5: Auto-detection of sensitive data (null values)
"token_name": null
}
},
"DATABASE_NAME.SCHEMA_NAME.TABLE_NAME_2": {
"0": {
"token_name": "Numeric Token"
},
"5": {
"token_name": "Text Token"
},
"7": {
"format": "Phone Number",
"token_name": "Numeric Token"
},
"8": {
"format": "URL",
"token_name": "Text Token"
}
}
}The configuration follows this structure:
- Column indices start from 0
- Each column can have:
token_name: The type of token to use for maskingformat: The data format to consider when masking- Both parameters can be omitted or set to
nullfor automatic detection
For available options, refer to:
Create a .env file in the root directory with these parameters:
NUM_ROWS=100 # Maximum number of rows to be fetched per table
TABLE_CHUNK_SIZE=5 # Number of rows processed in each API call batch
Execute the masking script:
python mask.py
- The script reads the table list from
input_list.txt - For each table:
- Connects to Snowflake and retrieves data in configurable chunks
- Applies masking configuration from
mask_configuration.json - Sends data to Protecto's masking API
- Receives tracking IDs for asynchronous processing
- Saves tracking IDs to
tracking_ids.txtfor reference - Polls for completion status
- Saves masked data to
output/[table name as given in input_list].xlsx
NUM_ROWS: Controls the maximum number of rows fetched per tableTABLE_CHUNK_SIZE: Sets the batch size for API processingMAX_COLUMNS_PER_API_CALL: Limits the number of columns in each API request
The tool generates:
- Excel files in the
output/directory (one per processed table, named according to the table name in the input table) - A tracking log (
tracking_ids.txt) for all processing requests
For assistance with this tool or the Protecto API:
- Email: [email protected]
- Documentation: Protecto Documentation