A lightweight, highly configurable tool for scanning files for personally identifiable information (PII) and other sensitive data. It uses AWS Comprehend and custom regex patterns, supports fast multi-threaded execution, and can run as a standalone tool or as a GitHub Action.
- AWS Comprehend Integration: Leverages AWS's powerful NLP capabilities to detect PII entities
- Custom Regex Patterns: Add your own regex patterns to detect organization-specific sensitive data
- Git Integration: Scan only files that have changed in a git repository
- Multiple Output Formats: Output findings in text, JSON, or CSV format
- Configurable: Adjust confidence thresholds, excluded directories, and more
- Multithreaded: Process files in parallel for faster scanning
- GitHub Action Support: Use as a GitHub Action to scan code changes in PRs
pip install boto3
# Scan only git changes (default)
pii_scan
# Scan all files recursively
pii_scan --all
# Show help
pii_scan --help
Options:
--all Scan all files recursively (default: only scan git changes)
--help, -h Show this help message and exit
--config FILE Path to configuration file
--min-confidence FLOAT Minimum confidence score (0.0-1.0) for PII detection
--output FORMAT Output format: text, json, or csv (default: text)
--custom-regex FILE Path to file containing custom regex patterns
--exclude-dirs DIRS Comma-separated list of directories to exclude
--exclude-exts EXTS Comma-separated list of file extensions to exclude
--workers INT Number of worker threads (default: 8)
--verbose, -v Enable verbose output
--quiet, -q Suppress all output except findings and errors
--region REGION AWS region to use for Comprehend API
# Scan all files with custom regex patterns
pii_scan --all --custom-regex custom_regex_patterns.json
# Scan git changes with higher confidence threshold
pii_scan --min-confidence 0.9
# Output findings in JSON format
pii_scan --output json
# Exclude additional directories
pii_scan --exclude-dirs "build,dist,node_modules"
# Use a specific AWS region
pii_scan --region us-west-2
You can define custom regex patterns in a JSON file and provide it using the --custom-regex
option. The file should contain a JSON object where keys are pattern names and values are regex patterns.
Example custom_regex_patterns.json
:
{
"Social Security Number": "\\b(?!000|666|9\\d{2})([0-8]\\d{2}|7([0-6]\\d|7[012]))([-]?)(?!00)\\d\\d\\3(?!0000)\\d{4}\\b",
"Credit Card Number": "\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12}|(?:2131|1800|35\\d{3})\\d{11})\\b",
"Harvard ID": "\\b\\d{8}\\b",
"API Key Pattern": "(?i)(api[_-]?key|apikey)\\s*[:=]\\s*['\"]([^'\"]{10,})['\"]",
"Database Connection String": "(?i)(jdbc|mongodb|postgresql|mysql|sqlserver):[^\\s]+"
}
You can provide a configuration file using the --config
option. The file should contain a JSON object with configuration parameters.
Example config.json
:
{
"max_workers": 8,
"min_confidence_score": 0.9,
"excluded_dirs": [".git", ".venv", "node_modules", "__pycache__", "dist", "build"],
"excluded_extensions": [".min.js", ".map", ".svg", ".woff", ".ttf", ".png", ".jpg"],
"critical_entity_types": ["AWS_ACCESS_KEY", "AWS_SECRET_KEY", "PASSWORD", "CREDIT_CARD"],
"security_relevant_entity_types": [
"AWS_ACCESS_KEY",
"AWS_SECRET_KEY",
"PASSWORD",
"USERNAME",
"IP_ADDRESS",
"EMAIL",
"CREDIT_CARD",
"PHONE_NUMBER"
]
}
This tool can be used as a GitHub Action to scan code changes in pull requests. See the action.yml
file for details.
Example workflow:
name: Scan for PII
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
pii-scan:
name: Scan for PII
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Run PII Scanner
uses: harvard-ea/action-pii-scanning-using-aws@main
with:
custom-regex-file: pii-engine/custom_regex_patterns.json
min-confidence: '0.85'
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
The tool requires AWS credentials with permissions to use the Comprehend service. You can provide these credentials using environment variables or AWS configuration files.
Required permissions:
comprehend:DetectPiiEntities
The aim was to build a free, open-source, drop-in alternative to Nightfall, which is both overpriced and limited in functionality.
Please open an Issue — contributions via PRs are also welcome!
This project is licensed under the MIT License.
© Ventz Petkov