Demo - AI safety validators (lexical slurs and PII removal) #463

rkritika1508 · 2025-12-03T12:47:54Z

Summary

Target issue is #PLEASE_TYPE_ISSUE_NUMBER
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Added safety guardrails to validate and protect user inputs and model outputs
- Implemented automatic detection and redaction of offensive language
- Added automatic detection and anonymization of sensitive personal information
- Introduced multi-language content validation support (English and Hindi)
- Added configurable banned word list filtering
Tests
- Added comprehensive test coverage for safety validation features

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-03T12:48:08Z

Caution

Review failed

The pull request is closed.

Walkthrough

Introduces a comprehensive guardrails safety system for content validation. Adds configuration models, a GuardrailsEngine orchestrator that builds and executes validators from config, multiple validator implementations (lexical slur detection, PII anonymization, ban lists), language detection utilities, and corresponding test coverage. Includes hub validator loader infrastructure and new project dependencies.

Changes

Cohort / File(s)	Summary
Configuration Types `backend/app/safety/guardrail_config.py`	Defines ValidatorConfigItem as a discriminated union over three validator config classes; introduces GuardrailConfig and GuardrailConfigRoot models for input/output validation pipelines.
Engine & Orchestration `backend/app/safety/guardrails_engine.py`	Implements GuardrailsEngine class that initializes from GuardrailConfigRoot, builds input/output guards via validator instantiation, and exposes run_input_validators and run_output_validators methods.
Validator Base & Constants `backend/app/safety/validators/base_validator_config.py`, `backend/app/safety/validators/constants.py`	Establishes BaseValidatorConfig with on_fail action and class variable for validator references; defines module constants for slur filenames and language/label/score keys.
Hub Validator Loader `backend/app/safety/validators/hub_loader.py`	Provides hub validator integration: HUB_VALIDATORS mapping, functions to check importability, install validators via Guardrails CLI, and dynamically load validator classes post-installation.
Lexical Slur Validator `backend/app/safety/validators/lexical_slur.py`	Implements LexicalSlur validator with SlurSeverity enum and text normalization (emoji removal, punctuation stripping, lowercasing); detects toxic words and redacts with [REDACTED_SLUR]; includes LexicalSlurSafetyValidatorConfig for configuration.
PII Remover Validator `backend/app/safety/validators/pii_remover.py`	Implements PIIRemover validator integrating Presidio for English PII anonymization with language detection branching (English/Hinglish paths); includes PIIRemoverSafetyValidatorConfig with entity types and threshold tuning.
Ban List Validator `backend/app/safety/validators/ban_list_safety_validator_config.py`	Defines BanListSafetyValidatorConfig extending BaseValidatorConfig with banned_words list and post_init hook to lazily load hub validator class via ensure_hub_validator_installed and load_hub_validator_class.
Language Detector `backend/app/safety/utils/language_detector.py`	Wraps XLM-RoBERTa language classification model via Hugging Face pipeline; provides cached predict method with normalization for Hindi labels and convenience methods is_hindi and is_english.
Guardrails Engine Tests `backend/app/tests/safety/test_guardrails_engine.py`	Tests GuardrailsEngine initialization and validation runs with uli_slur_match, ban_list, and pii_remover validators; verifies slur redaction and PII anonymization in validated outputs.
Lexical Slur Validator Tests `backend/app/tests/safety/validators/test_lexical_slurs.py`	Tests LexicalSlur validator with temporary slur CSV fixtures; covers severity filtering, text normalization, slur detection, and redaction across multiple language and severity configurations.
PII Remover Validator Tests `backend/app/tests/safety/validators/test_pii_remover.py`	Tests PIIRemover initialization, validation paths (English and Hinglish), language detection integration, entity_types configuration, and Presidio mocking for anonymization outcomes.
Dependencies `backend/pyproject.toml`	Adds runtime dependencies: guardrails-ai (≥0.7.0), emoji, ftfy, presidio_analyzer, presidio_anonymizer, transformers, and torch.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GuardrailsEngine
    participant HubLoader
    participant LanguageDetector
    participant Validators as Validators<br/>(Lexical Slur,<br/>Ban List, PII)
    participant ExternalServices as External<br/>(Guardrails Hub,<br/>Presidio, HF Model)

    User->>GuardrailsEngine: init(GuardrailConfigRoot)
    activate GuardrailsEngine
    GuardrailsEngine->>HubLoader: ensure_hub_validator_installed(type)
    activate HubLoader
    HubLoader->>HubLoader: is_importable(module_path)?
    alt Not Installed
        HubLoader->>ExternalServices: guardrails hub install
        ExternalServices-->>HubLoader: (installed)
    end
    HubLoader-->>GuardrailsEngine: (ready)
    deactivate HubLoader
    GuardrailsEngine->>HubLoader: load_hub_validator_class(type)
    HubLoader-->>GuardrailsEngine: validator_class
    GuardrailsEngine->>Validators: instantiate validators
    Validators-->>GuardrailsEngine: validator instances
    GuardrailsEngine-->>User: GuardrailsEngine ready
    deactivate GuardrailsEngine

    User->>GuardrailsEngine: run_input_validators(text)
    activate GuardrailsEngine
    GuardrailsEngine->>LanguageDetector: predict(text)
    activate LanguageDetector
    LanguageDetector->>ExternalServices: XLM-RoBERTa inference
    ExternalServices-->>LanguageDetector: lang_label
    LanguageDetector-->>GuardrailsEngine: {language, score}
    deactivate LanguageDetector
    
    rect rgb(200, 220, 255)
    note right of GuardrailsEngine: Run validators based on language
    end
    GuardrailsEngine->>Validators: run(text)
    alt Language == Hindi
        Validators->>Validators: Hinglish path (lexical slur)
    else Language == English
        Validators->>ExternalServices: Presidio anonymize (PII)
        ExternalServices-->>Validators: anonymized_text
    end
    Validators-->>GuardrailsEngine: validated_output
    GuardrailsEngine-->>User: validated_result
    deactivate GuardrailsEngine

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Text processing logic: Extensive normalization in lexical_slur.py (emoji removal, punctuation handling, lowercasing) and string redaction patterns require careful validation
External service integration: Presidio AnalyzerEngine and AnonymizerEngine integration paths in pii_remover.py; language-branching logic (English vs. Hinglish) needs verification
Hub validator mechanics: Lazy loading and installation via Guardrails CLI in hub_loader.py and BanListSafetyValidatorConfig post_init; import resolution logic warrants attention
Language detection: ML model caching with lru_cache and label normalization in language_detector.py; ensure model initialization and scoring are correct
GuardrailsEngine initialization: Validator instantiation pattern via model_dump and get_validator flow; Guard composition with use_many requires scrutiny
Configuration and discriminated unions: ValidatorConfigItem with discriminator field usage across multiple config classes; serialization/deserialization paths
Test coverage comprehensiveness: Multiple test files with mocking strategies (Presidio mocks, temp CSV fixtures) should align with implementation behavior

Suggested labels

enhancement, ready-for-review

Suggested reviewers

avirajsingh7
AkhileshNegi
nishika26

Poem

🐰 A guardrail here, a slur blocked there,
Presidio cleanses with utmost care!
PII removed, toxins redacted fast,
Language-aware safety holds steadfast!
Config flows true through our engines' core,
Now users' content's protected more! 🛡️

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17f427e and 6639589.

⛔ Files ignored due to path filters (2)

backend/app/safety/validators/lexical_slur/curated_slurlist_hi_en.csv is excluded by !**/*.csv
backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (13)

backend/app/safety/guardrail_config.py (1 hunks)
backend/app/safety/guardrails_engine.py (1 hunks)
backend/app/safety/utils/language_detector.py (1 hunks)
backend/app/safety/validators/ban_list_safety_validator_config.py (1 hunks)
backend/app/safety/validators/base_validator_config.py (1 hunks)
backend/app/safety/validators/constants.py (1 hunks)
backend/app/safety/validators/hub_loader.py (1 hunks)
backend/app/safety/validators/lexical_slur.py (1 hunks)
backend/app/safety/validators/pii_remover.py (1 hunks)
backend/app/tests/safety/test_guardrails_engine.py (1 hunks)
backend/app/tests/safety/validators/test_lexical_slurs.py (1 hunks)
backend/app/tests/safety/validators/test_pii_remover.py (1 hunks)
backend/pyproject.toml (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rkritika1508 and others added 27 commits November 25, 2025 13:25

Added code for lexical slur and PII remover validators

0af12f5

refactor: move directory

3401b19

Added PII remover and lexical slur detection validators

3efe7e8

Resolved comments

171d3e1

Updated code

80977e2

Removed redundant code

6efef3d

Renamed files acc to python convention

30faf31

fixed UTs

e5ba7d1

chore: code reorganization

017e2a4

Updated code and fixed test cases

0b05a99

updated unit test

34817d5

Removed few files

a36b823

reversed uv.lock

c490ef1

updated uv.lock

5bf8633

cleanup up code

b62f47c

Cleaned up code

bac48ce

chore: refactor

7caea29

chore: cleanup

58b0fbc

chore: cleanup

f6b2139

chore: cleanup

924b230

Added the banlist validator

2e2454d

Renamed file

180aa82

Merge branch 'main' into feature/ai-safety-tattle

ea079f2

Add PII remover code

778ff58

Fixed guardrail config

a3b013b

Merge branch 'feature/ai-safety-tattle' into feature/ai-safety-demo-1.1

37af902

Added PII remover code

6639589

rkritika1508 closed this Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Demo - AI safety validators (lexical slurs and PII removal) #463

Demo - AI safety validators (lexical slurs and PII removal) #463

Uh oh!

rkritika1508 commented Dec 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 3, 2025 •

edited

Loading

Review failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Demo - AI safety validators (lexical slurs and PII removal) #463

Demo - AI safety validators (lexical slurs and PII removal) #463

Uh oh!

Conversation

rkritika1508 commented Dec 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkritika1508 commented Dec 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 3, 2025 •

edited

Loading