An intelligent certificate management tool that automatically categorizes and organizes PDF certificates by company names using Google's Gemini AI. Perfect for professionals, students, and organizations managing large collections of digital certificates.
Certificate Segregator leverages cutting-edge AI technology to solve the common problem of certificate organization. Instead of manually sorting through dozens or hundreds of certificates, this tool automatically reads, analyzes, and categorizes your certificates by company name in seconds.
- π€ AI-Powered Analysis: Uses Google Gemini 1.5 Flash to extract company names from certificate PDFs with high accuracy
- π Automatic Organization: Creates folders and sorts certificates by company automatically - no manual work required
- β‘ Batch Processing: Upload and process multiple certificates simultaneously for maximum efficiency
- π¨ User-Friendly Interface: Clean, intuitive Streamlit web interface accessible from any browser
- π PDF Support: Converts PDF certificates to images for optimal AI processing
- π‘οΈ Error Handling: Robust error handling with informative messages and graceful failure recovery
- πΎ Safe Storage: Preserves original certificate quality while organizing files systematically
- π Smart Recognition: Handles various certificate formats and layouts intelligently
Upload your certificates, and watch as they get automatically organized by company in real-time!
Problem | Solution |
---|---|
π Hundreds of unsorted certificates | β‘ Instant AI-powered organization |
β° Hours of manual sorting | π Process multiple files in seconds |
π΅ Difficult to find specific certificates | π― Clear company-based folder structure |
π€ Inconsistent naming conventions | π€ AI extracts accurate company names |
πΌ Professional portfolio management | π Clean, systematic organization |
# 1. Clone and navigate
git clone https://github.com/harshajustin/Certificate-Clustering.git
cd Certificate-Clustering
# 2. Install dependencies
pip install -r requirements.txt
# 3. Install Poppler (macOS)
brew install poppler
# 4. Create .env file with your Gemini API key
echo "key=YOUR_GEMINI_API_KEY" > .env
# 5. Run the app
streamlit run main.py
Requirement | Version | Purpose |
---|---|---|
Python | 3.7+ | Core runtime environment |
Google AI API Key | Latest | Gemini AI access |
Poppler | Latest | PDF processing library |
Web Browser | Modern | Streamlit interface |
-
Clone the repository
git clone https://github.com/harshajustin/Certificate-Clustering.git cd Certificate-Clustering
-
Install required packages
pip install -r requirements.txt
-
Install Poppler (required for pdf2image)
On macOS:
brew install poppler
On Ubuntu/Debian:
sudo apt-get install poppler-utils
On Windows:
- Download from poppler for Windows
- Add to PATH
-
Set up environment variables
Create a
.env
file in the project root:key=your_google_gemini_api_key_here
To get a Google AI API key:
- Visit Google AI Studio
- Create a new API key
- Copy and paste it into your
.env
file
streamlit run main.py
The application will open in your default web browser at http://localhost:8501
For easy deployment using Docker:
# 1. Clone and navigate
git clone https://github.com/harshajustin/Certificate-Clustering.git
cd Certificate-Clustering
# 2. Set up environment
cp .env.example .env
# Edit .env and add your Gemini API key
# 3. Run with Docker Compose
docker-compose up -d
# 4. Access at http://localhost:8501
# Build the image
docker build -t certificate-segregator .
# Run the container
docker run -d -p 8501:8501 --env-file .env certificate-segregator
π For detailed Docker instructions, see DOCKER.md
- Launch the Application: Run the Streamlit app using the command above
- Upload Certificates: Click "Browse files" and select one or more PDF certificates
- Process: Click the "Submit" button to start processing
- View Results: The app will:
- Extract company names from each certificate
- Create folders named after each company
- Save certificates in their respective company folders
- Display success/error messages for each file
Certificate-Clustering/
βββ main.py # Main application file
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (create this)
βββ .gitignore # Git ignore file
βββ README.md # This file
βββ certificates/ # Auto-created folder for organized certificates
βββ Company1/
β βββ Company1_certificate.pdf
βββ Company2/
β βββ Company2_certificate.pdf
βββ ...
graph TD
A[PDF Upload] --> B[PDF to Image Conversion]
B --> C[Base64 Encoding]
C --> D[Gemini AI Analysis]
D --> E[Company Name Extraction]
E --> F[Folder Creation]
F --> G[Certificate Organization]
G --> H[Success Notification]
Function | Purpose | Key Features |
---|---|---|
process_uploaded_pdf() |
PDF Processing | Converts PDF to base64-encoded images, handles multiple pages |
get_company_name_from_pdf() |
AI Analysis | Uses Gemini AI to extract company names with context awareness |
save_certificate_to_company_folder() |
File Organization | Creates company folders and saves certificates systematically |
create_streamlit_ui() |
User Interface | Provides intuitive web interface with progress indicators |
Package | Version | Purpose | Key Features |
---|---|---|---|
streamlit |
Latest | Web interface framework | Interactive UI, file uploads, real-time feedback |
google-generativeai |
Latest | Google Gemini AI integration | Text extraction, company name recognition |
pdf2image |
Latest | PDF to image conversion | High-quality rendering, multi-page support |
python-dotenv |
Latest | Environment variable management | Secure API key handling |
pillow |
Latest | Image processing support | Format conversion, optimization |
- HR Departments: Organize employee training certificates
- Consultants: Manage client project certificates
- Freelancers: Maintain professional certification portfolio
- Course Completion: Sort online learning certificates
- Academic Records: Organize educational achievements
- Skill Development: Track certification progress
- Compliance Teams: Manage regulatory certificates
- Training Departments: Track employee certifications
- Quality Assurance: Organize vendor certificates
Metric | Performance |
---|---|
Processing Speed | ~2-3 seconds per certificate |
Accuracy Rate | 95%+ company name extraction |
Supported Formats | PDF (all versions) |
Batch Size | Unlimited (memory dependent) |
File Size Limit | Up to 200MB per file |
Variable | Description | Required | Example |
---|---|---|---|
key |
Google Gemini API key | β Yes | AIzaSyD... |
Format | Extension | Max Size | Notes |
---|---|---|---|
.pdf |
200MB | All PDF versions supported |
Create a config.yaml
file for advanced settings:
# Advanced Configuration (Optional)
processing:
max_file_size: 200MB
timeout: 30s
retry_attempts: 3
ai_settings:
model: "gemini-1.5-flash"
temperature: 0.1
max_tokens: 1000
folders:
base_path: "./certificates"
naming_convention: "{company_name}_certificate"
create_subfolders: true
sequenceDiagram
participant User
participant App
participant Gemini
participant FileSystem
User->>App: Upload PDF certificates
App->>App: Convert PDF to images
App->>Gemini: Send image for analysis
Gemini->>App: Return company name
App->>FileSystem: Create company folder
App->>FileSystem: Save certificate
App->>User: Display success message
π API Key Issues
Problem: "Google API Key not found" Error
Solutions:
- β
Ensure your
.env
file exists in the project root - β
Verify the key is named exactly
key
in the.env
file - β Check for extra spaces or quotes around the API key
- β Verify your API key is active at Google AI Studio
# Correct .env format
key=AIzaSyD1234567890abcdef
π PDF Processing Errors
Problem: PDF files won't process
Solutions:
- β Ensure Poppler is installed correctly
- β Check that uploaded files are valid PDF documents
- β Verify file size is under 200MB
- β Try with a different PDF to isolate the issue
# Test Poppler installation
pdftoppm -h
π File Organization Issues
Problem: Certificates not saving properly
Solutions:
- β Ensure write permissions in the project directory
- β Check available disk space
- β
Verify the
certificates/
folder can be created - β Close any open certificate files
π€ AI Extraction Issues
Problem: Company names not extracted correctly
Solutions:
- β Some certificates may have unclear text or unusual formatting
- β Try preprocessing the PDF (ensure text is selectable)
- β Check if the certificate contains readable text
- β Verify your API quota hasn't been exceeded
# Check Python version
python --version
# Verify package installation
pip list | grep -E "(streamlit|google-generativeai|pdf2image)"
# Test Poppler
pdftoppm -v
# Check file permissions
ls -la certificates/
We welcome contributions from the community! Here's how you can help make Certificate Segregator even better:
- π΄ Fork the repository
- πΏ Create a feature branch
git checkout -b feature/amazing-feature
- π» Make your changes
- β Test thoroughly
- π Commit with descriptive messages
git commit -m 'Add: Enhanced AI accuracy for handwritten certificates'
- π€ Push to your branch
git push origin feature/amazing-feature
- π Open a Pull Request
Area | Description | Difficulty |
---|---|---|
π€ AI Improvements | Enhance company name extraction accuracy | Advanced |
π¨ UI/UX | Improve interface design and user experience | Intermediate |
π Analytics | Add processing statistics and insights | Intermediate |
π§ Performance | Optimize processing speed and memory usage | Advanced |
π Documentation | Improve docs, add tutorials, create videos | Beginner |
π§ͺ Testing | Add unit tests, integration tests | Intermediate |
π Localization | Add multi-language support | Intermediate |
# 1. Clone your fork
git clone https://github.com/YOUR_USERNAME/Certificate-Clustering.git
cd Certificate-Clustering
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # If available
# 4. Set up pre-commit hooks
pre-commit install
# 5. Run tests
python -m pytest
Found a bug? Please create an issue with:
- π Clear description of the problem
- π Steps to reproduce
- π» System information (OS, Python version)
- π Sample files (if applicable)
- π· Screenshots (if relevant)
Have an idea? We'd love to hear it! Include:
- π― Clear description of the feature
- π€ Why it would be valuable
- π Possible implementation approach
- π Expected impact
Thanks to all the amazing people who have contributed to this project!
Feature | Status | ETA | Priority |
---|---|---|---|
π Advanced Search | π In Progress | Q3 2025 | High |
π Analytics Dashboard | π Planned | Q4 2025 | Medium |
π Multi-language Support | π Concept | 2026 | Low |
π± Mobile App | π Concept | TBD | Medium |
βοΈ Cloud Integration | π Concept | TBD | High |
- v1.0.0 (Current) - Initial release with core functionality
- v0.9.0 - Beta testing phase
- v0.1.0 - Alpha prototype
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Harsha Justin
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software...
Special thanks to the amazing technologies and communities that made this project possible:
- π€ Google AI - For the powerful Gemini API
- π¨ Streamlit - For the incredible web framework
- π pdf2image - For seamless PDF processing
- π Python Community - For the amazing ecosystem
- π Open Source Community - For inspiration and collaboration
- π‘ Contributors - For making this project better every day
Need help or want to connect with other users?
Channel | Purpose | Response Time |
---|---|---|
π GitHub Issues | Bug reports, feature requests | 24-48 hours |
π¬ Discussions | Questions, ideas, showcase | 24-48 hours |
π§ Email | Private inquiries | 2-3 business days |
- β Check the FAQ section
- β Search existing issues
- β Read the documentation thoroughly
- β Try the troubleshooting steps
When reporting bugs, please include:
**Environment:**
- OS: [e.g., macOS 12.0, Windows 11, Ubuntu 20.04]
- Python version: [e.g., 3.9.7]
- Package versions: [run `pip list`]
**Steps to reproduce:**
1. Go to '...'
2. Click on '....'
3. Upload file '....'
4. See error
**Expected behavior:**
A clear description of what you expected to happen.
**Actual behavior:**
A clear description of what actually happened.
**Additional context:**
Add any other context about the problem here.
Made with β€οΈ by Harsha Justin
If this project helped you, please consider giving it a β on GitHub!