Skip to content

CognicAI/Certificate-Clustering

Repository files navigation

πŸ… Certificate Segregator

Python Streamlit License: MIT Gemini AI

An intelligent certificate management tool that automatically categorizes and organizes PDF certificates by company names using Google's Gemini AI. Perfect for professionals, students, and organizations managing large collections of digital certificates.

🎯 Overview

Certificate Segregator leverages cutting-edge AI technology to solve the common problem of certificate organization. Instead of manually sorting through dozens or hundreds of certificates, this tool automatically reads, analyzes, and categorizes your certificates by company name in seconds.

✨ Features

  • πŸ€– AI-Powered Analysis: Uses Google Gemini 1.5 Flash to extract company names from certificate PDFs with high accuracy
  • πŸ“ Automatic Organization: Creates folders and sorts certificates by company automatically - no manual work required
  • ⚑ Batch Processing: Upload and process multiple certificates simultaneously for maximum efficiency
  • 🎨 User-Friendly Interface: Clean, intuitive Streamlit web interface accessible from any browser
  • πŸ“„ PDF Support: Converts PDF certificates to images for optimal AI processing
  • πŸ›‘οΈ Error Handling: Robust error handling with informative messages and graceful failure recovery
  • πŸ’Ύ Safe Storage: Preserves original certificate quality while organizing files systematically
  • πŸ” Smart Recognition: Handles various certificate formats and layouts intelligently

🎬 Demo

Certificate Segregator Demo

Upload your certificates, and watch as they get automatically organized by company in real-time!

πŸ”₯ Why Choose Certificate Segregator?

Problem Solution
πŸ“š Hundreds of unsorted certificates ⚑ Instant AI-powered organization
⏰ Hours of manual sorting πŸš€ Process multiple files in seconds
😡 Difficult to find specific certificates 🎯 Clear company-based folder structure
πŸ€” Inconsistent naming conventions πŸ€– AI extracts accurate company names
πŸ’Ό Professional portfolio management πŸ“Š Clean, systematic organization

πŸš€ Quick Start

⚑ TL;DR - Get Started in 5 Minutes

# 1. Clone and navigate
git clone https://github.com/harshajustin/Certificate-Clustering.git
cd Certificate-Clustering

# 2. Install dependencies
pip install -r requirements.txt

# 3. Install Poppler (macOS)
brew install poppler

# 4. Create .env file with your Gemini API key
echo "key=YOUR_GEMINI_API_KEY" > .env

# 5. Run the app
streamlit run main.py

πŸ“‹ Prerequisites

Requirement Version Purpose
Python 3.7+ Core runtime environment
Google AI API Key Latest Gemini AI access
Poppler Latest PDF processing library
Web Browser Modern Streamlit interface

Installation

  1. Clone the repository

    git clone https://github.com/harshajustin/Certificate-Clustering.git
    cd Certificate-Clustering
  2. Install required packages

    pip install -r requirements.txt
  3. Install Poppler (required for pdf2image)

    On macOS:

    brew install poppler

    On Ubuntu/Debian:

    sudo apt-get install poppler-utils

    On Windows:

  4. Set up environment variables

    Create a .env file in the project root:

    key=your_google_gemini_api_key_here

    To get a Google AI API key:

    • Visit Google AI Studio
    • Create a new API key
    • Copy and paste it into your .env file

Running the Application

streamlit run main.py

The application will open in your default web browser at http://localhost:8501

🐳 Docker Deployment

For easy deployment using Docker:

Quick Docker Setup

# 1. Clone and navigate
git clone https://github.com/harshajustin/Certificate-Clustering.git
cd Certificate-Clustering

# 2. Set up environment
cp .env.example .env
# Edit .env and add your Gemini API key

# 3. Run with Docker Compose
docker-compose up -d

# 4. Access at http://localhost:8501

Alternative Docker Commands

# Build the image
docker build -t certificate-segregator .

# Run the container
docker run -d -p 8501:8501 --env-file .env certificate-segregator

πŸ“– For detailed Docker instructions, see DOCKER.md

πŸ“– How to Use

  1. Launch the Application: Run the Streamlit app using the command above
  2. Upload Certificates: Click "Browse files" and select one or more PDF certificates
  3. Process: Click the "Submit" button to start processing
  4. View Results: The app will:
    • Extract company names from each certificate
    • Create folders named after each company
    • Save certificates in their respective company folders
    • Display success/error messages for each file

πŸ“ Project Structure

Certificate-Clustering/
β”œβ”€β”€ main.py                 # Main application file
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env                   # Environment variables (create this)
β”œβ”€β”€ .gitignore            # Git ignore file
β”œβ”€β”€ README.md             # This file
└── certificates/         # Auto-created folder for organized certificates
    β”œβ”€β”€ Company1/
    β”‚   └── Company1_certificate.pdf
    β”œβ”€β”€ Company2/
    β”‚   └── Company2_certificate.pdf
    └── ...

πŸ”§ Technical Details

πŸ—οΈ Architecture

graph TD
    A[PDF Upload] --> B[PDF to Image Conversion]
    B --> C[Base64 Encoding]
    C --> D[Gemini AI Analysis]
    D --> E[Company Name Extraction]
    E --> F[Folder Creation]
    F --> G[Certificate Organization]
    G --> H[Success Notification]
Loading

🧠 Core Functions

Function Purpose Key Features
process_uploaded_pdf() PDF Processing Converts PDF to base64-encoded images, handles multiple pages
get_company_name_from_pdf() AI Analysis Uses Gemini AI to extract company names with context awareness
save_certificate_to_company_folder() File Organization Creates company folders and saves certificates systematically
create_streamlit_ui() User Interface Provides intuitive web interface with progress indicators

πŸ“¦ Dependencies Deep Dive

Package Version Purpose Key Features
streamlit Latest Web interface framework Interactive UI, file uploads, real-time feedback
google-generativeai Latest Google Gemini AI integration Text extraction, company name recognition
pdf2image Latest PDF to image conversion High-quality rendering, multi-page support
python-dotenv Latest Environment variable management Secure API key handling
pillow Latest Image processing support Format conversion, optimization

🎯 Use Cases

πŸ‘¨β€πŸ’Ό Professionals

  • HR Departments: Organize employee training certificates
  • Consultants: Manage client project certificates
  • Freelancers: Maintain professional certification portfolio

πŸŽ“ Students

  • Course Completion: Sort online learning certificates
  • Academic Records: Organize educational achievements
  • Skill Development: Track certification progress

🏒 Organizations

  • Compliance Teams: Manage regulatory certificates
  • Training Departments: Track employee certifications
  • Quality Assurance: Organize vendor certificates

πŸ“Š Performance Metrics

Metric Performance
Processing Speed ~2-3 seconds per certificate
Accuracy Rate 95%+ company name extraction
Supported Formats PDF (all versions)
Batch Size Unlimited (memory dependent)
File Size Limit Up to 200MB per file

πŸ› οΈ Configuration

πŸ” Environment Variables

Variable Description Required Example
key Google Gemini API key βœ… Yes AIzaSyD...

πŸ“„ Supported File Types

Format Extension Max Size Notes
PDF .pdf 200MB All PDF versions supported

βš™οΈ Advanced Configuration

Create a config.yaml file for advanced settings:

# Advanced Configuration (Optional)
processing:
  max_file_size: 200MB
  timeout: 30s
  retry_attempts: 3

ai_settings:
  model: "gemini-1.5-flash"
  temperature: 0.1
  max_tokens: 1000

folders:
  base_path: "./certificates"
  naming_convention: "{company_name}_certificate"
  create_subfolders: true

πŸ”„ Workflow

sequenceDiagram
    participant User
    participant App
    participant Gemini
    participant FileSystem

    User->>App: Upload PDF certificates
    App->>App: Convert PDF to images
    App->>Gemini: Send image for analysis
    Gemini->>App: Return company name
    App->>FileSystem: Create company folder
    App->>FileSystem: Save certificate
    App->>User: Display success message
Loading

πŸ› Troubleshooting

🚨 Common Issues & Solutions

πŸ”‘ API Key Issues

Problem: "Google API Key not found" Error

Solutions:

  • βœ… Ensure your .env file exists in the project root
  • βœ… Verify the key is named exactly key in the .env file
  • βœ… Check for extra spaces or quotes around the API key
  • βœ… Verify your API key is active at Google AI Studio
# Correct .env format
key=AIzaSyD1234567890abcdef
πŸ“„ PDF Processing Errors

Problem: PDF files won't process

Solutions:

  • βœ… Ensure Poppler is installed correctly
  • βœ… Check that uploaded files are valid PDF documents
  • βœ… Verify file size is under 200MB
  • βœ… Try with a different PDF to isolate the issue
# Test Poppler installation
pdftoppm -h
πŸ“ File Organization Issues

Problem: Certificates not saving properly

Solutions:

  • βœ… Ensure write permissions in the project directory
  • βœ… Check available disk space
  • βœ… Verify the certificates/ folder can be created
  • βœ… Close any open certificate files
πŸ€– AI Extraction Issues

Problem: Company names not extracted correctly

Solutions:

  • βœ… Some certificates may have unclear text or unusual formatting
  • βœ… Try preprocessing the PDF (ensure text is selectable)
  • βœ… Check if the certificate contains readable text
  • βœ… Verify your API quota hasn't been exceeded

πŸ“Š Diagnostic Commands

# Check Python version
python --version

# Verify package installation
pip list | grep -E "(streamlit|google-generativeai|pdf2image)"

# Test Poppler
pdftoppm -v

# Check file permissions
ls -la certificates/

🀝 Contributing

We welcome contributions from the community! Here's how you can help make Certificate Segregator even better:

πŸš€ Quick Contribution Guide

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch
    git checkout -b feature/amazing-feature
  3. πŸ’» Make your changes
  4. βœ… Test thoroughly
  5. πŸ“ Commit with descriptive messages
    git commit -m 'Add: Enhanced AI accuracy for handwritten certificates'
  6. πŸ“€ Push to your branch
    git push origin feature/amazing-feature
  7. πŸ”€ Open a Pull Request

🎯 Areas for Contribution

Area Description Difficulty
πŸ€– AI Improvements Enhance company name extraction accuracy Advanced
🎨 UI/UX Improve interface design and user experience Intermediate
πŸ“Š Analytics Add processing statistics and insights Intermediate
πŸ”§ Performance Optimize processing speed and memory usage Advanced
πŸ“š Documentation Improve docs, add tutorials, create videos Beginner
πŸ§ͺ Testing Add unit tests, integration tests Intermediate
🌍 Localization Add multi-language support Intermediate

πŸ“‹ Development Setup

# 1. Clone your fork
git clone https://github.com/YOUR_USERNAME/Certificate-Clustering.git
cd Certificate-Clustering

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # If available

# 4. Set up pre-commit hooks
pre-commit install

# 5. Run tests
python -m pytest

πŸ› Bug Reports

Found a bug? Please create an issue with:

  • πŸ“ Clear description of the problem
  • πŸ”„ Steps to reproduce
  • πŸ’» System information (OS, Python version)
  • πŸ“Ž Sample files (if applicable)
  • πŸ“· Screenshots (if relevant)

πŸ’‘ Feature Requests

Have an idea? We'd love to hear it! Include:

  • 🎯 Clear description of the feature
  • πŸ€” Why it would be valuable
  • πŸ’­ Possible implementation approach
  • πŸ“Š Expected impact

πŸ† Contributors

Thanks to all the amazing people who have contributed to this project!

πŸ—ΊοΈ Roadmap

🎯 Upcoming Features

Feature Status ETA Priority
πŸ” Advanced Search πŸ”„ In Progress Q3 2025 High
πŸ“Š Analytics Dashboard πŸ“‹ Planned Q4 2025 Medium
🌍 Multi-language Support πŸ’­ Concept 2026 Low
πŸ“± Mobile App πŸ’­ Concept TBD Medium
☁️ Cloud Integration πŸ’­ Concept TBD High

🎁 Version History

  • v1.0.0 (Current) - Initial release with core functionality
  • v0.9.0 - Beta testing phase
  • v0.1.0 - Alpha prototype

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Harsha Justin

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software...

πŸ™ Acknowledgments

Special thanks to the amazing technologies and communities that made this project possible:

  • πŸ€– Google AI - For the powerful Gemini API
  • 🎨 Streamlit - For the incredible web framework
  • πŸ“„ pdf2image - For seamless PDF processing
  • 🐍 Python Community - For the amazing ecosystem
  • 🌟 Open Source Community - For inspiration and collaboration
  • πŸ’‘ Contributors - For making this project better every day

🏒 Powered By

Google AI Streamlit Python

πŸ“ž Support & Community

Need help or want to connect with other users?

πŸ†˜ Get Help

Channel Purpose Response Time
πŸ› GitHub Issues Bug reports, feature requests 24-48 hours
πŸ’¬ Discussions Questions, ideas, showcase 24-48 hours
πŸ“§ Email Private inquiries 2-3 business days

πŸ“ Before Asking for Help

  1. βœ… Check the FAQ section
  2. βœ… Search existing issues
  3. βœ… Read the documentation thoroughly
  4. βœ… Try the troubleshooting steps

πŸ› Reporting Issues

When reporting bugs, please include:

**Environment:**
- OS: [e.g., macOS 12.0, Windows 11, Ubuntu 20.04]
- Python version: [e.g., 3.9.7]
- Package versions: [run `pip list`]

**Steps to reproduce:**
1. Go to '...'
2. Click on '....'
3. Upload file '....'
4. See error

**Expected behavior:**
A clear description of what you expected to happen.

**Actual behavior:**
A clear description of what actually happened.

**Additional context:**
Add any other context about the problem here.

🌟 Star History

Star History Chart


πŸŽ‰ Happy Certificate Organizing! πŸŽ‰

Made with ❀️ by Harsha Justin

GitHub followers GitHub stars

If this project helped you, please consider giving it a ⭐ on GitHub!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published