Skip to content

A Streamlit web application for detecting and redacting Personally Identifiable Information (PII) in documents including PDFs, images, and text files. Supports Aadhaar, PAN, Driving License, and Voter ID detection with automated redaction capabilities.

License

Notifications You must be signed in to change notification settings

deeksha006/PII-Detection-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔒 PII Detection & Redaction App

Python Streamlit License: MIT

A powerful Python-based web application built with Streamlit for detecting and redacting Personally Identifiable Information (PII) in documents. Supports multiple file formats including PDFs, images, and text files with intelligent pattern recognition for Indian identity documents.

🌟 Features

  • 🔍 Smart PII Detection: Automatically detects Aadhaar numbers, PAN numbers, Driving Licenses, and Voter IDs using advanced regular expressions
  • 🎭 Intelligent Masking: Replaces detected PII with 'X' characters while preserving document structure
  • 📄 PDF Redaction: Creates professionally redacted PDFs with PII information blacked out
  • 📁 Multi-format Support: Processes PDF, PNG, JPG, JPEG, and TXT files seamlessly
  • 🖥️ User-friendly Interface: Clean, intuitive Streamlit web interface
  • ⚡ Real-time Processing: Instant PII detection and masking results

🚀 Quick Start

Prerequisites

  • Python 3.7 or higher
  • Git (for cloning the repository)

Installation

  1. Clone the repository

    git clone https://github.com/yourusername/pii-detection-app.git
    cd pii-detection-app
  2. Install Python dependencies

    pip install -r requirements.txt
  3. Install Tesseract OCR (Optional - for image processing)

    • Windows: Download from Tesseract OCR
    • macOS: brew install tesseract
    • Linux: sudo apt-get install tesseract-ocr
  4. Run the application

    streamlit run main.py
  5. Open your browser and navigate to http://localhost:8501

Note: The app works with PDF and TXT files even without Tesseract. Image processing requires Tesseract OCR installation.

📖 How It Works

1. Upload Document

  • Drag and drop or browse to select your document
  • Supports PDF, PNG, JPG, JPEG, and TXT formats

2. Automatic PII Detection

  • Advanced regex patterns scan for:
    • Aadhaar Numbers: 12-digit unique identification numbers
    • PAN Numbers: 10-character alphanumeric tax identification
    • Driving License: State-specific license number patterns
    • Voter ID: Election Commission identification numbers

3. Smart Processing

  • PDFs: Direct text extraction and redaction
  • Images: OCR-based text recognition (requires Tesseract)
  • Text Files: Direct content analysis

4. Secure Output

  • View detected PII in organized format
  • Download redacted PDFs with PII blacked out
  • See masked versions with 'X' replacements

🎯 Usage Example

  1. Launch the app: streamlit run main.py
  2. Open browser: Navigate to http://localhost:8501
  3. Upload file: Choose your document
  4. Review results: See detected PII and masked versions
  5. Download: Get redacted PDF if applicable

🛡️ Supported PII Types

PII Type Pattern Example
Aadhaar 12 digits (with/without spaces) 1234 5678 9012
PAN 5 letters + 4 digits + 1 letter ABCDE1234F
Driving License State code + digits MH1234567890
Voter ID 3 letters + 7 digits ABC1234567

⚠️ Important Notes

  • Privacy First: All processing happens locally on your machine
  • No Data Storage: Files are temporarily processed and automatically cleaned up
  • OCR Dependency: Image processing requires Tesseract OCR installation
  • Accuracy: Detection accuracy depends on document quality and text clarity
  • Indian Focus: Current patterns optimized for Indian identity documents

🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments


Star this repository if you find it helpful!

About

A Streamlit web application for detecting and redacting Personally Identifiable Information (PII) in documents including PDFs, images, and text files. Supports Aadhaar, PAN, Driving License, and Voter ID detection with automated redaction capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published