A powerful Python-based web application built with Streamlit for detecting and redacting Personally Identifiable Information (PII) in documents. Supports multiple file formats including PDFs, images, and text files with intelligent pattern recognition for Indian identity documents.
- 🔍 Smart PII Detection: Automatically detects Aadhaar numbers, PAN numbers, Driving Licenses, and Voter IDs using advanced regular expressions
- 🎭 Intelligent Masking: Replaces detected PII with 'X' characters while preserving document structure
- 📄 PDF Redaction: Creates professionally redacted PDFs with PII information blacked out
- 📁 Multi-format Support: Processes PDF, PNG, JPG, JPEG, and TXT files seamlessly
- 🖥️ User-friendly Interface: Clean, intuitive Streamlit web interface
- ⚡ Real-time Processing: Instant PII detection and masking results
- Python 3.7 or higher
- Git (for cloning the repository)
-
Clone the repository
git clone https://github.com/yourusername/pii-detection-app.git cd pii-detection-app
-
Install Python dependencies
pip install -r requirements.txt
-
Install Tesseract OCR (Optional - for image processing)
- Windows: Download from Tesseract OCR
- macOS:
brew install tesseract
- Linux:
sudo apt-get install tesseract-ocr
-
Run the application
streamlit run main.py
-
Open your browser and navigate to
http://localhost:8501
Note: The app works with PDF and TXT files even without Tesseract. Image processing requires Tesseract OCR installation.
- Drag and drop or browse to select your document
- Supports PDF, PNG, JPG, JPEG, and TXT formats
- Advanced regex patterns scan for:
- Aadhaar Numbers: 12-digit unique identification numbers
- PAN Numbers: 10-character alphanumeric tax identification
- Driving License: State-specific license number patterns
- Voter ID: Election Commission identification numbers
- PDFs: Direct text extraction and redaction
- Images: OCR-based text recognition (requires Tesseract)
- Text Files: Direct content analysis
- View detected PII in organized format
- Download redacted PDFs with PII blacked out
- See masked versions with 'X' replacements
- Launch the app:
streamlit run main.py
- Open browser: Navigate to
http://localhost:8501
- Upload file: Choose your document
- Review results: See detected PII and masked versions
- Download: Get redacted PDF if applicable
PII Type | Pattern | Example |
---|---|---|
Aadhaar | 12 digits (with/without spaces) | 1234 5678 9012 |
PAN | 5 letters + 4 digits + 1 letter | ABCDE1234F |
Driving License | State code + digits | MH1234567890 |
Voter ID | 3 letters + 7 digits | ABC1234567 |
- Privacy First: All processing happens locally on your machine
- No Data Storage: Files are temporarily processed and automatically cleaned up
- OCR Dependency: Image processing requires Tesseract OCR installation
- Accuracy: Detection accuracy depends on document quality and text clarity
- Indian Focus: Current patterns optimized for Indian identity documents
We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Streamlit for the web interface
- Uses Tesseract OCR for image text extraction
- PDF processing powered by PyMuPDF and PyPDF2
⭐ Star this repository if you find it helpful!