🔒 DocShield

A sophisticated Streamlit application that automatically detects and redacts sensitive information from PDF documents using OCR and pattern recognition.

Made by Enamul Hasan Shagato ❤️

🌟 Features

OCR Text Extraction: Advanced text extraction from PDF documents
Intelligent Redaction: Automatically detects and redacts:
- Social Security Numbers (SSN)
- Credit Card Numbers
- Street Addresses
- ZIP Codes
Dual Processing Modes:
- Conservative (labeled data only)
- Aggressive (pattern matching)
Multiple Export Formats: PDF and Word document outputs
Modern UI: Responsive design with dark mode support
Real-time Statistics: Processing metrics and redaction breakdown
Privacy-First: All processing done locally

📁 Project Structure

document_redaction/
├── main.py                 # Main Streamlit application
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── config/
│   ├── __init__.py
│   └── settings.py        # Configuration and patterns
├── core/
│   ├── __init__.py
│   ├── ocr_processor.py   # OCR functionality
│   ├── redactor.py        # Redaction logic
│   └── file_handler.py    # File operations
├── ui/
│   ├── __init__.py
│   ├── components.py      # UI components
│   └── styles.py          # CSS styling
└── utils/
    ├── __init__.py
    └── helpers.py         # Utility functions

🚀 Installation

Clone the repository:

git clone https://github.com/shagatomte19/docshield.git
cd docshield

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
streamlit run main.py
```

📖 Usage

Upload a PDF: Use the file uploader to select your PDF document
Choose Redaction Mode:
- Conservative: Only redacts explicitly labeled data
- Aggressive: Uses pattern matching for broader detection
Process Document: Click the "Process Document" button
Review Results: View redacted text and statistics
Download Files: Export as PDF or Word document

🔧 Configuration

Modify config/settings.py to customize:

Regex patterns for detection
Redaction labels
File size limits
OCR settings

🎨 Customization

Adding New Patterns

Add new regex patterns in config/settings.py:

NEW_PATTERNS = [
    r"new-regex-pattern-here"
]

UI Modifications

Update styles in ui/styles.py or add new components in ui/components.py.

Processing Logic

Extend redaction logic in core/redactor.py for new sensitive data types.

📊 Architecture

Core Components

OCRProcessor: Handles PDF to text conversion using EasyOCR
TextRedactor: Implements redaction logic with multiple strategies
FileHandler: Manages PDF and Word export functionality
UIComponents: Modular UI components for Streamlit interface

Processing Pipeline

PDF Upload → OCR Processing → Text Extraction
Pattern Detection → Redaction → Statistics Generation
File Export → Download → User Interface Updates

🛡️ Security & Privacy

Local Processing: All data processing happens locally
No Data Storage: Documents are not stored permanently
Temporary Files: Output files are created in temporary directories
Privacy-First: No external API calls for sensitive data processing

📝 Dependencies

Streamlit: Web application framework
EasyOCR: OCR text extraction
PyMuPDF: PDF processing
FPDF2: PDF generation
python-docx: Word document creation
PyTorch: ML backend for OCR

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Make your changes
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📋 Roadmap

Add support for more file formats (DOCX, images)
Implement custom redaction patterns via UI
Add batch processing capabilities
Enhance OCR accuracy with preprocessing
Add redaction confidence scores
Implement audit logging

🐛 Known Issues

OCR quality depends on PDF text clarity
Complex layouts may affect accuracy
Large files may require processing time
Handwritten text detection is limited

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Enamul Hasan Shagato

AI/ML Engineer
GitHub | LinkedIn

⭐ Show your support

Give a ⭐️ if this project helped you!

Made with ❤️ using Python and Streamlit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔒 DocShield

🌟 Features

📁 Project Structure

🚀 Installation

📖 Usage

🔧 Configuration

🎨 Customization

Adding New Patterns

UI Modifications

Processing Logic

📊 Architecture

Core Components

Processing Pipeline

🛡️ Security & Privacy

📝 Dependencies

🤝 Contributing

📋 Roadmap

🐛 Known Issues

📄 License

👨‍💻 Author

⭐ Show your support

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
core		core
ui		ui
utils		utils
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

shagatomte19/DocShield

Folders and files

Latest commit

History

Repository files navigation

🔒 DocShield

🌟 Features

📁 Project Structure

🚀 Installation

📖 Usage

🔧 Configuration

🎨 Customization

Adding New Patterns

UI Modifications

Processing Logic

📊 Architecture

Core Components

Processing Pipeline

🛡️ Security & Privacy

📝 Dependencies

🤝 Contributing

📋 Roadmap

🐛 Known Issues

📄 License

👨‍💻 Author

⭐ Show your support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages