A GenAI-powered tool that scans contracts and regulatory filings for missing clauses and suggests remediation using Anthropic's Claude API.
The Automated Document Compliance Auditor is a Flask-based web application that helps organizations ensure their documents comply with various regulations such as GDPR and HIPAA. It analyzes documents to identify missing clauses and provides AI-powered suggestions for remediation using Anthropic's Claude API.
- Document Processing: Extract text from PDF, DOCX, and TXT files
- Rule-based Compliance Checking: Detect missing clauses using regex and keyword patterns
- AI-Powered Suggestions: Generate remediation text using Anthropic's Claude API
- Interactive UI: Real-time highlighting and inline editing with dark mode support
- Domain-specific Compliance: Support for GDPR, HIPAA, and other standards
- Error Handling: Centralized error handling system with user-friendly feedback
- Performance Optimization: Caching, pagination, and background task processing
- Security: Input validation, CSRF protection, and rate limiting
- API Access: RESTful API for programmatic access to all features
- PDF Export: Generate PDF reports for compliance results
The main landing page showing the application overview and navigation options
Browse uploaded documents with filtering and sorting options
Upload new documents for compliance checking
View document content with compliance issues highlighted
View detailed compliance issues and get AI-powered suggestions
┌─────────────────────────────────────────────────────────────────┐
│ Client Browser │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Flask Web Server │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Routes │───▶│ Services │───▶│ Document Parser │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Templates │ │ Rule Engine │ │ PDF Export Service │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
└───────────────────────────┼─────────────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌───────────────────┐
│ MongoDB │ │ Anthropic API │ │ Cache System │
│ (Document DB) │ │ (Claude LLM) │ │ (Flask-Caching) │
└─────────────────┘ └─────────────────┘ └───────────────────┘
- Backend: Python with Flask
- Frontend: HTML, CSS, JavaScript with HTMX for interactivity
- Database: MongoDB for document storage
- Text Processing: PyPDF2, python-docx for document parsing
- AI Integration: Anthropic Claude API for generating suggestions
- Styling: Bootstrap 5 for responsive design with dark mode support
- Caching: In-memory caching with Flask-Caching
- Security: Flask-WTF for CSRF protection, input sanitization with Bleach
- API: RESTful API with rate limiting via Flask-Limiter
- PDF Generation: ReportLab for PDF report generation
- Background Processing: APScheduler for handling long-running tasks
- Python 3.9+
- MongoDB
- Anthropic API key (for AI suggestions)
- Clone the repository
git clone https://github.com/sylvester-francis/Automated-Document-Compliance-Auditor.git
cd Automated-Document-Compliance-Auditor
- Create and activate a virtual environment
python -m venv venv
# On macOS/Linux
source venv/bin/activate
# On Windows
# venv\Scripts\activate
- Install the dependencies
pip install -r requirements.txt
- Set up MongoDB
Make sure MongoDB is running on your system. You can install it following the official MongoDB installation guide.
- Create the instance directory and .env file
mkdir -p instance
touch instance/.env
Edit the .env
file and add the following configuration:
SECRET_KEY=your-secret-key
MONGO_URI=mongodb://localhost:27017/compliance_auditor
ANTHROPIC_API_KEY=your-anthropic-api-key
USE_MOCK_LLM=False # Set to True to use mock LLM service instead of Claude API
API_KEY=your-api-key # For accessing the API endpoints
MAX_CONTENT_LENGTH=10485760 # Maximum file size (10MB)
ALLOWED_EXTENSIONS=pdf,docx,txt # Allowed file extensions
Note: You'll need to obtain an Anthropic API key from Anthropic's website. If you don't have one, you can set
USE_MOCK_LLM=True
to use the mock LLM service for testing.
- Run the application
python app.py
- Access the application
Open your browser and navigate to http://localhost:5006
The application can also be deployed using Docker for easier setup and consistent environments.
- Clone the repository
git clone https://github.com/sylvester-francis/Automated-Document-Compliance-Auditor.git
cd Automated-Document-Compliance-Auditor
- Set your Anthropic API key as an environment variable
export ANTHROPIC_API_KEY=your_anthropic_api_key
# Alternatively, to use the mock LLM service (no API key required)
export USE_MOCK_LLM=True
- Start the application with Docker Compose
docker-compose up -d
- Access the application
Open your browser and navigate to http://localhost:5006
- Build the Docker image
docker build -t document-compliance-auditor .
- Run the container
docker run -p 5006:5006 \
-e MONGO_URI=your_mongo_uri \
-e ANTHROPIC_API_KEY=your_api_key \
-e SECRET_KEY=your_secret_key \
document-compliance-auditor
Note: When using Docker without Compose, you'll need to set up MongoDB separately and provide the correct connection URI.
-
Upload Documents
- Click the "Upload New Document" button on the documents list page
- Select a file (PDF, DOCX, or TXT) from your computer
- The system will process the document and extract text and metadata
-
Browse Documents
- Use the search bar to find documents by filename or content
- Filter documents by type (PDF, DOCX, TXT) using the dropdown menu
- Sort documents by date, name, or compliance score
- Toggle between ascending and descending order
-
View Document Details
- Click on a document card to view its details
- Navigate between document content, compliance issues, and metadata using the tabs
- Toggle between light and dark mode using the theme switch in the navigation bar
-
Run Compliance Check
- Click the "Run Compliance Check" button on the document view page
- The system will analyze the document against selected compliance standards
- View the compliance score and issues found
-
Review Compliance Issues
- Issues are highlighted in the document content
- Click on an issue to see details and suggestions
- Generate AI-powered suggestions using the "Generate Suggestion (Claude)" button
-
Export Compliance Report
- Click the "Export PDF Report" button to generate a PDF report
- The report includes document details, compliance score, issues, and suggestions
All functionality is also available through the API. See the API Documentation section for details.
Automated-Document-Compliance-Auditor/
├── app/ # Flask application
│ ├── __init__.py # App initialization
│ ├── config.py # Configuration settings
│ ├── extensions.py # Flask extensions
│ ├── models/ # Data models
│ │ ├── __init__.py
│ │ ├── compliance.py # Compliance models
│ │ └── document.py # Document models
│ ├── routes/ # View functions
│ │ ├── __init__.py
│ │ ├── api.py # API endpoints
│ │ ├── compliance.py # Compliance checking routes
│ │ ├── documents.py # Document management routes
│ │ └── main.py # Main routes
│ ├── services/ # Business logic
│ │ ├── __init__.py
│ │ ├── bulk_processor.py # Batch document processing
│ │ ├── document_classifier.py # Document type classification
│ │ ├── document_service.py # Document handling
│ │ ├── extraction_service.py # Text extraction
│ │ ├── llm_service.py # LLM integration with mock support
│ │ ├── pdf_exporter.py # PDF export generation
│ │ ├── rule_engine.py # Compliance rules
│ │ └── seed_service.py # Data seeding
│ ├── static/ # Static assets
│ │ ├── css/ # Stylesheets
│ │ ├── js/ # JavaScript files
│ │ └── img/ # Images
│ ├── templates/ # Jinja2 templates
│ │ ├── base.html # Base template
│ │ ├── index.html # Homepage
│ │ ├── about.html # About page
│ │ ├── compliance/ # Compliance templates
│ │ │ ├── debug.html # Debug page
│ │ │ ├── results.html # Results page
│ │ │ ├── results_partial.html # HTMX partial for results
│ │ │ └── suggestions_partial.html # HTMX partial for suggestions
│ │ ├── components/ # Reusable UI components
│ │ │ └── pagination.html # Pagination component
│ │ ├── documents/ # Document templates
│ │ │ ├── bulk_upload.html # Bulk upload form
│ │ │ ├── list.html # Document list
│ │ │ ├── list_partial.html # HTMX partial for document list
│ │ │ ├── upload.html # Upload form
│ │ │ └── view.html # Document viewer
│ │ └── reports/ # Report templates
│ │ ├── compliance_pdf.html # Compliance report template
│ │ └── document_pdf.html # Document report template
│ └── utils/ # Utility functions
│ ├── __init__.py
│ ├── background_tasks.py # Background task processing
│ ├── cache.py # Caching utilities
│ ├── document_extractor.py # Document extraction utilities
│ ├── error_handler.py # Centralized error handling
│ ├── form_validation.py # Input validation
│ ├── pagination.py # Pagination utilities
│ ├── pdf_export.py # PDF export utilities
│ ├── pdf_utils.py # PDF utility functions
│ ├── rate_limiter.py # API rate limiting
│ ├── security.py # Security utilities
│ └── text_processing.py # Text processing utilities
├── instance/ # Instance-specific files
│ ├── uploads/ # Uploaded documents
│ └── temp/ # Temporary files
├── screenshots/ # Application screenshots
├── static/ # Global static files
│ └── images/ # Image assets
│ └── screenshots/ # Screenshot images for documentation
├── testdocuments/ # Test document files
├── tests/ # Test suite
│ ├── __init__.py
│ ├── conftest.py # Test configuration
│ ├── test_api.py # API tests
│ ├── test_document_service.py # Document service tests
│ ├── test_extraction_service.py # Extraction service tests
│ ├── test_routes.py # Route tests
│ ├── test_rule_engine.py # Rule engine tests
│ └── test_utils.py # Utility tests
├── app.py # Application entry point
├── app.log # Application logs
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies
└── README.md # Project documentation
This project demonstrates:
- Full-stack development with Python (Flask) and modern frontend techniques (HTMX)
- Integration of NLP techniques and AI technologies
- Document processing and text analysis
- Database design and integration
- User interface design for complex data visualization
The system extracts text from various document formats (PDF, DOCX, TXT) and splits it into paragraphs for analysis. It uses PyPDF2 for PDF extraction and python-docx for DOCX files, with specialized utilities in the utils module.
The rules engine (rule_engine.py) checks documents against predefined compliance rules using:
- Regular expression matching for specific clause patterns
- Keyword detection for important compliance terms
- Severity classification (High, Medium, Low)
When a compliance issue is detected, the system generates remediation suggestions using Anthropic's Claude API (llm_service.py), providing context-appropriate clause examples that would satisfy compliance requirements. A fallback mock service is integrated directly into the LLM service and can be enabled by setting USE_MOCK_LLM=True in your environment variables or .env file.
The interface provides:
- Document uploading and management
- Real-time compliance checking
- Highlighted issues in the document view
- Detailed compliance reports
- Interactive suggestion generation with Claude
- Debug tools for testing API integration
-
Error Handling:
- Implemented a centralized error handling system with custom
AppError
class - Added decorators for route error handling with user-friendly feedback
- Implemented a centralized error handling system with custom
-
User Experience:
- Added toast notification system for improved user feedback
- Implemented dark mode support for better accessibility
- Enhanced mobile responsiveness for all device sizes
-
Performance Optimization:
- Added document caching to improve retrieval speed
- Implemented pagination for document lists to handle large datasets
- Added background task processing for long-running operations
-
Security Enhancements:
- Implemented input validation and sanitization to prevent XSS attacks
- Added CSRF protection for all forms
- Implemented rate limiting to prevent abuse
- Added API key authentication for API endpoints
-
Feature Additions:
- Created a RESTful API for programmatic access to all features
- Added PDF export functionality for compliance reports
- Implemented advanced search and filtering for documents
- Added health check endpoints for monitoring
-
Code Quality:
- Fixed metadata loading and compliance score display issues
- Consolidated LLM services by integrating mock functionality
- Added configuration options for toggling features
- Improved error handling and debugging information
This project uses ruff and flake8 for code quality checks. To run these checks locally:
- Run ruff:
# Navigate to your project directory
cd /Users/sylvester/Desktop/Automated-Document-Compliance-Auditor
# Activate virtual environment
source venv/bin/activate
# Run ruff on the entire codebase
ruff check .
# To automatically fix some issues
ruff check --fix .
- Run flake8:
# Run flake8 on the entire codebase
flake8 .
- PDF export occasionally fails with large documents
- Some HIPAA rules need refinement for better accuracy
- Mobile view has alignment issues on small screens
- MongoDB connection pooling needs optimization
This project includes a GitHub Actions workflow for continuous integration and deployment. The workflow is defined in .github/workflows/ci-cd.yml
and includes the following stages:
- Lint: Runs ruff and flake8 to check code quality
- Test: Runs pytest with coverage reporting
- Build: Builds and pushes a Docker image to DockerHub (on main/master branch)
- Deploy: Deploys the application to production (on main/master branch)
The CI/CD pipeline uses GitHub Container Registry (GHCR) to store Docker images, which is free for public repositories. The pipeline automatically handles authentication using GitHub Actions' built-in secrets.
If you're using the deployment step, you'll need to set up the following GitHub secrets:
DEPLOY_USER
: SSH username for deployment (if using SSH deployment)DEPLOY_HOST
: SSH host for deployment (if using SSH deployment)
- Support for additional document formats (HTML, XML, etc.)
- More compliance standards (SOX, CCPA, etc.)
- Machine learning model for document classification
- Custom compliance rules with a rule builder interface
- Analytics dashboard with compliance trends
- Integration with document management systems
- Multi-language support
- Collaborative review features
- Automated scheduled compliance checks
- Advanced prompt engineering for more precise suggestions
The application provides a RESTful API for programmatic access to all features. API endpoints are secured with API key authentication and rate limiting.
All API requests require an API key to be included in the request headers:
X-API-Key: your-api-key
To generate and configure an API key for the application:
- Create a secure random API key:
python -c "import secrets; print(secrets.token_hex(32))"
- Add the API key to your
.env
file in theinstance
directory:
# Create the instance directory if it doesn't exist
mkdir -p instance
# Add the API key to your .env file
echo "API_KEY=your_generated_key_here" >> instance/.env
- Restart the application to load the new API key from the environment.
For security best practices:
- Generate a unique API key for each client or service
- Rotate API keys periodically
- Never share API keys in public repositories or insecure channels
GET /api/documents
- List all documents with pagination and filteringGET /api/documents/{document_id}
- Get a specific document by IDGET /api/documents/{document_id}/compliance
- Get compliance information for a documentPOST /api/documents/{document_id}/check
- Check compliance for a documentGET /api/documents/{document_id}/export/pdf
- Export a document as PDFGET /api/documents/{document_id}/compliance/export/pdf
- Export compliance report as PDFGET /api/rules
- List all compliance rulesGET /api/stats
- Get application statistics