Automated Document Compliance Auditor

A GenAI-powered tool that scans contracts and regulatory filings for missing clauses and suggests remediation using Anthropic's Claude API.

Overview

The Automated Document Compliance Auditor is a Flask-based web application that helps organizations ensure their documents comply with various regulations such as GDPR and HIPAA. It analyzes documents to identify missing clauses and provides AI-powered suggestions for remediation using Anthropic's Claude API.

Key Features

Document Processing: Extract text from PDF, DOCX, and TXT files
Rule-based Compliance Checking: Detect missing clauses using regex and keyword patterns
AI-Powered Suggestions: Generate remediation text using Anthropic's Claude API
Interactive UI: Real-time highlighting and inline editing with dark mode support
Domain-specific Compliance: Support for GDPR, HIPAA, and other standards
Error Handling: Centralized error handling system with user-friendly feedback
Performance Optimization: Caching, pagination, and background task processing
Security: Input validation, CSRF protection, and rate limiting
API Access: RESTful API for programmatic access to all features
PDF Export: Generate PDF reports for compliance results

Application Screenshots

Homepage/Dashboard

The main landing page showing the application overview and navigation options

Document List View

Browse uploaded documents with filtering and sorting options

Document Upload Interface

Upload new documents for compliance checking

Document Detail View

View document content with compliance issues highlighted

Compliance Check Results

View detailed compliance issues and get AI-powered suggestions

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Client Browser                           │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Flask Web Server                        │
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   Routes    │───▶│  Services   │───▶│  Document Parser    │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│         │                  │                      │             │
│         ▼                  ▼                      ▼             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ Templates   │    │ Rule Engine │    │ PDF Export Service  │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│                           │                                     │
└───────────────────────────┼─────────────────────────────────────┘
                            │
           ┌────────────────┼────────────────┐
           │                │                │
           ▼                ▼                ▼
┌─────────────────┐ ┌─────────────────┐ ┌───────────────────┐
│    MongoDB      │ │  Anthropic API  │ │  Cache System     │
│  (Document DB)  │ │  (Claude LLM)   │ │  (Flask-Caching)  │
└─────────────────┘ └─────────────────┘ └───────────────────┘

Technology Stack

Backend: Python with Flask
Frontend: HTML, CSS, JavaScript with HTMX for interactivity
Database: MongoDB for document storage
Text Processing: PyPDF2, python-docx for document parsing
AI Integration: Anthropic Claude API for generating suggestions
Styling: Bootstrap 5 for responsive design with dark mode support
Caching: In-memory caching with Flask-Caching
Security: Flask-WTF for CSRF protection, input sanitization with Bleach
API: RESTful API with rate limiting via Flask-Limiter
PDF Generation: ReportLab for PDF report generation
Background Processing: APScheduler for handling long-running tasks

Getting Started

Prerequisites

Python 3.9+
MongoDB
Anthropic API key (for AI suggestions)

Installation

Clone the repository

git clone https://github.com/sylvester-francis/Automated-Document-Compliance-Auditor.git
cd Automated-Document-Compliance-Auditor

Create and activate a virtual environment

python -m venv venv

# On macOS/Linux
source venv/bin/activate

# On Windows
# venv\Scripts\activate

Install the dependencies

pip install -r requirements.txt

Set up MongoDB

Make sure MongoDB is running on your system. You can install it following the official MongoDB installation guide.

Create the instance directory and .env file

mkdir -p instance
touch instance/.env

Edit the .env file and add the following configuration:

SECRET_KEY=your-secret-key
MONGO_URI=mongodb://localhost:27017/compliance_auditor
ANTHROPIC_API_KEY=your-anthropic-api-key
USE_MOCK_LLM=False  # Set to True to use mock LLM service instead of Claude API
API_KEY=your-api-key  # For accessing the API endpoints
MAX_CONTENT_LENGTH=10485760  # Maximum file size (10MB)
ALLOWED_EXTENSIONS=pdf,docx,txt  # Allowed file extensions

Note: You'll need to obtain an Anthropic API key from Anthropic's website. If you don't have one, you can set USE_MOCK_LLM=True to use the mock LLM service for testing.

Run the application

python app.py

Access the application

Open your browser and navigate to http://localhost:5006

Docker Deployment

The application can also be deployed using Docker for easier setup and consistent environments.

Using Docker Compose (Recommended)

Clone the repository

git clone https://github.com/sylvester-francis/Automated-Document-Compliance-Auditor.git
cd Automated-Document-Compliance-Auditor

Set your Anthropic API key as an environment variable

export ANTHROPIC_API_KEY=your_anthropic_api_key

# Alternatively, to use the mock LLM service (no API key required)
export USE_MOCK_LLM=True

Start the application with Docker Compose

docker-compose up -d

Access the application

Open your browser and navigate to http://localhost:5006

Using Docker without Compose

Build the Docker image

docker build -t document-compliance-auditor .

Run the container

docker run -p 5006:5006 \
  -e MONGO_URI=your_mongo_uri \
  -e ANTHROPIC_API_KEY=your_api_key \
  -e SECRET_KEY=your_secret_key \
  document-compliance-auditor

Note: When using Docker without Compose, you'll need to set up MongoDB separately and provide the correct connection URI.

Usage

Document Management

Upload Documents
- Click the "Upload New Document" button on the documents list page
- Select a file (PDF, DOCX, or TXT) from your computer
- The system will process the document and extract text and metadata
Browse Documents
- Use the search bar to find documents by filename or content
- Filter documents by type (PDF, DOCX, TXT) using the dropdown menu
- Sort documents by date, name, or compliance score
- Toggle between ascending and descending order
View Document Details
- Click on a document card to view its details
- Navigate between document content, compliance issues, and metadata using the tabs
- Toggle between light and dark mode using the theme switch in the navigation bar

Compliance Checking

Run Compliance Check
- Click the "Run Compliance Check" button on the document view page
- The system will analyze the document against selected compliance standards
- View the compliance score and issues found
Review Compliance Issues
- Issues are highlighted in the document content
- Click on an issue to see details and suggestions
- Generate AI-powered suggestions using the "Generate Suggestion (Claude)" button
Export Compliance Report
- Click the "Export PDF Report" button to generate a PDF report
- The report includes document details, compliance score, issues, and suggestions

API Access

All functionality is also available through the API. See the API Documentation section for details.

Project Structure

Automated-Document-Compliance-Auditor/
├── app/                    # Flask application
│   ├── __init__.py         # App initialization
│   ├── config.py           # Configuration settings
│   ├── extensions.py       # Flask extensions
│   ├── models/             # Data models
│   │   ├── __init__.py
│   │   ├── compliance.py   # Compliance models
│   │   └── document.py     # Document models
│   ├── routes/             # View functions
│   │   ├── __init__.py
│   │   ├── api.py          # API endpoints
│   │   ├── compliance.py   # Compliance checking routes
│   │   ├── documents.py    # Document management routes
│   │   └── main.py         # Main routes
│   ├── services/           # Business logic
│   │   ├── __init__.py
│   │   ├── bulk_processor.py      # Batch document processing
│   │   ├── document_classifier.py # Document type classification
│   │   ├── document_service.py    # Document handling
│   │   ├── extraction_service.py  # Text extraction
│   │   ├── llm_service.py         # LLM integration with mock support
│   │   ├── pdf_exporter.py        # PDF export generation
│   │   ├── rule_engine.py         # Compliance rules
│   │   └── seed_service.py        # Data seeding
│   ├── static/             # Static assets
│   │   ├── css/            # Stylesheets
│   │   ├── js/             # JavaScript files
│   │   └── img/            # Images
│   ├── templates/          # Jinja2 templates
│   │   ├── base.html       # Base template
│   │   ├── index.html      # Homepage
│   │   ├── about.html      # About page
│   │   ├── compliance/     # Compliance templates
│   │   │   ├── debug.html          # Debug page
│   │   │   ├── results.html        # Results page
│   │   │   ├── results_partial.html # HTMX partial for results
│   │   │   └── suggestions_partial.html # HTMX partial for suggestions
│   │   ├── components/     # Reusable UI components
│   │   │   └── pagination.html     # Pagination component
│   │   ├── documents/      # Document templates
│   │   │   ├── bulk_upload.html    # Bulk upload form
│   │   │   ├── list.html          # Document list
│   │   │   ├── list_partial.html   # HTMX partial for document list
│   │   │   ├── upload.html        # Upload form
│   │   │   └── view.html          # Document viewer
│   │   └── reports/        # Report templates
│   │       ├── compliance_pdf.html # Compliance report template
│   │       └── document_pdf.html   # Document report template
│   └── utils/              # Utility functions
│       ├── __init__.py
│       ├── background_tasks.py # Background task processing
│       ├── cache.py           # Caching utilities
│       ├── document_extractor.py # Document extraction utilities
│       ├── error_handler.py   # Centralized error handling
│       ├── form_validation.py # Input validation
│       ├── pagination.py      # Pagination utilities
│       ├── pdf_export.py      # PDF export utilities
│       ├── pdf_utils.py       # PDF utility functions
│       ├── rate_limiter.py    # API rate limiting
│       ├── security.py        # Security utilities
│       └── text_processing.py # Text processing utilities
├── instance/              # Instance-specific files
│   ├── uploads/           # Uploaded documents
│   └── temp/              # Temporary files
├── screenshots/           # Application screenshots
├── static/                # Global static files
│   └── images/           # Image assets
│       └── screenshots/    # Screenshot images for documentation
├── testdocuments/         # Test document files
├── tests/                 # Test suite
│   ├── __init__.py
│   ├── conftest.py        # Test configuration
│   ├── test_api.py        # API tests
│   ├── test_document_service.py # Document service tests
│   ├── test_extraction_service.py # Extraction service tests
│   ├── test_routes.py     # Route tests
│   ├── test_rule_engine.py # Rule engine tests
│   └── test_utils.py      # Utility tests
├── app.py                 # Application entry point
├── app.log                # Application logs
├── Dockerfile             # Docker configuration
├── docker-compose.yml     # Docker Compose configuration
├── requirements.txt       # Python dependencies
└── README.md              # Project documentation

Portfolio Project Notes

This project demonstrates:

Full-stack development with Python (Flask) and modern frontend techniques (HTMX)
Integration of NLP techniques and AI technologies
Document processing and text analysis
Database design and integration
User interface design for complex data visualization

Features in Detail

Document Processing

The system extracts text from various document formats (PDF, DOCX, TXT) and splits it into paragraphs for analysis. It uses PyPDF2 for PDF extraction and python-docx for DOCX files, with specialized utilities in the utils module.

Compliance Rules Engine

The rules engine (rule_engine.py) checks documents against predefined compliance rules using:

Regular expression matching for specific clause patterns
Keyword detection for important compliance terms
Severity classification (High, Medium, Low)

AI-Powered Suggestions

When a compliance issue is detected, the system generates remediation suggestions using Anthropic's Claude API (llm_service.py), providing context-appropriate clause examples that would satisfy compliance requirements. A fallback mock service is integrated directly into the LLM service and can be enabled by setting USE_MOCK_LLM=True in your environment variables or .env file.

Interactive User Interface

The interface provides:

Document uploading and management
Real-time compliance checking
Highlighted issues in the document view
Detailed compliance reports
Interactive suggestion generation with Claude
Debug tools for testing API integration

Recent Improvements

Error Handling:
- Implemented a centralized error handling system with custom AppError class
- Added decorators for route error handling with user-friendly feedback
User Experience:
- Added toast notification system for improved user feedback
- Implemented dark mode support for better accessibility
- Enhanced mobile responsiveness for all device sizes
Performance Optimization:
- Added document caching to improve retrieval speed
- Implemented pagination for document lists to handle large datasets
- Added background task processing for long-running operations
Security Enhancements:
- Implemented input validation and sanitization to prevent XSS attacks
- Added CSRF protection for all forms
- Implemented rate limiting to prevent abuse
- Added API key authentication for API endpoints
Feature Additions:
- Created a RESTful API for programmatic access to all features
- Added PDF export functionality for compliance reports
- Implemented advanced search and filtering for documents
- Added health check endpoints for monitoring
Code Quality:
- Fixed metadata loading and compliance score display issues
- Consolidated LLM services by integrating mock functionality
- Added configuration options for toggling features
- Improved error handling and debugging information

Development

Code Quality

This project uses ruff and flake8 for code quality checks. To run these checks locally:

Run ruff:

# Navigate to your project directory
cd /Users/sylvester/Desktop/Automated-Document-Compliance-Auditor

# Activate virtual environment
source venv/bin/activate

# Run ruff on the entire codebase
ruff check .

# To automatically fix some issues
ruff check --fix .

Run flake8:

# Run flake8 on the entire codebase
flake8 .

Known Issues

PDF export occasionally fails with large documents
Some HIPAA rules need refinement for better accuracy
Mobile view has alignment issues on small screens
MongoDB connection pooling needs optimization

CI/CD Pipeline

This project includes a GitHub Actions workflow for continuous integration and deployment. The workflow is defined in .github/workflows/ci-cd.yml and includes the following stages:

Lint: Runs ruff and flake8 to check code quality
Test: Runs pytest with coverage reporting
Build: Builds and pushes a Docker image to DockerHub (on main/master branch)
Deploy: Deploys the application to production (on main/master branch)

The CI/CD pipeline uses GitHub Container Registry (GHCR) to store Docker images, which is free for public repositories. The pipeline automatically handles authentication using GitHub Actions' built-in secrets.

If you're using the deployment step, you'll need to set up the following GitHub secrets:

DEPLOY_USER: SSH username for deployment (if using SSH deployment)
DEPLOY_HOST: SSH host for deployment (if using SSH deployment)

Future Enhancements

Support for additional document formats (HTML, XML, etc.)
More compliance standards (SOX, CCPA, etc.)
Machine learning model for document classification
Custom compliance rules with a rule builder interface
Analytics dashboard with compliance trends
Integration with document management systems
Multi-language support
Collaborative review features
Automated scheduled compliance checks
Advanced prompt engineering for more precise suggestions

API Documentation

The application provides a RESTful API for programmatic access to all features. API endpoints are secured with API key authentication and rate limiting.

Authentication

All API requests require an API key to be included in the request headers:

X-API-Key: your-api-key

Generating an API Key

To generate and configure an API key for the application:

Create a secure random API key:

python -c "import secrets; print(secrets.token_hex(32))"

Add the API key to your .env file in the instance directory:

# Create the instance directory if it doesn't exist
mkdir -p instance

# Add the API key to your .env file
echo "API_KEY=your_generated_key_here" >> instance/.env

Restart the application to load the new API key from the environment.

For security best practices:

Generate a unique API key for each client or service
Rotate API keys periodically
Never share API keys in public repositories or insecure channels

Endpoints

GET /api/documents - List all documents with pagination and filtering
GET /api/documents/{document_id} - Get a specific document by ID
GET /api/documents/{document_id}/compliance - Get compliance information for a document
POST /api/documents/{document_id}/check - Check compliance for a document
GET /api/documents/{document_id}/export/pdf - Export a document as PDF
GET /api/documents/{document_id}/compliance/export/pdf - Export compliance report as PDF
GET /api/rules - List all compliance rules
GET /api/stats - Get application statistics

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
app		app
static/images/screenshots		static/images/screenshots
testdocuments		testdocuments
tests		tests
.DS_Store		.DS_Store
.flake8		.flake8
.gitignore		.gitignore
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
README.md		README.md
app.log		app.log
app.py		app.py
directory.sh		directory.sh
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

sylvester-francis/Automated-Document-Compliance-Auditor

Folders and files

Latest commit

History

Repository files navigation

Automated Document Compliance Auditor

Overview

Key Features

Application Screenshots

Homepage/Dashboard

Document List View

Document Upload Interface

Document Detail View

Compliance Check Results

System Architecture

Technology Stack

Getting Started

Prerequisites

Installation

Docker Deployment

Using Docker Compose (Recommended)

Using Docker without Compose

Usage

Document Management

Compliance Checking

API Access

Project Structure

Portfolio Project Notes

Features in Detail

Document Processing

Compliance Rules Engine

AI-Powered Suggestions

Interactive User Interface

Recent Improvements

Development

Code Quality

Known Issues

CI/CD Pipeline

Future Enhancements

API Documentation

Authentication

Generating an API Key

Endpoints

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Languages

Packages