Skip to content

This project is a Python-based document processor application that leverages Azure OpenAI for extracting and analyzing content from various document formats such as PDF, Word (DOCX, DOC), and Excel (XLSX, XLS).

License

Notifications You must be signed in to change notification settings

backyMacky/contracts-processor-with-OpenAI

Repository files navigation

Enhanced Document Processor with SQL & Knowledge Graph

🚀 Latest Updates - SQL-Focused Enhancement

Key Improvements:

  1. File Hash-Based Version Control

    • Every file is hashed (SHA-256) BEFORE processing
    • Automatic detection of file changes
    • Version tracking for all document updates
    • Skip processing of unchanged files
  2. Enhanced Progress Tracking

    • Dual progress bars (overall + batch)
    • Real-time statistics (processed, skipped, failed)
    • Elapsed time tracking
    • Memory usage monitoring per file
  3. Contract Ontology Management

    • Hierarchical categorization system
    • Visual ontology tree editor
    • AI-powered auto-categorization
    • Confidence scoring for assignments
  4. Comprehensive SQL Schema

    • 10 core tables for complete data management
    • Full audit trail with processing logs
    • Daily statistics aggregation
    • Optimized indexes for 2600+ documents
  5. Knowledge Graph Enhancements

    • Filtered visualization (documents, companies, ontologies)
    • Color-coded by ontology categories
    • Relationship type visualization
    • Export capabilities

SQL Tables Overview:

  • documents - Core document storage with version tracking
  • file_hashes - Track all file versions
  • document_versions - Complete version history
  • contract_ontology - Hierarchical categorization
  • document_ontology_mapping - Document categorization
  • companies - Company management
  • document_companies - Document-company relationships
  • document_relationships - Inter-document relationships
  • processing_logs - Complete audit trail
  • processing_statistics - Performance metrics

See SQL_SCHEMA_DOCUMENTATION.md for complete schema details.


Overview

A powerful document processing system that can handle large-scale contract analysis with SQL database integration, memory management, and knowledge graph capabilities.

Features

Core Capabilities

  • Large-Scale Processing: Process 2600+ contracts efficiently with memory management
  • SQL Database Integration: Store and query processed documents in PostgreSQL/MySQL/SQLite
  • Knowledge Graph: Visualize document relationships and company connections
  • Memory Management: Automatic memory optimization and garbage collection
  • Settings Persistence: Save and load configurations via JSON
  • Document Tracking: Skip already processed files, track processing status

Document Processing

  • Support for PDF, DOCX, DOC, XLSX, XLS files
  • AI-powered content analysis using OpenAI or Azure OpenAI
  • Extract contract details, vendor assessments, and technical specifications
  • Automatic relationship detection between documents

Data Governance

  • Company ID and Company Group tracking
  • Contract Work (CW) number assignment
  • Document ontology and categorization
  • Comprehensive audit trail and processing logs

Installation

  1. Install required dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
# For OpenAI
export OPENAI_API_KEY="your-api-key"

# For Azure OpenAI
export AZURE_ENDPOINT="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"
export AZURE_API_KEY="your-api-key"
  1. Set up database (PostgreSQL example):
createdb contract_processor

Usage

  1. Run the application:
python contract-processor.py
  1. Configure settings via the Settings menu:

    • Database connection (PostgreSQL/MySQL/SQLite)
    • API configuration (OpenAI/Azure)
    • Processing parameters (batch size, memory limits)
    • Company metadata defaults
  2. Process documents:

    • Select directory containing contracts
    • Enter CW number, Company ID, and Company Group
    • Choose file types to process
    • Click "Start Processing"
  3. View results:

    • Excel output in the configured output directory
    • SQL database with full processing history
    • Knowledge graph visualization in the application

Database Schema

Tables

  • documents: Stores document metadata and processing status
  • companies: Company information and groupings
  • document_relationships: Links between related documents
  • processing_logs: Audit trail of all processing activities
  • document_ontology: Document categorization hierarchy

Key Features

  • File hash-based duplicate detection
  • Processing status tracking (pending/processing/completed/failed)
  • Contract date extraction and storage
  • Company-document relationship mapping

Memory Management

The system automatically manages memory for large-scale processing:

  • Dynamic batch size adjustment based on available memory
  • Process pool isolation for memory-intensive operations
  • Automatic garbage collection between batches
  • Configurable memory limits

Knowledge Graph

The integrated knowledge graph provides:

  • Visual representation of document relationships
  • Company-document connections
  • Interactive graph exploration
  • Export to PNG/PDF formats

Configuration

Settings are persisted in contract_processor_settings.json:

{
  "settings": {
    "max_workers": 4,
    "batch_size": 10,
    "memory_limit_mb": 4096,
    "db_type": "postgresql",
    "db_host": "localhost",
    "db_port": 5432,
    "skip_processed_files": true,
    "auto_detect_relationships": true
  }
}

API Support

  • OpenAI: Direct integration with OpenAI API
  • Azure OpenAI: Support for Azure-hosted OpenAI services
  • Configurable model parameters and endpoints

Error Handling

  • Comprehensive error logging
  • Graceful handling of processing failures
  • Resume capability for interrupted processing
  • Permission error handling

Performance

Optimized for processing large document sets:

  • Concurrent processing with configurable workers
  • Streaming file processing for large documents
  • Redis caching support (optional)
  • Neo4j integration for advanced graph operations (optional)

License

MIT License - See LICENSE file for details

Author

Martin Bacigal, 01/2025 @ https://procureai.tech

About

This project is a Python-based document processor application that leverages Azure OpenAI for extracting and analyzing content from various document formats such as PDF, Word (DOCX, DOC), and Excel (XLSX, XLS).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published