-
File Hash-Based Version Control
- Every file is hashed (SHA-256) BEFORE processing
- Automatic detection of file changes
- Version tracking for all document updates
- Skip processing of unchanged files
-
Enhanced Progress Tracking
- Dual progress bars (overall + batch)
- Real-time statistics (processed, skipped, failed)
- Elapsed time tracking
- Memory usage monitoring per file
-
Contract Ontology Management
- Hierarchical categorization system
- Visual ontology tree editor
- AI-powered auto-categorization
- Confidence scoring for assignments
-
Comprehensive SQL Schema
- 10 core tables for complete data management
- Full audit trail with processing logs
- Daily statistics aggregation
- Optimized indexes for 2600+ documents
-
Knowledge Graph Enhancements
- Filtered visualization (documents, companies, ontologies)
- Color-coded by ontology categories
- Relationship type visualization
- Export capabilities
documents
- Core document storage with version trackingfile_hashes
- Track all file versionsdocument_versions
- Complete version historycontract_ontology
- Hierarchical categorizationdocument_ontology_mapping
- Document categorizationcompanies
- Company managementdocument_companies
- Document-company relationshipsdocument_relationships
- Inter-document relationshipsprocessing_logs
- Complete audit trailprocessing_statistics
- Performance metrics
See SQL_SCHEMA_DOCUMENTATION.md for complete schema details.
A powerful document processing system that can handle large-scale contract analysis with SQL database integration, memory management, and knowledge graph capabilities.
- Large-Scale Processing: Process 2600+ contracts efficiently with memory management
- SQL Database Integration: Store and query processed documents in PostgreSQL/MySQL/SQLite
- Knowledge Graph: Visualize document relationships and company connections
- Memory Management: Automatic memory optimization and garbage collection
- Settings Persistence: Save and load configurations via JSON
- Document Tracking: Skip already processed files, track processing status
- Support for PDF, DOCX, DOC, XLSX, XLS files
- AI-powered content analysis using OpenAI or Azure OpenAI
- Extract contract details, vendor assessments, and technical specifications
- Automatic relationship detection between documents
- Company ID and Company Group tracking
- Contract Work (CW) number assignment
- Document ontology and categorization
- Comprehensive audit trail and processing logs
- Install required dependencies:
pip install -r requirements.txt
- Set up environment variables:
# For OpenAI
export OPENAI_API_KEY="your-api-key"
# For Azure OpenAI
export AZURE_ENDPOINT="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"
export AZURE_API_KEY="your-api-key"
- Set up database (PostgreSQL example):
createdb contract_processor
- Run the application:
python contract-processor.py
-
Configure settings via the Settings menu:
- Database connection (PostgreSQL/MySQL/SQLite)
- API configuration (OpenAI/Azure)
- Processing parameters (batch size, memory limits)
- Company metadata defaults
-
Process documents:
- Select directory containing contracts
- Enter CW number, Company ID, and Company Group
- Choose file types to process
- Click "Start Processing"
-
View results:
- Excel output in the configured output directory
- SQL database with full processing history
- Knowledge graph visualization in the application
- documents: Stores document metadata and processing status
- companies: Company information and groupings
- document_relationships: Links between related documents
- processing_logs: Audit trail of all processing activities
- document_ontology: Document categorization hierarchy
- File hash-based duplicate detection
- Processing status tracking (pending/processing/completed/failed)
- Contract date extraction and storage
- Company-document relationship mapping
The system automatically manages memory for large-scale processing:
- Dynamic batch size adjustment based on available memory
- Process pool isolation for memory-intensive operations
- Automatic garbage collection between batches
- Configurable memory limits
The integrated knowledge graph provides:
- Visual representation of document relationships
- Company-document connections
- Interactive graph exploration
- Export to PNG/PDF formats
Settings are persisted in contract_processor_settings.json
:
{
"settings": {
"max_workers": 4,
"batch_size": 10,
"memory_limit_mb": 4096,
"db_type": "postgresql",
"db_host": "localhost",
"db_port": 5432,
"skip_processed_files": true,
"auto_detect_relationships": true
}
}
- OpenAI: Direct integration with OpenAI API
- Azure OpenAI: Support for Azure-hosted OpenAI services
- Configurable model parameters and endpoints
- Comprehensive error logging
- Graceful handling of processing failures
- Resume capability for interrupted processing
- Permission error handling
Optimized for processing large document sets:
- Concurrent processing with configurable workers
- Streaming file processing for large documents
- Redis caching support (optional)
- Neo4j integration for advanced graph operations (optional)
MIT License - See LICENSE file for details
Martin Bacigal, 01/2025 @ https://procureai.tech