A powerful Python application for intelligent metadata extraction from Excel files. This tool provides comprehensive schema analysis, business context extraction, and data quality insights.
- π Comprehensive Metadata Extraction: Schema information, data types, constraints
- π·οΈ Intelligent Business Context: Auto-generated descriptions and tags based on column patterns
- π Data Quality Metrics: Null counts, uniqueness ratios, quality indicators
- π― Smart Type Detection: Automatic detection of integers, floats, dates, booleans, and more
- π Multi-Sheet Support: Process multiple sheets within a single Excel file
- π Sample Value Extraction: Get sample data for better understanding
- π JSON Export: Export metadata in structured JSON format
- π¨ Rich Terminal Interface: Beautiful command-line interface with tables and colors
- π³ Docker Support: Containerized deployment ready
- Python 3.11+
- pip (Python package manager)
-
Clone the repository
git clone <repository-url> cd data-source-app
-
Create and activate virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up demo data
python -m src.cli setup-demo
-
Run the demo
python -m src.cli demo
-
Run tests
pytest tests/ -v pytest tests/ --cov=src --cov-report=html # With coverage
-
Build and run with Docker Compose
cd docker docker-compose up --build -
Run extraction job
docker-compose run extractor-job
# Extract metadata from an Excel file
python -m src.cli extract path/to/your/file.xlsx
# Extract with custom options
python -m src.cli extract data.xlsx --output metadata.json --sample-size 10 --max-rows 5000
# Run the demo
python -m src.cli demo
# Set up demo data
python -m src.cli setup-demo
# Show application information
python -m src.cli info--output, -o: Export metadata to JSON file--sample-size: Number of sample values to extract (default: 5)--max-rows: Maximum rows to process per sheet (default: 10000)--skip-empty/--include-empty: Skip or include empty sheets--include-samples/--no-samples: Include or exclude sample values--verbose, -v: Enable verbose logging
src/
βββ cli.py # Command-line interface
βββ data_extractor/ # Extraction logic
β βββ excel_extractor.py # Excel-specific extractor
βββ models/ # Data models
β βββ metadata.py # Pydantic models for metadata
βββ utils/ # Utilities
βββ formatters.py # Display and export formatting
- ExcelMetadata: Complete file metadata
- SheetMetadata: Individual sheet information
- ColumnMetadata: Column-level details
- ExtractionConfig: Configuration options
- Intelligent Type Detection: Analyzes data patterns to determine optimal data types
- Business Context Generation: Creates meaningful descriptions and tags based on column names and characteristics
- Quality Metrics: Calculates null percentages, uniqueness ratios, and other quality indicators
- Flexible Configuration: Customizable extraction parameters
The application provides rich terminal output including:
- File Summary: Size, sheet count, total rows
- Sheets Overview: Table with row/column counts and tags
- Columns Analysis: Detailed column information with data types and quality metrics
- Data Quality Report: Overall quality insights and statistics
- JSON Export: Structured metadata for programmatic use
config = ExtractionConfig(
include_sample_values=True, # Include sample data
sample_size=5, # Number of samples
include_quality_metrics=True, # Calculate quality metrics
max_sheet_rows=10000, # Performance limit
skip_empty_sheets=True, # Skip empty sheets
custom_tags=["custom", "tag"] # Additional tags
)The demo includes sample data with:
- Employee Sheet: HR data with various data types
- Transactions Sheet: Financial data with dates and categories
- Mixed Data Types: Integers, strings, dates, booleans, floats
- Quality Issues: Some null values and patterns for demonstration
This application includes comprehensive test coverage (97%) with:
- Unit Tests: Individual component testing
- Integration Tests: End-to-end workflow testing
- CLI Tests: Command-line interface testing
- Edge Case Tests: Error handling and boundary conditions
- Performance Tests: Large file processing validation
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Run specific test categories
pytest tests/test_models.py # Unit tests
pytest tests/test_integration.py # Integration tests
pytest tests/test_cli.py # CLI tests
# Run fast tests only
pytest tests/ -m "not slow"tests/test_models.py- Data model validationtests/test_excel_extractor.py- Core extraction logictests/test_formatters.py- Display and export functionalitytests/test_cli.py- Command-line interfacetests/test_integration.py- End-to-end workflowstests/conftest.py- Test fixtures and configuration
- File Validation: Checks file existence and format
- Graceful Degradation: Continues processing even if individual sheets fail
- Comprehensive Logging: Detailed logs for debugging
- User-Friendly Messages: Clear error messages with suggestions
- File path validation
- Safe Excel parsing
- No external network calls
- Configurable row limits
- Efficient memory usage
- Streaming for large files
- Modular design for different data sources
- Extensible configuration system
- Plugin architecture ready
- Structured logging with loguru
- Performance metrics collection
- Error tracking and reporting
- Create a new extractor class inheriting from base extractor
- Implement the
extract_metadatamethod - Add source-specific models if needed
- Update CLI to support new source type
- Modify
_generate_column_descriptionfor custom descriptions - Update
_generate_column_tagsfor custom tagging logic - Extend
ColumnMetadatamodel for additional fields
- pandas: Data manipulation and analysis
- openpyxl: Excel file reading
- pydantic: Data validation and models
- typer: CLI framework
- rich: Terminal formatting
- loguru: Advanced logging
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or support, please open an issue in the repository or contact the development team.