Data Source Application 🚀

A powerful Python application for intelligent metadata extraction from Excel files. This tool provides comprehensive schema analysis, business context extraction, and data quality insights.

Features ✨

📊 Comprehensive Metadata Extraction: Schema information, data types, constraints
🏷️ Intelligent Business Context: Auto-generated descriptions and tags based on column patterns
📈 Data Quality Metrics: Null counts, uniqueness ratios, quality indicators
🎯 Smart Type Detection: Automatic detection of integers, floats, dates, booleans, and more
📋 Multi-Sheet Support: Process multiple sheets within a single Excel file
🔍 Sample Value Extraction: Get sample data for better understanding
📄 JSON Export: Export metadata in structured JSON format
🎨 Rich Terminal Interface: Beautiful command-line interface with tables and colors
🐳 Docker Support: Containerized deployment ready

Quick Start 🚀

Prerequisites

Python 3.11+
pip (Python package manager)

Local Setup

Clone the repository

git clone <repository-url>
cd data-source-app

Create and activate virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set up demo data
```
python -m src.cli setup-demo
```
Run the demo
```
python -m src.cli demo
```

Run tests

pytest tests/ -v
pytest tests/ --cov=src --cov-report=html  # With coverage

Docker Setup

Build and run with Docker Compose
```
cd docker
docker-compose up --build
```
Run extraction job
```
docker-compose run extractor-job
```

Usage 📖

Command Line Interface

# Extract metadata from an Excel file
python -m src.cli extract path/to/your/file.xlsx

# Extract with custom options
python -m src.cli extract data.xlsx --output metadata.json --sample-size 10 --max-rows 5000

# Run the demo
python -m src.cli demo

# Set up demo data
python -m src.cli setup-demo

# Show application information
python -m src.cli info

Command Options

--output, -o: Export metadata to JSON file
--sample-size: Number of sample values to extract (default: 5)
--max-rows: Maximum rows to process per sheet (default: 10000)
--skip-empty/--include-empty: Skip or include empty sheets
--include-samples/--no-samples: Include or exclude sample values
--verbose, -v: Enable verbose logging

Architecture 🏗️

Core Components

src/
├── cli.py                 # Command-line interface
├── data_extractor/        # Extraction logic
│   └── excel_extractor.py # Excel-specific extractor
├── models/                # Data models
│   └── metadata.py        # Pydantic models for metadata
└── utils/                 # Utilities
    └── formatters.py      # Display and export formatting

Data Models

ExcelMetadata: Complete file metadata
SheetMetadata: Individual sheet information
ColumnMetadata: Column-level details
ExtractionConfig: Configuration options

Key Features

Intelligent Type Detection: Analyzes data patterns to determine optimal data types
Business Context Generation: Creates meaningful descriptions and tags based on column names and characteristics
Quality Metrics: Calculates null percentages, uniqueness ratios, and other quality indicators
Flexible Configuration: Customizable extraction parameters

Sample Output 📊

The application provides rich terminal output including:

File Summary: Size, sheet count, total rows
Sheets Overview: Table with row/column counts and tags
Columns Analysis: Detailed column information with data types and quality metrics
Data Quality Report: Overall quality insights and statistics
JSON Export: Structured metadata for programmatic use

Configuration ⚙️

ExtractionConfig Options

config = ExtractionConfig(
    include_sample_values=True,    # Include sample data
    sample_size=5,                # Number of samples
    include_quality_metrics=True, # Calculate quality metrics
    max_sheet_rows=10000,         # Performance limit
    skip_empty_sheets=True,       # Skip empty sheets
    custom_tags=["custom", "tag"] # Additional tags
)

Demo Data 🎯

The demo includes sample data with:

Employee Sheet: HR data with various data types
Transactions Sheet: Financial data with dates and categories
Mixed Data Types: Integers, strings, dates, booleans, floats
Quality Issues: Some null values and patterns for demonstration

Testing 🧪

This application includes comprehensive test coverage (97%) with:

Unit Tests: Individual component testing
Integration Tests: End-to-end workflow testing
CLI Tests: Command-line interface testing
Edge Case Tests: Error handling and boundary conditions
Performance Tests: Large file processing validation

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test categories
pytest tests/test_models.py          # Unit tests
pytest tests/test_integration.py     # Integration tests
pytest tests/test_cli.py            # CLI tests

# Run fast tests only
pytest tests/ -m "not slow"

Test Structure

tests/test_models.py - Data model validation
tests/test_excel_extractor.py - Core extraction logic
tests/test_formatters.py - Display and export functionality
tests/test_cli.py - Command-line interface
tests/test_integration.py - End-to-end workflows
tests/conftest.py - Test fixtures and configuration

Error Handling 🛡️

File Validation: Checks file existence and format
Graceful Degradation: Continues processing even if individual sheets fail
Comprehensive Logging: Detailed logs for debugging
User-Friendly Messages: Clear error messages with suggestions

Production Considerations 🌍

Security

File path validation
Safe Excel parsing
No external network calls

Performance

Configurable row limits
Efficient memory usage
Streaming for large files

Scalability

Modular design for different data sources
Extensible configuration system
Plugin architecture ready

Monitoring

Structured logging with loguru
Performance metrics collection
Error tracking and reporting

Extending the Application 🔧

Adding New Data Sources

Create a new extractor class inheriting from base extractor
Implement the extract_metadata method
Add source-specific models if needed
Update CLI to support new source type

Custom Business Logic

Modify _generate_column_description for custom descriptions
Update _generate_column_tags for custom tagging logic
Extend ColumnMetadata model for additional fields

Dependencies 📦

pandas: Data manipulation and analysis
openpyxl: Excel file reading
pydantic: Data validation and models
typer: CLI framework
rich: Terminal formatting
loguru: Advanced logging

Contributing 🤝

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Support 💬

For questions or support, please open an issue in the repository or contact the development team.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
src		src
tests		tests
.coverage		.coverage
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.sh		setup.sh

Baldev-P/Data-Source-Application

Folders and files

Latest commit

History

Repository files navigation