Skip to content

Baldev-P/Data-Source-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Source Application πŸš€

A powerful Python application for intelligent metadata extraction from Excel files. This tool provides comprehensive schema analysis, business context extraction, and data quality insights.

Features ✨

  • πŸ“Š Comprehensive Metadata Extraction: Schema information, data types, constraints
  • 🏷️ Intelligent Business Context: Auto-generated descriptions and tags based on column patterns
  • πŸ“ˆ Data Quality Metrics: Null counts, uniqueness ratios, quality indicators
  • 🎯 Smart Type Detection: Automatic detection of integers, floats, dates, booleans, and more
  • πŸ“‹ Multi-Sheet Support: Process multiple sheets within a single Excel file
  • πŸ” Sample Value Extraction: Get sample data for better understanding
  • πŸ“„ JSON Export: Export metadata in structured JSON format
  • 🎨 Rich Terminal Interface: Beautiful command-line interface with tables and colors
  • 🐳 Docker Support: Containerized deployment ready

Quick Start πŸš€

Prerequisites

  • Python 3.11+
  • pip (Python package manager)

Local Setup

  1. Clone the repository

    git clone <repository-url>
    cd data-source-app
  2. Create and activate virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up demo data

    python -m src.cli setup-demo
  5. Run the demo

    python -m src.cli demo
  6. Run tests

    pytest tests/ -v
    pytest tests/ --cov=src --cov-report=html  # With coverage

Docker Setup

  1. Build and run with Docker Compose

    cd docker
    docker-compose up --build
  2. Run extraction job

    docker-compose run extractor-job

Usage πŸ“–

Command Line Interface

# Extract metadata from an Excel file
python -m src.cli extract path/to/your/file.xlsx

# Extract with custom options
python -m src.cli extract data.xlsx --output metadata.json --sample-size 10 --max-rows 5000

# Run the demo
python -m src.cli demo

# Set up demo data
python -m src.cli setup-demo

# Show application information
python -m src.cli info

Command Options

  • --output, -o: Export metadata to JSON file
  • --sample-size: Number of sample values to extract (default: 5)
  • --max-rows: Maximum rows to process per sheet (default: 10000)
  • --skip-empty/--include-empty: Skip or include empty sheets
  • --include-samples/--no-samples: Include or exclude sample values
  • --verbose, -v: Enable verbose logging

Architecture πŸ—οΈ

Core Components

src/
β”œβ”€β”€ cli.py                 # Command-line interface
β”œβ”€β”€ data_extractor/        # Extraction logic
β”‚   └── excel_extractor.py # Excel-specific extractor
β”œβ”€β”€ models/                # Data models
β”‚   └── metadata.py        # Pydantic models for metadata
└── utils/                 # Utilities
    └── formatters.py      # Display and export formatting

Data Models

  • ExcelMetadata: Complete file metadata
  • SheetMetadata: Individual sheet information
  • ColumnMetadata: Column-level details
  • ExtractionConfig: Configuration options

Key Features

  1. Intelligent Type Detection: Analyzes data patterns to determine optimal data types
  2. Business Context Generation: Creates meaningful descriptions and tags based on column names and characteristics
  3. Quality Metrics: Calculates null percentages, uniqueness ratios, and other quality indicators
  4. Flexible Configuration: Customizable extraction parameters

Sample Output πŸ“Š

The application provides rich terminal output including:

  • File Summary: Size, sheet count, total rows
  • Sheets Overview: Table with row/column counts and tags
  • Columns Analysis: Detailed column information with data types and quality metrics
  • Data Quality Report: Overall quality insights and statistics
  • JSON Export: Structured metadata for programmatic use

Configuration βš™οΈ

ExtractionConfig Options

config = ExtractionConfig(
    include_sample_values=True,    # Include sample data
    sample_size=5,                # Number of samples
    include_quality_metrics=True, # Calculate quality metrics
    max_sheet_rows=10000,         # Performance limit
    skip_empty_sheets=True,       # Skip empty sheets
    custom_tags=["custom", "tag"] # Additional tags
)

Demo Data 🎯

The demo includes sample data with:

  • Employee Sheet: HR data with various data types
  • Transactions Sheet: Financial data with dates and categories
  • Mixed Data Types: Integers, strings, dates, booleans, floats
  • Quality Issues: Some null values and patterns for demonstration

Testing πŸ§ͺ

This application includes comprehensive test coverage (97%) with:

  • Unit Tests: Individual component testing
  • Integration Tests: End-to-end workflow testing
  • CLI Tests: Command-line interface testing
  • Edge Case Tests: Error handling and boundary conditions
  • Performance Tests: Large file processing validation

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test categories
pytest tests/test_models.py          # Unit tests
pytest tests/test_integration.py     # Integration tests
pytest tests/test_cli.py            # CLI tests

# Run fast tests only
pytest tests/ -m "not slow"

Test Structure

  • tests/test_models.py - Data model validation
  • tests/test_excel_extractor.py - Core extraction logic
  • tests/test_formatters.py - Display and export functionality
  • tests/test_cli.py - Command-line interface
  • tests/test_integration.py - End-to-end workflows
  • tests/conftest.py - Test fixtures and configuration

Error Handling πŸ›‘οΈ

  • File Validation: Checks file existence and format
  • Graceful Degradation: Continues processing even if individual sheets fail
  • Comprehensive Logging: Detailed logs for debugging
  • User-Friendly Messages: Clear error messages with suggestions

Production Considerations 🌍

Security

  • File path validation
  • Safe Excel parsing
  • No external network calls

Performance

  • Configurable row limits
  • Efficient memory usage
  • Streaming for large files

Scalability

  • Modular design for different data sources
  • Extensible configuration system
  • Plugin architecture ready

Monitoring

  • Structured logging with loguru
  • Performance metrics collection
  • Error tracking and reporting

Extending the Application πŸ”§

Adding New Data Sources

  1. Create a new extractor class inheriting from base extractor
  2. Implement the extract_metadata method
  3. Add source-specific models if needed
  4. Update CLI to support new source type

Custom Business Logic

  • Modify _generate_column_description for custom descriptions
  • Update _generate_column_tags for custom tagging logic
  • Extend ColumnMetadata model for additional fields

Dependencies πŸ“¦

  • pandas: Data manipulation and analysis
  • openpyxl: Excel file reading
  • pydantic: Data validation and models
  • typer: CLI framework
  • rich: Terminal formatting
  • loguru: Advanced logging

Contributing 🀝

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License πŸ“„

This project is licensed under the MIT License - see the LICENSE file for details.

Support πŸ’¬

For questions or support, please open an issue in the repository or contact the development team.

About

Data Source Application - Intelligent metadata extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages