This project aims to enhance data quality, automate workflows, and improve accessibility for multiple projects, including Suplari Implementation, P Card, and Travel Analytics. The primary objective is to develop a robust, automated matching framework for AIC files that ensures consistent data across platforms, streamlining workflows and enabling accurate reporting and analysis.
- Registering and Getting Access to Datasets in EDP - LinkedIn Corporate Wiki
- Registering and Getting Access with EDP CLI
- EDP Metadata on Visual Studio
- Requesting Access to EDP Tables
- Connecting Power BI to EDP
- Oracle to Hadoop Filer based Data Push
- Near Real-Time Data for Operational Analytics/Reporting
- Finance Data Platform - LinkedIn Corporate Wiki
- Master Data Exports
- Travel BCD - Finance Data Platform
- Revenue Analytics - Ada Ma
- Requesting SGP-ENG Group Memberships
- Access to Oracle EBS Database - LinkedIn JIRA
- EBS Instances - Information Services & Technology
- docs/ - Documentation files including data dictionary, architecture, and analysis notes.
- config/ - Configuration files for database connections, environment settings, and mapping schemas.
- sql/ - SQL scripts categorized into raw data extraction, transformations, aggregations, and matching logic.
- etl/ - ETL pipeline code with modular functions for data normalization, matching, enrichment, and validation.
- scripts/ - Utility scripts for validating matches, checking data quality, and generating reports.
- data_samples/ - Sample data files (sanitized) for testing ETL processes, including raw and processed data.
The following tables and columns are key to the Matching - AIC File project:
Table Name | Key Columns |
---|---|
GL_DAILY_RATES | Conversion Date, From Currency, To Currency, Conversion Rate |
HR_OPERATING_UNITS | Organization ID, Name, Location |
HZ_CUST_ACCOUNTS | Account Number, Customer ID, Customer Name |
- Clone the repository.
- Install required packages by running
pip install -r requirements.txt
. - Set up database connections by configuring
db_connections.yaml
in theconfig/
folder. - Prepare sample data in the
data_samples/
folder for initial testing.
The main ETL pipeline is located in etl/main_etl_pipeline.py
. You can run the entire ETL process or execute individual modules as needed:
python etl/main_etl_pipeline.py
- validate_matches.py - Checks matching accuracy and provides metrics on match rates.
- check_data_quality.py - Runs data quality checks to identify inconsistencies.
- generate_reports.py - Generates reports on the matching process and progress.
- data_dictionary.md - Detailed documentation of tables, fields, and relationships.
- architecture.md - High-level overview of data architecture and system integration.
- troubleshooting.md - Common issues and solutions for data and ETL challenges.
For contributions, please adhere to the following guidelines:
- Use branches for specific features or modules, and provide descriptive names (e.g.,
feature/fuzzy_matching
). - Commit regularly with clear messages describing changes made.
- Document any new functions or scripts added to the repository.
For any questions or further assistance, please reach out to the project lead, Scott Morgan.