Skip to content

Python ETL preprocessor to flatten excel files into CSV for reporting, data pipeline ingestion, and auditing

License

Notifications You must be signed in to change notification settings

pgaljan/excel-dumper

Repository files navigation

Dumper

Cross-platform Excel ETL pre-processor that flattens Excel files into a predictable CSV format, preserving key source metadata for easy ingestion into data pipelines. Exports values or formulas, simplifying audit processes. Available in both Python and PowerShell versions.

Features

  • Auto-discovery: Automatically finds and processes the newest Excel file in the specified directory
  • Multi-sheet support: Extracts data from all worksheets in an Excel file
  • Hidden sheet control: Option to include or skip hidden worksheets
  • Smart filtering: Only exports rows containing non-null data
  • Formula extraction: Option to show formulas instead of calculated values (.xlsx/.xlsm only)
  • Worksheet identification: Prepends worksheet name as the first column in output
  • Row number tracking: Option to include Excel row numbers for data traceability
  • Collision-safe naming: Automatically adds incremental numbers if output files already exist
  • Custom output directory: Specify where to save the exported CSV files
  • Flexible input sources: Search for files in any directory or process specific files
  • Multiple Excel formats: Supports .xlsx, .xls, .xlsm, and .xlsb files
  • Timestamped output: Uses file modification time for consistent naming (ISO 8601 format)

Quick Start

Python Version

# Install dependencies
pip install pandas openpyxl xlrd

# Basic usage
python dumper.py
python dumper.py -file "data.xlsx" -rownumbers

PowerShell Version

# Install dependencies
Install-Module ImportExcel -Force

# Basic usage
.\dumper.ps1
.\dumper.ps1 -File "data.xlsx" -RowNumbers

Installation

Python Version

Prerequisites: Python 3.6 or higher

Install Dependencies:

pip install pandas openpyxl xlrd

PowerShell Version

Prerequisites: PowerShell 5.1 or PowerShell Core 6+

Install Dependencies:

with admin rights

Install-Module ImportExcel -Force

without admin rights

Install-Module ImportExcel -Force -Scope CurrentUser

Command Line Options

Feature Python Version PowerShell Version Description
Specify file -file FILE -File FILE Process specific Excel file
Input directory -input DIR -InputDir DIR Directory to search for Excel files
Output directory -output DIR -OutputDir DIR Directory for CSV output
Skip hidden sheets -no-hide -NoHide Exclude hidden worksheets
Include row numbers -rownumbers -RowNumbers Add Excel row numbers to output
Show formulas -formulas -Formulas Show formulas instead of values (.xlsx/.xlsm only)
Help -help -Help Display detailed help

Usage Examples

Basic Usage

Python:

# Process newest Excel file from current directory
python dumper.py

# Process specific file
python dumper.py -file "data.xlsx"

# Include row numbers and formulas
python dumper.py -file "data.xlsx" -rownumbers -formulas

PowerShell:

# Process newest Excel file from current directory
.\dumper.ps1

# Process specific file
.\dumper.ps1 -File "data.xlsx"

# Include row numbers and formulas
.\dumper.ps1 -File "data.xlsx" -RowNumbers -Formulas

Advanced Usage

Python:

# Comprehensive processing with all options
python dumper.py -input "./source" -output "./exports" -file "report.xlsx" -no-hide -rownumbers -formulas

# Process from specific directory
python dumper.py -input "C:\Reports" -output "C:\Processed"

PowerShell:

# Comprehensive processing with all options
.\dumper.ps1 -InputDir "./source" -OutputDir "./exports" -File "report.xlsx" -NoHide -RowNumbers -Formulas

# Process from specific directory
.\dumper.ps1 -InputDir "C:\Reports" -OutputDir "C:\Processed"

Get Help

Python:

python dumper.py -help

PowerShell:

.\dumper.ps1 -Help

Output Format

File Naming Convention

Both versions create files with the pattern:

dumper[py|ps]_[original_filename]_[timestamp].csv
  • Python version: dumperpy_Sales_Report_2025-07-21T14-30-52-0500.csv
  • PowerShell version: dumperps_Sales_Report_2025-07-21T14-30-52-0500.csv
  • Timestamp format: ISO 8601 with colons replaced by hyphens
  • Collision handling: Adds (1), (2), etc. if file exists

CSV Structure

The output CSV contains:

  1. First column: Worksheet name
  2. Second column: Excel row number (if row numbers option used)
  3. Remaining columns: Original data from Excel worksheets (Column_1, Column_2, etc.)

Without row numbers:

Worksheet,Column_1,Column_2,Column_3
Sheet1,John Doe,Sales Manager,50000
Sheet1,Jane Smith,Developer,65000
Summary,Total Employees,,2

With row numbers:

Worksheet,Row_Number,Column_1,Column_2,Column_3
Sheet1,2,John Doe,Sales Manager,50000
Sheet1,3,Jane Smith,Developer,65000
Summary,5,Total Employees,,2

With formulas (when -formulas option used):

Worksheet,Column_1,Column_2,Column_3
Sheet1,John Doe,Sales Manager,50000
Summary,Total Employees,"FORMULA: =COUNTA(A:A)-1",2
Summary,Average Salary,"FORMULA: =AVERAGE(C2:C3)",57500

Supported Excel Formats

Both versions support:

  • .xlsx - Excel 2007+ (Open XML Format)
  • .xls - Excel 97-2003 (Binary Format)
  • .xlsm - Excel Macro-Enabled Workbook
  • .xlsb - Excel Binary Workbook

Note: Formula extraction (-formulas/-Formulas) only works with .xlsx and .xlsm files.

Dependencies

Python Version

  • pandas - Data manipulation and Excel reading
  • openpyxl - Excel 2007+ file support and formula extraction
  • xlrd - Legacy Excel (.xls) file support
  • Standard library modules: argparse, csv, glob, os, sys, datetime, pathlib

PowerShell Version

  • ImportExcel module - Excel file processing
  • PowerShell 5.1 or PowerShell Core 6+

Platform Compatibility

Python Version

  • Windows, macOS, Linux - Full cross-platform support
  • Uses pathlib for platform-independent path handling

PowerShell Version

  • Windows - Native PowerShell 5.1+
  • macOS, Linux - PowerShell Core 6+
  • Cross-platform path handling with PowerShell cmdlets

Error Handling

Both versions provide clear error messages for:

  • File not found: When specified file doesn't exist
  • No Excel files: When no Excel files found in directory
  • Missing dependencies: When required libraries/modules aren't installed
  • Read errors: When Excel file is corrupted or inaccessible
  • Write errors: When output location is not writable
  • Directory creation: Automatically creates output directories

Example Workflows

Daily Report Processing

Python:

# Process daily reports
python dumper.py -input "/data/daily_reports" -output "./exports/$(date +%Y-%m-%d)"

PowerShell:

# Process daily reports
.\dumper.ps1 -InputDir "C:\Data\DailyReports" -OutputDir "C:\Exports\$(Get-Date -Format 'yyyy-MM-dd')"

Batch Processing Multiple Directories

Python:

# Process multiple directories
for dir in data/*/; do 
    python dumper.py -input "$dir" -output "./processed/$(basename $dir)"
done

PowerShell:

# Process multiple directories
Get-ChildItem "C:\Data" -Directory | ForEach-Object {
    .\dumper.ps1 -InputDir $_.FullName -OutputDir "C:\Processed\$($_.Name)"
}

Troubleshooting

Common Issues

Python - "Error: Required libraries not found"

pip install pandas openpyxl xlrd
python -c "import pandas, openpyxl; print('Dependencies OK')"

PowerShell - "ImportExcel module not found"

Install-Module ImportExcel -Force -Scope CurrentUser
Import-Module ImportExcel

"No Excel files found in directory"

  • Verify directory path with -input/-InputDir
  • Check file extensions (.xlsx, .xls, .xlsm, .xlsb)
  • Use -file/-File to specify exact filename

"Permission denied" errors

  • Check write permissions for output directory
  • Close Excel files that might be locking files
  • Run with appropriate privileges

"Error reading Excel file"

  • Verify file isn't corrupted by opening in Excel
  • Check file isn't password-protected
  • Ensure file isn't in use by another application

Choosing Between Versions

Use Python Version When:

  • You need maximum cross-platform compatibility
  • You're integrating with existing Python data pipelines
  • You prefer pip-based dependency management
  • You need the most mature Excel processing capabilities

Use PowerShell Version When:

  • You're working primarily in Windows environments
  • You prefer PowerShell-based automation workflows
  • You want native integration with Windows administration tasks
  • You're already using PowerShell for other data processing

Advanced Configuration

Large File Processing

Both versions handle large files efficiently by:

  • Processing sheets one at a time to manage memory
  • Showing progress for each sheet
  • Using streaming approaches where possible

Integration with CI/CD

Both scripts can be integrated into automated workflows:

Python in GitHub Actions:

- name: Process Excel files
  run: |
    pip install pandas openpyxl xlrd
    python dumper.py -input "./data" -output "./processed"

PowerShell in Azure DevOps:

- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      Install-Module ImportExcel -Force
      .\dumper.ps1 -InputDir ".\data" -OutputDir ".\processed"

Performance Considerations

  • Hidden sheets: Use -no-hide/-NoHide to skip unnecessary processing
  • Formula extraction: Only enable when needed as it requires additional processing
  • Large workbooks: Both versions stream data to minimize memory usage
  • Network drives: Local processing is faster than network-mounted drives

About

Python ETL preprocessor to flatten excel files into CSV for reporting, data pipeline ingestion, and auditing

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages