This project implements a secure and automated data ingestion pipeline for retrieving CSV-based financial reports from an SFTP server operated by GPP Group (gpp.group / Titan IS). It extracts, transforms, and stores this data into a Microsoft SQL Server database for subsequent analysis and operational use.
The application performs the following automated tasks:
- Establishes a secure SFTP connection to GPP’s data repository.
- Detects and downloads new CSV reports.
- Transforms and consolidates raw CSVs into structured dataframes.
- Inserts the processed data into SQL Server tables.
- Avoids redundant ingestion using timestamp-based filters.
This solution is suitable for financial data teams integrating third-party position, transaction, or custody reports into internal systems.
All data is retrieved from an SFTP server hosted by GPP Group or its data platform Titan IS:
- Reports are stored under
/Reports
on the SFTP host. - Filenames follow a structured timestamp format (e.g.,
20240501_positions.csv
). - Each file corresponds to specific financial snapshots (positions, holdings, etc.).
Access requires a valid username/password and server configuration.
The entry point main.py
orchestrates the workflow:
-
Connect to SFTP:
- Authenticates and connects via
paramiko
to fetch new CSV reports.
- Authenticates and connects via
-
Filter Reports:
- Uses the last inserted timestamp (from the DB) to avoid reprocessing old files.
-
Parse Files:
- CSVs are read using
pandas
and filtered to discard invalid rows. - Each file is mapped to a derived table name based on naming conventions.
- CSVs are read using
-
Transform & Insert:
- Cleaned data is passed through a transformer and inserted into the configured SQL Server tables.
-
Cleanup:
- The SFTP session and transport are closed cleanly after execution.
gpp-client-main/
├── client/ # SFTP and file engine
│ ├── engine.py # File filtering, reading, and transformation
│ └── sftp.py # Secure connection logic
├── config/ # Logging and environment settings
├── database/ # SQL Server DB helpers
├── transformer/ # Custom transformation logic
├── main.py # Entry point for orchestration
├── .env.sample # Sample configuration for environment variables
├── Dockerfile # Container setup
You must configure a .env
file based on .env.sample
with the following variables:
Variable | Description |
---|---|
SFTP_HOST , SFTP_PORT |
SFTP server connection details |
SFTP_USER , SFTP_PASSWORD |
SFTP credentials |
MSSQL_* |
Database connection configuration |
OUTPUT_TABLE |
Target table name for inserts |
INSERTER_MAX_RETRIES |
Retry limit for DB operations |
LOG_LEVEL |
Logging verbosity (e.g., INFO, DEBUG) |
The application is container-ready for deployment.
docker build -t gpp-client .
docker run --env-file .env gpp-client
Install dependencies via pip:
pip install -r requirements.txt
Key libraries:
paramiko
: SFTP communicationpandas
: Data loading and shapingSQLAlchemy
,pyodbc
: MSSQL interfacingfast-to-sql
: Efficient bulk data operations
Ensure .env
is configured properly, then run:
python main.py
Application logs will confirm:
- Number of new files detected
- Number of rows inserted per table
- Any errors or warnings during transformation
This project is MIT licensed. Usage of the SFTP server and its data is subject to agreements with GPP Group and Titan IS. Ensure proper authorization before accessing external data.
- Add support for PGP-encrypted reports.
- Implement webhook or schedule triggers.
- Extend transformation rules via configuration or plugins.
- Track ingestion metadata in a separate audit table.