PySpark Jobs for Digital Land

pyspark-dev

PySpark Jobs for Digital Land

A comprehensive PySpark data processing framework designed for Amazon EMR Serverless with Apache Airflow integration. This project provides scalable ETL pipelines for processing and transforming digital land data collections.

🏗️ Project Overview

This repository contains PySpark jobs that process various digital land datasets including:

Transport access nodes
Title boundaries
Entity data transformations
Fact and fact resource processing
Issue tracking and validation

Key Features

✅ EMR Serverless Ready: Optimized for AWS EMR Serverless execution
✅ Airflow Integration: DAGs for orchestrating data workflows
✅ Modular Design: Reusable transformation components
✅ Comprehensive Testing: Unit, integration, and acceptance tests with pytest
✅ Configuration Management: JSON-based dataset and schema configuration
✅ AWS Secrets Integration: Secure credential management
✅ Multiple Output Formats: Support for Parquet, CSV, and database outputs

📁 Project Structure

pyspark-jobs/
├── src/                          # Source code
│   ├── jobs/                     # Core PySpark job modules
│   │   ├── main_collection_data.py      # Main ETL pipeline
│   │   ├── transform_collection_data.py # Data transformation logic
│   │   ├── run_main.py                  # EMR entry point script
│   │   ├── config/                      # Configuration files
│   │   │   ├── datasets.json           # Dataset definitions
│   │   │   └── transformed_source.json # Schema configurations
│   │   └── dbaccess/                    # Database connectivity modules
│   ├── utils/                    # Utility modules
│   │   ├── aws_secrets_manager.py      # AWS Secrets Manager integration
│   │   └── path_utils.py               # Path resolution utilities
│   ├── airflow/                  # Airflow DAGs and configuration
│   │   └── dags/                       # Airflow DAG definitions
│   └── infra/                    # Infrastructure scripts
│       └── emr/                        # EMR deployment scripts
├── tests/                        # Comprehensive test suite
│   ├── unit/                     # Unit tests (fast, isolated)
│   ├── integration/              # Integration tests (databases, files)
│   ├── acceptance/               # End-to-end workflow tests
│   └── conftest.py              # Shared test configuration
├── examples/                     # Usage examples
├── requirements.txt              # Production dependencies
├── requirements-test.txt         # Testing dependencies
├── pytest.ini                   # Pytest configuration
├── setup.py                     # Package configuration
└── README.md                    # This file

🚀 Quick Start

Prerequisites

Python 3.8+
Java 11+ (for PySpark)
Apache Spark 3.3+
AWS CLI configured (for deployment)

Installation

Clone the repository:

git clone <repository-url>
cd pyspark-jobs

Install dependencies:

# Production dependencies
pip install -r requirements.txt

# Development and testing dependencies
pip install -r requirements-test.txt

Install the package in development mode:

pip install -e .

Running Locally

Run a specific transformation:

python src/jobs/run_main.py \
  --load_type full \
  --data_set transport-access-node \
  --path s3://your-bucket/data/

Execute the main ETL pipeline:

python src/jobs/main_collection_data.py

🧪 Testing

This project includes a comprehensive test suite with three levels of testing:

Running Tests

# Run all tests
pytest

# Run specific test categories
pytest -m unit                    # Fast unit tests
pytest -m integration             # Integration tests
pytest -m acceptance              # End-to-end tests

# Run with coverage
pytest --cov=src --cov-report=html

# Run in parallel
pytest -n auto

Test Structure

Unit Tests (tests/unit/): Fast, isolated component tests
Integration Tests (tests/integration/): Database and external service tests
Acceptance Tests (tests/acceptance/): Complete workflow validation

For detailed testing information, see tests/README.md.

📊 Data Processing Workflows

Main ETL Pipeline

The core ETL pipeline (main_collection_data.py) processes data through these stages:

Data Extraction: Load from S3 CSV files
Data Transformation: Apply business logic transformations
Data Loading: Output to partitioned Parquet files

Supported Datasets

Configure datasets in src/jobs/config/datasets.json:

{
  "transport-access-node": {
    "path": "s3://bucket/transport-access-node-collection/",
    "enabled": true
  },
  "title-boundaries": {
    "path": "s3://bucket/title-boundary-collection/", 
    "enabled": false
  }
}

Transformation Types

Fact Processing: Deduplicate and prioritize fact records
Fact Resource Processing: Extract resource relationships
Entity Processing: Pivot fields into structured entity records
Issue Processing: Track and validate data quality issues

🔧 Configuration

Environment Variables

# AWS Configuration
export AWS_REGION=eu-west-2
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key

# Database Configuration (optional)
export POSTGRES_SECRET_NAME=your-secret-name
export USE_DATABASE=true

# Spark Configuration
export PYSPARK_PYTHON=python3
export SPARK_HOME=/path/to/spark

AWS Secrets Manager

Use AWS Secrets Manager for secure credential storage:

from utils.aws_secrets_manager import get_database_credentials

# Retrieve database credentials
db_creds = get_database_credentials("myapp/database/postgres")

See examples/secrets_usage_example.py for detailed usage.

🚁 Deployment

EMR Serverless Deployment

Package the application:

python setup.py bdist_wheel

Upload to S3:

aws s3 cp dist/pyspark_jobs-*.whl s3://your-bucket/packages/
aws s3 cp src/jobs/run_main.py s3://your-bucket/scripts/

Submit EMR Serverless job:

aws emr-serverless start-job-run \
  --application-id your-app-id \
  --execution-role-arn your-role-arn \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://your-bucket/scripts/run_main.py",
      "sparkSubmitParameters": "--py-files s3://your-bucket/packages/pyspark_jobs-*.whl"
    }
  }'

Airflow Integration

Deploy DAGs to Amazon MWAA:

aws s3 sync src/airflow/dags/ s3://your-airflow-bucket/dags/

📈 Monitoring and Logging

Spark UI

Access Spark UI at http://localhost:4040 during local execution.

CloudWatch Logs

EMR Serverless jobs automatically log to CloudWatch under:

/aws/emr-serverless/applications/{application-id}/jobs/{job-run-id}

Application Logs

Structured logging with configurable levels:

import logging
logger = logging.getLogger(__name__)
logger.info("Processing started for dataset: %s", dataset_name)

🔍 Data Quality

Schema Validation

Automatic schema inference and validation
Support for required and optional fields
Data type enforcement

Issue Tracking

Comprehensive data quality checks
Issue categorization and reporting
Integration with fact/entity processing

🛠️ Development

Code Structure

Jobs: Main processing logic in src/jobs/
Utils: Shared utilities in src/utils/
Configuration: JSON-based config in src/jobs/config/
Tests: Comprehensive test suite in tests/

Adding New Transformations

Create transformation function in transform_collection_data.py
Add schema configuration to config/ directory
Write comprehensive tests in appropriate test directory
Update dataset configuration if needed

Code Quality

# Run linters
black src/ tests/
flake8 src/ tests/
isort src/ tests/

# Type checking
mypy src/

📚 Examples

See the examples/ directory for:

AWS Secrets Manager usage
Custom transformation examples
Configuration templates
Deployment scripts

🤝 Contributing

Fork the repository
Create a feature branch
Make changes with tests
Run the test suite
Submit a pull request

Development Setup

# Install development dependencies
pip install -r requirements-test.txt

# Install pre-commit hooks
pre-commit install

# Run tests before committing
pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

For issues and questions:

Check the tests/README.md for testing guidance
Review examples in the examples/ directory
Check documentation in the docs/ directory:
Open an issue on GitHub

🔄 CI/CD

The project includes configuration for:

Automated testing with pytest
Code quality checks
AWS deployment pipelines
Docker containerization support

GitHub Actions (if configured)

# Example workflow
name: CI/CD Pipeline
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
      - run: pip install -r requirements-test.txt
      - run: pytest --cov=src

Built with ❤️ for Digital Land data processing pyspark-jobs repo for pyspark jobs. added code for issue table, fact-res, fact tables. main

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
bin		bin
build_output/entry_script		build_output/entry_script
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_aws_package.sh		build_aws_package.sh
pytest.ini		pytest.ini
requirements-emr.txt		requirements-emr.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

digital-land/pyspark-jobs

Folders and files

Latest commit

History

Repository files navigation

PySpark Jobs for Digital Land

🏗️ Project Overview

Key Features

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Running Locally

🧪 Testing

Running Tests

Test Structure

📊 Data Processing Workflows

Main ETL Pipeline

Supported Datasets

Transformation Types

🔧 Configuration

Environment Variables

AWS Secrets Manager

🚁 Deployment

EMR Serverless Deployment

Airflow Integration

📈 Monitoring and Logging

Spark UI

CloudWatch Logs

Application Logs

🔍 Data Quality

Schema Validation

Issue Tracking

🛠️ Development

Code Structure

Adding New Transformations

Code Quality

📚 Examples

🤝 Contributing

Development Setup

📄 License

🆘 Support

🔄 CI/CD

GitHub Actions (if configured)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages