This repository provides a template for creating new processing pipelines within the Impresso project ecosystem. It demonstrates best practices for building scalable, distributed newspaper processing workflows using Make, Python, and S3 storage.
- Overview
- Template Structure
- Quick Start
- Configuration
- Running the Template
- Adapting to Your Processing Pipeline
- Build System
- Contributing
- About Impresso
This template provides a complete framework for building newspaper processing pipelines that:
- Scale Horizontally: Process data across multiple machines without conflicts
- Handle Large Datasets: Efficiently process large collections using S3 and local stamp files
- Maintain Consistency: Ensure reproducible results with proper dependency management
- Support Parallel Processing: Utilize multi-core systems and distributed computing
- Integrate with S3: Seamlessly work with both local files and S3 storage
├── README.md # This file
├── Makefile # Main build configuration
├── .env # Environment variables (create manually from dotenv.sample)
├── dotenv.sample # Sample environment configuration
├── Pipfile # Python dependencies
├── lib/
│ └── cli_TEMPLATE.py # Template CLI script
├── cookbook/ # Build system components
│ ├── README.md # Detailed cookbook documentation
│ ├── setup_TEMPLATE.mk # Template-specific setup
│ ├── paths_TEMPLATE.mk # Path definitions
│ ├── sync_TEMPLATE.mk # Data synchronization
│ ├── processing_TEMPLATE.mk # Processing targets
│ └── ... # Other cookbook components
└── build.d/ # Local build directory (auto-created)
Follow these steps to get started with the template:
Ensure you have the required system dependencies installed:
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y make git git-lfs parallel coreutils python3 python3-pip
macOS:
# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install make git git-lfs parallel coreutils python3
System Requirements:
- Python 3.11+
- Make (GNU Make recommended)
- Git with git-lfs
- AWS CLI (optional, for direct S3 access)
-
Clone the repository:
git clone --recursive <your-template-repo> cd impresso-cookbook-template
-
Configure environment:
cp dotenv.sample .env # Edit .env with your S3 credentials (see Configuration section below)
-
Install Python dependencies:
# Using pipenv (recommended) pipenv install # Or using pip directly python3 -m pip install -r requirements.txt
-
Initialize the environment:
make setup
Test your setup with a quick help command:
make help
You should see available targets and configuration options.
Before running any processing, configure your environment:
Edit your .env
file with these required settings:
# S3 Configuration (required)
SE_ACCESS_KEY=your_s3_access_key
SE_SECRET_KEY=your_s3_secret_key
SE_HOST_URL=https://os.zhdk.cloud.switch.ch/
# Logging Configuration (optional)
LOGGING_LEVEL=INFO
These can be set in .env
as shell variables (propagate to make) or passed as command
arguments to make:
NEWSPAPER
: Target newspaper to processBUILD_DIR
: Local build directory (default:build.d
)NEWSPAPER_YEAR_SORTING
: Processing order (shuf
for random,cat
for chronological) of the years within a newspaperNPROC
: Number of CPU cores (auto-detected if not set)NEWSPAPER_JOBS
: Number of parallel jobs per newspaper processing (derived: NPROC ÷ COLLECTION_JOBS)COLLECTION_JOBS
: Number of newspapers to process in parallel within a collection (default: 2)MAX_LOAD
: Maximum system load (default: NPROC)
Configure S3 buckets in your paths file:
S3_BUCKET_REBUILT
: Input data bucket (default:22-rebuilt-final
)S3_BUCKET_TEMPLATE
: Output data bucket (default:140-processed-data-sandbox
)
Process a small newspaper to verify everything works:
# Test with a smaller newspaper first
make newspaper NEWSPAPER=actionfem
Process a single newspaper (all years):
make newspaper NEWSPAPER=actionfem
Step-by-step processing:
-
Sync data:
make sync NEWSPAPER=actionfem
-
Run processing:
make processing-target NEWSPAPER=actionfem
Process multiple newspapers:
make collection COLLECTION_JOBS=4
Explore the build system:
# Show all available targets
make help
# Show current configuration
make config
Once you've verified the template works, adapt it to your specific processing needs:
Decide on a short acronym for your new pipeline (e.g., myimpressopipeline
):
export PROCESSING_ACRONYM=myimpressopipeline
make -f cookbook/template-starter.mk
This will create adapted files with your acronym:
├── README.md # This file
├── Makefile.myimpressopipeline # Main build configuration adapted for myimpressopipeline
├── .env # Environment variables (create manually from dotenv.sample)
├── dotenv.sample # Sample environment configuration
├── Pipfile # Python dependencies
├── lib/
│ └── cli_myimpressopipeline.py # Template CLI script adapted for myimpressopipeline
├── cookbook/ # Build system components
│ ├── README.md # Detailed cookbook documentation
│ ├── setup_myimpressopipeline.mk # myimpressopipeline-specific setup
│ ├── paths_myimpressopipeline.mk # Path definitions
│ ├── sync_myimpressopipeline.mk # Data synchronization
│ ├── processing_myimpressopipeline.mk # Processing targets
│ └── ... # Other cookbook components
└── build.d/ # Local build directory (auto-created)
After adaptation, customize these key files:
lib/cli_myimpressopipeline.py
: Implement your processing logiccookbook/processing_myimpressopipeline.mk
: Define your processing targetscookbook/paths_myimpressopipeline.mk
: Configure input/output paths and S3 buckets
# Use your new Makefile
make -f Makefile.myimpressopipeline newspaper NEWSPAPER=actionfem
make help
: Show available targets and current configurationmake setup
: Initialize environment (run once after installation)make newspaper
: Process single newspapermake collection
: Process multiple newspapers in parallelmake all
: Complete processing pipeline with data sync
make sync
: Sync input and output datamake sync-input
: Download input data from S3make sync-output
: Upload results to S3 (will never overwrite existing data)make clean-build
: Remove build directory
The system automatically detects CPU cores and configures parallel processing:
# Process collection with custom parallelization
make collection COLLECTION_JOBS=4 MAX_LOAD=8
The build system uses:
- Stamp Files: Track processing state without downloading full datasets
- S3 Integration: Direct processing from/to S3 storage
- Distributed Processing: Multiple machines can work independently
- Dependency Management: Automatic dependency resolution via Make
For detailed build system documentation, see cookbook/README.md.
- Fork the repository
- Create a feature branch
- Make your changes
- Test with
make newspaper NEWSPAPER=actionfem
- Submit a pull request
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders.
The project is funded by:
- Swiss National Science Foundation (grants CRSII5_173719 and CRSII5_213585)
- Luxembourg National Research Fund (grant 17498891)
Copyright (C) 2024 The Impresso team.
This program is provided as open source under the GNU Affero General Public License v3 or later.