Impresso Make-Based Processing Template

This repository provides a template for creating new processing pipelines within the Impresso project ecosystem. It demonstrates best practices for building scalable, distributed newspaper processing workflows using Make, Python, and S3 storage.

Overview

This template provides a complete framework for building newspaper processing pipelines that:

Scale Horizontally: Process data across multiple machines without conflicts
Handle Large Datasets: Efficiently process large collections using S3 and local stamp files
Maintain Consistency: Ensure reproducible results with proper dependency management
Support Parallel Processing: Utilize multi-core systems and distributed computing
Integrate with S3: Seamlessly work with both local files and S3 storage

Template Structure

├── README.md                   # This file
├── Makefile                    # Main build configuration
├── .env                        # Environment variables (create manually from dotenv.sample)
├── dotenv.sample               # Sample environment configuration
├── Pipfile                     # Python dependencies
├── lib/
│   └── cli_TEMPLATE.py         # Template CLI script
├── cookbook/                   # Build system components
│   ├── README.md               # Detailed cookbook documentation
│   ├── setup_TEMPLATE.mk       # Template-specific setup
│   ├── paths_TEMPLATE.mk       # Path definitions
│   ├── sync_TEMPLATE.mk        # Data synchronization
│   ├── processing_TEMPLATE.mk  # Processing targets
│   └── ...                     # Other cookbook components
└── build.d/                    # Local build directory (auto-created)

Quick Start

Follow these steps to get started with the template:

1. Prerequisites

Ensure you have the required system dependencies installed:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y make git git-lfs parallel coreutils python3 python3-pip

macOS:

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install dependencies
brew install make git git-lfs parallel coreutils python3

System Requirements:

Python 3.11+
Make (GNU Make recommended)
Git with git-lfs
AWS CLI (optional, for direct S3 access)

2. Clone and Setup

Clone the repository:

git clone --recursive <your-template-repo>
cd impresso-cookbook-template

Configure environment:

cp dotenv.sample .env
# Edit .env with your S3 credentials (see Configuration section below)

Install Python dependencies:

# Using pipenv (recommended)
pipenv install

# Or using pip directly
python3 -m pip install -r requirements.txt

Initialize the environment:
```
make setup
```

3. Verify Installation

Test your setup with a quick help command:

make help

You should see available targets and configuration options.

Configuration

Before running any processing, configure your environment:

Required Environment Variables

Edit your .env file with these required settings:

# S3 Configuration (required)
SE_ACCESS_KEY=your_s3_access_key
SE_SECRET_KEY=your_s3_secret_key
SE_HOST_URL=https://os.zhdk.cloud.switch.ch/

# Logging Configuration (optional)
LOGGING_LEVEL=INFO

Optional Processing Variables

These can be set in .env as shell variables (propagate to make) or passed as command arguments to make:

NEWSPAPER: Target newspaper to process
BUILD_DIR: Local build directory (default: build.d)
NEWSPAPER_YEAR_SORTING: Processing order (shuf for random, cat for chronological) of the years within a newspaper
NPROC: Number of CPU cores (auto-detected if not set)
NEWSPAPER_JOBS: Number of parallel jobs per newspaper processing (derived: NPROC ÷ COLLECTION_JOBS)
COLLECTION_JOBS: Number of newspapers to process in parallel within a collection (default: 2)
MAX_LOAD: Maximum system load (default: NPROC)

S3 Bucket Configuration

Configure S3 buckets in your paths file:

S3_BUCKET_REBUILT: Input data bucket (default: 22-rebuilt-final)
S3_BUCKET_TEMPLATE: Output data bucket (default: 140-processed-data-sandbox)

Running the Template

Test the Template Processing

Process a small newspaper to verify everything works:

# Test with a smaller newspaper first
make newspaper NEWSPAPER=actionfem

Processing Options

Process a single newspaper (all years):

make newspaper NEWSPAPER=actionfem

Step-by-step processing:

Sync data:
```
make sync NEWSPAPER=actionfem
```

Run processing:

make processing-target NEWSPAPER=actionfem

Process multiple newspapers:

make collection COLLECTION_JOBS=4

Available Commands

Explore the build system:

# Show all available targets
make help

# Show current configuration
make config

Adapting to Your Processing Pipeline

Once you've verified the template works, adapt it to your specific processing needs:

1. Choose Your Processing Acronym

Decide on a short acronym for your new pipeline (e.g., myimpressopipeline):

export PROCESSING_ACRONYM=myimpressopipeline
make -f cookbook/template-starter.mk

This will create adapted files with your acronym:

├── README.md                   # This file
├── Makefile.myimpressopipeline # Main build configuration adapted for myimpressopipeline
├── .env                        # Environment variables (create manually from dotenv.sample)
├── dotenv.sample               # Sample environment configuration
├── Pipfile                     # Python dependencies
├── lib/
│   └── cli_myimpressopipeline.py         # Template CLI script adapted for myimpressopipeline
├── cookbook/                   # Build system components
│   ├── README.md               # Detailed cookbook documentation
│   ├── setup_myimpressopipeline.mk       # myimpressopipeline-specific setup
│   ├── paths_myimpressopipeline.mk       # Path definitions
│   ├── sync_myimpressopipeline.mk        # Data synchronization
│   ├── processing_myimpressopipeline.mk  # Processing targets
│   └── ...                     # Other cookbook components
└── build.d/                    # Local build directory (auto-created)

2. Customize Your Processing Logic

After adaptation, customize these key files:

lib/cli_myimpressopipeline.py: Implement your processing logic
cookbook/processing_myimpressopipeline.mk: Define your processing targets
cookbook/paths_myimpressopipeline.mk: Configure input/output paths and S3 buckets

3. Test Your Adapted Pipeline

# Use your new Makefile
make -f Makefile.myimpressopipeline newspaper NEWSPAPER=actionfem

Build System

Core Targets

make help: Show available targets and current configuration
make setup: Initialize environment (run once after installation)
make newspaper: Process single newspaper
make collection: Process multiple newspapers in parallel
make all: Complete processing pipeline with data sync

Data Management

make sync: Sync input and output data
make sync-input: Download input data from S3
make sync-output: Upload results to S3 (will never overwrite existing data)
make clean-build: Remove build directory

Parallel Processing

The system automatically detects CPU cores and configures parallel processing:

# Process collection with custom parallelization
make collection COLLECTION_JOBS=4 MAX_LOAD=8

Build System Architecture

The build system uses:

Stamp Files: Track processing state without downloading full datasets
S3 Integration: Direct processing from/to S3 storage
Distributed Processing: Multiple machines can work independently
Dependency Management: Automatic dependency resolution via Make

For detailed build system documentation, see cookbook/README.md.

Contributing

Fork the repository
Create a feature branch
Make your changes
Test with make newspaper NEWSPAPER=actionfem
Submit a pull request

About Impresso

Impresso Project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders.

The project is funded by:

Swiss National Science Foundation (grants CRSII5_173719 and CRSII5_213585)
Luxembourg National Research Fund (grant 17498891)

Copyright

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
config		config
cookbook @ ba579bc		cookbook @ ba579bc
lib		lib
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
dotenv.sample		dotenv.sample
mypy.ini		mypy.ini
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Impresso Make-Based Processing Template

Table of Contents

Overview

Template Structure

Quick Start

1. Prerequisites

2. Clone and Setup

3. Verify Installation

Configuration

Required Environment Variables

Optional Processing Variables

S3 Bucket Configuration

Running the Template

Test the Template Processing

Processing Options

Available Commands

Adapting to Your Processing Pipeline

1. Choose Your Processing Acronym

2. Customize Your Processing Logic

3. Test Your Adapted Pipeline

Build System

Core Targets

Data Management

Parallel Processing

Build System Architecture

Contributing

About Impresso

Impresso Project

Copyright

License

About

Uh oh!

Releases

Packages

Languages

impresso/impresso-newsagencies-cookbook

Folders and files

Latest commit

History

Repository files navigation

Impresso Make-Based Processing Template

Table of Contents

Overview

Template Structure

Quick Start

1. Prerequisites

2. Clone and Setup

3. Verify Installation

Configuration

Required Environment Variables

Optional Processing Variables

S3 Bucket Configuration

Running the Template

Test the Template Processing

Processing Options

Available Commands

Adapting to Your Processing Pipeline

1. Choose Your Processing Acronym

2. Customize Your Processing Logic

3. Test Your Adapted Pipeline

Build System

Core Targets

Data Management

Parallel Processing

Build System Architecture

Contributing

About Impresso

Impresso Project

Copyright

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages