Skip to content

CodeForAfrica/promisetracker_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Extraction Pipeline

A simple document extraction pipeline built with Deno that extracts text content from various document formats and saves them as text files.

Features

  • 📁 Batch Processing: Process all files in the input directory automatically
  • Duplicate Detection: Skip files that have already been extracted
  • 🚀 Deno Native: Built using Deno runtime and standard library features
  • 🔧 Apache Tika Integration: Supports multiple document formats (PDF, DOC, DOCX, etc.)

Project Structure

├── extractor.ts        # Document extraction logic
├── main.ts             # Main application entry point
├── input/              # Input directory for documents (auto-created)
├── output/             # Output directory for extracted text (auto-created)
└── README.md           # This file

Usage

  1. Start Apache Tika server using Docker:

    docker compose up -d
  2. Place documents in the input directory:

    mkdir input
    cp your-documents/* ./input/
  3. Run the extraction pipeline:

    deno task dev

    Or run directly:

    deno run -A main.ts
  4. Check the extracted text files in the output directory:

    ls output/

How it Works

  1. The application reads all files from the ./input directory
  2. For each file, it checks if a corresponding .txt file already exists in ./output
  3. If the file hasn't been processed, it extracts the text using Apache Tika
  4. The extracted text is cleaned (normalized whitespace) and saved as a .txt file
  5. Files that have already been processed are automatically skipped

Supported File Formats

The pipeline uses Apache Tika and supports:

  • PDF documents
  • Microsoft Word (DOC, DOCX)
  • Microsoft Excel (XLS, XLSX)
  • Microsoft PowerPoint (PPT, PPTX)
  • OpenDocument formats (ODT, ODS, ODP)
  • Rich Text Format (RTF)
  • Plain text files
  • And many more formats supported by Apache Tika

Output Format

Extracted text files are saved with the following naming convention:

  • Input: document.pdf → Output: document.txt
  • Input: report.docx → Output: report.txt

The extracted text is cleaned by:

  • Normalizing whitespace
  • Removing excessive line breaks
  • Trimming leading/trailing spaces

Example

# Place some documents
mkdir input
cp ~/Documents/*.pdf ./input/

# Run extraction
deno task dev

# Output:
# 📄 Document Extraction Pipeline Starting...
#    Input: ./input
#    Output: ./output
# 📄 Extracting document1.pdf...
# ✅ Extracted document1.pdf → document1.txt
# 📄 Extracting document2.pdf...
# ✅ Extracted document2.pdf → document2.txt
# 🎉 Document extraction pipeline completed!

# Check results
ls output/
# document1.txt  document2.txt

Dependencies

The project uses:

  • @ax-llm/ax for Apache Tika integration
  • Deno standard library for file system operations and path utilities

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published