Document Extraction Pipeline

A simple document extraction pipeline built with Deno that extracts text content from various document formats and saves them as text files.

Features

📁 Batch Processing: Process all files in the input directory automatically
⚡ Duplicate Detection: Skip files that have already been extracted
🚀 Deno Native: Built using Deno runtime and standard library features
🔧 Apache Tika Integration: Supports multiple document formats (PDF, DOC, DOCX, etc.)

Project Structure

├── extractor.ts        # Document extraction logic
├── main.ts             # Main application entry point
├── input/              # Input directory for documents (auto-created)
├── output/             # Output directory for extracted text (auto-created)
└── README.md           # This file

Usage

Start Apache Tika server using Docker:
```
docker compose up -d
```

Place documents in the input directory:

mkdir input
cp your-documents/* ./input/

Run the extraction pipeline:
```
deno task dev
```
Or run directly:
```
deno run -A main.ts
```
Check the extracted text files in the output directory:
```
ls output/
```

How it Works

The application reads all files from the ./input directory
For each file, it checks if a corresponding .txt file already exists in ./output
If the file hasn't been processed, it extracts the text using Apache Tika
The extracted text is cleaned (normalized whitespace) and saved as a .txt file
Files that have already been processed are automatically skipped

Supported File Formats

The pipeline uses Apache Tika and supports:

PDF documents
Microsoft Word (DOC, DOCX)
Microsoft Excel (XLS, XLSX)
Microsoft PowerPoint (PPT, PPTX)
OpenDocument formats (ODT, ODS, ODP)
Rich Text Format (RTF)
Plain text files
And many more formats supported by Apache Tika

Output Format

Extracted text files are saved with the following naming convention:

Input: document.pdf → Output: document.txt
Input: report.docx → Output: report.txt

The extracted text is cleaned by:

Normalizing whitespace
Removing excessive line breaks
Trimming leading/trailing spaces

Example

# Place some documents
mkdir input
cp ~/Documents/*.pdf ./input/

# Run extraction
deno task dev

# Output:
# 📄 Document Extraction Pipeline Starting...
#    Input: ./input
#    Output: ./output
# 📄 Extracting document1.pdf...
# ✅ Extracted document1.pdf → document1.txt
# 📄 Extracting document2.pdf...
# ✅ Extracted document2.pdf → document2.txt
# 🎉 Document extraction pipeline completed!

# Check results
ls output/
# document1.txt  document2.txt

Dependencies

The project uses:

@ax-llm/ax for Apache Tika integration
Deno standard library for file system operations and path utilities

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
input		input
output		output
.gitignore		.gitignore
README.md		README.md
deno.json		deno.json
deno.lock		deno.lock
docker-compose.yml		docker-compose.yml
extractor.ts		extractor.ts
main.ts		main.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Extraction Pipeline

Features

Project Structure

Usage

How it Works

Supported File Formats

Output Format

Example

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CodeForAfrica/promisetracker_pipeline

Folders and files

Latest commit

History

Repository files navigation

Document Extraction Pipeline

Features

Project Structure

Usage

How it Works

Supported File Formats

Output Format

Example

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages