A simple document extraction pipeline built with Deno that extracts text content from various document formats and saves them as text files.
- 📁 Batch Processing: Process all files in the input directory automatically
- ⚡ Duplicate Detection: Skip files that have already been extracted
- 🚀 Deno Native: Built using Deno runtime and standard library features
- 🔧 Apache Tika Integration: Supports multiple document formats (PDF, DOC, DOCX, etc.)
├── extractor.ts # Document extraction logic
├── main.ts # Main application entry point
├── input/ # Input directory for documents (auto-created)
├── output/ # Output directory for extracted text (auto-created)
└── README.md # This file
-
Start Apache Tika server using Docker:
docker compose up -d
-
Place documents in the input directory:
mkdir input cp your-documents/* ./input/
-
Run the extraction pipeline:
deno task dev
Or run directly:
deno run -A main.ts
-
Check the extracted text files in the output directory:
ls output/
- The application reads all files from the
./input
directory - For each file, it checks if a corresponding
.txt
file already exists in./output
- If the file hasn't been processed, it extracts the text using Apache Tika
- The extracted text is cleaned (normalized whitespace) and saved as a
.txt
file - Files that have already been processed are automatically skipped
The pipeline uses Apache Tika and supports:
- PDF documents
- Microsoft Word (DOC, DOCX)
- Microsoft Excel (XLS, XLSX)
- Microsoft PowerPoint (PPT, PPTX)
- OpenDocument formats (ODT, ODS, ODP)
- Rich Text Format (RTF)
- Plain text files
- And many more formats supported by Apache Tika
Extracted text files are saved with the following naming convention:
- Input:
document.pdf
→ Output:document.txt
- Input:
report.docx
→ Output:report.txt
The extracted text is cleaned by:
- Normalizing whitespace
- Removing excessive line breaks
- Trimming leading/trailing spaces
# Place some documents
mkdir input
cp ~/Documents/*.pdf ./input/
# Run extraction
deno task dev
# Output:
# 📄 Document Extraction Pipeline Starting...
# Input: ./input
# Output: ./output
# 📄 Extracting document1.pdf...
# ✅ Extracted document1.pdf → document1.txt
# 📄 Extracting document2.pdf...
# ✅ Extracted document2.pdf → document2.txt
# 🎉 Document extraction pipeline completed!
# Check results
ls output/
# document1.txt document2.txt
The project uses:
@ax-llm/ax
for Apache Tika integration- Deno standard library for file system operations and path utilities