This project provides tools to automatically extract frames from YouTube videos, preparing them for use in training Optical Character Recognition (OCR) models. The eventual goal is to use these frames with a multi-modal LLM to generate rich training data based on the visual content and (optionally) subtitles.
- 🎞️ Frame Extraction: Extracts frames from video files at a configurable FPS.
- ✂️ Frame Sampling: Optionally samples a maximum number of frames per video.
- 🖼️ Image Resizing: Optionally resizes frames to a maximum dimension while maintaining aspect ratio.
- ⚙️ Parallel Processing: Efficiently processes multiple video directories in parallel using
concurrent.futures.ProcessPoolExecutor
. - 📝 Metadata Handling: Copies associated metadata files (e.g.,
.info.json
) to the output directory. - 📊 Progress Tracking: Uses
tqdm
for progress bars during video processing. - 🖼️ LLM Multimodal Analysis: Processes sequences of video frames with an LLM (e.g., Gemini) for detailed, multi-task visual and OCR analysis (Tasks 1-5 from images).
- ✍️ LLM Text Refinement: Takes Tesseract OCR text output for frame sequences and uses an LLM (e.g., Gemini 2.5 Pro) to clean the text, convert it to markdown, and generate a contextual summary (Tasks 3-5 from text).
- ✅ Milestone 1: Setup & Configuration: Environment setup (Conda, Poetry), API key handling (
.env
), and Vertex AI client verification are complete. - ✅ Milestone 2: Frame Extraction Pipeline:
ocr_dataset_builder/video/processing.py
(extract_frames
function) is implemented and functional.ocr_dataset_builder/video/frame_pipeline.py
is implemented and functional, enabling parallel processing, metadata handling, slicing, CLI, and checkpointing.
- ✅ Milestone 3: LLM Prompt Adaptation & Initial LLM Processing Code:
- The core
ocr_dataset_builder/prompts/ocr_image_multi_task_prompt.md
has been adapted for video frame sequences. ocr_dataset_builder/llm/processing.py
for image-based LLM interaction is implemented and functional.
- The core
- ✅ Milestone 4: LLM Multimodal Pipeline Development:
ocr_dataset_builder/llm/pipeline.py
for image-based analysis is implemented and functional, with checkpointing, parallel processing, and standardized logging.
- ✅ Milestone 5: Tesseract OCR Integration (Optional/Baseline):
ocr_dataset_builder/tesseract/processing.py
andocr_dataset_builder/tesseract/pipeline.py
are implemented and functional, including checkpointing and logging.
- 🚧 Milestone 6: Full Pipeline Integration & Output Generation: Integrating
video/frame_pipeline.py
-> (optionaltesseract/pipeline.py
) ->llm/pipeline.py
(image) andllm/text_pipeline.py
(text) and generating final structured outputs. Development of the text LLM pipeline is in progress. - ✅ Documentation Foundational Work:
docs/PIPELINE_GUIDE.md
,docs/DATA_FORMATS.md
,docs/DESIGN.md
, anddocs/MILESTONES.md
have been updated to reflect current progress and the new text refinement pipeline.docs/TEXT_LLM_PIPELINE_GUIDE.md
created. - ✅ Codebase Refactoring: Project structure updated with
video
,tesseract
, andllm
submodules. Imports and documentation reflect these changes. - ⏳ New Text LLM Refinement Pipeline (Phase 2 - Implementation):
ocr_dataset_builder/llm/text_processing.py
(handles LLM interaction for text).ocr_dataset_builder/llm/text_pipeline.py
(orchestrates the text refinement process).
-
ocr_dataset_builder/video/processing.py
:- Core logic for extracting, resizing, and sampling frames from individual video files.
extract_frames()
: Main function for per-video frame processing.get_human_readable_size()
: Utility for file sizes.
-
ocr_dataset_builder/video/frame_pipeline.py
:- Orchestrates the frame extraction process across an entire dataset of videos using
ocr_dataset_builder/video/processing.py
. - Handles directory traversal, metadata copying, parallel execution for video directories.
- Provides a CLI (
fire
) for dataset processing.
- Orchestrates the frame extraction process across an entire dataset of videos using
-
ocr_dataset_builder/prompts/ocr_image_multi_task_prompt.md
:- Detailed multi-task prompt for the Gemini LLM, adapted for video frame sequence analysis (per-frame Tasks 1-4, per-sequence Task 5).
-
ocr_dataset_builder/llm/processing.py
:- Handles the direct interaction with the multi-modal LLM (e.g., Gemini via Vertex AI).
- Takes a sequence of frames and the prompt to get the structured textual analysis.
- Manages API calls, responses, and error handling for a single LLM request.
-
ocr_dataset_builder/llm/pipeline.py
:- Orchestrates the process of sending frame sequences (from
video/frame_pipeline.py
output) to the LLM viallm/processing.py
. - Manages batching of frame sequences, collecting LLM results, and preparing them for final output generation.
- Orchestrates the process of sending frame sequences (from
-
ocr_dataset_builder/tesseract/processing.py
(Optional/Baseline):- Performs OCR on individual image frames using the Tesseract OCR engine.
- Can be used to generate a baseline OCR text for comparison or as an input to other processes.
-
ocr_dataset_builder/tesseract/pipeline.py
(Optional/Baseline):- Orchestrates the application of Tesseract OCR (via
tesseract/processing.py
) over a dataset of extracted frames. - Expected to output individual
.txt
files per frame for consumption by the Text LLM Refinement Pipeline.
- Orchestrates the application of Tesseract OCR (via
-
ocr_dataset_builder/prompts/ocr_text_refinement_prompt.md
(New):- Specialized prompt for guiding an LLM to refine sequences of Tesseract OCR text, focusing on text cleaning (Task 3), markdown conversion (Task 4), and summarization (Task 5).
-
ocr_dataset_builder/llm/llm_text_processing.py
(New - Planned):- Handles direct interaction with an LLM for text-based refinement tasks using the
ocr_dataset_builder/prompts/ocr_text_refinement_prompt.md
.
- Handles direct interaction with an LLM for text-based refinement tasks using the
-
ocr_dataset_builder/llm/llm_text_pipeline.py
(New - Planned):- Orchestrates the process of sending Tesseract OCR text sequences (e.g., from
tesseract/pipeline.py
output) to an LLM viallm/llm_text_processing.py
for refinement.
- Orchestrates the process of sending Tesseract OCR text sequences (e.g., from
git clone <repository-url>
cd ocr-dataset-builder
The install-conda-env.sh
script automates the creation of the Conda environment using the pyproject.toml
file for dependencies via Poetry.
bash install-conda-env.sh
This will create a Conda environment named ocr-dataset-builder
. Activate it:
conda activate ocr-dataset-builder
For the planned LLM interaction part of the project, you will need an API key.
Copy the .env.example
to .env
and fill in your GOOGLE_API_KEY
:
cp .env.example .env
# Now edit .env and add your GOOGLE_API_KEY
# GOOGLE_API_KEY="your_api_key_here"
This key will be used by scripts/example-vertex.py
and the future LLM processing module.
This project involves multiple pipelines. For detailed CLI arguments and usage, please refer to docs/PIPELINE_GUIDE.md
, docs/TEXT_LLM_PIPELINE_GUIDE.md
.
Use ocr_dataset_builder/video/frame_pipeline.py
to process a dataset of videos and extract frames.
Example:
python -m ocr_dataset_builder.video.frame_pipeline process_videos \
--dataset_path "/path/to/your/video_dataset" \
--output_path "./output/extracted_frames" \
--target_fps 1
If baseline OCR is needed, ocr_dataset_builder/tesseract/pipeline.py
can be used on the output of the frame extraction.
Example:
python -m ocr_dataset_builder.tesseract.pipeline run \
--input_dir "./output/extracted_frames" \
--output_dir "./output/tesseract_ocr_results"
Once frames are extracted, ocr_dataset_builder/llm/pipeline.py
can be used to process sequences of these frames with the Gemini LLM.
Example (Multimodal Image Analysis):
python -m ocr_dataset_builder.llm.pipeline run \
--input_dir "./output/extracted_frames" \
--output_dir "./output/llm_analysis_results" \
--batch_size 30
After running the Tesseract OCR pipeline (which should output .txt
files per frame), ocr_dataset_builder/llm/llm_text_pipeline.py
can be used to refine these texts.
Example (Text Refinement):
python -m ocr_dataset_builder.llm.llm_text_pipeline run \
--input_dir "./output/tesseract_text_output" \
--output_dir "./output/llm_text_refined_output" \
--frames_per_batch 60
This project uses ruff
for linting and black