Skip to content

Automated extraction of structured data from Tatar-language PDF documents using Google Gemini and Yandex Disk public links. Supports chunked processing, prompt engineering, and JSON output validation.

License

Notifications You must be signed in to change notification settings

neurotatarlar/gemini-pdf-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🐎 gemini-pdf-extractor

Automated extraction of structured data from Tatar-language PDF documents using Google Gemini and public Yandex Disk links. Supports chunked processing, prompt engineering, and validation of model output into structured JSON format.

✨ Features

  • 📄 PDF download via Yandex Disk public links
  • 🔪 Smart chunking of large documents
  • 🤖 Extraction using Google Gemini (supports multiple models)
  • 🧠 Prompt engineering for Tatar documents
  • 📦 Outputs validated as structured JSON (via Pydantic)
  • 🗂️ Resulting chunks zipped for easy sharing and storage

📅 Installation

Clone the repository and install dependencies:

git clone https://github.com/YOUR_USERNAME/tatar-gemini-extractor.git
cd tatar-gemini-extractor
pip install -r requirements.txt

🔧 Setup

  1. Set your Google Gemini API key in Params class (currently commented out in code):
self.gemini_api_key = "<YOUR_GEMINI_API_KEY>"
  1. Adjust models, chunk size, or directory structure as needed in the Params class.

🚀 Usage

Call the main extract() function with one or more public Yandex Disk links:

if __name__ == "__main__":
    extract([
        "https://yadi.sk/i/9oIYgGcPvfSs4w"
    ])

The system will:

  • Download the PDF
  • Slice it into logical page chunks
  • Upload each chunk to Gemini for content extraction
  • Validate and save JSON outputs
  • Zip all extracted chunks for easy distribution

🗃️ Output Structure

Outputs are saved under the ./.artifacts/ directory:

.artifacts/
🗂️ downloads/       # Original PDFs
🗂️ chunk_results/   # Extracted JSON chunks
🗂️ slices/          # Sliced PDFs per chunk
🗂️ prompts/         # Prompt files used in extraction
🗂️ zips/            # Final zipped output per document

📚 Example Prompt

Prompt logic is defined in prompt.py and tailored for Tatar-language document structures (e.g. headings, footnotes). You can customize this file to better match your document types.


🛠️ Development Notes

  • Uses pymupdf for PDF manipulation
  • Google Gemini client from google.genai
  • Schema validation via pydantic
  • Retry and error handling for API rate limits (429)

🏷️ License

MIT License — feel free to use, modify, and contribute.

About

Automated extraction of structured data from Tatar-language PDF documents using Google Gemini and Yandex Disk public links. Supports chunked processing, prompt engineering, and JSON output validation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages