🐎 gemini-pdf-extractor

Automated extraction of structured data from Tatar-language PDF documents using Google Gemini and public Yandex Disk links. Supports chunked processing, prompt engineering, and validation of model output into structured JSON format.

✨ Features

📄 PDF download via Yandex Disk public links
🔪 Smart chunking of large documents
🤖 Extraction using Google Gemini (supports multiple models)
🧠 Prompt engineering for Tatar documents
📦 Outputs validated as structured JSON (via Pydantic)
🗂️ Resulting chunks zipped for easy sharing and storage

📅 Installation

Clone the repository and install dependencies:

git clone https://github.com/YOUR_USERNAME/tatar-gemini-extractor.git
cd tatar-gemini-extractor
pip install -r requirements.txt

🔧 Setup

Set your Google Gemini API key in Params class (currently commented out in code):

self.gemini_api_key = "<YOUR_GEMINI_API_KEY>"

Adjust models, chunk size, or directory structure as needed in the Params class.

🚀 Usage

Call the main extract() function with one or more public Yandex Disk links:

if __name__ == "__main__":
    extract([
        "https://yadi.sk/i/9oIYgGcPvfSs4w"
    ])

The system will:

Download the PDF
Slice it into logical page chunks
Upload each chunk to Gemini for content extraction
Validate and save JSON outputs
Zip all extracted chunks for easy distribution

🗃️ Output Structure

Outputs are saved under the ./.artifacts/ directory:

.artifacts/
🗂️ downloads/       # Original PDFs
🗂️ chunk_results/   # Extracted JSON chunks
🗂️ slices/          # Sliced PDFs per chunk
🗂️ prompts/         # Prompt files used in extraction
🗂️ zips/            # Final zipped output per document

📚 Example Prompt

Prompt logic is defined in prompt.py and tailored for Tatar-language document structures (e.g. headings, footnotes). You can customize this file to better match your document types.

🛠️ Development Notes

Uses pymupdf for PDF manipulation
Google Gemini client from google.genai
Schema validation via pydantic
Retry and error handling for API rate limits (429)

🏷️ License

MIT License — feel free to use, modify, and contribute.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.artifacts/shots		.artifacts/shots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
prompt.py		prompt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐎 gemini-pdf-extractor

✨ Features

📅 Installation

🔧 Setup

🚀 Usage

🗃️ Output Structure

📚 Example Prompt

🛠️ Development Notes

🏷️ License

About

Uh oh!

Releases

Packages

Languages

License

neurotatarlar/gemini-pdf-extractor

Folders and files

Latest commit

History

Repository files navigation

🐎 gemini-pdf-extractor

✨ Features

📅 Installation

🔧 Setup

🚀 Usage

🗃️ Output Structure

📚 Example Prompt

🛠️ Development Notes

🏷️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages