Automated extraction of structured data from Tatar-language PDF documents using Google Gemini and public Yandex Disk links. Supports chunked processing, prompt engineering, and validation of model output into structured JSON format.
- 📄 PDF download via Yandex Disk public links
- 🔪 Smart chunking of large documents
- 🤖 Extraction using Google Gemini (supports multiple models)
- 🧠 Prompt engineering for Tatar documents
- 📦 Outputs validated as structured JSON (via Pydantic)
- 🗂️ Resulting chunks zipped for easy sharing and storage
Clone the repository and install dependencies:
git clone https://github.com/YOUR_USERNAME/tatar-gemini-extractor.git
cd tatar-gemini-extractor
pip install -r requirements.txt
- Set your Google Gemini API key in
Params
class (currently commented out in code):
self.gemini_api_key = "<YOUR_GEMINI_API_KEY>"
- Adjust models, chunk size, or directory structure as needed in the
Params
class.
Call the main extract()
function with one or more public Yandex Disk links:
if __name__ == "__main__":
extract([
"https://yadi.sk/i/9oIYgGcPvfSs4w"
])
The system will:
- Download the PDF
- Slice it into logical page chunks
- Upload each chunk to Gemini for content extraction
- Validate and save JSON outputs
- Zip all extracted chunks for easy distribution
Outputs are saved under the ./.artifacts/
directory:
.artifacts/
🗂️ downloads/ # Original PDFs
🗂️ chunk_results/ # Extracted JSON chunks
🗂️ slices/ # Sliced PDFs per chunk
🗂️ prompts/ # Prompt files used in extraction
🗂️ zips/ # Final zipped output per document
Prompt logic is defined in prompt.py
and tailored for Tatar-language document structures (e.g. headings, footnotes).
You can customize this file to better match your document types.
- Uses
pymupdf
for PDF manipulation - Google Gemini client from
google.genai
- Schema validation via
pydantic
- Retry and error handling for API rate limits (
429
)
MIT License — feel free to use, modify, and contribute.