OCR & Summarization Micro-service

by Aleksei Vdonin
Faculty of Computer Science, HSE University, Moscow, 2025

Coursework 2025

A self-hosted micro-service that recognises and summarises HSE corporate documents (PDF, DOC/DOCX, TXT) and is ready to be plugged into the mobile application HSE App X.
The pipeline can run end-to-end on CPU-only servers inside the HSE perimeter, keeping sensitive documents private.

Overview

Stage	Description
1. Crawling & OCR	Parses >2 000 public documents from hse.ru and runs Tesseract OCR where needed.
2. Dataset Builder	Splits texts into 1 000-token chunks with 100-token overlap, then queries OpenAI o3-mini to create reference summaries.
3. Teacher	Fine-tunes `d0rj/ru-mbart-large-summ` to reach state-of-the-art quality on legal prose.
4. Students	Distils the teacher into compact BiLSTM and BiLSTM + MHA Seq2Seq models for CPU inference.
5. Streamlit UI	Lets users upload a file or paste text and instantly compare outputs of all four models.

All components are container-friendly and require only Python 3.11 + PyPI packages.

Features

Document-scale abstractive summarization (arbitrary length, sliding window).
Soft-label knowledge distillation for ~5× faster CPU inference.
End-to-end automation: call → OCR → chunk → target → summary.
Lightweight deployment (no GPU, < 200 MB student weights).
Interactive web demo with timing & progress bars.

Dataset

Stage	Output	Size
Crawl portal	`data/data_raw/`	2 059 docs
OCR & cleaning	`data/rproc_data/`	2 059 UTF-8 texts
Chunking	8 000 + fragments	≈ 860 tokens each
Final JSONL	`train_smart.jsonl`	~67 MB (text, summary pairs)

Training & Distillation

Stage	Entry-point	Epochs	GPU-hours
Teacher fine-tune	`train/mBART.ipynb`	7	27
LSTM student	`distill/distill_train.ipynb`	50	30
LSTM + MHA student	`distill/distill_train_mha.ipynb`	16	20

Experiments are logged with Weights & Biases; reproducible configs are present in respective folders.

Results

Model	Params	BERTScore F1	CPU latency (500 w)	Speed-up
Teacher (mBART-large)	380 M	0.76	11.2 s	1×
Student (LSTM + MHA)	47 M	0.69	4.66 s	2.4×
Student (LSTM)	46 M	0.68	2.23 s	5.0×

Web Demo

The Streamlit interface supports:

File upload (TXT, DOC/DOCX, PDF with on-the-fly OCR).
Real-time comparison of all models with elapsed-time badges.
Progress bars for OCR & long-text chunking.

A live instance runs at http://158.160.61.64:8501/ (internal HSE network).

Future Work

Hybrid extractive + abstractive summarization for better factual coverage.
Dynamic model selector (fast vs. accurate) based on document length / SLA.
ONNX export of student models for mobile devices.

Links

Live UI – Link
Train Data – Drive
Model Checkpoints – Drive
Coursework Report - Report

License

Released under the MIT License – see LICENSE for full text.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dataset		dataset
distill		distill
train		train
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
streamlit.py		streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR & Summarization Micro-service

Coursework 2025

Table of Contents

Overview

Features

Dataset

Training & Distillation

Results

Web Demo

Future Work

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

vdoninav/hse_ocr_summ

Folders and files

Latest commit

History

Repository files navigation

OCR & Summarization Micro-service

Coursework 2025

Table of Contents

Overview

Features

Dataset

Training & Distillation

Results

Web Demo

Future Work

Links

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages