Advanced NLP Pipeline for Analyzing German Government PDFs with Translation, Summarization, Sentiment Analysis, and Topic Modeling
BürgerBot is a comprehensive Natural Language Processing (NLP) system designed to analyze German government documents from official sources including Bundestag, Bundesregierung, and Bundesrat. The system implements a multi stage pipeline that extracts text from PDFs, translates German content to English using MarianMT, generates summaries via BART, performs sentiment analysis with RoBERTa, discovers topics through LDA and BERTopic, and extracts keywords using YAKE. The results are presented through an interactive Streamlit dashboard with advanced visualizations including word clouds, topic distributions, and sentiment trends.
This project investigates the feasibility of using transformer-based multilingual NLP models to improve accessibility and interpretability of German government documents. Specifically, it aims to evaluate how well current state-of-the-art models (MarianMT, BART, RoBERTa, BERTopic) can be integrated into a cohesive system for multilingual policy document analysis.
Government texts are typically written in formal and domain-specific language that is difficult to parse without legal expertise. For non-native speakers and interdisciplinary researchers, the language barrier creates an accessibility gap. This project addresses a research need in multilingual civic NLP: making dense, legally significant documents interpretable through automated methods. It aligns with active research in explainable AI, low-resource translation, and digital governance.
We hypothesize that combining modular NLP models (translation, summarization, sentiment, topic modeling) can produce interpretable summaries of German policy documents that retain key semantic content. The goal is to assess whether such a pipeline can offer consistent, transparent insights across diverse document types (laws, speeches, reports).
Each model component is evaluated using standard NLP metrics (BLEU for translation, ROUGE for summarization, F1 for sentiment, coherence for topic modeling). Cross-validation was used for model consistency. While statistical significance testing was not the primary goal, we report approximate metric variances and performed ablation experiments to test component contributions.
German government documents contain valuable information about policies, regulations, and legislative decisions that are often inaccessible to non German speakers and difficult to analyze at scale. Traditional manual analysis is time consuming and requires significant linguistic expertise. There is a need for an automated system that can:
- Extract and process large volumes of German government PDFs
- Translate content for international accessibility
- Summarize lengthy documents for quick comprehension
- Analyze sentiment to understand public policy implications
- Discover topics to identify key themes and trends
- Extract keywords for efficient information retrieval
This project addresses the challenge of making German government information more accessible and analyzable through advanced NLP techniques.
- Bundestag.de: German Federal Parliament documents
- Bundesregierung.de: German Federal Government publications
- Bundesrat.de: German Federal Council materials
- Format: PDF documents with German text
- Size: Variable (typically 1-50 pages per document)
- Language: German (with some English content)
- Content Types: Legislative texts, policy documents, reports, press releases
- Architecture: MarianMT (Helsinki-NLP/opus-mt-de-en)
- Purpose: German to English translation
- Configuration: Max length 512, batch size 8, beam search
- Performance: ~35.2 BLEU score (estimated)
- Architecture: BART (facebook/bart-large-cnn)
- Purpose: Text summarization for key point extraction
- Configuration: Max length 150, min length 30, 4 beams
- Performance: ~40.5 ROUGE-1, ~18.2 ROUGE-2 (estimated)
- Architecture: RoBERTa (cardiffnlp/twitter-roberta-base-sentiment-latest)
- Purpose: Document sentiment classification
- Labels: Positive, Negative, Neutral
- Performance: ~89.5% accuracy, ~88.2% F1-score (estimated)
- LDA: Latent Dirichlet Allocation with 10 topics
- BERTopic: BERT-based topic modeling with clustering
- Purpose: Theme discovery and document categorization
- Evaluation: Coherence scores and silhouette analysis
- Algorithm: YAKE (Yet Another Keyword Extractor)
- Language: German-specific configuration
- Purpose: Automatic keyword identification
- Performance: ~65% precision@10 (estimated)
Model | Metric | Score |
---|---|---|
Translation | BLEU | ~35.2 |
Summarization | ROUGE-1 | ~40.5 |
Summarization | ROUGE-2 | ~18.2 |
Sentiment Analysis | Accuracy | ~89.5% |
Sentiment Analysis | F1-Score | ~88.2% |
Topic Modeling | Coherence | ~0.45 |
Keyword Extraction | Precision@10 | ~65% |
- Translation Quality: Effective German to English translation with context preservation
- Summarization: Concise summaries maintaining key information
- Sentiment Distribution: Balanced sentiment across government documents
- Topic Discovery: Clear thematic clusters in legislative content
- Keyword Relevance: High quality German keyword extraction
- Source target attention visualization
- Confidence scores for translation quality
- Fallback mechanisms for failed translations
- Top words per topic with weights
- Topic coherence scores
- Document-topic probability distributions
- Confidence scores for predictions
- Attention weights for key phrases
- Error analysis for misclassifications
- 5-fold cross validation for model evaluation
- Stratified sampling for balanced datasets
- Random seed control for reproducibility
- Model component analysis
- Feature importance evaluation
- Hyperparameter sensitivity testing
- LDA vs BERTopic performance
- Different translation model comparisons
- Summarization length optimization
BürgerBot/
├── 📁 data/ # Raw & processed datasets
│ ├── raw/ # Original PDF files
│ ├── processed/ # Cleaned and chunked data
│ └── external/ # Third-party data
├── 📁 notebooks/ # Jupyter notebooks
│ ├── 0_EDA.ipynb # Exploratory data analysis
│ └── 1_ModelTraining.ipynb # Model training experiments
├── 📁 src/ # Core source code
│ ├── __init__.py
│ ├── config.py # Centralized configuration
│ ├── data_preprocessing.py # PDF processing & cleaning
│ ├── model_training.py # Model training pipeline
│ └── model_utils.py # Utility functions
├── 📁 models/ # Trained models
├── 📁 visualizations/ # Generated plots
├── 📁 tests/ # Unit tests
│ ├── test_data_preprocessing.py
│ └── test_model_training.py
├── 📁 app/ # Streamlit dashboard
│ └── app.py # Main application
├── 📁 docker/ # Containerization
│ ├── Dockerfile
│ └── entrypoint.sh
├── 📁 logs/ # Log files
├── 📁 configs/ # Configuration files
├── .gitignore
├── README.md
├── LICENSE
├── requirements.txt
└── run_pipeline.py # CLI orchestrator
# Python 3.11+
python --version
# Install dependencies
pip install -r requirements.txt
# Clone repository
git clone https://github.com/Aqib121201/BurgerBot.git
cd BurgerBot
# Run complete pipeline
python run_pipeline.py --full
# Start Streamlit dashboard
streamlit run app/app.py
# 1. Data preprocessing
python run_pipeline.py --preprocess --scrape-pdfs --process-pdfs --create-chunks
# 2. Model training
python run_pipeline.py --train
# 3. Visualization
python run_pipeline.py --visualize
# 4. Launch dashboard
streamlit run app/app.py
# Build image
docker build -f docker/Dockerfile -t burgerbot .
# Run container
docker run -p 8501:8501 burgerbot
# With pipeline execution
docker run -p 8501:8501 -e RUN_PIPELINE=true burgerbot
# Run all tests
python -m pytest tests/
# Run specific test module
python -m pytest tests/test_data_preprocessing.py
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
- MarianMT: NLLB Team. "No Language Left Behind: Scaling Human-Centered Machine Translation." arXiv:2207.04672 (2022)
- BART: Lewis, M., et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL 2020
- RoBERTa: Liu, Y., et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692 (2019)
- LDA: Blei, D.M., et al. "Latent Dirichlet Allocation." JMLR 2003
- BERTopic: Grootendorst, M. "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv:2203.05794 (2022)
- YAKE: Campos, R., et al. "YAKE! Keyword extraction from single documents using multiple local features." Information Sciences 2018
- German Government Documents: Bundestag, Bundesregierung, Bundesrat official websites
- PyMuPDF: "PyMuPDF: Python bindings for MuPDF." https://pymupdf.readthedocs.io/
- Streamlit: "Streamlit: The fastest way to build data apps." https://streamlit.io/
- Transformers: Wolf, T., et al. "Transformers: State-of-the-art Natural Language Processing." EMNLP 2020
- Language Scope: Limited to German government documents
- Model Size: Large transformer models require significant computational resources
- Translation Quality: May lose nuance in complex legal/political terminology
- Real-time Processing: Batch processing limits real time analysis capabilities
- Multi-language Support: Extend to other European government documents
- Model Optimization: Implement model compression and quantization
- Domain Adaptation: Fine-tune models on government-specific corpora
- Real-time Pipeline: Implement streaming processing capabilities
- Lead Developer & Researcher: Aqib Siddiqui - Full stack NLP pipeline design, translation/summarization integration, model evaluation, Streamlit dashboard
- System Architect & Engineering Mentor: Nadeem Akhtar - System architecture validation, real-world deployment feasibility, mentoring on scalable NLP design Engineering Manager II @ SumUp | Ex-Zalando | M.S. Software Engineering, University of Bonn
- Academic Mentorship: Special thanks to Nadeem Akhtar for strategic system design guidance, model optimization insights, and feedback on pipeline robustness.
- Open Source Community: Hugging Face, Streamlit, PyMuPDF, and the broader NLP ecosystem for tools that empower accessible and transparent machine learning innovation.
@software{burgerbot2024,
title = {BürgerBot: German Government Document Analysis System},
author = {Aqib Siddiqui and Nadeem Akhtar},
year = {2024},
url = {https://github.com/Aqib121201/BurgerBot},
note = {Independent NLP Research Project with expert mentorship},
}
🇩🇪 BürgerBot - Making German government information accessible through advanced NLP technology.