A large-scale corpus of political speeches for analyzing temporal figures and political enforcement patterns in times of change.
- Table of Contents
- Overview
- Research Context
- Corpus Description
- Data Structure
- File Structure
- Download
- License & Usage
- Contributors
- Citation
- References
- Contact
- Acknowledgments
This corpus was created to study Chronorhetorics - the relationship between temporal language patterns and political enforcement in political discourse. The dataset enables computational linguistic analysis of how political actors use temporal references to advance their political projects and legitimize power.
⚠️ Work in Progress: This corpus is currently under active development. While we are making it available for research purposes, please be aware that additional content may be added over time, data structure changes may occur as we refine our approach, and field modifications might happen based on research needs and technical considerations. We recognize that some current technical design decisions may not be optimal and are subject to revision. Researchers using this corpus should expect potential updates and changes. We recommend checking back regularly for updates and versioning information.
Our work explores the concept of Chronopolitics, specifically focusing on "politicized time" - time employed as a weapon of politics and a means of legitimizing political programs (Esposito & Becker, 2023). We investigate:
- How patterns and linkages between political and temporal language can be made visible using computational linguistics
- How these patterns manifest, especially in contexts of looming conflict and crises
The corpus contains various forms of political speeches from multiple official and archival sources:
Government Sources:
-
US White House - Presidential remarks and speeches
- Current: whitehouse.gov/remarks/
- Biden Archive: bidenwhitehouse.archives.gov
- Trump Archive: trumpwhitehouse.archives.gov/remarks/
- Obama Archive: obamawhitehouse.archives.gov
- Bush Archive: georgewbush-whitehouse.archives.gov
-
Russian Presidential Speeches - Kremlin transcripts and statements (English translations)
- tadadit.xyz/datasets/kremlin.ru_ru/ (base dataset)
- en.kremlin.ru/events/president/transcripts (updated through March 2025)
-
German Bundestag - Parliamentary debates and proceedings
- Historical: GermaParl corpus (Blätte & Blessing, 2018)
- Recent: bundestag.de/services/opendata
-
German Chancellor Speeches - Olaf Scholz addresses
-
Ukrainian Presidential Speeches - Volodymyr Zelenskyy addresses (English translations)
-
UK House of Commons - Parliamentary speeches and debates
- parliament.uk (Hansard archives)
- parser.theyworkforyou.com/hansard.html (XML dataset)
Specialized Collections:
-
Munich Security Conference - Annual conference speeches (1963-2024)
-
Hitler Speeches - Adolf Hitler speeches from Max Domarus collections
- Domarus, Max. Hitler: Speeches and Proclamations 1932-1945 (Volumes 1 & 2)
- Public addresses and announcements
- Parliamentary debates and proceedings
- Public interviews and hearings
- Holiday announcements and ceremonial speeches
- Time span: 1919 - March 2025
- Languages: German (de), English (en)
- Total size: Over 90% of tokens come from Bundestag and House of Commons speeches
Each speech includes standardized metadata:
- Date of speech
- Speaker information (individuals and institutions)
- Language code
- Source institution
- Speech type/context
Each speech in the corpus is stored as a JSON object with standardized metadata and structured content. Here's an example:
{
"id": "DE-Bundeskanzler-Speech-Scholz-0001",
"url": "https://www.bundeskanzler.de/bk-de/aktuelles/...",
"title": "Rede von Bundeskanzler Scholz anlässlich der State of the World Sessions...",
"source_language": "de",
"source_type": "Person",
"source_name": "Bundeskanzler Scholz",
"date": "Mittwoch, 19. Januar 2022",
"date_iso": "2022-01-19T00:00:00",
"location": "Davos",
"structured_content": [
{
"type": "speech",
"sequence": 1,
"speaker": "Bundeskanzler Scholz",
"text": "Sehr geehrter Herr Professor Schwab..."
}
]
}
- id: Unique identifier mostly following the pattern
{COUNTRY}-{INSTITUTION}-{TYPE}-{SPEAKER}-{NUMBER}
- url: Original source URL of the speech
- title: Full title of the speech or address
- source_language: Language code (de/en)
- source_type: Type of speaker (Person/Institution)
- source_name: Name of the speaker or institution
- date: Human-readable date in original language
- date_iso: ISO 8601 formatted date for computational processing
- location: Location where speech was delivered (if available)
The structured_content
field contains an array of speech segments, each representing a distinct part of the political discourse. The structure and available fields vary depending on the source type and institutional context.
Common fields across all sources:
- sequence: Numerical order of the segment within the document
- text: Full text content of the segment
Content type identifiers:
"speech"
: Primary speech content"question"
: Parliamentary questions or audience questions"answer"
: Responses to questions"interjection"
: Parliamentary interruptions, applause, or audience reactions"interruption"
: Formal parliamentary interruptions"procedural"
: Procedural statements or announcements"procedural_note"
: Administrative notes and formal procedures"section_header"
: Agenda items or major section dividers"subsection_header"
: Minor headings within larger sections
Speaker information (varies by source):
Simple speeches (White House, Kremlin):
- speaker: String with speaker name
Parliamentary sources (Bundestag, House of Commons):
- speaker: Object containing detailed information:
name
: Full name of the speakerperson_id
: Unique identifier for the person (may be null)party
: Political party affiliation (may be null)parliamentary_group
: Parliamentary group membership (Bundestag only)role
: Role type ("mp"
,"presidency"
,"minister"
, etc.)position
: Specific position ("Präsident"
,"Alterspräsident"
, etc.)
Source-specific fields:
Bundestag documents include:
- bundestag_id: Internal Bundestag segment identifier
- content_type: More specific content categorization
- legislative_period: Parliamentary term number
- session_no: Session number within the legislative period
House of Commons documents include:
- hansard_id: Unique Hansard identifier for each segment
- xml_tag: Original XML structure tag from Hansard archives
- files_combined: List of source XML files used to construct the document
The corpus is organized into two main categories based on the source type:
chronorhetorics-corpus/
└── data/
├── institution/
│ ├── msc/
│ ├── parliament/
│ │ ├── de-bundestag/
│ │ └── uk-house-of-commons-protocols/
│ ├── ru-putin-speeches/
│ └── us-white-house-remarks/
│ ├── biden/
│ ├── bush/
│ ├── obama/
│ ├── trump17/
│ └── trump25/
└── person/
├── de-bundeskanzler-speeches/
├── de-hitler-speeches/
└── ua-zelenskyy-speeches/
Standard Format:
Most files follow the pattern: {COUNTRY}-{INSTITUTION}-{TYPE}-{SPEAKER}-{NUMBER}.json
Examples:
DE-Bundestag-Protocol-0001.json
US-WhiteHouse-Remarks-Biden-0042.json
DE-Bundeskanzler-Speech-Scholz-0007.json
Munich Security Conference (MSC) Format:
MSC files use a specialized naming format: MSC-{YEAR}-{SPEAKER-NAME}-Speech.json
Examples:
MSC-1966-Karl-Theodor-Freiherr-von-und-zu-Guttenberg-Speech.json
MSC-2015-Angela-Merkel-Speech.json
MSC-2022-Volodymyr-Zelenskyy-Speech.json
The Chronorhetorics Corpus is available for download through Zenodo:
Zenodo Repository (Recommended)
- Full Corpus:
- Direct download: zenodo.org/records/15548394/files/chronorhetorics-corpus-v1.0-all.zip
For researchers interested in specific political systems or time periods:
Subcorpus | Time Coverage | Description | Size | Download |
---|---|---|---|---|
German Political Speeches | 1949 - March 2025 | Bundestag debates + Chancellor speeches | ~3.44GB | Download |
US Presidential Speeches | 2003 - March 2025 | White House remarks (Bush-Trump) | ~207MB | Download |
UK House of Commons Protocols | 1919 - March 2025 | House of Commons Hansard | ~2.1GB | Download |
Ukrainian Presidential Speeches | 2019 - March 2025 | Ukrainian President Zelenskyy speeches | ~8.3MB | Download |
Russian Presidential Speeches | 1999 - March 2025 | Kremlin transcripts (English) | ~122.3MB | Download |
Hitler Speeches | 1933 - 1945 | Hitler speeches from Domarus collections | ~2.7MB | Download |
International Conferences | 1966 - March 2025 | Munich Security Conference | ~744KB | Download |
- JSON: Structured data with full metadata
- Disk Space: ~5.9GB for complete corpus
# Download and extract
wget https://zenodo.org/records/15548394/files/chronorhetorics-corpus-v1.0-all.zip
unzip chronorhetorics-corpus-v1.0-all.zip
# Basic usage (Python)
import json
with open('data/speeches/bundestag/DE-Bundestag-Protocol-0001.json', 'r') as f:
speech = json.load(f)
print(speech['title'])
This corpus is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
By downloading and using the corpus, you agree to:
- Cite the corpus appropriately in any publications using the provided citation format
- Respect the terms of use for derivative datasets (GermaParl, etc.)
- Attribute the work when redistributing or creating derivative works
- Use the data for research, educational, and commercial purposes as permitted under CC BY 4.0
To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
- Dr. Florian Nieser - Heidelberg Center for Digital Humanities (HCDH), Heidelberg University
- Jonathan Gaede - Heidelberg Center for Digital Humanities (HCDH), Heidelberg University
- David Schatz - Heidelberg Center for Digital Humanities (HCDH), Heidelberg University
- Dr. Stefan Pietrusky - Heidelberg Center for Digital Humanities (HCDH), Heidelberg University
If you use this corpus in your research, please cite:
@dataset{nieser2025chronorhetorics,
title={Chronorhetorics Corpus: Political Speeches for Temporal-Political Analysis},
author={Nieser, Florian and Gaede, Jonathan and Schatz, David and Pietrusky, Stefan},
year={2025},
institution={Heidelberg Center for Digital Humanities, Heidelberg University},
url={https://zenodo.org/records/15548394/files/chronorhetorics-corpus-v1.0-all.zip}
}
Blaette, Andreas (2017): GermaParl. Corpus of Plenary Protocols of the German Bundestag. TEI files, available at: https://github.com/PolMine/GermaParlTEI.
Divita, David. "Radical-right populism in spain and the strategy of chronopolitics." History and Theory, 62/4:3–23, 2023.
Domarus, Max. Hitler: Speeches and Proclamations 1932-1945. Volumes 1 & 2. Wauconda, IL: Bolchazy-Carducci Publishers.
Esposito, Fernando and Tobias Becker. "The time of politics, the politics of time, and politiced time: An introduction to chronopolitics." Language in Society, 52:757–781, 2023.
If you have any questions about the corpus or research, feel free to reach out to the members of the HCDH!
This research was conducted at the Heidelberg Center for Digital Humanities (HCDH), Heidelberg University. We thank the institutions that make their political speeches publicly available, enabling this research.
Special acknowledgments:
- The GermaParl project (Blätte & Blessing, 2018) for providing comprehensive German parliamentary data
- The German Bundestag's open data portal for transparency and open government data access
- Tadadit.xyz / Giorgio Comai for providing the foundational Kremlin speeches dataset
- Munich Security Conference for making historical speeches available
- TheyWorkForYou.com for providing the comprehensive Hansard XML dataset that enabled UK parliamentary data integration
- Various government archives and institutions for maintaining accessible speech collections