SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting
SustainableQA is a large-scale question-answering dataset designed for corporate sustainability reporting and EU Taxonomy compliance. The dataset provides comprehensive QA pairs extracted from corporate sustainability reports and annual reports, enabling the development of AI systems for sustainability compliance and ESG analysis.
Our dataset is derived from 61 publicly available corporate reports from German and Austrian companies (2020-2024), including annual reports, sustainability reports, and EU Taxonomy disclosures downloaded from official corporate websites.
- 195,287 total QA pairs from corporate sustainability reports
- 88,792 factoid questions + 102,539 non-factoid questions + 3,956 table-based QA
- Three domains: EU Taxonomy, ESG, and Sustainability
- Multi-span complexity: 16.7% of questions require multiple text spans
- German & Austrian focus: 61 reports from DACH region companies
Component | Count | Details |
---|---|---|
Total QA Pairs | 195,287 | Factoid + Non-factoid + Tables |
Factoid Questions | 88,792 | Short, precise answers |
Non-factoid Questions | 102,539 | Descriptive, explanatory answers |
Table-based QA | 3,956 | From 218 complex tables |
Text Passages | 8,067 | Semantically coherent segments |
Source Reports | 61 | German & Austrian companies |
Category | Passages | Factoid QA | Non-factoid QA | Total QA |
---|---|---|---|---|
ESG | 4,320 | 48,260 | 55,139 | 103,399 |
EU Taxonomy | 747 | 8,260 | 8,906 | 17,166 |
Sustainability | 3,000 | 32,272 | 38,494 | 70,746 |
Category | Mean Spans | Single-Span | Multi-Span | Complexity Notes |
---|---|---|---|---|
Overall | 1.36 | 83.3% | 16.7% | EU Taxonomy most complex |
ESG | 1.37 | 83.1% | 16.9% | Moderate complexity |
EU Taxonomy | 1.45 | 78.8% | 21.2% | Highest complexity |
Sustainability | 1.32 | 84.6% | 15.4% | Lowest complexity |
Q: What SDGs are mentioned in the context?
A: SDG 13: Climate action, SDG 16: Peace and justice...
Q: What is the company's total CapEx for taxonomy-eligible activities?
A: €15.2 million
Q: Which environmental objectives does activity 3.10 contribute to?
A: Climate change mitigation
Q: Why does activity 3.10 fail to meet the substantial contribution criterion for the manufacture of hydrogen?
A: Because the quantified life-cycle GHG emission savings are not verified, which is necessary to fulfill the criterion for substantial contribution to climate change mitigation.
Q: How does the company assess the "Do No Significant Harm" criteria?
A: The company conducts a comprehensive evaluation across all six environmental objectives, ensuring that while contributing to one objective, the activity does not significantly harm the other five through detailed impact assessments and third-party verification.
This dataset is created for academic research at University of Innsbruck using publicly available corporate reports. All source materials were published for stakeholder transparency and regulatory compliance. The dataset is released under CC BY-NC 4.0 license for non-commercial research and educational use.
This dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This means you can use, share, and adapt the material for non-commercial purposes with proper attribution.
- Primary Contact: mohammed.ali@uibk.ac.at
- Co-Authors: abdelrahman.abdallah@uibk.ac.at, adam.jatowt@uibk.ac.at
- Institution: University of Innsbruck, Department of Computer Science
- 💾 Dataset: GitHub Repository
- 🏛️ Institution: University of Innsbruck
- 📧 Contact: mohammed.ali@uibk.ac.at
- 📄 Academic Paper: Research Publication