Extracts QA pairs from documents with human review workflow via Label Studio. Includes source tracking and quality filtering for creating ground truth evaluation datasets.
- Extract QA pairs with source references (line numbers, chunks)
- Human review interface using Label Studio
- Quality filtering based on review scores
- Multiple export formats
- Python 3.10+
- An LLM provider API key (OpenAI, Anthropic, etc.) or local LLM (vLLM, Ollama)
- macOS/Linux (Windows users may need WSL)
# Clone the repository
git clone git@github.com:eggai-tech/qa-extraction-with-human-review.git
cd qa-extraction-with-human-review
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
make setup
Copy the example config and add your API key:
cp configs/config.example.yaml configs/config.yaml
Then edit configs/config.yaml
and configure your API key:
api-endpoint:
api_key: "your-api-key-here"
Place your text documents in the data/txt/
directory or alternatively if you have PDFs, place them in data/pdf/
and run the conversion to markdown:
make convert-pdfs
We use docling library to convert PDFs to text files. The converted files will be saved in data/txt/
.
make qa-pairs
This will process all the documents in data/txt
and save the extracted QA pairs as json files in data/extracted
make filter-qa-pairs
This will filter the extracted QA pairs based on quality metrics faithfulness, answer relevance, context accuracy and save the filtered pairs in data/filtered
.
As a final step, we can verify the QA pairs using human review via Label Studio. This allows us to ensure the quality of the extracted QA pairs.
# 1. Export QA pairs for review
make export-labelstudio
# 2. Start Label Studio
make start-labelstudio
# 3. In Label Studio (http://localhost:8080):
# - Create project "QA Pairs Review"
# - Import label_config.xml and qa_review_tasks.json
# - Review QA pairs: for each question/answer/context answer two simple yes/no questions: "Is the answer accurate based on the context?", "Is the question relevant and well-formed?"
# - Export the results after review: Export -> Choose 'JSON-MIN'
# 4. Process review results
make process-reviews EXPORT_FILE=path/to/export.json
After the last step of processing the reviews, the final QA pairs will be saved in data/labelstudio/verified_results
.
make setup # Initial setup
make qa-pairs # Extract QA pairs
make filter-qa-pairs # Filter QA pairs based on quality
make export-labelstudio # Export for review
make start-labelstudio # Start Label Studio
make process-reviews EXPORT_FILE=<file> # Process results
Edit configs/config.yaml
to customize:
llm:
provider: "api-endpoint" # or "vllm", "ollama", etc.
api-endpoint:
api_base: "https://api.openai.com/v1"
api_key: "your-key-here"
model: "gpt-4o-mini" # Model depends on provider
extraction:
temperature: 0.7
chunk_size: 2000
num_pairs: 5 # QA pairs per chunk
In order to add a custom prompt for the QA generation step (make qa-pairs
), you can edit the prompts
section in the config, e.g.:
prompts:
my_qa_generation: |
Create {num_pairs} question-answer pairs from this text for LLM training.
Rules:
1. Questions must be about important facts in the text
2. Answers must be directly supported by the text
3. Return JSON format only:
[
{{
"question": "Question 1?",
"answer": "Answer 1."
}},
{{
"question": "Question 2?",
"answer": "Answer 2."
}}
]
Text:
{chunk_text}
and then run:
python generate_qa.py --prompt my_qa_generation
- Label Studio won't start: Try
venv/bin/label-studio start --port 8081
- LLM errors: Check API key in
configs/config.yaml
- Large documents: Reduce
chunk_size
in config
{
"question": "What is the total nominal amount of the tranche of securities issued by Erste Group Bank AG?",
"answer": "The tranche of securities is issued in a total nominal amount of up to EUR 50.000.000.",
"reference": {
"chunk_id": 1,
"char_start": 3800,
"char_end": 7800,
"line_start": 72,
"line_end": 141,
"chunk_preview": "Erste Group Bank AG (die \"Emittentin\") vom 28. Oktober 2020, und etwaigen Nachträgen, bzw. einem ...",
"source_document": "AT0000A2VDG3.txt"
}
}
Let's create a sample question-answering benchmark to evaluate the RAG-based system performance on a given corpus.
We'll use a small subset of 10 documents from the FinCorpus-DE10k dataset.
The documents can be found in data/pdf
directory.
We'll split the tasks into the following steps:
- Convert the PDF documents to text files using.
- Generate question-answer pairs from the text files.
- Score the generated QA pairs based on quality metrics.
- Verify the QA pairs using human review via Label Studio.
Some PDFs are scans of the original documents, so we need to use OCR capabilities to convert them to text.
Fortunately, the docling
library provides a convenient way to do this. We convert the sample PDFs to markdown files
using the make convert-pdfs
command.
Although it's doing a good job it's not free of errors, so for some files we need to manually review the generated text
files in data/txt/
directory and fix any issues.
Below are sample errors found in the generated markdown files: in the top-right table is parsed incorrectly, resulting in a single column instead of two columns.
To generate question-answer pairs from the text files, we can use the make qa-pairs
command.
To configure the chunking strategy and the number of QA pairs to generate per chunk, edit the configs/config.yaml
file.
One can also configure the QA generation prompt in configs/config.yaml
under the prompts
section.
If we use the generic prompt, i.e.:
Create {num_pairs} question-answer pairs from this text for LLM training.
Rules:
1. Questions must be about important facts in the text
2. Answers must be directly supported by the text
3. Return JSON format only:
[
{{
"question": "Question 1?",
"answer": "Answer 1."
}},
{{
"question": "Question 2?",
"answer": "Answer 2."
}}
]
Text:
{chunk_text}
lots of QA pairs will be incomplete, meaning the questions are very specific to the information in the chunk but lack context to be useful for a RAG-based system. Sample questions of this type are:
- What is the submission period for the securities as specified in the text?
- What is the ISIN code for the securities mentioned in the document?
- What risks are associated with variable interest rate bonds according to the text?
We can see that the above questions do make sense, but they are too specific to the chunk and do not provide enough context.
To improve the QA generation, we can use a custom prompt, which apart from the chunk text also includes the document summary in order to provide the necessary context for the questions. Here's the more optimized prompt:
Generate high-quality question-answer pairs for LLM training.
Document Summary:
{summary}
Use this summary to understand the document's overall context. However, generate all questions and answers strictly from the main text provided below.
Focus Areas:
- Financial instruments, especially bonds, equities, derivatives, etc.
- Issuer details, ISINs, maturity dates, terms & conditions
- Financial metrics, issuance volumes, and key legal or regulatory elements
Avoid vague or overly general questions. Ensure each question is specific, fact-based, and clearly tied to the source text.
Instructions:
1. Create exactly {num_pairs} question-answer pairs.
2. Questions must cover important facts from the text.
3. Answers must be verifiable and explicitly supported by the text.
4. Add context to each question based on the summary, such as company names, financial instruments, ISINs, etc.
5. Questions and answers must be in the same language as the source text.
6. Return JSON format only:
[
{{
"question": "Question 1?",
"answer": "Answer 1."
}},
{{
"question": "Question 2?",
"answer": "Answer 2."
}}
]
Text:
{chunk_text}
On top of QA generation, we make one additional LLM call per document to generate a summary of the document. Here are some examples of the generated question-answer pairs:
- Where can I find the prospectus for the bond with ISIN AT0000A268B3?
- What is the total amount of the Zero Coupon Notes issued by Deutsche Pfandbriefbank AG?
- Welchen Zinssatz haben die Schuldverschreibungen der Erste Group Bank?
The above questions are more complete, in the sense that they provide enough context to be useful for a RAG-based system.
To score and the generated QA pairs based on quality metrics, we can use the make filter-qa-pairs
command.
In order to configure the scoring/filtering step edit the filtering
section in the configs/config.yaml
file:
# Query filtering setting
filtering:
deduplicate_threshold: 0.7 # Whether to deduplicate questions
faithfulness_threshold: 0.6 # Minimum faithfulness threshold for question-answer pairs
answer_relevancy_threshold: 0.7 # Minimum relevancy threshold for answers
context_precision_threshold: 0.6 # Minimum precision threshold for context relevance
First of all, we use semantic deduplication to remove similar questions. To this end we use an embedding model
(Sentence Transformers) to compute the embeddings of the questions and then use cosine similarity to filter out similar questions.
If the cosine similarity between two questions is above the deduplicate_threshold
(default 0.7), we keep only one of them.
After the semantic deduplication step, we score the question-answer pairs based on the following quality metrics:
- Faithfulness: Measures if the answer is supported by the text. The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. LLM is used to extract the claims from the answer and then check if they can be found in the context. IMPORTANT: if the document summary was included in the QA generation prompt, the context used for the faithfulness check will also include the summary. This is done to ensure that the answer is faithful to the overall document context, not just the chunk text.
- Answer Relevancy: Measures if the answer is relevant to the question. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. The Answer Relevancy is defined as the mean cosine similarity of the original question to a number of questions, which where generated (reverse engineered) by the LLM based on the answer.
- Context Precision: Context Precision is a metric that evaluates for a given question whether all the relevant items (i.e. our answer) are ranked higher than other answers (randomly selected from the extracted QA pairs). This is an indicator that the question is very specific and the answer can be easily distinguished from other answers.
We give the thresholds for the quality metrics in the configs/config.yaml
file, and the QA pairs that meet the thresholds are saved in the data/generated/filtered
directory.
The QA paris that do not are below the thresholds are saved in the data/generated/rejected
directory.
Below we show the distribution of the Faithfulness and Answer Relevancy for the QA extracted from our sample documents:
We can see that although most of the QA paris have almost perfect Faithfulness, i.e. the answers are supported by the context. Similarly, most of the QA pairs have Answer Relevancy score above 0.8, meaning that the answers are relevant to the questions.
Let's examine some examples of the QA pairs generated from the sample documents:
"question": "What is the issue amount for the bonds issued by Erste Group Bank AG under its Debt Issuance Programme?",
"answer": "The issue amount for the bonds is up to EUR 50,000,000.",
"question": "What is the interest rate for the bonds issued by Erste Group Bank AG?",
"answer": "The interest rate is 1.40% per annum.",
However, we can also see a lot of QA pairs which are very similar to the 2nd example and have not been captured by our semantic deduplication step, e.g.
"question": "What is the interest rate for the bonds issued by Erste Group Bank AG under its Debt Issuance Programme?",
"answer": "The interest rate is 1.40% per annum."
"question": "What is the interest rate of the bonds issued by Erste Group Bank AG under its EUR 30 billion Debt Issuance Programme?",
"answer": "The interest rate is 1.40% per annum."
We can either increase the deduplicate_threshold
in the configs/config.yaml
file to filter out more similar questions, or we can manually review the QA pairs in the last verification step performed by a human annotator in Label Studio.
To conclude, the quality metrics are good proxies for the QA pairs quality. For the best quality it is recommended to manually review the QA pairs in the last step of the workflow.
In order to ensure the high quality of the extracted QA pairs, we can use human review via Label Studio.
To do this, we can use the make export-labelstudio
+ make start-labelstudio
commands to export the QA pairs for review and start the Label Studio server.
Here's the example of the Label Studio interface for reviewing the QA pairs:
Here we simplify the task and ask the user only two questions:
- Is the answer accurate based on the question and context? YES/NO
- Is the question relevant to and well-formed? YES/NO
After the review is done we can export the results and save the final QA pairs using make process-reviews
command.
This will save only the QA pairs that have been marked as "yes" for both questions in the data/labelstudio/verified_results
directory.
MIT
- Uses Label Studio for human review interface
- Supports multiple LLM providers (OpenAI, Anthropic, local models)