Agent-K is a novel framework for extracting mathematically complex properties from unstructured documents using large language models (LLMs). Unlike standard batch-extraction approaches via structured output prompting that often fail on multi-step numerical reasoning, Agent-K breaks down the task into three stages: extracting intermediate facts, generating and executing Python code via a ReAct agent with self-reflection, and validating outputs with inter-property constraints. On a benchmark built from real-world NI 43-101 mineral reports, Agent-K significantly reduces error (sMAPE -22.1%) and improves accuracy (pass@1 +15.8%) over baselines, and further generalizes to the financial domain (FinQA), improving pass@1 accuracy by up to 11.1% in a zero-shot setting. Our empirical results show that Agent-K can be applied as a robust framework for structured data extraction that does not rely on the availability of structured output APIs.
- 2025-08: The first version of Agent-K is released!
- Python 3.12 or higher
- Docker (for running code interpreter tool)
uv
package manager (recommended for Python dependency management)- Environment variables:
OPENAI_API_KEY
: OpenAI API tokenHF_TOKEN
: HuggingFace token
- Clone the repository
- Install
uv
package manager - Create and activate a virtual environment
- Install dependencies using either
uv sync
orpip install -r requirements.txt
- Build Docker image for code interpreter using
make build
- Run code interpreter Docker container using
make run
- Configure API tokens by renaming the
.env.example
file to.env
and adding your API tokens OPENAI_API_KEY
: needed for running OpenAI models (e.g.gpt-4o-mini-2024-07-18
)HF_TOKEN
: needed for running Open-Source models (e.g.Qwen/Qwen3-30B-A3B
)
The FinQA experiments test Agent-K on financial question answering datasets.
- Run Agent-K predictions on FinQA test set:
uv run python src/experiments/fin_qa/fin_qa_pred.py
- Evaluate the results:
uv run python src/experiments/fin_qa/fin_qa_eval.py
Running batch extraction experiments separately for long context and RAG-based:
uv run python src/experiments/batch_extraction.py
Configure batch extraction settings in src/config/experiment_config.py
:
BATCH_EXTRACTION_MODEL
: Model to useMAX_NUM_RETRIEVED_DOCS
: Number of documents to retrieve in RAG-based batch extractionBATCH_METHOD
: Choose betweenLONG_CONTEXT
orRAG_BASED
BATCH_EXTRACTION_SAMPLE_SIZE
: Set toNone
for full dataset or specify number of samples
To run TAT-LLM experiments:
- Configure the extraction method in
src/config/experiment_config.py
:
PDF_EXTRACTION_METHOD = ExtractionMethod.TAT_LLM
- Configure sample size and evaluation set:
PDF_EXTRACTION_SAMPLE_SIZE = None # None for all 50 PDFs, or specify number
PDF_EXTRACTION_EVAL_TYPE = "FULL"
TAT_LLM_MODEL = "gpt-4o-mini-2024-07-18"
TAT_LLM_TEMPERATURE = 0.2
- Run the TAT-LLM extraction:
uv run python src/experiments/multi_method_extraction.py
- Configure the extraction method in
src/config/experiment_config.py
:
PDF_EXTRACTION_METHOD = ExtractionMethod.SELF_RAG
- Configure sample size and evaluation set:
PDF_EXTRACTION_SAMPLE_SIZE = None # None for all 50 PDFs, or specify number
PDF_EXTRACTION_EVAL_TYPE = "FULL"
SELF_RAG_MODEL = "gpt-4o-mini-2024-07-18"
SELF_RAG_TEMPERATURE = 0.2
- Run the Self-RAG extraction:
uv run python src/experiments/multi_method_extraction.py
- Configure the extraction method in
src/config/experiment_config.py
:
PDF_EXTRACTION_METHOD = ExtractionMethod.AGENT_K
- Configure sample size and evaluation set:
PDF_EXTRACTION_SAMPLE_SIZE = None # None for all 50 PDFs, or specify number
PDF_EXTRACTION_EVAL_TYPE = "FULL"
AGENT_K_MODEL = "gpt-4o-mini-2024-07-18"
AGENT_K_TEMPERATURE = 0.2
MAX_REFLECTION_ITERATIONS = 5
- Run the Agent-K extraction:
uv run python src/experiments/multi_method_extraction.py
After running experiments, results will be stored in data/experiment/
under the corresponding method directory. To evaluate the results:
- Copy experimental result path into
eval.py
:
- Find the result files from
data/experiments/
(e.g.data/experiments/agent_k/agent-k_2025-08-29_11-55-40.csv
) - Copy the result path in
agent_extractions
list insrc/eval.py
. You can also add multiple result paths to the list to calculate pass@k scores.
- Run evaluation:
uv run python src/eval.py
- The evaluation will output two files:
pdf_extraction_metrics_<timestamp>.csv
: aggregated metrics (absolute mean error, R-squared, SMAPE, pass@1) for each complex numerical property + the average of all metrics.df_merged_<timestamp>.csv
: Mineral report level metrics.
To understand the contribution of different components:
uv run python src/experiments/ablation_tests.py
This will test various configurations with components disabled to measure their impact. The output will be saved in data/experiments/ablation_tests/
under the corresponding variant directory.
The parameter search script tests different combinations of key parameters to find the configuration that yields the best extraction performance. It evaluates three main parameters:
- Max Reflection Iterations - Controls the maximum number of iterations for self-reflection before falling back to self-consistency.
- Temperature - LLM temperature.
- Number of Retrieved Documents - Number of documents to retrieve for context during experiment execution for each complex numerical property.
To find optimal hyperparameters for your specific use case:
- Configure search parameters in
src/experiments/parameter_search/parameter_search_config.yaml
- Model: Which model to use (default: gpt-4o-mini)
- Sample Size: Number of PDFs to process per experiment (default: 5)
- Evaluation Set: DEV, TEST, or FULL dataset
- Parameter Values: Specific values to test for each parameter
- Run parameter search:
uv run python src/experiments/parameter_search/parameter_search.py
-
Visualize results: You can also specify the weights for the composite metric. The composite metric is calculated as:
$\alpha \times (1-sMAPE) + \beta \times Pass@1$ where$\alpha$ and$\beta$ are configurable weights (default: 0.5 each).
# Prioritize SMAPE over pass@1 (α=0.8, β=0.2)
uv run python src/experiments/parameter_search/visualize_parameter_search.py --alpha 0.8 --beta 0.2
This project is licensed under the MIT License - see the LICENSE file for details.