Interpretable LLM-based Table Question Answering

by Giang Nguyen¹, Ivan Brugere², Shubham Sharma², Sanjay Kariyappa³, Anh Totti Nguyen^2* Freddy Lecue^1*,

^*Equal advising
¹Auburn University, ²J.P.Morgan AI Research, ³NVIDIA

Transactions on Machine Learning Research

📜 Abstract

Interpretability in Table Question Answering (Table QA) is critical, especially in high-stakes domains like finance and healthcare. While recent Table QA approaches based on Large Language Models (LLMs) achieve high accuracy, they often produce ambiguous explanations of how answers are derived. We propose Plan-of-SQLs (POS), a new Table QA method that makes the model's decision-making process interpretable. POS decomposes a question into a sequence of atomic steps, each directly translated into an executable SQL command on the table, thereby ensuring that every intermediate result is transparent. Through extensive experiments, we show that: First, POS generates the highest-quality explanations among compared methods, which markedly improves the users' ability to simulate and verify the model’s decisions. Second, when evaluated on standard Table QA benchmarks (TabFact, WikiTQ, and FeTaQA), POS achieves QA accuracy that is competitive to existing methods, while also offering greater efficiency—requiring significantly fewer LLM calls and table database queries (up to 25x fewer)—and more robust performance on large-sized tables. Finally, we observe high agreement (up to 90.59% in forward simulation) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating Table QA explanations.

🗺️ Table of Contents

Environment Setup
Datasets & Benchmarks
- TabFact
- WikiTQ
Improved Planning Algorithm
Performance vs. Table Size
Visualization & Evaluation
- LLM-as-a-Judge
- Human Evaluation Interfaces
Interactive Demo
Citation

Environment Setup

# 1️⃣  Create and activate Conda env
conda create -n tabular-llms-openai python=3.10.13 -y
conda activate tabular-llms-openai

# 2️⃣  Install core dependencies
pip install openai==0.27.4 azure-identity==1.12.0 azure-cli==2.41.0 azure-mgmt-cognitiveservices

# 3️⃣  Authenticate with Azure (browser sign-in)
az login

# 4️⃣  Project-specific dependencies
python install.py

Datasets & Benchmarks

TabFact

# 🔍 Step 1: Extract data
unzip data.zip

# ▶️ Step 2: Run evaluation
python run_tabfact_pos.py --use_subset True --load_dataset True

WikiTQ

# 🔍 Step 1: Clone Dater repo (official evaluation code)
git clone https://github.com/AlibabaResearch/DAMO-ConvAI.git
cd DAMO-ConvAI/dater

# 🔍 Step 2: Download pre-processed data (see repo instructions) to obtain dater_data.tar.gz

# ▶️ Step 3: Run POS on WikiTQ
mv saved/ code/
python run_wikitq_pos.py --load_dataset True --use_subset True

Improved Planning Algorithm

Why & how (click to expand)

We observed that many POS errors stem from the planning stage.
The original planner generates all steps up‑front, often missing crucial conditions.
Our dynamic planner instead generates one atomic step at a time, conditioned on the current intermediate table—leading to consistent accuracy gains across LLM backbones.

Try it by toggling self.planning_algorithm in helper.py between static and dynamic.

Performance vs. Table Size

Generate the table-size analysis plots:

python run_table_size_analysis.py           # TabFact
python run_table_size_analysis_wikitq.py    # WikiTQ

Input: a JSON results file (see example below).

{
  "test-1385": {
    "input": { ... },
    "answer": "TRUE",
    "answer_plans": { "dynamic": 1 },
    "groundtruth": "TRUE",
    "table_token_count": 101
  },
  ...
}

Visualization & Evaluation

Visualizing Table QA Explanations

cd visualization/script
sh vis_POS.sh

LLM-as-a-Judge

Prepare & run the three automatic XAI studies

# 1. Extract similar examples across XAI methods
cd xai_study/llm-judge/scripts
sh prepare_samples.sh

# 2. Experiments
# 2a. Preference (clearer reasoning?)
sh run_preference.sh     

# 2b. Forward Simulation (given explanation, predict model answer)
sh run_forward_sim.sh   

# 2c. Model Prediction Debugging (is prediction correct?)
sh run_debugging.sh

Human Evaluation Interfaces

Preference Ranking – https://huggingface.co/spaces/luulinh90s/Tabular-LLM-Study-Preference-Ranking
Forward Simulation – https://huggingface.co/spaces/luulinh90s/Tabular-LLM-Study-Forward-Simulation
Model Prediction Debugging – https://huggingface.co/spaces/luulinh90s/Tabular-LLM-Study-Debugging

Interactive Demo

Explore POS explanations live: Interactive Tabular XAI.

Citation

@article{nguyen2024interpretable,
  title={Interpretable llm-based table question answering},
  author={Nguyen, Giang and Brugere, Ivan and Sharma, Shubham and Kariyappa, Sanjay and Nguyen, Anh Totti and Lecue, Freddy},
  journal={arXiv preprint arXiv:2412.12386},
  year={2024}
}

♥ Please reach out to nguyengiangbkhn@gmail.com for any inquiries ♥

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
WikiTQ_logs		WikiTQ_logs
data		data
dater_logs		dater_logs
operations		operations
rebuttal		rebuttal
results/tabfact		results/tabfact
third_party/select_column_row_prompts		third_party/select_column_row_prompts
tmlr_rebuttal		tmlr_rebuttal
utils		utils
visualization		visualization
wikitq_dater_logs		wikitq_dater_logs
xai_study		xai_study
.gitignore		.gitignore
README.md		README.md
compute_tabfact_acc.py		compute_tabfact_acc.py
compute_wikitq_acc.py		compute_wikitq_acc.py
data.zip		data.zip
install.py		install.py
requirements.txtf		requirements.txtf
run_tabfact_cot.py		run_tabfact_cot.py
run_tabfact_pos.py		run_tabfact_pos.py
run_tabfact_table_size_analysis.py		run_tabfact_table_size_analysis.py
run_wikitq_pos.py		run_wikitq_pos.py
run_wikitq_table_size_analysis.py		run_wikitq_table_size_analysis.py
test-azure.py		test-azure.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpretable LLM-based Table Question Answering

📜 Abstract

🗺️ Table of Contents

Environment Setup

Datasets & Benchmarks

TabFact

WikiTQ

Improved Planning Algorithm

Performance vs. Table Size

Visualization & Evaluation

Visualizing Table QA Explanations

LLM-as-a-Judge

Human Evaluation Interfaces

Interactive Demo

Citation

About

Uh oh!

Uh oh!

Languages

giangnguyen2412/Plan-of-SQLs-TMLR2025

Folders and files

Latest commit

History

Repository files navigation

Interpretable LLM-based Table Question Answering

📜 Abstract

🗺️ Table of Contents

Environment Setup

Datasets & Benchmarks

TabFact

WikiTQ

Improved Planning Algorithm

Performance vs. Table Size

Visualization & Evaluation

Visualizing Table QA Explanations

LLM-as-a-Judge

Human Evaluation Interfaces

Interactive Demo

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages