This repository contains code for evaluating and benchmarking Vision Language Models (VLMs) on tabular data. Specifically, it focuses on the task of answering questions based on information presented in one or more tables.
The repository is organized as follows:
results
: Directory to store the output JSONL files containing model response results and metrics.evals.ipynb
: Experimented Code.gemini_eval.py
: Python script for evaluating the Gemini model.qwen_eval.py
: Python script for evaluating the Qwen model.llava_eval.py
: Python script for evaluating the LLaVA model.phi3_eval.py
: Python script for evaluating the Phi-3 model.evaluate_metrics.py
: Python script for calculating evaluation metrics (Exact Match and F1 Score).requirements.txt
: Lists the required Python packages for running the code.utils.py
: Utility functions used by the evaluation scripts.
The repository currently includes evaluation scripts for the following VLMs:
- Gemini:
gemini_eval.py
is used to evaluate the Gemini model (specificallygemini-exp-1206
). - Qwen:
qwen_eval.py
is used to evaluate the Qwen model (specificallyQwen/Qwen2-VL-2B-Instruct
). - LLaVA:
llava_eval.py
is used to evaluate the LLaVA model (e.g.,llava-hf/llava-onevision-qwen2-7b-ov-hf
). - Phi-3:
phi3_eval.py
is used to evaluate the Phi-3 model (e.g.,microsoft/Phi-3.5-vision-instruct
).
-
Install required packages:
pip install -r requirements.txt
-
Set up API keys:
- For Gemini, you need to configure your Google API key. Instructions can be found in the
gemini_eval.py
file.
- For Gemini, you need to configure your Google API key. Instructions can be found in the
To evaluate the Gemini model, run the following command:
python gemini_eval.py --input-dir <path_to_data_directory> --output-file <path_to_output_file> --model-name gemini-exp-1206
--input-dir
: Path to the directory containing the JSONL data file and thetable_images
subdirectory.--output-file
: Path to the output JSONL file where the results will be saved (e.g.,results/gemini_two_table_eval.jsonl
).--model-name
: The name of the Gemini model to use (default:gemini-exp-1206
).
To evaluate the Qwen model, run the following command:
python qwen_eval.py --input-dir <path_to_data_directory> --output-file <path_to_output_file> --model-path <path_to_model> --device <device> --quantization False
--input-dir
: Path to the directory containing the JSONL data file and thetable_images
subdirectory.--output-file
: Path to the output JSONL file where the results will be saved (e.g.,results/qwen_two_table_eval.jsonl
).--model-path
: Path to the Qwen model (default:Qwen/Qwen2-VL-2B-Instruct
).--device
: The device to run the model on (e.g.,cuda:0
,cpu
). Default iscuda:3
.--quantization
: Qunatization option.--quant_bit
: either 4 or 8 bit qunatization
To evaluate the LLaVA model, run the following command:
python llava_eval.py --input-dir <path_to_data_directory> --output-file <path_to_output_file> --model-path <path_to_llava_model> --device <device>
--input-dir
: Path to the directory containing the JSONL data file and thetable_images
subdirectory.--output-file
: Path to the output JSONL file where the results will be saved (e.g.,results/llava_two_table_eval.jsonl
).--model-path
: Path to the LLaVA model (e.g.,llava-hf/llava-onevision-qwen2-7b-ov-hf
).--device
: The device to run the model on (e.g.,cuda:0
,cpu
). Default iscuda:0
.
To evaluate the Phi-3 model, run the following command:
python phi3_eval.py --input-dir <path_to_data_directory> --output-file <path_to_output_file> --model-path <path_to_phi3_model> --device <device>
--input-dir
: Path to the directory containing the JSONL data file and thetable_images
subdirectory.--output-file
: Path to the output JSONL file where the results will be saved (e.g.,results/phi3_two_table_eval.jsonl
).--model-path
: Path to the Phi-3 model (e.g.,microsoft/Phi-3.5-vision-instruct
).--device
: The device to run the model on (e.g.,cuda:0
,cpu
). Default iscuda:3
.
After generating model responses, you can calculate evaluation metrics using evaluate_metrics.py
.
python evaluate_metrics.py --input-file <path_to_model_output_file> --output-file <path_to_metrics_output_file> --wandb-project <project_name> --wandb-entity <entity> --wandb-run-name <run_name>
--input-file
: Path to the JSONL file containing model responses (output from*_eval.py
scripts).--output-file
: Path to save the JSON metrics file (e.g.,results/metrics/qwen_two_table_metrics.json
).--wandb-project
: (Optional) Your Weights & Biases project name for logging metrics.--wandb-entity
: (Optional) Your Weights & Biases entity name.--wandb-run-name
: (Optional) A name for your Weights & Biases run.
The input data should be in a directory containing:
- A JSONL file where each line represents a data sample.
- A subdirectory named
table_images
containing the images of the tables referenced in the JSONL file.
Each data sample in the JSONL file should have the following format:
{
"question": "Which department has more than 1 head at a time? List the id, name and the number of heads.",
"answer": {"columns": ["Department_ID", "Name", "count(*)"], "index": [0], "data": [[2, "Treasury", 2]]},
"table_names": ["department", "management"],
"table_image_ids": ["TableImg_11gu6_15.png", "TableImg_Y3m5c_5.png"],
"original_data_index": 3
}
question
: The question to be answered based on the tables.answer
: The ground truth answer to the question.table_image_ids
: A list of image file names (from thetable_images
directory) corresponding to the tables needed to answer the question.
The output file will be a JSONL file where each line represents the evaluation result for a single data sample. The format is as follows:
{
"question": "What is the name and country of origin of the artist who released a song that has \"love\" in its title?",
"golden_answer": {"columns": ["artist_name", "country"], "index": [0], "data": [["Enrique", "USA"]]},
"table_image_ids": ["TableImg_U9vum_6.png", "TableImg_Hn0vz_6.png"],
"response": "[{\"artist_name\": \"Enrique\", \"country\": \"USA\", \"country_of_origin\": \"USA\"}]"
}
question
: The original question.golden_answer
: The ground truth answer.table_image_ids
: The IDs of the tables used.response
: The model's generated response.
The utils.py
file contains the SYSTEMS_INSTRUCTIONS
variable, which defines the instructions given to the models during evaluation. These instructions are designed to guide the models to:
- Understand and reason about tabular data.
- Carefully examine the tables and identify relevant information.
- Formulate a clear and concise answer in natural language.
- Avoid including SQL queries in the answer.
- Be accurate and avoid hallucinations.
- Provide answers in a specific JSON format.
- The Gemini evaluation script (
gemini_eval.py
) includes rate limiting to avoid exceeding the API usage limits. - The Qwen, LLaVA, and Phi-3 evaluation scripts handle CUDA out-of-memory errors by skipping the problematic data sample and freeing up memory.
evals.ipynb
notebook contains experiment with the evaluation code and explore the results.- Make sure to adjust the
--device
argument in evaluation scripts based on your available hardware resources. - The
evaluate_metrics.py
script calculates Exact Match and F1-score and can optionally log results to Weights & Biases.