Xiaohan Zhang*
Tavis Shore*
Chen Chen
Oscar Mendez
Simon Hadfield
Safwan Wshah
Vermont Artificial Intelligence Laboratory (VaiL)
Centre for Vision, Speech, and Signal Processing (CVSSP)
University of Central Florida
Locus Robotics
Backbone | Params (M) | FLOPs (G) | Dims | R@1 | R@5 | R@10 |
---|---|---|---|---|---|---|
ConvNeXt-T | 28 | 4.5 | 768 | 1.36 | 4.34 | 7.95 |
ConvNeXt-B | 89 | 15.4 | 1024 | 3.14 | 8.14 | 13.22 |
ViT-B | 86 | 17.6 | 768 | 3.30 | 8.92 | 13.96 |
ViT-L | 307 | 60.6 | 1024 | 9.62 | 23.42 | 32.73 |
DINOv2-B | 86 | 152 | 768 | 17.37 | 36.14 | 46.96 |
DINOv2-L | 304 | 507 | 1024 | 27.49 | 51.96 | 63.13 |
VLM | R@1 | R@5 | R@10 |
---|---|---|---|
Without Re-ranking | 27.49 | 51.96 | 63.13 |
Gemini 2.5 Flash Lite | 23.54 | 48.39 | 63.13 |
Gemini 2.5 Flash | 30.21 | 53.04 | 63.13 |
R@1 | R@5 | R@10 | |
---|---|---|---|
0 | 24.47 | 48.16 | 60.99 |
0.1 | 26.98 | 51.34 | 61.92 |
0.3 | 27.49 | 51.96 | 63.13 |
0.5 | 24.89 | 52.03 | 62.66 |
Model | R@1 | R@5 | R@10 |
---|---|---|---|
U1652~\cite{zheng2020university} | 1.20 | - | - |
LPN w/o drone~\cite{wang2021each} | 0.74 | - | - |
LPN w/ drone~\cite{wang2021each} | 0.81 | - | - |
DINOv2-L | 24.66 | 48.00 | 59.02 |
+ Drone Data | 27.49 | 51.96 | 63.13 |
+ VLM Re-rank (Ours) | 30.21 | 53.04 | 63.13 |
conda env create -n ENV -f requirements.yaml && conda activate ENV
Before running Stage 1, configure your dataset paths:
- Navigate to the
/config/
directory. - Open the
default.yaml
file (or copy it to a new file). - Replace the placeholder values (e.g.,
DATA_ROOT
) with the actual paths to your dataset and related files.
Once your configuration file is ready, you can train Stage 1 using:
python stage_1.py --config YOUR_CONFIG_FILE_NAME
You can also download our pre-trained weights here.
To run Stage 2, you need to:
- Open the stage_2.py file.
- Replace the relevant placeholders (e.g., the path to the answer file from Stage 1 and your Gemini API key).
- Ensure any other required directories or options are correctly set.
Then, simply run:
python stage_2.py
This will perform re-ranking using a Vision-Language Model (VLM) on top of the initial retrieval results. There will be a LLM_re_ranked_answer.txt
in the answer directory and a reasons.json
containing all the reasons for re-ranking.