Swiss Army Knife: Synergizing BiAs in Knowledge from Vision Foundation Models for Multi-Task Learning
Yuxiang Lu* Shengcao Cao* Yu-Xiong Wang
Different foundation models excel at different tasks — what if we could combine their strengths?
Introducing SAK: a "Swiss Army Knife" approach that preserve and exploit the unique representation biases of each model during distillation, optimizing their power for multiple downstream tasks.
- [2025.03.03] Pre-trained checkpoints released on Hugging Face 🤗.
- [2025.02.26] Code released.
Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs.
The following environment has been tested and recommended, you can use pypi to install the requirements:
conda create -n sak python=3.10
pip install -r requirements.txt
In stage 1, we use the ImageNet-1k (ILSVRC-2012) dataset for distillation, which can be downloaded from the official website. We use a modified dataloader, you could change it in datasets/imagenet.py
.
In stage 2, we use the PASCAL-Context and NYUD-v2 datasets for learning multiple downstream tasks.The two datasets can be downloaded from the google drive: PASCAL-Context, NYUD-v2.
You should place three datasets in the same directory, and specify the path to the directory as db_root
variable in datasets/utils/mypath.py
.
We provide the pre-trained SAK models (Stage 1 & Stage 2) in our Hugging Face model hub.
The config files of our models are defined in configs/
, with s1
and s2
for stage 1 and stage 2, respectively. We provide some examples and you can modify the teachers, architectures, hyperparameters, and output directory (results_dir
).
Stage 1
torchrun --nproc_per_node=8 train_s1.py --config_path $PATH_TO_CONFIG_FILE --exp $EXP_NAME
Stage 2
torchrun --nproc_per_node=2 train_s2.py --config_path $PATH_TO_CONFIG_FILE --exp $EXP_NAME --checkpoint $PATH_TO_CHECKPOINT --task_out
$PATH_TO_CONFIG_FILE
is the path to the config file, and $EXP_NAME
is the name of the experiment. The config file and checkpoints will be saved in $results_dir/$EXP_NAME
.
There are some arguments you can specify in the command line, including --seed $SEED
to set a seed, --wandb_name $WANDB_NAME
to log with wandb, --checkpoint $PATH_TO_CHECKPOINT
to load a checkpoint (typically for stage 2), --resume
to resume training with saved optimizer and scheduler, and --fp16
to use mixed precision training.
There are two arguments only for stage 2, including --task_out
to train the model with task-specific output heads, and --alpha $ALPHA
to balance the distillation loss and the task losses.
python test.py --exp $EXP_NAME --results_dir $RESULTS_DIR --evaluate
$EXP_NAME
is the name of the experiment specified when training stage 2, and $RESULTS_DIR
is the output directory specified in config file. When --evaluate
is used, the model will be evaluated on all tasks, and the predictions for boundary will be saved. When --save
is used, the predictions for all tasks will be saved. The predictions will be saved in $RESULTS_DIR/$EXP_NAME/predictions
. You can specify the gpu to use by --gpu_id $GPU
.
To evaluate the object boundary/edge detection result, a evaluation tool is needed to calculate optimal-dataset-scale F-measure (odsF), which is modified from SEISM project. Specfically, we use maxDist=0.0075 for PASCAL-Context and maxDist=0.011 for NYUD-v2, following the previous works.
You can follow the steps below:
- The prediction images should be saved in the directory
$RESULTS_DIR/$EXP_NAME/predictions/edge/img/
after runningtest.py
. - The SEISM project is based on MATLAB, make sure you have MATLAB installed.
- Clone our modified version of SEISM into
evaluation/
folder:
cd evaluation
git clone https://github.com/innovator-zero/seism.git
- Modify the
seism/src/gt_wrappers/db_root_dir.m
to specify the path to the dataset. - Run the following command to perform pre-processing and evaluation:
cd evaluation
python edge_evaluation.py --exp $EXP_NAME --results_dir $RESULTS_DIR --dataset $DATASET --nms
Multiple $EXP_NAME
can be specified and evaluated sequentially (should share the same dataset), $DATASET
is either PASCALContext
or NYUD
, --nms
firstly applies non-maximum suppression (NMS) processing to the predictions, the processed images will be saved in $RESULTS_DIR/$EXP_NAME/predictions/edge/nms/
.
- Get the evaluation results by running the following command:
python edge_evaluation.py --exp $EXP_NAME --results_dir $RESULTS_DIR --dataset $DATASET --done
You can also find detailed results in $RESULTS_DIR/$EXP_NAME/predictions/edge_test.txt
.
Our implementation is based on our MTDP_Lib, which is a simple code base for multi-task dense prediction methods.
We also thank following code repositories for supporting the backbones and baselines: timm, RADIO, Theia.
@inproceedings{lu2025swiss,
title={Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning},
author={Yuxiang Lu and Shengcao Cao and Yu-Xiong Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}