Liang Yao (姚亮)*
,
Fan Liu (刘凡)* ✉
,
Delong Chen (陈德龙)*
,
Chuanyi Zhang (张传一)
,
Yijun Wang (王翌骏)
,
Ziyun Chen (陈子赟)
,
Wei Xu (许玮)
,
Shimin Di (邸世民)
,
Yuhui Zheng (郑钰辉)
* Equal Contribution ✉ Corresponding Author
Model : 🤗RemoteSAM
Dataset : 🤗RemoteSAM-270K
- 2025/5/7: We have released the model and dataset! You can download RemoteSAM-270K from 🤗RemoteSAM-270K and checkpoint from 🤗RemoteSAM.
- 2025/5/3: Welcome to RemoteSAM! The preprint of our paper is available. Dataset and model are open-sourced at this repository.
Welcome to the official repository of our paper "RemoteSAM: Towards Segment Anything for Earth Observation" !
Recent advances in AI have revolutionized Earth observation, yet most remote sensing tasks still rely on specialized models with fragmented interfaces. To address this, we present RemoteSAM, a vision foundation model that unifies pixel-, region-, and image-level tasks through a novel architecture centered on Referring Expression Segmentation (RES). Unlike existing paradigms—task-specific heads with limited knowledge sharing or text-based models struggling with dense outputs—RemoteSAM leverages pixel-level predictions as atomic units, enabling upward compatibility to higher-level tasks while eliminating computationally heavy language model backbones. This design achieves an order-of-magnitude parameter reduction (billions to millions), enabling efficient high-resolution data processing.
We also build RemoteSAM-270K dataset, a large-scale collection of 270K Image-Text-Mask triplets generated via an automated pipeline powered by vision-language models (VLMs). This dataset surpasses existing resources in semantic diversity, covering 1,000+ object categories and rich attributes (e.g., color, spatial relations) through linguistically varied prompts. We further introduce RSVocab-1K, a hierarchical semantic vocabulary to quantify dataset coverage and adaptability.
The code has been verified to work with PyTorch v1.13.0 and Python 3.8.
- Clone this repository.
- Change directory to root of this repository.
- Create a new Conda environment with Python 3.8 then activate it:
conda create -n RemoteSAM python==3.8
conda activate RemoteSAM
- Install PyTorch v1.13.0 with a CUDA version that works on your cluster/machine (CUDA 11.6 is used in this example):
pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116
- Install mmcv from openmmlab:
pip install mmcv-full==1.7.1 -f https://download.openmmlab.com/mmcv/dist/cu116/torch1.13.0/index.html
- Install the packages in
requirements.txt
viapip
:
pip install -r requirements.txt
- Create the
./pretrained_weights
directory where we will be storing the weights.
mkdir ./pretrained_weights
- Download pre-trained classification weights of the Swin Transformer,
and put the
pth
file in./pretrained_weights
. These weights are needed for training to initialize the model.
We perform all experiments based on our proposed dataset RemoteSAM-270K.
- Download our dataset from HuggingFace.
- Copy all the downloaded files to
./refer/data/
. The dataset folder should be like this:
$DATA_PATH
├── RemoteSAM-270K
│ ├── JPEGImages
│ ├── Annotations
└──── ├── refs(unc).p
├── instances.json
We use DistributedDataParallel from PyTorch for training. To run on 8 GPUs on a single node: More training setting can be change in args.py.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--nproc_per_node 8 --master_port 12345 train.py \
--epochs 40 --img_size 896 2>&1 | tee ./output
To get started with RemoteSAM, please first initialize a model and load the RemoteSAM checkpoint with a few lines of code:
from tasks.code.model import RemoteSAM, init_demo_model
import cv2
import numpy as np
device = 'cuda:0'
checkpoint = "./pretrained_weights/checkpoint.pth"
model = init_demo_model(checkpoint, device)
model = RemoteSAM(model, device, use_EPOC=True)
Then, you can explore different tasks with RemoteSAM via:
- Referring Expression Segmentation
image = cv2.imread("./assets/demo.jpg")
mask = model.referring_seg(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), sentence="the airplane on the right")
- Semantic Segmentation
image = cv2.imread("./assets/demo.jpg")
result = model.semantic_seg(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'])
for classname in ["airplane", "vehicle"]:
mask = result[classname]
- Object Detection
image = cv2.imread("./assets/demo.jpg")
result = model.detection(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'])
for classname in ["airplane", "vehicle"]:
boxes = result[classname]
- Visual Grounding
image = cv2.imread("./assets/demo.jpg")
box = model.visual_grounding(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), sentence="the airplane on the right")
- Multi-label classification
image = cv2.imread("./assets/demo.jpg")
result = model.multi_label_cls(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'])
print(result)
- Image Classification
image = cv2.imread("./assets/demo.jpg")
result = model.multi_class_cls(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'])
print(result)
- Image Captioning
image = cv2.imread("./assets/demo.jpg")
result = model.captioning(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'], region_split=9)
print(result)
- Object Counting
image = cv2.imread("./assets/demo.jpg")
result = model.counting(image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), classnames=['airplane', 'vehicle'])
for classname in ["airplane", "vehicle"]:
print("{}: {}".format(classname, result[classname]))
- Evaluation of Referring Expression Segmentation
bash tasks/REF.sh
- Evaluation of Semantic Segmentation
bash tasks/SEG.sh
- Evaluation of Object Detection
bash tasks/DET.sh
- Evaluation of Visual Grounding
bash tasks/VG.sh
- Evaluation of Multi-label classification
bash tasks/MLC.sh
- Evaluation of Image classification
bash tasks/MCC.sh
- Evaluation of Image Captioning
bash tasks/CAP.sh
- Evaluation of Object Counting
bash tasks/CNT.sh
- Thanks Lu Wang (王璐) for his efforts on the RemoteSAM-270K dataset.
- Code in this repository is built on RMSIN. We'd like to thank the authors for open sourcing their project.
Please Contact yaoliang@hhu.edu.cn
If you find this work useful, please cite our paper as:
@misc{yao2025RemoteSAM,
title={RemoteSAM: Towards Segment Anything for Earth Observation},
author={Liang Yao and Fan Liu and Delong Chen and Chuanyi Zhang and Yijun Wang and Ziyun Chen and Wei Xu and Shimin Di and Yuhui Zheng},
year={2025},
eprint={2505.18022},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.18022},
}