Skip to content

tsinghua-fib-lab/CityLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CityLens

This repo is for CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Introduction

In this work, we introduce CityLens, a comprehensive benchmark designed to evaluate the capabilities of large vision-language models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. Our results reveal that while LVLMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LVLMs to understand and predict urban socioeconomic patterns.

🌍 Framework

CityLens spans 11 real-world indicators across 6 socioeconomic domains, covering 17 globally distributed cities with diverse urban forms and development levels. To systematically assess model performance, we evaluate 17 different LVLMs using 3 distinct evaluation paradigms.

πŸŒ† Task

This section defines the tasks designed to evaluate the capabilities of large vision-language models in predicting socioeconomic indicators at the regional level using satellite and street view imagery. As Figure a illustrates, ground-truth data for 28 indicators across six domains are initially collected, from which a final set of 11 indicators is derived to define the prediction tasks . As Figure b illustrates, tasks are defined at different spatial resolutions: census tract level for US-only indicators, MSOA level for UK-only indicators, and satellite image coverage level for global indicators. Each region is represented by satellite and street view imagery paired with corresponding socioeconomic indicator values.

⌨️ Codes Structure

  • data_process # scripts for processing raw data
  • evaluate # evaluation codes
  • download_script # codes for download street view images
  • config.py # global variables in project

πŸ”§ Installation

Install Python dependencies.

conda create -n citylens python==3.10
pip install -r requirements.txt

πŸ€– LVLM Support

For using LVLM API, you need to set API Key as follows

export OpenAI_API_KEY = ""         # For OpenAI GPT3.5, GPT4, GPT4o
export DASHSCOPE_API_KEY = ""       # For QwenVL
export DeepInfra_API_KEY = ""        # For LLama3, Gemma, Mistral
export SiliconFlow_API_KEY = ""        # For InternLM or Qwen

πŸ› οΈ CityLens Evaluation System User Guide

CityLens provides two complete evaluation workflows for direct prediction tasks and feature-based regression tasks:

Part1: Evaluation Data Preparation

The CityLens dataset is constructed to support the evaluation of large vision-language models (LVLMs) on region-level socioeconomic prediction tasks. The data preparation process consists of two main stages:

1. Ground Truth Collection

We collect ground-truth values for 28 candidate indicators across six domains: population, economy, housing, health, transportation, and safety. Based on perceptual relevance and redundancy reduction, 11 indicators are ultimately selected.

  • Perceptual relevance: Whether a human (or a vision-language model) can reasonably infer the indicator from visual inputs such as satellite or street view imagery. Indicators without clear spatial cues (e.g., "personal miles traveled on a working weekday") are excluded.
  • Redundancy reduction: Pearson correlation analysis is performed among semantically similar indicators. Highly correlated variables (e.g., obesity and mental health, r = 0.75) are filtered to avoid task duplication.

2. Urban Image Collection

Each region is represented by a combination of:

To balance computational efficiency and model compatibility, we limit the number of street view images to 10 per region. This setup ensures that input constraints of current LVLMs (e.g., Gemini) are met, while still preserving adequate visual context.

πŸ“‚ Data Access

The full dataset is publicly available at:

πŸ‘‰ https://huggingface.co/datasets/Tianhui-Liu/CityLens

βš™οΈ Download Script Descriptions

download_script/ folder is for Google and Baidu Street View data.

  • Google Street View

    • utils_scp.py β€” Preprocessing, parameter setting, and utility tools.
    • main_multi_scp_with_lanlon.py β€” Image downloading script, you can use your own CSV including latitude and longitude to download images from Google.
    • cut_save.py β€” Process and save image patches.
  • Baidu Street View

    • utils.py β€” Preprocessing, parameter setting, and utility tools.
    • baidu_crawler.py β€” Image downloading script, you can use your own CSV including latitude and longitude to download images from Baidu.

Part2.1: Direct and Normalized Estimation

This section supports two prompting strategies:
β€’ Simple prompt: Directly requests model output of metric values
β€’ Normalized prompt: Asks model to output relative scores on a 0.0–9.9 scale
Prompt type can choose simple or normalized.

Economy

# GDP
python -m evaluate.global.global_indicator --city_name="all" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --num_process=10 --task_name="gdp"
python -m evaluate.global.metrics --city_name="all" --model_name="gpt-4o" --prompt_type="simple" --task_name="gdp"

# house price
python -m evaluate.house_price.house_price_us --city_name="all" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --num_process=10
python -m evaluate.house_price.metrics --city_name="all" --model_name="gpt-4o" --prompt_type="simple" 

# population
python -m evaluate.global.global_indicator --city_name="all" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --num_process=10 --task_name="pop"
python -m evaluate.global.metrics --city_name="all" --model_name="gpt-4o" --prompt_type="simple" --task_name="pop"

Education

# bachelor ratio
python -m evaluate.education.bachelor_ratio --city_name="US" --mode="eval" --model_name="google/gemma-3-12b-it" --prompt_type="simple" 
python -m evaluate.education.metrics --city_name="US" --model_name="google/gemma-3-12b-it" --prompt_type="simple"  

Crime

# violent crime
python -m evaluate.crime.crime_us --city_name="US" --mode="eval" --model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" --prompt_type="simple" --task_name="violent"
python -m evaluate.crime.metrics --city_name="US" --model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" --prompt_type="simple" --task_name="violent"

Transport

# drive ratio
python -m evaluate.transport.transport_us --city_name="US" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --task_name="drive"
python -m evaluate.transport.metrics --city_name="US" --model_name="gpt-4o" --prompt_type="simple"  --task_name="drive"

# public transport
python -m evaluate.transport.transport_us --city_name="US" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --task_name="public"
python -m evaluate.transport.metrics --city_name="US" --model_name="gpt-4o" --prompt_type="simple"  --task_name="public"

Health

# mental health
python -m evaluate.health.health_us --city_name="US" --mode="eval" --model_name="google/gemma-3-4b-it" --prompt_type="simple" --task_name="mental"
python -m evaluate.health.metrics --city_name="US" --model_name="google/gemma-3-4b-it" --prompt_type="simple" --task_name="mental"

# life expectancy
python -m evaluate.life_exp.life_exp_uk --city_name="UK" --mode="eval" --model_name="google/gemma-3-4b-it" --prompt_type="simple"
python -m evaluate.life_exp.metrics --city_name="UK" --model_name="google/gemma-3-4b-it" --prompt_type="simple"

# accessibility to health
python -m evaluate.global.global_indicator --city_name="all" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --num_process=10 --task_name="acc2health"
python -m evaluate.global.metrics --city_name="all" --model_name="gpt-4o" --prompt_type="simple" --task_name="acc2health"

Environment

# building height
python -m evaluate.global.global_indicator --city_name="all" --mode="eval" --model_name="gpt-4o" --prompt_type="simple" --num_process=10 --task_name="build_height"
python -m evaluate.global.metrics --city_name="all" --model_name="gpt-4o" --prompt_type="simple" --task_name="build_height"

Part2.2: Feature-Based Regression

This method follows a three-stage process:
β€’ Feature extraction: Extract 13 visual features from street view images
β€’ Feature alignment: Construct region-level feature vectors
β€’ LASSO regression: Predict socioeconomic indicators

# extract feature from LLM
python -m evaluate.feature.extract_feature
# extract answer from LLM
python extract_api.py
# align feature and reference
python feature.py
# LASSO regression
python regression.py

This evaluation system has been used for benchmarking all 17 cities and 11 metrics in our paper. Complete results are detailed in Chapter 6.

🌟 Citation

If you find this work helpful, please cite our paper.

@article{liu2025citylens,
  title={CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing},
  author={Liu, Tianhui and Pang, Hetian and Zhang, Xin and Ouyang, Tianjian and Zhang, Zhiyuan and Feng, Jie and Li, Yong and Pan, Hui},
  journal={arXiv preprint arXiv:2506.00530},
  year={2025}
}

πŸ‘ Acknowledgement

We appreciate the following GitHub repos a lot for their valuable code and efforts.

πŸ“© Contact

If you have any questions or want to use the code, feel free to contact: Tianhui Liu (tianhuiliu06@gmail.com)

About

CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages