GitHub - Lizhecheng02/VLM4Font: [COLM 2025] Repository for our paper "Texture or Semantics? Vision-Language Models Get Lost in Font Recognition".

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

This is the repository for our paper Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

When you are reading a research paper or doing some design work, do you ever want to know what font is used in the figure or design? We do, and we find it annoying that current vision-language models (VLMs) even struggle to provide a similar font type as the final answer.

In this paper, we propose a simple yet meaningful font recognition benchmark, which contains 15 popular fonts specifically designed for vision-language models (VLMs). We set two levels of challenges for VLMs: the easy version consists of regular sentences, while the hard version includes the font name in the image to introduce a stroop effect, which can be challenging for the model but easy for humans to recognize as a trap.

Run Codes

1. Install Packages

pip install -r requirements.txt

2. OpenAI & HuggingFace & OpenRouter

Use the config.yaml file to set your different api keys.

openai:
  api_key: "YOUR_API_KEY"
  organization: "YOUR_ORGANIZATION"
huggingface:
  api_key: "YOUR_API_KEY"
openrouter:
  api_key: "YOUR_API_KEY"

3. Dataset

Download all corresponding .ttf files to the ./Fonts directory and run the following command.

python generate.py

4. Experiments

The code examples for running experiments are already provided in the command.py file.
To run the few-shot experiments, you need to execute python search.py first to construct the few-shot demonstrations.
You can add your own commands to the command.py file and run them using python command.py.

Experimental Results

In conclusion, current vision-language models (VLMs) struggle with this task under different inference settings, even with the 15 popular fonts considered, and are highly vulnerable to the stroop effect.

We also conduct few-shot tests on six different models with different architectures and still obtain disappointing results.

Attention Analysis

We conduct attention analysis on two different models, and the results shed light on the inherent limitations of vision-language models (VLMs) in capturing semantic features. These models struggle to focus on the crucial edge regions, making it difficult for them to produce correct answers.

Questions?

If you have any questions related to the repo or the paper, or you encounter any problems when using the code, feel free to email Zhecheng Li (zhl186@ucsd.edu).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Figs		Figs
Fonts		Fonts
.python_version		.python_version
README.md		README.md
command.py		command.py
config.yaml		config.yaml
few-shot.py		few-shot.py
generate.py		generate.py
generator.py		generator.py
mcq-zero-shot-CoT.py		mcq-zero-shot-CoT.py
mcq-zero-shot.py		mcq-zero-shot.py
process.py		process.py
requirements.txt		requirements.txt
search.py		search.py
utils.py		utils.py
zero-shot-CoT.py		zero-shot-CoT.py
zero-shot.py		zero-shot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Run Codes

1. Install Packages

2. OpenAI & HuggingFace & OpenRouter

3. Dataset

4. Experiments

Experimental Results

Attention Analysis

Questions?

About

Uh oh!

Releases

Packages

Languages

Lizhecheng02/VLM4Font

Folders and files

Latest commit

History

Repository files navigation

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Run Codes

1. Install Packages

2. OpenAI & HuggingFace & OpenRouter

3. Dataset

4. Experiments

Experimental Results

Attention Analysis

Questions?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages