Clarification about training scripts and the WebCode2M fine-tuned model


Hi and thanks for the great work on WebCode2M! I’m currently working on my master’s thesis on UI code generation and have been exploring your papers and codebase in detail. I have a few questions regarding the training setup and scripts.

---

### 1. Are the training scripts in the repo related only to WebSight?

In the **WebCode2M paper**, you mention training **two models** based on Pix2Struct:

> "Compared to the two fine-tuned baselines, Design2Code-18B and WebSight VLM-8B, our model, fine-tuned from the smaller Pix2Struct-1.3B, outperforms both..."

Also:

> "WebCoder*-1.3B is another Pix2Struct model trained on the WebSight dataset for comparative experiments."

This suggests that:
- One Pix2Struct model was fine-tuned on **WebCode2M** (your main proposed model),
- Another Pix2Struct model was fine-tuned on **WebSight**, used for comparison only.

However, in the GitHub repository, all the training scripts (`stage0.sh`, `stage1.sh`, `stage2.sh`) seem to use **WebSight** as the dataset (`WebSight-format-parquet`, `arrows_8-14_processed`, etc.).

Can you confirm:
- Are these scripts only for the WebSight-trained model (`WebCoder*`)?
- Are you planning to release the scripts for training the **WebCode2M-finetuned** model?


### 2. The paper describes two training phases, but the codebase defines three stages

In the **WebCode2M paper** (Section 3.4), the training procedure is presented as consisting of **two main phases**:

> "Initially, we fine-tune the model [...] with a sequence length of 2,048 tokens for three epochs (90,000 iterations)... Then we refine the model [...] reducing the sequence length to 1,024 tokens over 10,000 iterations."

However, in the codebase — specifically in the `my_dataset.py` file — there are **three distinct training stages** implemented, each corresponding to a different input encoding and processing logic: stage 0, 1 and 2.

- What is the **mapping** between the three stages implemented in `my_dataset.py` and the **two training phases** described in the paper?
- What is the order of execution of the three stages?
- Which of these stages correspond to the first 2048-token training phase, and which to the 1024-token refinement?
- Is Stage 2 part of the WebCode2M training pipeline?

### 3. Why is 2048 tokens used as the sequence length? Truncation concern

Some real-world webpages exceed 2048 tokens after HTML+CSS merging. In the codebase longer sequences are **truncated** or **dropped**. I wanted to ask:

- If it only sees truncated inputs during training, does the 2048-token truncation limit the model’s ability to learn full-page structures?
- Is the 2048-token limit a design choice due to GPU/memory limitations ?
- Or were there other reasons (e.g., training efficiency, convergence)?
---

Thanks again for the amazing research — and I look forward to your clarification!

Best regards,  
Mattia


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification about training scripts and the WebCode2M fine-tuned model #48

1. Are the training scripts in the repo related only to WebSight?

2. The paper describes two training phases, but the codebase defines three stages

3. Why is 2048 tokens used as the sequence length? Truncation concern

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification about training scripts and the WebCode2M fine-tuned model #48

Description

1. Are the training scripts in the repo related only to WebSight?

2. The paper describes two training phases, but the codebase defines three stages

3. Why is 2048 tokens used as the sequence length? Truncation concern

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions