-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Hi and thanks for the great work on WebCode2M! I’m currently working on my master’s thesis on UI code generation and have been exploring your papers and codebase in detail. I have a few questions regarding the training setup and scripts.
1. Are the training scripts in the repo related only to WebSight?
In the WebCode2M paper, you mention training two models based on Pix2Struct:
"Compared to the two fine-tuned baselines, Design2Code-18B and WebSight VLM-8B, our model, fine-tuned from the smaller Pix2Struct-1.3B, outperforms both..."
Also:
"WebCoder*-1.3B is another Pix2Struct model trained on the WebSight dataset for comparative experiments."
This suggests that:
- One Pix2Struct model was fine-tuned on WebCode2M (your main proposed model),
- Another Pix2Struct model was fine-tuned on WebSight, used for comparison only.
However, in the GitHub repository, all the training scripts (stage0.sh
, stage1.sh
, stage2.sh
) seem to use WebSight as the dataset (WebSight-format-parquet
, arrows_8-14_processed
, etc.).
Can you confirm:
- Are these scripts only for the WebSight-trained model (
WebCoder*
)? - Are you planning to release the scripts for training the WebCode2M-finetuned model?
2. The paper describes two training phases, but the codebase defines three stages
In the WebCode2M paper (Section 3.4), the training procedure is presented as consisting of two main phases:
"Initially, we fine-tune the model [...] with a sequence length of 2,048 tokens for three epochs (90,000 iterations)... Then we refine the model [...] reducing the sequence length to 1,024 tokens over 10,000 iterations."
However, in the codebase — specifically in the my_dataset.py
file — there are three distinct training stages implemented, each corresponding to a different input encoding and processing logic: stage 0, 1 and 2.
- What is the mapping between the three stages implemented in
my_dataset.py
and the two training phases described in the paper? - What is the order of execution of the three stages?
- Which of these stages correspond to the first 2048-token training phase, and which to the 1024-token refinement?
- Is Stage 2 part of the WebCode2M training pipeline?
3. Why is 2048 tokens used as the sequence length? Truncation concern
Some real-world webpages exceed 2048 tokens after HTML+CSS merging. In the codebase longer sequences are truncated or dropped. I wanted to ask:
- If it only sees truncated inputs during training, does the 2048-token truncation limit the model’s ability to learn full-page structures?
- Is the 2048-token limit a design choice due to GPU/memory limitations ?
- Or were there other reasons (e.g., training efficiency, convergence)?
Thanks again for the amazing research — and I look forward to your clarification!
Best regards,
Mattia