Replies: 3 comments 2 replies
-
Thanks for the detailed analysis. This is super interesting. I am moving this to the "Discussions" because if I understand your write-up and analysis correctly, this is not a bug but a consequence of the implementation. Regarding
this is generally true and is a consequence how the training works. I.e., that we fine-tune a particular token position (here the last token position because of the causal mask; and this is dependent on the longest example in the training set) to be the token for the classification task. That's why we have the print(classify_review(
text_1, model, tokenizer, device, max_length=train_dataset.max_length
)) Based on this text:
It sounds like you changed the padding length from def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)[:, -1, :] # Logits of last output token
loss = torch.nn.functional.cross_entropy(logits, target_batch)
return loss And in our implementation in Chapter 6, the last token was always token position 120. To make this a bit more flexible, you could train with a batch size of 1 where you don't need padding. But even in this case, you might be observing similar behavior. For instance, I ran a quick check and the "spam" messages do seem 2x as long as the "ham" messages on average: df["Length"] = df["Text"].apply(lambda x: len(tokenizer.encode(x)))
print(df[df["Label"] == "ham"]["Length"].values.mean())
# prints 19.85
df[df["Label"] == "spam"]["Length"].values.mean()
# prints 41.67 So because this is a very simple dataset, the model will likely learn how to exploit the length (/token position) to learn how to make a good prediction. To further experiment with this you could try to train the model similar to row 14, 15, or 16, which don't use/need padding tokens: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments But given the spam dataset statistics mentioned above, you might still get a length-skew. So, in this case, it would additionally be interesting to look at a different dataset: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/03_bonus_imdb-classification I did a quick check, and there seems to be less length-bias: train_df["Length"] = train_df["text"].apply(lambda x: len(tokenizer.encode(x)))
print(train_df[train_df["label"] == 0]["Length"].values.mean())
# prints 294.48
print(train_df[train_df["label"] == 1]["Length"].values.mean())
# prints 296.77 Btw I do like your idea of training all tokens though by creating this label array for each position. That's another really nice workaround! Thanks a lot for sharing this idea! |
Beta Was this translation helpful? Give feedback.
-
Thank you for the response and moving the topic into the Discussions.
Indeed. I wrote a simple function to test model's performance on custom inputs which aren't part of the original dataset. I implicitly assumed that the model should generalize to arbitrary input sequences and should perform equally well on non-padding tokens too.
That's exactly what I wanted to try too, but didn't have time to write a code for it. I'll try it and see how much it changes model's behavior. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I was trying to play around with a classifier trained in chapter 6 and found out that it was always predicting "spam" for any message I give it. When I looked into it I found that predicted class actually depends on the number of padding tokens added into the sequence. If such a model is going to be used in production the same padding size must always be applied to the model. Two graphs below show probability of spam prediction next to a token up to which a sequence was considered
You can see that final token in the message has probability close to 1, but it changes with each extra padding token added to the sequence. In the code, all sequences have the same length so in order to get desired probability we will need to have a correct padding size, otherwise predicted probability can give a very poor accuracy.
I'm not sure if this classifies as a bug, since, in a way, the model is doing what it was asked to do, but at the same time model's behavior is unintuitive. I made small adjustments to the model in order to get the desired behavior which shows how probability evolves with every token added in a single inference step and doesn't depend on number of padding tokens
Main changes:
Modified code (not everything works from the old code, but it runs fine): https://gist.github.com/itdxer/b30ceedb4ac0f3fd2b3e37fb54f71398
Original code: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/gpt_class_finetune.py
Is there a better way to address this issue?
Beta Was this translation helpful? Give feedback.
All reactions