Class prediction in chapter 6 depends on the number of paddings added #567

itdxer · 2025-03-13T19:20:47Z

itdxer
Mar 13, 2025

Hello,

I was trying to play around with a classifier trained in chapter 6 and found out that it was always predicting "spam" for any message I give it. When I looked into it I found that predicted class actually depends on the number of padding tokens added into the sequence. If such a model is going to be used in production the same padding size must always be applied to the model. Two graphs below show probability of spam prediction next to a token up to which a sequence was considered

You can see that final token in the message has probability close to 1, but it changes with each extra padding token added to the sequence. In the code, all sequences have the same length so in order to get desired probability we will need to have a correct padding size, otherwise predicted probability can give a very poor accuracy.

I'm not sure if this classifies as a bug, since, in a way, the model is doing what it was asked to do, but at the same time model's behavior is unintuitive. I made small adjustments to the model in order to get the desired behavior which shows how probability evolves with every token added in a single inference step and doesn't depend on number of padding tokens

Main changes:

# create one label per each token in order to make partial predictions
self.labels = [
    [label] * len(encoded_text) + [ignore_index] * (self.max_length - len(encoded_text))
    for label, encoded_text in zip(self.labels, self.encoded_texts)
]
... 
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())

Modified code (not everything works from the old code, but it runs fine): https://gist.github.com/itdxer/b30ceedb4ac0f3fd2b3e37fb54f71398
Original code: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/gpt_class_finetune.py

Is there a better way to address this issue?

rasbt · 2025-03-14T20:07:39Z

rasbt
Mar 14, 2025
Maintainer

Thanks for the detailed analysis. This is super interesting. I am moving this to the "Discussions" because if I understand your write-up and analysis correctly, this is not a bug but a consequence of the implementation.

Regarding

If such a model is going to be used in production the same padding size must always be applied to the model. Two graphs below show probability of spam prediction next to a token up to which a sequence was considered

this is generally true and is a consequence how the training works. I.e., that we fine-tune a particular token position (here the last token position because of the causal mask; and this is dependent on the longest example in the training set) to be the token for the classification task. That's why we have the train_dataset.max_length here:

print(classify_review(
    text_1, model, tokenizer, device, max_length=train_dataset.max_length
))

Based on this text:

You can see that final token in the message has probability close to 1, but it changes with each extra padding token added to the sequence. In the code, all sequences have the same length so in order to get desired probability we will need to have a correct padding size, otherwise predicted probability can give a very poor accuracy.

It sounds like you changed the padding length from train_dataset.max_length? In this case, unexpected things might happen. Because we optimize the last token position via the -1 in

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)[:, -1, :]  # Logits of last output token
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    return loss

And in our implementation in Chapter 6, the last token was always token position 120.

To make this a bit more flexible, you could train with a batch size of 1 where you don't need padding.

But even in this case, you might be observing similar behavior. For instance, I ran a quick check and the "spam" messages do seem 2x as long as the "ham" messages on average:

df["Length"] = df["Text"].apply(lambda x: len(tokenizer.encode(x)))
print(df[df["Label"] == "ham"]["Length"].values.mean())
# prints 19.85

df[df["Label"] == "spam"]["Length"].values.mean()
# prints 41.67

So because this is a very simple dataset, the model will likely learn how to exploit the length (/token position) to learn how to make a good prediction.

To further experiment with this you could try to train the model similar to row 14, 15, or 16, which don't use/need padding tokens: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments

But given the spam dataset statistics mentioned above, you might still get a length-skew. So, in this case, it would additionally be interesting to look at a different dataset: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/03_bonus_imdb-classification

I did a quick check, and there seems to be less length-bias:

train_df["Length"] = train_df["text"].apply(lambda x: len(tokenizer.encode(x)))
print(train_df[train_df["label"] == 0]["Length"].values.mean())
# prints 294.48
print(train_df[train_df["label"] == 1]["Length"].values.mean())
# prints 296.77

Btw I do like your idea of training all tokens though by creating this label array for each position. That's another really nice workaround! Thanks a lot for sharing this idea!

0 replies

itdxer · 2025-03-15T14:05:34Z

itdxer
Mar 15, 2025
Author

Thank you for the response and moving the topic into the Discussions.

It sounds like you changed the padding length from train_dataset.max_length?

Indeed. I wrote a simple function to test model's performance on custom inputs which aren't part of the original dataset. I implicitly assumed that the model should generalize to arbitrary input sequences and should perform equally well on non-padding tokens too.

To further experiment with this you could try to train the model similar to row 14, 15, or 16, which don't use/need padding tokens:

That's exactly what I wanted to try too, but didn't have time to write a code for it. I'll try it and see how much it changes model's behavior.

1 reply

rasbt Mar 15, 2025
Maintainer

Thanks for confirming! If you ever get to

That's exactly what I wanted to try too, but didn't have time to write a code for it. I'll try it and see how much it changes model's behavior.

Please let me know what you find out 😊

itdxer · 2025-03-16T14:32:54Z

itdxer
Mar 16, 2025
Author

The code in the ch06/02_bonus_additional-experiments helped me to save a lot of time. Thanks for pointing me to it.

From what I can see the model indeed learned to exploit length of the message as a feature, since I can get close to 85% accuracy on the balanced dataset when I replace validation data tokens with random tokens. I can also get roughly the same accuracy if I simply classify all messages with more than 30 tokens as a spam (~100 characters). I used the following code to run the experiment

if "validation" in str(csv_file):
    self.encoded_texts = [
        np.random.randint(50255, size=len(et)).tolist() + [pad_token_id] * (self.max_length - len(et))
        for et in self.encoded_texts
    ]

Switching from --trainable_token_pos last to --trainable_token_pos flexible helps quite a lot. The model perhaps is still not as reliable, since, for example, omitting punctuations can change prediction from "ham" to "spam".

I checked that prediction accuracy decreases from 98.6% to 94% if I omit last token in validation data.

Maybe it has something to do with the misalignment between original next-token-prediction and classification objectives which still affects previous layers. I tried running training with LoRA and it mostly addressed the remaining problem: --trainable_token_pos flexible --trainable_layers lora.

1 reply

rasbt Mar 21, 2025
Maintainer

Thanks for the additional analysis! Yeah, I think a lot of this is due to the small and simple nature of the dataset... It's interesting that LoRA helped addressing issues here! Btw when you say

Maybe it has something to do with the misalignment between original next-token-prediction and classification objectives which still affects previous layers.

The next interesting experiments to try here would be

training all layers --trainable_layers all
initializing the model with random weights and training all layers --weights random --trainable_layers all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Class prediction in chapter 6 depends on the number of paddings added #567

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Class prediction in chapter 6 depends on the number of paddings added #567

Uh oh!

itdxer Mar 13, 2025

Replies: 3 comments · 2 replies

Uh oh!

rasbt Mar 14, 2025 Maintainer

Uh oh!

itdxer Mar 15, 2025 Author

Uh oh!

rasbt Mar 15, 2025 Maintainer

Uh oh!

itdxer Mar 16, 2025 Author

Uh oh!

rasbt Mar 21, 2025 Maintainer

itdxer
Mar 13, 2025

Replies: 3 comments 2 replies

rasbt
Mar 14, 2025
Maintainer

itdxer
Mar 15, 2025
Author

rasbt Mar 15, 2025
Maintainer

itdxer
Mar 16, 2025
Author

rasbt Mar 21, 2025
Maintainer