ch06 - fine-tuning an LLM for binary classification task - add vs update output layer #192

kevalshah90 · 2024-06-02T21:24:14Z

kevalshah90
Jun 2, 2024

I am following the example from ch06 for fine-tuning an LLM for classification task. When I run the following code from the example, it doesn't update the layer but add a new lm_head. Is this expected? Shouldn't I update the existing lm_head ?

num_classes = 2

peft_model.base_model.lm_head = torch.nn.Linear(in_features=peft_model.get_input_embeddings().embedding_dim, out_features=num_classes, bias=False)

peft_model.base_model.lm_head

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): MistralRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm()
            (post_attention_layernorm): MistralRMSNorm()
          )
        )
        (norm): MistralRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
    (lm_head): Linear(in_features=4096, out_features=2, bias=False)
  )
)

Answered by rasbt

Jun 2, 2024

Hi there,

yes, you are right, there shouldn't be a second lm_head. My guess is that's because the base model has a weird implementation where they have 2 nested models.

Instead of

peft_model.base_model.lm_head = torch.nn.Linear(

you probably need to do

peft_model.base_model.model.lm_head = torch.nn.Linear(...)

View full answer

rasbt · 2024-06-02T21:56:21Z

rasbt
Jun 2, 2024
Maintainer

Hi there,

yes, you are right, there shouldn't be a second lm_head. My guess is that's because the base model has a weird implementation where they have 2 nested models.

Instead of

peft_model.base_model.lm_head = torch.nn.Linear(

you probably need to do

peft_model.base_model.model.lm_head = torch.nn.Linear(...)

15 replies

rasbt Jun 8, 2024
Maintainer

Hm, tricky. It's still a decoder in a sense that it still has the causal attention mask

kevalshah90 Jun 10, 2024
Author

@rasbt

I wanted to clarify how you are calculating the accuracy between predicted labels and target labels in def calc_accuracy_loader function.

Suppose, I have the following with batch size 8, sequence length of 25 within each batch and binary classification with [0,1] labels.

Batch # 0
Predicted Labels tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target Labels tensor([1, 0, 0, 1, 0, 0, 0, 1], device='cuda:0')
Batch # 1
Predicted Labels tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target Labels tensor([0, 1, 1, 0, 0, 1, 1, 0], device='cuda:0')
Batch # 2
Predicted Labels tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target Labels tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')

A few clarification questions:

For binary classification, we are predicting the class probability for each sequence within the batch instead of the next token ? For this,
- 1. we take the raw output logits
- 1. get the last token from each sequence within the batch and apply softmax
- 1. Argmax to locate the highest probability for each class label
num_examples is total # of sequences within the batch. So, for 3 batches, this would be 24 ?
correct_predictions is comparing each labels in the predicted and target tensors. correct_preds += torch.eq(predicted_labels, target_labels).sum().item() ?

rasbt Jun 11, 2024
Maintainer

Yes, that's correct, but you can actually skip optionally even skip the softmax and directly apply the argmax to the logits.

num_examples is total # of sequences within the batch.

You would calculate the accuracy over the whole dataset not just the batch, but yeah, like you said, if you have 3 batches of size 8, that's anum_examples = 24

kevalshah90 Jun 11, 2024
Author

I am curious about the choice of activation function for binary classification problem. Isn't sigmoid the go-to activation function for binary classification problem vs softmax ? why did you go with softmax over sigmoid ?

rasbt Jun 11, 2024
Maintainer

Sure you can do that. I described it here: https://sebastianraschka.com/blog/2022/losses-learned-part1.html

I used softmax here because it also works with multiple classes. Otherwise, someone would not be able to adopt this code for other datasets without having to change the loss etc.

kevalshah90 · 2024-06-11T23:44:38Z

kevalshah90
Jun 11, 2024
Author

So, I am trying to use sigmoid with BCELoss(), however, I am running into some issues.

And, then I implement the training loop:

criterion = torch.nn.BCELoss()

def calc_loss_batch(input_batch, target_batch, model):

    target_batch = target_batch.unsqueeze(1)
    print("target batch", target_batch.shape)
    
    output = model(input_batch)
    
    # Logits of last output token
    loss = criterion(output.logits[:, -1, :], target_batch)
    
    return loss
    

def train(epoch):

    # set the model to training mode
    peft_model.train()
    
    # record
    losses = []
    batch_idx = []
    epochs = []

    for i, (input_batch, target_batch) in enumerate(train_loader):
        
        input_batch, target_labels = input_batch.to(device), target_batch.to(device)

        optimizer.zero_grad() # Reset loss gradients from previous epoch
        
        print('iteration {}'.format(i))
        
        # calculate loss
        loss = calc_loss_batch(input_batch, target_labels, peft_model)

        if i%100==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')

        loss.backward() # Backpropogation - calculate loss gradients
        
        optimizer.step() # update model weights using loss gradients
        
        # Record
        losses.append(loss.item())
        batch_idx.append(i)
        epochs.append(epoch)

        # early stopping
        if i == 2:
            break;


# Start the Training run
start_time = time.time()

EPOCHS = 2

for epoch in range(EPOCHS):
    print('epochs', epoch)
    train(epoch)
    
end_time = time.time()

exec_time_mins = (end_time - start_time)/60
print(f"Training completed in {exec_time_mins:.2f} minutes.")

I keep running into ValueError: Using a target size (torch.Size([8, 1])) that is different to the input size (torch.Size([])) is deprecated. Please ensure they have the same size.

could you share how I can go about implementing Sigmoid activation with BCELoss() ?

14 replies

kevalshah90 Jun 18, 2024
Author

@rasbt I am a bit confused w.r.t to BCELoss() and what it expects and if we need to apply the 0.5 threshold.

I have defined a custom class with sigmoid activation which out probs between 0 and 1.

class BinaryClassification(nn.Module):
    
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.dropout = nn.Dropout(0.05)
        #self.classifier = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        outputs = self.base_model(x)
        dropout_output = self.dropout(outputs.logits)
        probs = self.sigmoid(dropout_output[:, -1, :]) # Apply sigmoid to logits to get probabilities
        return probs

I am using the following loss and optimizer :

criterion = torch.nn.BCELoss()
optimizer = torch.optim.AdamW(peft_model.parameters(), lr=0.001, weight_decay=0.001)

Training loop and loss calculation:

def calc_loss_batch(input_batch, target_batch):
    
    output = model(input_batch)
    
    # Apply threshold 
    
    # Reshape target to match the shape of probs
    target_batch = target_batch.unsqueeze(1)
    
    if output.shape != target_batch.shape:
        raise Exception("Shape mismatch between input logits and target label")
    
    # Logits of last output token
    loss = criterion(output, target_batch)
    
    return loss

# record
losses = []
batch_idx = []
epochs = []

def train(epoch):

    # set the model to training mode
    model.train()

    for i, (input_batch, target_batch) in enumerate(train_loader):
        
        input_batch, target_labels = input_batch.to(device), target_batch.to(device)

        optimizer.zero_grad() # Reset loss gradients from previous epoch
        
        print('iteration {}'.format(i))
        
        # calculate loss
        loss = calc_loss_batch(input_batch, target_labels.float())

        if i%50==0:
            print(f'epoch: {epoch}, loss:  {loss.item()}')

        loss.backward() # Backpropogation - calculate loss gradients
        
        optimizer.step() # update model weights using loss gradients
        
        # Record
        losses.append(loss.item())
        batch_idx.append(i)
        epochs.append(epoch)

Do I need apply threshold predicted_labels = (outputs > 0.5).float() explicitly or I can simply pass probs to BCELoss() ?

rasbt Jun 18, 2024
Maintainer

In BCELoss you pass the probabilities. The threshold is only required for converting the probabilities into class labels when you want to calculate the accuracy and return the predictions.

kevalshah90 Jun 18, 2024
Author

Isn't predicted_labels = (probs> 0.5).float() and torch.argmax(probs, dim=-1) the same thing?

rasbt Jun 18, 2024
Maintainer

That's a good point, but that would only be the case if you go back to the original code and consider the softmax probabilities. If you only use one output node, however, then argmax would always return 0.

kevalshah90 Jun 20, 2024
Author

@rasbt

In my training loop, I am printing the gradient values for each batch:

# Print gradients
for name, param in model_init.named_parameters():
      if param.grad is not None:
           print(f'Gradient for {name}: {param.grad.norm()}')

I am trying to understanding, why all the gradients values are 0s except for the #1 iteration (starting from 0th).

iteration 0
...
...
Gradient for base_model.base_model.model.model.layers.31.self_attn.q_proj.lora_B.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight: 0.0

iteration 1
...
...
Gradient for base_model.base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight: 0.0142822265625
Gradient for base_model.base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: 3.953125
Gradient for base_model.base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: 0.185546875

iteration n
...
...
Gradient for base_model.base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight: 0.0
Gradient for base_model.base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight: 0.0

Isn't this a bit unusual? Is it the vanishing gradient problem?

For context,

# custom class
class BinaryClassification(nn.Module):
    
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.dropout = nn.Dropout(0.05)
        #self.classifier = nn.Linear(hidden_size, 1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        outputs = self.base_model(x)
        dropout_output = self.dropout(outputs.logits)
        relu_output = self.relu(dropout_output[:, -1, :])
        probs = self.sigmoid(relu_output) # Apply sigmoid to logits to get probabilities
        #print('forward probs', probs)
        return probs

model_init = BinaryClassification(peft_model)

# optimizer
criterion = torch.nn.BCELoss()
optimizer = torch.optim.AdamW(model_init.parameters(), lr=0.001, eps=1e-08, weight_decay=0.001)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=True)

# Loss
def calc_loss_batch(input_batch, target_batch):
    
    output = model_init(input_batch)
    
    # Reshape target to match the shape of probs
    target_batch = target_batch.unsqueeze(1)
    
    if output.shape != target_batch.shape:
        raise Exception("Shape mismatch between input logits and target label")
    
    # Logits of last output token
    loss = criterion(output, target_batch)
    
    return loss

ch06 - fine-tuning an LLM for binary classification task - add vs update output layer #192

Uh oh!

kevalshah90 Jun 2, 2024

Replies: 2 comments · 29 replies

Uh oh!

rasbt Jun 2, 2024 Maintainer

Uh oh!

rasbt Jun 8, 2024 Maintainer

Uh oh!

Uh oh!

kevalshah90 Jun 10, 2024 Author

Uh oh!

rasbt Jun 11, 2024 Maintainer

Uh oh!

kevalshah90 Jun 11, 2024 Author

Uh oh!

rasbt Jun 11, 2024 Maintainer

Uh oh!

Uh oh!

kevalshah90 Jun 11, 2024 Author

Uh oh!

kevalshah90 Jun 18, 2024 Author

Uh oh!

rasbt Jun 18, 2024 Maintainer

Uh oh!

kevalshah90 Jun 18, 2024 Author

Uh oh!

rasbt Jun 18, 2024 Maintainer

Uh oh!

Uh oh!

kevalshah90 Jun 20, 2024 Author

kevalshah90
Jun 2, 2024

Replies: 2 comments 29 replies

rasbt
Jun 2, 2024
Maintainer

rasbt Jun 8, 2024
Maintainer

kevalshah90 Jun 10, 2024
Author

rasbt Jun 11, 2024
Maintainer

kevalshah90 Jun 11, 2024
Author

rasbt Jun 11, 2024
Maintainer

kevalshah90
Jun 11, 2024
Author

kevalshah90 Jun 18, 2024
Author

rasbt Jun 18, 2024
Maintainer

kevalshah90 Jun 18, 2024
Author

rasbt Jun 18, 2024
Maintainer

kevalshah90 Jun 20, 2024
Author