Skip to content

Drop in Dev NDCG@10 After Training on CodeSearchNet-CCR (Python only): Possible Issue with In-Batch Negatives #22

@gnatesan

Description

@gnatesan

I’m running into a weird issue when training on the CodeSearchNet-CCR dataset. I’m focusing on just one language (Python), so every batch only has Python examples. The model I’m using is granite-embedding-125m-english, and I’m training with MultipleNegativesRankingLoss. There’s no language mixing happening within batches.

Here’s what’s happening: even though my training and validation loss both go down smoothly, the dev set NDCG@10 gets worse after training compared to the baseline (pretrained) model. It’s a pretty big drop. Sometimes training for just one step can boost the score, but once I go past a few steps, the dev score drops off and keeps going down with more training, even though the loss keeps improving.

Some extra info:

My batches are all Python, so the "negatives" in the batch are just other Python function pairs.

The dataset seems to be structured as one query and one positive document per pair, basically the two halves of a Python function. There aren’t really any hard negatives.

No languages are mixed together within a batch, just Python.

My guess right now is that since there is only a one-to-one mapping and no hard negatives, the in-batch negatives are just too easy. The model might be learning to distinguish unrelated code snippets without actually learning much about semantic matching, so dev set performance gets worse even though the loss looks good.

Has anyone else run into this, or found a good way to improve training on this kind of setup? I’m open to advice on better loss functions or negative sampling. Would love to hear any best practices for CodeSearchNet-CCR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions