Skip to content

Learning to Rank: Weights and Label difference normalization in pairwise full query ranking. #11424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jaguerrerod opened this issue Apr 22, 2025 · 2 comments

Comments

@jaguerrerod
Copy link

There are two distinct categories of use cases in Learning to Rank (LTR):

1. Ranking Relevant Items Within a Query

This is the standard scenario in information retrieval, such as search engine result ranking or some recommendation systems. Its main characteristics include:

  • Use of relevance-based metrics focused on top-ranked items, such as MAP or NDCG.
  • Position bias correction mechanisms.
  • Truncation of candidate pairs based on the most relevant items (according to the labels or predictions).
  • Other types of normalizations specific to this context.

2. Full Ranking of a Dataset

Another important and often overlooked use case is the complete ranking of all elements in a dataset. This can be framed as LTR with a single query (or several queries representing different periods in time series datasets), and it is applicable to problems where the evaluation metric is, for instance, Spearman correlation, or even binary classification problems with AUC as the metric.

The nature of this use case makes many LTR implementations unsuitable (for example, LightGBM does not support it well).

XGBoost, however, does support LTR through the rank:pairwise objective. Still, there are some impactful aspects that could be improved:

Weights

In LTR, weights are always considered at the query level.
But what happens in pairwise use cases where there is only one query, or when multiple queries exist but we want to assign instance-level weights?

Since the weight parameter in the DMatrix constructor is the same, this behavior should be generalized. It should be possible to:

  • Provide weights of length equal to the number of queries (to be applied per group), or
  • Provide weights of length equal to the number of observations (to be applied per instance).

A consistent internal approach (aligned with other objectives) would be:

  • Always interpret weight as per-instance, and
  • If a per-query weight array is passed (with length equal to the number of queries), internally expand it into a vector matching the number of instances by repeating the group weight according to the group size.

Label Difference Normalization

In full-dataset ranking scenarios, labels are often quasi-continuous or have high granularity.
(It is up to the user to discretize or bin the labels if needed.)

Pairs with similar labels are generally less informative than those with very different labels. Therefore, introducing a normalization based on label difference is a natural and useful idea.

Assuming labels are preprocessed to lie within the [0, 1] percentile scale, the following logic (from XGBoost source code):

https://github.com/dmlc/xgboost/blob/4e24639d7de3d8e0aae0ae0ab061c14f704c0c35/src/objective/lambdarank_obj.h#L123C3-L125C4

Can be generalized as:

if (norm_by_diff && best_score != worst_score) {
  if (param_.IsMean()) {
    delta_metric *= std::pow(std::abs(y_high - y_low), label_diff_normalization);
  } else {
    delta_metric /= (delta_score + 0.01);
  }
}

Where label_diff_normalization is a user-defined parameter, with default value = 0.

Since y_high and y_low are percentiles, their absolute difference is bounded in [0, 1].

  • When label_diff_normalization == 0, delta_metric remains unchanged.
  • As label_diff_normalization increases, delta_metric decreases, effectively penalizing pairs with similar labels.
@trivialfis
Copy link
Member

Thank you for raising the issue.

In full-dataset ranking scenarios, labels are often quasi-continuous or have high granularity.

May I ask what kind of problem you are trying to solve or what type of data you are trying to model? Just asking out of personal curiosity.

@jaguerrerod
Copy link
Author

Thanks for your answer.

The data I work with is a time series where it’s important to sort all records within certain temporal segments. The response is continuous with some degree of discretization, and the metric used to evaluate the model's fit is Spearman correlation.
In this case, since the response is continuous, it's useful to weight the pairs, giving more importance when the response difference between the pair is large, and less when the difference is smaller.
To use the proposed formulation, I transform the response into percentiles within each query so that it falls within the [0, 1] range.
In my use case, some elements are more relevant than others, so I apply weights and use the weighted version of Spearman correlation.
Currently, since xgboost does not support assigning weights at the observation level, only at the query level, I have to simulate the weights by repeating certain observations 2, 3, or even 10 times so that the model accounts for an approximation of the intended weight. This is inefficient both in training time and in GPU RAM usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants