Description
There are two distinct categories of use cases in Learning to Rank (LTR):
1. Ranking Relevant Items Within a Query
This is the standard scenario in information retrieval, such as search engine result ranking or some recommendation systems. Its main characteristics include:
- Use of relevance-based metrics focused on top-ranked items, such as MAP or NDCG.
- Position bias correction mechanisms.
- Truncation of candidate pairs based on the most relevant items (according to the labels or predictions).
- Other types of normalizations specific to this context.
2. Full Ranking of a Dataset
Another important and often overlooked use case is the complete ranking of all elements in a dataset. This can be framed as LTR with a single query (or several queries representing different periods in time series datasets), and it is applicable to problems where the evaluation metric is, for instance, Spearman correlation, or even binary classification problems with AUC as the metric.
The nature of this use case makes many LTR implementations unsuitable (for example, LightGBM does not support it well).
XGBoost, however, does support LTR through the rank:pairwise
objective. Still, there are some impactful aspects that could be improved:
Weights
In LTR, weights are always considered at the query level.
But what happens in pairwise use cases where there is only one query, or when multiple queries exist but we want to assign instance-level weights?
Since the weight
parameter in the DMatrix
constructor is the same, this behavior should be generalized. It should be possible to:
- Provide weights of length equal to the number of queries (to be applied per group), or
- Provide weights of length equal to the number of observations (to be applied per instance).
A consistent internal approach (aligned with other objectives) would be:
- Always interpret
weight
as per-instance, and - If a per-query weight array is passed (with length equal to the number of queries), internally expand it into a vector matching the number of instances by repeating the group weight according to the group size.
Label Difference Normalization
In full-dataset ranking scenarios, labels are often quasi-continuous or have high granularity.
(It is up to the user to discretize or bin the labels if needed.)
Pairs with similar labels are generally less informative than those with very different labels. Therefore, introducing a normalization based on label difference is a natural and useful idea.
Assuming labels are preprocessed to lie within the [0, 1] percentile scale, the following logic (from XGBoost source code):
Can be generalized as:
if (norm_by_diff && best_score != worst_score) {
if (param_.IsMean()) {
delta_metric *= std::pow(std::abs(y_high - y_low), label_diff_normalization);
} else {
delta_metric /= (delta_score + 0.01);
}
}
Where label_diff_normalization
is a user-defined parameter, with default value = 0.
Since y_high
and y_low
are percentiles, their absolute difference is bounded in [0, 1].
- When
label_diff_normalization == 0
,delta_metric
remains unchanged. - As
label_diff_normalization
increases,delta_metric
decreases, effectively penalizing pairs with similar labels.