Skip to content

Questions on Key Derivations in "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment" #3

@luodi-7

Description

@luodi-7

While studying your paper "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment", I encountered some confusions about the theoretical derivations and hope to get your clarification.

1. Inequality Direction in the Expectation of Order Consistency

We have the bounds:
$p_{\text{correct}} \geq (1-\epsilon) \cdot \xi(\Delta r)$
$p_{\text{incorrect}} \leq \epsilon \cdot (1-\xi(\Delta r))$

The order consistency of the learned model with the oracle utility is given by:
$\mathbb{E}_{x, y_1, y_2 \sim \ell(x)}\left[\mathbb{1}\left(\hat{H}\left(r(y_1,x) - r(y_2,x)\right) \geq 0\right) \mid \Delta r\right] = p_{\text{correct}} \cdot p_{\text{annotator}} + p_{\text{incorrect}} \cdot (1-p_{\text{annotator}})$

Finally, we arrive at the conclusion:
$\mathbb{E}_{x, y_1, y_2 \sim \ell(x)}\left[\mathbb{1}\left(\hat{H}\left(r(y_1,x) - r(y_2,x)\right) \geq 0\right) \mid \Delta r\right] \geq (1-\epsilon) \cdot \xi^2(\Delta r) + \epsilon \cdot (1-\xi(\Delta r))^2$

I'm confused about how this "greater than or equal to" conclusion is derived. Since $p_{\text{correct}}$ has a lower bound (≥) and $p_{\text{incorrect}}$ has an upper bound (≤), could you explain in detail why their combination in the expectation formula results in a lower bound (≥) for the final order consistency?

2. Interpretation of $\epsilon$ and the "Incorrect Case" Description

In the "Incorrect Case", the statement "When the annotator is incorrect, the learned model agrees with the annotator with probability at most $\epsilon$" is perplexing. I initially thought $\epsilon$ represents the maximum probability that the learned model $\hat{H}$ disagrees with human annotations, but this description suggests a different meaning. Could you clarify the precise definition of $\epsilon$? Is it the probability of model-oracle disagreement, or does it pertain to model-annotator agreement in the context of incorrect annotations?

3. Redundancy Concern in the Formula for $p_{incorrect}$ and Subsequent Multiplications

For the formula $p_{\text{incorrect}} \leq \epsilon \cdot (1 - \xi(\Delta r))$:

  • First, when considering the situation where the annotator is incorrect (with probability $1 - \xi(\Delta r)$), and given the ambiguity around $\epsilon$ (whether it relates to model-oracle or model-annotator disagreement), the combination $\epsilon \cdot (1 - \xi(\Delta r))$ is said to represent "human annotation error + model-human inconsistency = consistency with the true answer". However, in the final expectation formula, we multiply $p_{\text{incorrect}}$ by $(1 - p_{\text{annotator}})$ again. Since $p_{\text{annotator}}$ is associated with annotator correctness ($\xi(\Delta r)$ for correct, $1 - \xi(\Delta r)$ for incorrect), it seems like we are double-counting the case of incorrect annotations.
  • Additionally, the subsequent derivation leads to a squared term like $(1 - \xi(\Delta r))^2$ in $(1 - \epsilon) \cdot \xi^2(\Delta r) + \epsilon \cdot (1 - \xi(\Delta r))^2$. Given the above confusions, is this square term necessary, or is there a misinterpretation in my understanding of how these probabilities interact?

Thank you for your time and attention to these questions. Your explanations will be crucial for me to further understand your important work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions