Problems with the new judge_score implementation

- Judge score uses ``StreamedSyntheticPartialDataset``, which cuts off the messages randomly, so the last 'role' of the conversation to generate from can be both ``user`` and ``assistant``:
```
allowed_len = min(len(messages), self.max_messages)
if random.random() < self._cut_message_chain_early:
    # Choose a random cutoff between at least half of allowed_len and allowed_len
    min_cut = max(1, allowed_len // 2)
    cutoff = random.randint(min_cut, allowed_len)
else:
    cutoff = allowed_len
truncated_messages = messages[:cutoff]
```
- The ``original_conversation`` is created by appending ``truncated_messages`` with the LAST assistant response, not the next assistant response, so the response may be irrelevant to the user message. This doesn't make any sense. For example, if the current user message is "How are you today?", what's the point of generating a better answer than "Alright, see you later!" (the last assistant message)?

Have you guys considered these problems?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with the new judge_score implementation #159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with the new judge_score implementation #159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions