Curate UltraFeedack dataset's overall_score

Based on our curation efforts, we spotted a bug in the `overall_score` of UltraFeedback AI Critique score. TLDR: Responses getting the lowest score (1 or less) become a high score (10 or 8.0 or 7.5 who knows!). Our initial work with notus shows that by using something different to the overall score, we can train a better model.

In this task, we want to really clean up the original dataset to make sure others build on an error free dataset. I have myself curated a few hundreds (sorting by chosen score = 10) and most of the responses getting a 10 are totally useless according to the rationale (natural language) explanation.

The objective is as follows:

1. Using [this dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-curation) take the col `best_overall_score_response`, get the `critique` text and run it through a very simple sentiment analysis (I suggest starting with TextBlob's because it's really fast and the rationales are very expressive when the response is really bad).
2. Add this sentiment score to the dataset on a new column, `best_overall_score_response_critique_sentiment`.
3. Based on this new dataset, let's try to find out those examples that get a high overall_score but a bad sentiment.
4. Iterate as much as we can to really narrow down those problematic cases. I'd strongly suggest to use Argilla UI with sort and filters to quickly adjust.
5. Once we know the problematic cases, we have several choices,  the best I can think of is reduce their overall_score (dividing by 10 :-) ) in the completions object.
6. Now we have a clean dataset, we can use to experiment further (compare rating vs critique, etc.) and most important share it with the community so people build on a clean version!

More details about the initial analysis on the dataset readme.

Please keep us posted as you start and iterate!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curate UltraFeedack dataset's overall_score #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Curate UltraFeedack dataset's overall_score #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions