-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Based on our curation efforts, we spotted a bug in the overall_score
of UltraFeedback AI Critique score. TLDR: Responses getting the lowest score (1 or less) become a high score (10 or 8.0 or 7.5 who knows!). Our initial work with notus shows that by using something different to the overall score, we can train a better model.
In this task, we want to really clean up the original dataset to make sure others build on an error free dataset. I have myself curated a few hundreds (sorting by chosen score = 10) and most of the responses getting a 10 are totally useless according to the rationale (natural language) explanation.
The objective is as follows:
- Using this dataset take the col
best_overall_score_response
, get thecritique
text and run it through a very simple sentiment analysis (I suggest starting with TextBlob's because it's really fast and the rationales are very expressive when the response is really bad). - Add this sentiment score to the dataset on a new column,
best_overall_score_response_critique_sentiment
. - Based on this new dataset, let's try to find out those examples that get a high overall_score but a bad sentiment.
- Iterate as much as we can to really narrow down those problematic cases. I'd strongly suggest to use Argilla UI with sort and filters to quickly adjust.
- Once we know the problematic cases, we have several choices, the best I can think of is reduce their overall_score (dividing by 10 :-) ) in the completions object.
- Now we have a clean dataset, we can use to experiment further (compare rating vs critique, etc.) and most important share it with the community so people build on a clean version!
More details about the initial analysis on the dataset readme.
Please keep us posted as you start and iterate!