Skip to content

Add dataset creation from traces #663

@rachwalk

Description

@rachwalk

Is your feature request related to a problem? Please describe.
The data gathered through the use of rai_benchmark presents itself as a great opportunity to create datasets, which can be used for fine-tuning of smaller L/VLMs. Ability to create datasets automatically after a benchmark is run would be very helpful.

Describe the solution you'd like
A module could be created allowing for automatic creation of a dataset from a langfuse traceback after a run of rai_benchmark -- (however the module should not be tightly integrated with rai_benchmark itself -- in the future it might be beneficial to extend this functionality to non-benchmark deployments, allowing for online and/or imitation learning).

Describe alternatives you've considered
TBD

Additional context
Initial research has pointed to https://docs.unsloth.ai/ as a possible target framework to conduct fine tuning. Additional information on dataset requirements may be found there.

Note: Different models might require/benefit from different prompt structure for fine tuning. The dataset shouldn't necessarily take care of that, it might be better to leave it to pre-processing pipeline during fine tuning (once that's developed). However the dataset should possibly include all relevant metadata, which will allow for finetuning possibly largest range of models.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededpriority/minorLower-priority tasks that can be picked up when time allows or planned for later.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions