RougeSimilarityFilter Should Consider Input Field for Filtering #1625
Replies: 1 comment
-
I ultimately adjusted the self_instruct.py and templates.py code locally to meet my needs. First, I converted the non-standard format seed_tasks.jsonl to generate a batch of new instructions without modifying any code. from {
"instruction": "Extract time information from the following text.",
"input": "Did you go to the party last Friday? It was March 7th.",
"output": "[\"last Friday\", \"March 7th\"]"
} to {
"system": "Extract time information from the following text.",
"instruction": "Did you go to the party last Friday? It was March 7th.",
"input": "",
"output": "[\"last Friday\", \"March 7th\"]"
} Then, to ensure the output format was generated, I added content in the input_first_template_for_gen template to allow the use of a portion of the sampled data as a prompt. https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/templates.py#L271
And passed a sampled_human_tasks composed of system + instruction into the input_first_template_for_gen template. https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/self_instruct.py#L226 sampled_human_tasks = self.sample_human_tasks(
self.human_to_machine_ratio[0])
sampled_human_tasks_str = "\n".join(
f"Task: {task['instruction']}\n Output: {task['output']}\n\n "
for task in sampled_human_tasks
)
if classification:
prompt = (
SelfInstructTemplates.output_first_template_for_clf.format(
instruction=instruction
)
)
else:
prompt = SelfInstructTemplates.input_first_template_for_gen.format(
instruction=instruction,
sampled_human_tasks = sampled_human_tasks_str,
) These adjustments solved my problem. Thanks so much! And the data generation function is great! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm encountering an issue while using the SelfInstructPipeline for data generation. My goal is to keep the instruction field constant while generating diverse input and output pairs. However, it appears that the RougeSimilarityFilter is only being applied to the values within the instruction field. This prevents me from generating new data and results in repetitive outputs.
Here's an example of my data samples:
I understand that disabling the RougeSimilarityFilter would bypass this issue. However, I would prefer for it to also consider the input field during the filtering process. This would allow for more varied data generation while still maintaining control over similarity.
Furthermore, I'd like to suggest the addition of a system field to define roles or contexts. For example:
I noticed that Alpaca does not define a
system
field (https://github.com/tatsu-lab/stanford_alpaca). Could you provide any insights or suggestions on how to address my issue with theRougeSimilarityFilter
and whether the addition of asystem
field is a feasible or recommended approach?Beta Was this translation helpful? Give feedback.
All reactions