RougeSimilarityFilter Should Consider Input Field for Filtering #1625
Replies: 1 comment
-
|
I ultimately adjusted the self_instruct.py and templates.py code locally to meet my needs. First, I converted the non-standard format seed_tasks.jsonl to generate a batch of new instructions without modifying any code. from {
"instruction": "Extract time information from the following text.",
"input": "Did you go to the party last Friday? It was March 7th.",
"output": "[\"last Friday\", \"March 7th\"]"
}to {
"system": "Extract time information from the following text.",
"instruction": "Did you go to the party last Friday? It was March 7th.",
"input": "",
"output": "[\"last Friday\", \"March 7th\"]"
}Then, to ensure the output format was generated, I added content in the input_first_template_for_gen template to allow the use of a portion of the sampled data as a prompt. https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/templates.py#L271 And passed a sampled_human_tasks composed of system + instruction into the input_first_template_for_gen template. https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/self_instruct.py#L226 sampled_human_tasks = self.sample_human_tasks(
self.human_to_machine_ratio[0])
sampled_human_tasks_str = "\n".join(
f"Task: {task['instruction']}\n Output: {task['output']}\n\n "
for task in sampled_human_tasks
)
if classification:
prompt = (
SelfInstructTemplates.output_first_template_for_clf.format(
instruction=instruction
)
)
else:
prompt = SelfInstructTemplates.input_first_template_for_gen.format(
instruction=instruction,
sampled_human_tasks = sampled_human_tasks_str,
)These adjustments solved my problem. Thanks so much! And the data generation function is great! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm encountering an issue while using the SelfInstructPipeline for data generation. My goal is to keep the instruction field constant while generating diverse input and output pairs. However, it appears that the RougeSimilarityFilter is only being applied to the values within the instruction field. This prevents me from generating new data and results in repetitive outputs.
Here's an example of my data samples:
{ "instruction": "Extract time information from the following text", "input": "Did you go to the party last Friday? It was March 7th.", "output": "[\"last Friday\", \"March 7th\"]" } { "instruction": "Extract time information from the following text", "input": "I remember last winter, when we went skiing. It was December 25th.", "output": "[\"last winter\", \"December 25th\"]" } { "instruction": "Extract time information from the following text", "input": "We plan to hold a meeting on the 15th of next month.", "output": "[\"next month\", \"15th\"]" } { "instruction": "Extract time information from the following text", "input": "Yesterday was my birthday, which was August 1st.", "output": "[\"yesterday\", \"August 1st\"]" }I understand that disabling the RougeSimilarityFilter would bypass this issue. However, I would prefer for it to also consider the input field during the filtering process. This would allow for more varied data generation while still maintaining control over similarity.
Furthermore, I'd like to suggest the addition of a system field to define roles or contexts. For example:
{ "system": "Extract time information from the following text", "instruction": "Did you go to the party last Friday? It was March 7th.", "input": "", "output": "[\"last Friday\", \"March 7th\"]" }I noticed that Alpaca does not define a
systemfield (https://github.com/tatsu-lab/stanford_alpaca). Could you provide any insights or suggestions on how to address my issue with theRougeSimilarityFilterand whether the addition of asystemfield is a feasible or recommended approach?Beta Was this translation helpful? Give feedback.
All reactions