RougeSimilarityFilter Should Consider Input Field for Filtering #1625

coolbeevip · 2025-02-19T09:31:55Z

coolbeevip
Feb 19, 2025
Collaborator

I'm encountering an issue while using the SelfInstructPipeline for data generation. My goal is to keep the instruction field constant while generating diverse input and output pairs. However, it appears that the RougeSimilarityFilter is only being applied to the values within the instruction field. This prevents me from generating new data and results in repetitive outputs.

Here's an example of my data samples:

{
    "instruction": "Extract time information from the following text",
    "input": "Did you go to the party last Friday? It was March 7th.",
    "output": "[\"last Friday\", \"March 7th\"]"
}
{
    "instruction": "Extract time information from the following text",
    "input": "I remember last winter, when we went skiing. It was December 25th.",
    "output": "[\"last winter\", \"December 25th\"]"
}
{
    "instruction": "Extract time information from the following text",
    "input": "We plan to hold a meeting on the 15th of next month.",
    "output": "[\"next month\", \"15th\"]"
}
{
    "instruction": "Extract time information from the following text",
    "input": "Yesterday was my birthday, which was August 1st.",
    "output": "[\"yesterday\", \"August 1st\"]"
}

I understand that disabling the RougeSimilarityFilter would bypass this issue. However, I would prefer for it to also consider the input field during the filtering process. This would allow for more varied data generation while still maintaining control over similarity.

Furthermore, I'd like to suggest the addition of a system field to define roles or contexts. For example:

{
    "system": "Extract time information from the following text",
    "instruction": "Did you go to the party last Friday? It was March 7th.",
    "input": "",
    "output": "[\"last Friday\", \"March 7th\"]"
}

I noticed that Alpaca does not define a system field (https://github.com/tatsu-lab/stanford_alpaca). Could you provide any insights or suggestions on how to address my issue with the RougeSimilarityFilter and whether the addition of a system field is a feasible or recommended approach?

coolbeevip · 2025-03-04T09:14:30Z

coolbeevip
Mar 4, 2025
Collaborator Author

I ultimately adjusted the self_instruct.py and templates.py code locally to meet my needs.

First, I converted the non-standard format seed_tasks.jsonl to generate a batch of new instructions without modifying any code.

from

{
   "instruction": "Extract time information from the following text.",
   "input": "Did you go to the party last Friday? It was March 7th.",
   "output": "[\"last Friday\", \"March 7th\"]"
}

to

{
    "system": "Extract time information from the following text.",
    "instruction": "Did you go to the party last Friday? It was March 7th.",
    "input": "",
    "output": "[\"last Friday\", \"March 7th\"]"
}

Then, to ensure the output format was generated, I added content in the input_first_template_for_gen template to allow the use of a portion of the sampled data as a prompt.

https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/templates.py#L271

    {sampled_human_tasks}

    Task: {instruction}

And passed a sampled_human_tasks composed of system + instruction into the input_first_template_for_gen template.

https://github.com/camel-ai/camel/blob/master/camel/datagen/self_instruct/self_instruct.py#L226

        sampled_human_tasks = self.sample_human_tasks(
            self.human_to_machine_ratio[0])
        sampled_human_tasks_str = "\n".join(
            f"Task: {task['instruction']}\n    Output: {task['output']}\n\n    "
            for task in sampled_human_tasks
        )
        
        if classification:
            prompt = (
                SelfInstructTemplates.output_first_template_for_clf.format(
                    instruction=instruction
                )
            )
        else:
            prompt = SelfInstructTemplates.input_first_template_for_gen.format(
                instruction=instruction,
                sampled_human_tasks = sampled_human_tasks_str,
            )

These adjustments solved my problem. Thanks so much! And the data generation function is great!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

camel-ai.org

RougeSimilarityFilter Should Consider Input Field for Filtering #1625

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

camel-ai.org

RougeSimilarityFilter Should Consider Input Field for Filtering #1625

Uh oh!

coolbeevip Feb 19, 2025 Collaborator

Replies: 1 comment

Uh oh!

coolbeevip Mar 4, 2025 Collaborator Author

coolbeevip
Feb 19, 2025
Collaborator

coolbeevip
Mar 4, 2025
Collaborator Author