DatasetGenerator for q/a pairs settings #8648

snassimr · 2023-11-02T17:47:25Z

snassimr
Nov 2, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

I want to ask regarding this peace of code :

`for file in files:
nodes = node_parser.get_nodes_from_documents(documents[file])
dataset_generator = DatasetGenerator(
nodes[:3],
service_context=service_context,
show_progress=True,
num_questions_per_chunk=10,
)

    qa_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=20)`

Each chunk (I have 77 of them) is mapped to node .
The code should generate 30 q/a pairs (number of nodes (3) X num_questions_per_chunk (1) .
What the meaniing of num=20 in this case ?

Thanks

@snassimr · 2023-11-02T17:47:30Z

dosubot[bot]
bot Nov 2, 2023

🤖

Hello @snassimr!

It's good to see you again. I hope you're doing well! Let's see how we can assist you with this issue today.

Give me a moment to look at your issue. I'll post a detailed answer shortly.

0 replies

@dosu-bot · 2023-11-02T17:57:21Z

dosubot[bot]
bot Nov 2, 2023

🤖

The parameter 'num=20' in the 'agenerate_dataset_from_nodes' function signifies the maximum number of question-answer pairs to be generated from the nodes. If the 'num' parameter is set to a specific integer, the function will stop generating new question-answer pairs once it reaches that number. If 'num' is set to None, the function will generate question-answer pairs for all nodes.

This is evident from the following code snippet in the LlamaIndex codebase:

for node in nodes:
    if num is not None and len(queries) >= num:
        break

Here, the function iterates over each node to generate question-answer pairs. If 'num' is not None and the length of the 'queries' dictionary (which stores the generated question-answer pairs) is greater than or equal to 'num', the function breaks out of the loop, effectively limiting the number of generated question-answer pairs to 'num'.

In the context of your code, even though you have set 'num_questions_per_chunk' to 10 and you are generating questions for 3 nodes, the 'num=20' parameter in 'agenerate_dataset_from_nodes' function will limit the total number of question-answer pairs to 20.

You can find more details about this in the llama_index/evaluation/dataset_generation.py file in the LlamaIndex repository.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

@dosu-bot · 2023-11-02T17:57:51Z

dosubot[bot]
bot Nov 2, 2023

🤖

In the code snippet you provided, the num=20 parameter in the agenerate_dataset_from_nodes function is used to specify the maximum number of question-answer pairs that should be generated from the nodes.

The function will iterate over each node and generate question-answer pairs. If the 'num' parameter is set to a specific integer, the function will stop generating new question-answer pairs once it reaches that number. If 'num' is set to None, the function will generate question-answer pairs for all nodes. This is evident from the following code snippet:

for node in nodes:
    if num is not None and len(queries) >= num:
        break

Here, the function breaks out of the loop if the length of the 'queries' dictionary (which stores the generated question-answer pairs) is greater than or equal to 'num', effectively limiting the number of generated question-answer pairs to 'num'.

You can find more details in the source code of the DatasetGenerator class in the LlamaIndex codebase, specifically in the agenerate_dataset_from_nodes method. Here is the link to the source code: llama_index/evaluation/dataset_generation.py

I hope this clarifies your question. If you have any more questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

nerdai · 2023-11-02T19:00:53Z

nerdai
Nov 2, 2023

Hey @snassimr! So what's happening is that each for each node, there is an LLM that is prompted to generate num_questions_per_chunk (in this case =10) questions. We then parse the response of the LLM to create a list of questions. What num represents is the maximum number of questions that will be returned from the extracted list of questions.

1 reply

snassimr Nov 2, 2023
Author

Hi ,

I've got the idea , but I still have some inconsistency . If you kind to address the issue.

This code gives me 20 pairs :

`
for file in files:

nodes = node_parser.get_nodes_from_documents(documents[file])

dataset_generator = DatasetGenerator(
nodes[:3],
service_context=service_context,
show_progress=True,
num_questions_per_chunk=1,
)

qa_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=20)
`

And this gives me 6 pairs :

`
for file in files:

nodes = node_parser.get_nodes_from_documents(documents[file])

dataset_generator = DatasetGenerator(
nodes[:3],
service_context=service_context,
show_progress=True,
num_questions_per_chunk=2,
)

qa_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=20)
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DatasetGenerator for q/a pairs settings #8648

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

DatasetGenerator for q/a pairs settings #8648

Uh oh!

snassimr Nov 2, 2023

Question Validation

Question

Replies: 4 comments · 1 reply

Uh oh!

dosubot[bot] bot Nov 2, 2023

Uh oh!

dosubot[bot] bot Nov 2, 2023

Sources

Uh oh!

dosubot[bot] bot Nov 2, 2023

Sources

Uh oh!

nerdai Nov 2, 2023

Uh oh!

Uh oh!

snassimr Nov 2, 2023 Author

snassimr
Nov 2, 2023

Replies: 4 comments 1 reply

dosubot[bot]
bot Nov 2, 2023

dosubot[bot]
bot Nov 2, 2023

dosubot[bot]
bot Nov 2, 2023

nerdai
Nov 2, 2023

snassimr Nov 2, 2023
Author