-
Notifications
You must be signed in to change notification settings - Fork 3k
Remove text unit grouping #2052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the group_by_columns
configuration option from text chunking, simplifying the system by eliminating the ability to group documents before chunking. The change moves from a one-to-many or many-to-many relationship between documents and text units to a strict one-to-many relationship where each text unit belongs to exactly one document.
- Removes
group_by_columns
parameter from chunking configuration and related workflows - Changes text unit data model from
document_ids
(list) todocument_id
(single string) - Updates test files to reflect new chunking behavior and token counts
Reviewed Changes
Copilot reviewed 23 out of 32 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
graphrag/config/models/chunking_config.py | Removes group_by_columns field from ChunkingConfig |
graphrag/data_model/text_unit.py | Changes document_ids to document_id in TextUnit model |
graphrag/index/workflows/create_base_text_units.py | Simplifies chunking logic by removing document grouping |
graphrag/index/workflows/create_final_*.py | Updates workflows to use document_id instead of document_ids |
tests/verbs/test_*.py | Updates test assertions for new chunking behavior |
docs/*.md | Updates documentation to reflect removal of grouping feature |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Removes the group_by_columns config that would group documents before chunking. In practice this is never used, but adds a lot of complexity to maintain.