Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant improvements to how text nodes are handled in the knowledge graph, focusing on tracking and deduplicating context by line numbers rather than generic metadata. The changes include adding explicit start and end line numbers to text nodes, updating all related code and database interactions, and implementing robust deduplication logic for extracted contexts. These updates improve the accuracy and usefulness of context extraction, storage, and retrieval throughout the system.
Knowledge Graph Enhancements:
Replaced the generic
metadata
field inTextNode
andNeo4jTextNode
with explicitstart_line
andend_line
fields, updating all related code, database queries, and serialization/deserialization logic to use these new fields. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]During text file graph construction, line positions are now calculated and stored as metadata in each document chunk, allowing for precise mapping of text to original file lines. [1] [2]
Context Extraction and Deduplication:
Added a new
deduplicate_contexts
utility that removes duplicate or contained contexts based on file, content, and line numbers, and applied it to all context extraction flows. This ensures only unique and most relevant contexts are returned. [1] [2] [3] [4]Updated context extraction logic to skip empty content and deduplicate before returning results, improving both efficiency and relevance of context data. [1] [2]
Other Improvements:
Improved artifact aggregation in
transform_tool_messages_to_str
to handle all tool message artifacts collectively, ensuring comprehensive context stringification.Fixed an off-by-one error in line selection for code reading, ensuring correct lines are included.
Minor: Updated tool initialization to specify a new response format for file reading tools.
These changes collectively make context extraction, storage, and retrieval more accurate, deduplicated, and line-aware, which is crucial for downstream tasks such as code analysis and bug localization.