Skip to content

Commit 39a4b73

Browse files
authored
Improve Document/Chunk ID management (#222)
* Make sure the created chunk ID are unique * Remove unused id_prefix * Rm unused imports * Changelog + deprecate field * Fix mypy and UR * Do not change chunk UID after embeddings * Address comments * Update lock file * Regenerate lock file after merge * Changelog + deprecate field * Recreate lock file * WIP: e2e tests * Fix CI * Ruff (why on so many files?) * Fix doc * Undo change to conftest.py * E2E tests
1 parent 39fd4f7 commit 39a4b73

35 files changed

+1127
-1034
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@
99

1010
### Changed
1111
- Updated LLM implementations to handle message history consistently across providers.
12+
- The `id_prefix` parameter in the `LexicalGraphConfig` is deprecated.
13+
14+
### Fixed
15+
- IDs for the Document and Chunk nodes in the lexical graph are now randomly generated and unique across multiple runs, fixing issues in the lexical graph where relationships were created between chunks that were created by different pipeline runs.
16+
1217

1318
## 1.3.0
1419

docs/source/types.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,12 @@ RagResultModel
3939

4040
.. autoclass:: neo4j_graphrag.generation.types.RagResultModel
4141

42+
DocumentInfo
43+
============
44+
45+
.. autoclass:: neo4j_graphrag.experimental.components.types.DocumentInfo
46+
47+
4248
TextChunk
4349
=========
4450

docs/source/user_guide_kg_builder.rst

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -672,7 +672,7 @@ Example usage:
672672
from neo4j_graphrag.experimental.pipeline.components.lexical_graph_builder import LexicalGraphBuilder
673673
from neo4j_graphrag.experimental.pipeline.components.types import LexicalGraphConfig
674674
675-
lexical_graph_builder = LexicalGraphBuilder(config=LexicalGraphConfig(id_prefix="example"))
675+
lexical_graph_builder = LexicalGraphBuilder(config=LexicalGraphConfig())
676676
graph = await lexical_graph_builder.run(
677677
text_chunks=TextChunks(chunks=[
678678
TextChunk(text="some text", index=0),
@@ -713,7 +713,6 @@ Optionally, the document and chunk node labels can be configured using a `Lexica
713713
# optionally, define a LexicalGraphConfig object
714714
# shown below with the default values
715715
config = LexicalGraphConfig(
716-
id_prefix="", # used to prefix the chunk and document IDs
717716
chunk_node_label="Chunk",
718717
document_node_label="Document",
719718
chunk_to_document_relationship_type="PART_OF_DOCUMENT",
@@ -998,7 +997,7 @@ without making assumptions about entity similarity. The Entity Resolver
998997
is responsible for refining the created knowledge graph by merging entity
999998
nodes that represent the same real-world object.
1000999

1001-
In practice, this package implements a single resolver that merges nodes
1000+
In practice, this package implements a simple resolver that merges nodes
10021001
with the same label and identical "name" property.
10031002

10041003
.. warning::
@@ -1018,15 +1017,30 @@ It can be used like this:
10181017
10191018
.. warning::
10201019

1021-
By default, all nodes with the __Entity__ label will be resolved.
1022-
To exclude specific nodes, a filter_query can be added to the query.
1023-
For example, if a `:Resolved` label has been applied to already resolved entities
1024-
in the graph, these entities can be excluded with the following approach:
1020+
By default, all nodes with the `__Entity__` label will be resolved.
1021+
This behavior can be controled using the `filter_query` parameter described below.
10251022

1026-
.. code:: python
1023+
Filter Query Parameter
1024+
----------------------
10271025

1028-
from neo4j_graphrag.experimental.components.resolver import (
1029-
SinglePropertyExactMatchResolver,
1030-
)
1031-
resolver = SinglePropertyExactMatchResolver(driver, filter_query="WHERE not entity:Resolved")
1032-
res = await resolver.run()
1026+
To exclude specific nodes from the resolution, a `filter_query` can be added to the query.
1027+
For example, if a `:Resolved` label has been applied to already resolved entities
1028+
in the graph, these entities can be excluded with the following approach:
1029+
1030+
.. code:: python
1031+
1032+
from neo4j_graphrag.experimental.components.resolver import (
1033+
SinglePropertyExactMatchResolver,
1034+
)
1035+
filter_query = "WHERE NOT entity:Resolved"
1036+
resolver = SinglePropertyExactMatchResolver(driver, filter_query=filter_query)
1037+
res = await resolver.run()
1038+
1039+
1040+
Similar approach can be used to exclude entities created from a previous pipeline
1041+
run on the same document, assuming a label `OldDocument` has been assigned to the
1042+
previously created document node:
1043+
1044+
.. code:: python
1045+
1046+
filter_query = "WHERE NOT EXISTS((entity)-[:FROM_DOCUMENT]->(:OldDocument))"

examples/build_graph/simple_kg_builder_from_pdf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
DATABASE = "neo4j"
2424

2525

26-
root_dir = Path(__file__).parents[4]
26+
root_dir = Path(__file__).parents[1]
2727
file_path = root_dir / "data" / "Harry Potter and the Chamber of Secrets Summary.pdf"
2828

2929

examples/customize/build_graph/components/extractors/custom_extractor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
EntityRelationExtractor,
55
OnError,
66
)
7-
from neo4j_graphrag.experimental.components.pdf_loader import DocumentInfo
87
from neo4j_graphrag.experimental.components.types import (
8+
DocumentInfo,
99
LexicalGraphConfig,
1010
Neo4jGraph,
1111
TextChunks,

examples/customize/build_graph/components/lexical_graph_builder/lexical_graph_builder.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ async def main() -> GraphResult:
1313
# optionally, define a LexicalGraphConfig object
1414
# shown below with default values
1515
config = LexicalGraphConfig(
16-
id_prefix="", # used to prefix the chunk and document IDs
1716
chunk_node_label="Chunk",
1817
document_node_label="Document",
1918
chunk_to_document_relationship_type="PART_OF_DOCUMENT",

examples/customize/build_graph/components/loaders/custom_loader.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,8 @@
33
from pathlib import Path
44
from typing import Dict, Optional
55

6-
from neo4j_graphrag.experimental.components.pdf_loader import (
7-
DataLoader,
8-
DocumentInfo,
9-
PdfDocument,
10-
)
6+
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader
7+
from neo4j_graphrag.experimental.components.types import DocumentInfo, PdfDocument
118

129

1310
class MyLoader(DataLoader):

examples/customize/build_graph/pipeline/lexical_graph_builder_from_text.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
3333
pipe.add_component(TextChunkEmbedder(embedder=OpenAIEmbeddings()), "chunk_embedder")
3434
# optional: define some custom node labels for the lexical graph:
3535
lexical_graph_config = LexicalGraphConfig(
36-
id_prefix="example",
3736
chunk_node_label="TextPart",
3837
)
3938
pipe.add_component(

examples/customize/build_graph/pipeline/text_to_lexical_graph_to_entity_graph_single_pipeline.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,6 @@ async def define_and_run_pipeline(
164164
async def main(driver: neo4j.Driver) -> PipelineResult:
165165
# optional: define some custom node labels for the lexical graph:
166166
lexical_graph_config = LexicalGraphConfig(
167-
id_prefix="example",
168167
chunk_node_label="TextPart",
169168
document_node_label="Text",
170169
)

examples/customize/build_graph/pipeline/text_to_lexical_graph_to_entity_graph_two_pipelines.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,6 @@ async def read_chunk_and_perform_entity_extraction(
184184
async def main(driver: neo4j.Driver) -> PipelineResult:
185185
# optional: define some custom node labels for the lexical graph:
186186
lexical_graph_config = LexicalGraphConfig(
187-
id_prefix="example",
188187
document_node_label="Book", # default: "Document"
189188
chunk_node_label="Chapter", # default "Chunk"
190189
chunk_text_property="content", # default: "text"

0 commit comments

Comments
 (0)