Skip to content

Commit 7d97932

Browse files
authored
Add Simple Entity Resolver (#160)
* Simple Entity Resolution component Adds a component that will clean up the graph after the Writer to merge nodes with same label and name * Mypy * WIP CHANGELOG and doc * Let the user control which nodes to exclude from resolution if any * End to end test with 2 documents * Add example to run multiple documents with same pipeline * Remove debug logger, more annoying that anything in the example - Python devs know how to configure a logger * ruff * Update docstring + ChatGPT-improved user guide * Fix after merge + improved docstrings * Fix code block in doc * Update CHANGELOG * Add resolver to the KG pipelien builder * Remove annotation - not needed by mypy, only Pycharm complains * Fix component name
1 parent be56247 commit 7d97932

19 files changed

+882
-109
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
## Next
44

5+
- Added `SinglePropertyExactMatchResolver` component allowing to merge entities with exact same property (e.g. name)
6+
57
## 0.7.0
68

79
### Added

docs/source/api.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,14 @@ LLMEntityRelationExtractor
7171
:members: run
7272

7373

74+
SinglePropertyExactMatchResolver
75+
================================
76+
77+
.. autoclass:: neo4j_graphrag.experimental.components.resolver.SinglePropertyExactMatchResolver
78+
:members: run
79+
80+
81+
7482
.. _pipeline-section:
7583

7684
********

docs/source/user_guide_kg_builder.rst

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ unstructured data.
1111

1212
This feature is still experimental. API changes and bug fixes are expected.
1313

14-
It is not recommended to use it in production yet.
15-
1614

1715
******************
1816
Pipeline Structure
@@ -26,6 +24,7 @@ A Knowledge Graph (KG) construction pipeline requires a few components:
2624
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG.
2725
- **Entity and relation extractor**: extract relevant entities and relations from the text.
2826
- **Knowledge Graph writer**: save the identified entities and relations.
27+
- **Entity resolver**: merge similar entities into a single node.
2928

3029
.. image:: images/kg_builder_pipeline.png
3130
:alt: KG Builder pipeline
@@ -426,3 +425,44 @@ It is possible to create a custom writer using the `KGWriter` interface:
426425

427426

428427
See :ref:`kgwritermodel` and :ref:`kgwriter` in API reference.
428+
429+
430+
Entity Resolver
431+
===============
432+
433+
The KG Writer component creates new nodes for each identified entity
434+
without making assumptions about entity similarity. The Entity Resolver
435+
is responsible for refining the created knowledge graph by merging entity
436+
nodes that represent the same real-world object.
437+
438+
In practice, this package implements a single resolver that merges nodes
439+
with the same label and identical "name" property.
440+
441+
.. warning::
442+
443+
The `SinglePropertyExactMatchResolver` **replaces** the nodes created by the KG writer.
444+
445+
446+
It can be used like this:
447+
448+
.. code:: python
449+
from neo4j_graphrag.experimental.components.resolver import (
450+
SinglePropertyExactMatchResolver,
451+
)
452+
resolver = SinglePropertyExactMatchResolver(driver)
453+
res = await resolver.run()
454+
455+
.. warning::
456+
457+
By default, all nodes with the __Entity__ label will be resolved.
458+
To exclude specific nodes, a filter_query can be added to the query.
459+
For example, if a `:Resolved` label has been applied to already resolved entities
460+
in the graph, these entities can be excluded with the following approach:
461+
462+
.. code:: python
463+
464+
from neo4j_graphrag.experimental.components.resolver import (
465+
SinglePropertyExactMatchResolver,
466+
)
467+
resolver = SinglePropertyExactMatchResolver(driver, filter_query="WHERE not entity:Resolved")
468+
res = await resolver.run()
Binary file not shown.

examples/pipeline/kg_builder_from_pdf.py

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,14 @@
3333
FixedSizeSplitter,
3434
)
3535
from neo4j_graphrag.experimental.pipeline.pipeline import PipelineResult
36-
from neo4j_graphrag.llm import OpenAILLM
36+
from neo4j_graphrag.llm import LLMInterface, OpenAILLM
3737

3838
logging.basicConfig(level=logging.INFO)
3939

4040

41-
async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
41+
async def define_and_run_pipeline(
42+
neo4j_driver: neo4j.AsyncDriver, llm: LLMInterface
43+
) -> PipelineResult:
4244
from neo4j_graphrag.experimental.pipeline import Pipeline
4345

4446
# Instantiate Entity and Relation objects
@@ -86,13 +88,7 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
8688
pipe.add_component(SchemaBuilder(), "schema")
8789
pipe.add_component(
8890
LLMEntityRelationExtractor(
89-
llm=OpenAILLM(
90-
model_name="gpt-4o",
91-
model_params={
92-
"max_tokens": 2000,
93-
"response_format": {"type": "json_object"},
94-
},
95-
),
91+
llm=llm,
9692
on_error=OnError.RAISE,
9793
),
9894
"extractor",
@@ -127,8 +123,23 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
127123
return await pipe.run(pipe_inputs)
128124

129125

130-
if __name__ == "__main__":
131-
with neo4j.GraphDatabase.driver(
126+
async def main() -> PipelineResult:
127+
llm = OpenAILLM(
128+
model_name="gpt-4o",
129+
model_params={
130+
"max_tokens": 2000,
131+
"response_format": {"type": "json_object"},
132+
},
133+
)
134+
driver = neo4j.AsyncGraphDatabase.driver(
132135
"bolt://localhost:7687", auth=("neo4j", "password")
133-
) as driver:
134-
print(asyncio.run(main(driver)))
136+
)
137+
res = await define_and_run_pipeline(driver, llm)
138+
await driver.close()
139+
await llm.async_client.close()
140+
return res
141+
142+
143+
if __name__ == "__main__":
144+
res = asyncio.run(main())
145+
print(res)

examples/pipeline/kg_builder_from_text.py

Lines changed: 24 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
from __future__ import annotations
1616

1717
import asyncio
18-
import logging.config
1918

2019
import neo4j
2120
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
@@ -36,30 +35,12 @@
3635
)
3736
from neo4j_graphrag.experimental.pipeline import Pipeline
3837
from neo4j_graphrag.experimental.pipeline.pipeline import PipelineResult
39-
from neo4j_graphrag.llm import OpenAILLM
40-
41-
# set log level to DEBUG for all neo4j_graphrag.* loggers
42-
logging.config.dictConfig(
43-
{
44-
"version": 1,
45-
"handlers": {
46-
"console": {
47-
"class": "logging.StreamHandler",
48-
}
49-
},
50-
"loggers": {
51-
"root": {
52-
"handlers": ["console"],
53-
},
54-
"neo4j_graphrag": {
55-
"level": "DEBUG",
56-
},
57-
},
58-
}
59-
)
38+
from neo4j_graphrag.llm import LLMInterface, OpenAILLM
6039

6140

62-
async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
41+
async def define_and_run_pipeline(
42+
neo4j_driver: neo4j.AsyncDriver, llm: LLMInterface
43+
) -> PipelineResult:
6344
"""This is where we define and run the KG builder pipeline, instantiating a few
6445
components:
6546
- Text Splitter: in this example we use the fixed size text splitter
@@ -83,13 +64,7 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
8364
pipe.add_component(SchemaBuilder(), "schema")
8465
pipe.add_component(
8566
LLMEntityRelationExtractor(
86-
llm=OpenAILLM(
87-
model_name="gpt-4o",
88-
model_params={
89-
"max_tokens": 1000,
90-
"response_format": {"type": "json_object"},
91-
},
92-
),
67+
llm=llm,
9368
on_error=OnError.RAISE,
9469
),
9570
"extractor",
@@ -164,8 +139,23 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
164139
return await pipe.run(pipe_inputs)
165140

166141

167-
if __name__ == "__main__":
168-
with neo4j.GraphDatabase.driver(
142+
async def main() -> PipelineResult:
143+
llm = OpenAILLM(
144+
model_name="gpt-4o",
145+
model_params={
146+
"max_tokens": 1000,
147+
"response_format": {"type": "json_object"},
148+
},
149+
)
150+
driver = neo4j.AsyncGraphDatabase.driver(
169151
"bolt://localhost:7687", auth=("neo4j", "password")
170-
) as driver:
171-
print(asyncio.run(main(driver)))
152+
)
153+
res = await define_and_run_pipeline(driver, llm)
154+
await driver.close()
155+
await llm.async_client.close()
156+
return res
157+
158+
159+
if __name__ == "__main__":
160+
res = asyncio.run(main())
161+
print(res)
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Copyright (c) "Neo4j"
2+
# Neo4j Sweden AB [https://neo4j.com]
3+
# #
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
# #
8+
# https://www.apache.org/licenses/LICENSE-2.0
9+
# #
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
from __future__ import annotations
16+
17+
import asyncio
18+
19+
import neo4j
20+
from neo4j_graphrag.experimental.components.entity_relation_extractor import (
21+
LLMEntityRelationExtractor,
22+
OnError,
23+
)
24+
from neo4j_graphrag.experimental.components.kg_writer import Neo4jWriter
25+
from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader
26+
from neo4j_graphrag.experimental.components.resolver import (
27+
SinglePropertyExactMatchResolver,
28+
)
29+
from neo4j_graphrag.experimental.components.schema import (
30+
SchemaBuilder,
31+
SchemaEntity,
32+
SchemaProperty,
33+
SchemaRelation,
34+
)
35+
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import (
36+
FixedSizeSplitter,
37+
)
38+
from neo4j_graphrag.experimental.pipeline import Pipeline
39+
from neo4j_graphrag.llm import LLMInterface, OpenAILLM
40+
41+
42+
async def define_and_run_pipeline(
43+
neo4j_driver: neo4j.AsyncDriver, llm: LLMInterface
44+
) -> None:
45+
"""This is where we define and run the KG builder pipeline, instantiating a few
46+
components:
47+
- Text Splitter: in this example we use the fixed size text splitter
48+
- Schema Builder: this component takes a list of entities, relationships and
49+
possible triplets as inputs, validate them and return a schema ready to use
50+
for the rest of the pipeline
51+
- LLM Entity Relation Extractor is an LLM-based entity and relation extractor:
52+
based on the provided schema, the LLM will do its best to identity these
53+
entities and their relations within the provided text
54+
- KG writer: once entities and relations are extracted, they can be writen
55+
to a Neo4j database
56+
"""
57+
pipe = Pipeline()
58+
# define the components
59+
pipe.add_component(PdfLoader(), "loader")
60+
pipe.add_component(
61+
FixedSizeSplitter(),
62+
"splitter",
63+
)
64+
pipe.add_component(SchemaBuilder(), "schema")
65+
pipe.add_component(
66+
LLMEntityRelationExtractor(
67+
llm=llm,
68+
on_error=OnError.IGNORE,
69+
),
70+
"extractor",
71+
)
72+
pipe.add_component(Neo4jWriter(neo4j_driver), "writer")
73+
pipe.add_component(SinglePropertyExactMatchResolver(neo4j_driver), "resolver")
74+
# define the execution order of component
75+
# and how the output of previous components must be used
76+
pipe.connect("loader", "splitter", {"text": "loader.text"})
77+
pipe.connect("splitter", "extractor", input_config={"chunks": "splitter"})
78+
pipe.connect(
79+
"schema",
80+
"extractor",
81+
input_config={"schema": "schema", "document_info": "loader.document_info"},
82+
)
83+
pipe.connect(
84+
"extractor",
85+
"writer",
86+
input_config={"graph": "extractor"},
87+
)
88+
pipe.connect("writer", "resolver", {})
89+
# user input:
90+
# the initial text
91+
# and the list of entities and relations we are looking for
92+
pipe_inputs = {
93+
"loader": {},
94+
"schema": {
95+
"entities": [
96+
SchemaEntity(
97+
label="Person",
98+
properties=[
99+
SchemaProperty(name="name", type="STRING"),
100+
SchemaProperty(name="place_of_birth", type="STRING"),
101+
SchemaProperty(name="date_of_birth", type="DATE"),
102+
],
103+
),
104+
SchemaEntity(
105+
label="Organization",
106+
properties=[
107+
SchemaProperty(name="name", type="STRING"),
108+
SchemaProperty(name="country", type="STRING"),
109+
],
110+
),
111+
],
112+
"relations": [
113+
SchemaRelation(
114+
label="WORKED_FOR",
115+
),
116+
SchemaRelation(
117+
label="FRIEND",
118+
),
119+
SchemaRelation(
120+
label="ENEMY",
121+
),
122+
],
123+
"potential_schema": [
124+
("Person", "WORKED_FOR", "Organization"),
125+
("Person", "FRIEND", "Person"),
126+
("Person", "ENEMY", "Person"),
127+
],
128+
},
129+
}
130+
# run the pipeline for each documents
131+
for document in [
132+
"examples/pipeline/Harry Potter and the Chamber of Secrets Summary.pdf",
133+
"examples/pipeline/Harry Potter and the Death Hallows Summary.pdf",
134+
]:
135+
pipe_inputs["loader"]["filepath"] = document
136+
await pipe.run(pipe_inputs)
137+
138+
139+
async def main() -> None:
140+
llm = OpenAILLM(
141+
model_name="gpt-4o",
142+
model_params={
143+
"max_tokens": 1000,
144+
"response_format": {"type": "json_object"},
145+
},
146+
)
147+
driver = neo4j.AsyncGraphDatabase.driver(
148+
"bolt://localhost:7687", auth=("neo4j", "password")
149+
)
150+
await define_and_run_pipeline(driver, llm)
151+
await driver.close()
152+
await llm.async_client.close()
153+
154+
155+
if __name__ == "__main__":
156+
asyncio.run(main())

0 commit comments

Comments
 (0)