Skip to content

Add SpaCy Semantic match resolver for KG Builder #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

## Next

### Added
- Added a new semantic match resolver to the KG Builder for entity resolution based on spaCy embeddings and cosine similarities so that nodes with similar textual properties get merged.

## 1.6.0

### Added
Expand Down
5 changes: 5 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,11 @@ SinglePropertyExactMatchResolver
.. autoclass:: neo4j_graphrag.experimental.components.resolver.SinglePropertyExactMatchResolver
:members: run

SpaCySemanticMatchResolver
================================

.. autoclass:: neo4j_graphrag.experimental.components.resolver.SpaCySemanticMatchResolver
:members: run

.. _pipeline-section:

Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ List of extra dependencies:
- **qdrant**: store vectors in Qdrant
- **experimental**: experimental features mainly from the Knowledge Graph creation pipelines.
- Warning: this requires `pygraphviz`. Installation instructions can be found `here <https://pygraphviz.github.io/documentation/stable/install.html>`_.

- nlp:
- **spaCy**: load spaCy trained models for nlp pipelines, used by `SpaCySemanticMatchResolver` component from the Knowledge Graph creation pipelines.

********
Examples
Expand Down
16 changes: 11 additions & 5 deletions docs/source/user_guide_kg_builder.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1028,22 +1028,28 @@ without making assumptions about entity similarity. The Entity Resolver
is responsible for refining the created knowledge graph by merging entity
nodes that represent the same real-world object.

In practice, this package implements a simple resolver that merges nodes
with the same label and identical "name" property.
In practice, this package implements two resolvers:

- a simple resolver that merges nodes with the same label and identical "name" property;
- a semantic match resolver that merges nodes with the same label and similar set of textual properties (by default it uses the "name" property).
So far, the semantic matching is based on spaCy embeddings and cosine similarities of embedding vectors.

.. warning::

The `SinglePropertyExactMatchResolver` **replaces** the nodes created by the KG writer.
- The `SinglePropertyExactMatchResolver` and `SpaCySemanticMatchResolver` **replace** the nodes created by the KG writer.

- Check the :ref:`installation` section to make sure you have the required dependencies installed when using `SpaCySemanticMatchResolver`.


It can be used like this:
The resolvers can be used like this:

.. code:: python

from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver,
)
resolver = SinglePropertyExactMatchResolver(driver)
resolver = SinglePropertyExactMatchResolver(driver) # exact match resolver
# resolver = SpaCySemanticMatchResolver(driver) # semantic match with spaCy
res = await resolver.run()

.. warning::
Expand Down
2 changes: 2 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,8 @@ are listed in [the last section of this file](#customize).
- Entity Resolver:
- [SinglePropertyExactMatchResolver](./customize/build_graph/components/resolvers/simple_entity_resolver.py)
- [SinglePropertyExactMatchResolver with pre-filter](./customize/build_graph/components/resolvers/simple_entity_resolver_pre_filter.py)
- [SpaCySemanticMatchResolver](./customize/build_graph/components/resolvers/spacy_entity_resolver.py)
- [SpaCySemanticMatchResolver with pre-filter](./customize/build_graph/components/resolvers/spacy_entity_resolver_pre_filter.py)
- [Custom resolver](./customize/build_graph/components/resolvers/custom_resolver.py)
- [Custom component](./customize/build_graph/components/custom_component.py)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""The SpaCySemanticMatchResolver merge nodes with same label
and similar textual properties (by default using the "name" property) based on spaCy
embeddings and cosine similarities of embedding vectors.

WARNING: this process is destructive, initial nodes are deleted and replaced
by the resolved ones, but all relationships are kept.
See apoc.refactor.mergeNodes documentation for more details.
"""

import neo4j
from neo4j_graphrag.experimental.components.resolver import (
SpaCySemanticMatchResolver,
)
from neo4j_graphrag.experimental.components.types import ResolutionStats


async def main(driver: neo4j.Driver) -> None:
resolver = SpaCySemanticMatchResolver(
driver,
# optionally, change the properties used for resolution (default is "name")
# resolve_properties=["name", "ssn"],
# the similarity threshold (default is 0.8)
# similarity_threshold=0.9
# the spaCy trained model (default is "en_core_web_lg")
# spacy_model="en_core_web_sm"
# and the neo4j database where data is updated
# neo4j_database="neo4j",
)
res: ResolutionStats = await resolver.run()
print(res)
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""The SpaCySemanticMatchResolver merges nodes with same label
and similar textual properties (by default using the "name" property).

If the resolution is intended to be applied only on some nodes, for instance nodes that
belong to a specific document, a "WHERE" query can be added. The only variable in the
query scope is "entity".

WARNING: this process is destructive, initial nodes are deleted and replaced
by the resolved ones, but all relationships are kept.
See apoc.refactor.mergeNodes documentation for more details.
"""

import neo4j
from neo4j_graphrag.experimental.components.resolver import (
SpaCySemanticMatchResolver,
)
from neo4j_graphrag.experimental.components.types import ResolutionStats


async def main(driver: neo4j.Driver) -> None:
resolver = SpaCySemanticMatchResolver(
driver,
# let's filter all entities that belong to a certain docId
filter_query="WHERE (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(doc:"
"Document {id = 'docId'}",
# optionally, change the properties used for resolution (default is "name")
# resolve_properties=["name", "ssn"],
# the similarity threshold (default is 0.8)
# similarity_threshold=0.9
# the spaCy trained model (default is "en_core_web_lg")
# spacy_model="en_core_web_sm"
# and the neo4j database where data is updated
# neo4j_database="neo4j",
)
res: ResolutionStats = await resolver.run()
print(res)
Loading