Skip to content

Add Fuzzy match resolver for KG builder #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

- Added support for multi-vector collection in Qdrant driver.
- Added a `Pipeline.stream` method to stream pipeline progress.
- Added a new semantic match resolver to the KG Builder for entity resolution based on spaCy embeddings and cosine similarities so that nodes with similar textual properties get merged.
- Added a new fuzzy match resolver to the KG Builder for entity resolution based on RapiFuzz string fuzzy matching.

### Changed

Expand Down
11 changes: 11 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,17 @@ SinglePropertyExactMatchResolver
.. autoclass:: neo4j_graphrag.experimental.components.resolver.SinglePropertyExactMatchResolver
:members: run

SpaCySemanticMatchResolver
==========================

.. autoclass:: neo4j_graphrag.experimental.components.resolver.SpaCySemanticMatchResolver
:members: run

FuzzyMatchResolver
==================

.. autoclass:: neo4j_graphrag.experimental.components.resolver.FuzzyMatchResolver
:members: run

.. _pipeline-section:

Expand Down
5 changes: 4 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,10 @@ List of extra dependencies:
- **qdrant**: store vectors in Qdrant
- **experimental**: experimental features mainly from the Knowledge Graph creation pipelines.
- Warning: this requires `pygraphviz`. Installation instructions can be found `here <https://pygraphviz.github.io/documentation/stable/install.html>`_.

- nlp:
- **spaCy**: load spaCy trained models for nlp pipelines, used by `SpaCySemanticMatchResolver` component from the Knowledge Graph creation pipelines.
- fuzzy-matching:
- **rapidfuzz**: apply fuzzy matching using string similarity, used by `FuzzyMatchResolver` component from the Knowledge Graph creation pipelines.

********
Examples
Expand Down
21 changes: 16 additions & 5 deletions docs/source/user_guide_kg_builder.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1028,22 +1028,33 @@ without making assumptions about entity similarity. The Entity Resolver
is responsible for refining the created knowledge graph by merging entity
nodes that represent the same real-world object.

In practice, this package implements a simple resolver that merges nodes
with the same label and identical "name" property.
In practice, this package implements three resolvers:

- a simple resolver that merges nodes with the same label and identical "name" property;
- two similarity-based resolvers that merge nodes with the same label and similar set of textual properties (by default they use the "name" property):

- a semantic match resolver, which is based on spaCy embeddings and cosine similarities of embedding vectors. This resolver is ideal for higher quality KG resolution using static embeddings.
- a fuzzy match resolver, which is based on RapidFuzz for Rapid fuzzy string matching using the Levenshtein Distance. This resolver offers faster ingestion speeds by using string similarity measures, at the potential cost of resolution precision.

.. warning::

The `SinglePropertyExactMatchResolver` **replaces** the nodes created by the KG writer.
- The `SinglePropertyExactMatchResolver`, `SpaCySemanticMatchResolver`, and `FuzzyMatchResolver` **replace** the nodes created by the KG writer.

- Check the :ref:`installation` section to make sure you have the required dependencies installed when using `SpaCySemanticMatchResolver`, and `FuzzyMatchResolver`.


It can be used like this:
The resolvers can be used like this:

.. code:: python

from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver,
# SpaCySemanticMatchResolver,
# FuzzyMatchResolver,
)
resolver = SinglePropertyExactMatchResolver(driver)
resolver = SinglePropertyExactMatchResolver(driver) # exact match resolver
# resolver = SpaCySemanticMatchResolver(driver) # semantic match with spaCy
# resolver = FuzzyMatchResolver(driver) # fuzzy match with RapidFuzz
res = await resolver.run()

.. warning::
Expand Down
3 changes: 2 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,9 @@ are listed in [the last section of this file](#customize).
- [Neo4j writer](./customize/build_graph/components/writers/neo4j_writer.py)
- [Custom](./customize/build_graph/components/writers/custom_writer.py)
- Entity Resolver:
- [SinglePropertyExactMatchResolver](./customize/build_graph/components/resolvers/simple_entity_resolver.py)
- [FuzzyMatchResolver](./customize/build_graph/components/resolvers/fuzzy_match_entity_resolver_pre_filter.py)
- [SinglePropertyExactMatchResolver with pre-filter](./customize/build_graph/components/resolvers/simple_entity_resolver_pre_filter.py)
- [SpaCySemanticMatchResolver with pre-filter](./customize/build_graph/components/resolvers/spacy_entity_resolver_pre_filter.py)
- [Custom resolver](./customize/build_graph/components/resolvers/custom_resolver.py)
- [Custom component](./customize/build_graph/components/custom_component.py)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""The FuzzyMatchResolver merges nodes with same label
and similar textual properties (by default using the "name" property) based on RapidFuzz
for string matching.

If the resolution is intended to be applied only on some nodes, for instance nodes that
belong to a specific document, a "WHERE" query can be added. The only variable in the
query scope is "entity".

WARNING: this process is destructive, initial nodes are deleted and replaced
by the resolved ones, but all relationships are kept.
See apoc.refactor.mergeNodes documentation for more details.
"""

from neo4j_graphrag.experimental.components.resolver import (
FuzzyMatchResolver,
)
from neo4j_graphrag.experimental.components.types import ResolutionStats

import neo4j


async def main(driver: neo4j.Driver) -> None:
resolver = FuzzyMatchResolver(
driver,
# let's filter all entities that belong to a certain docId
filter_query="WHERE (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(doc:"
"Document {id = 'docId'}",
# optionally, change the properties used for resolution (default is "name")
# resolve_properties=["name", "ssn"],
# the similarity threshold (default is 0.8)
# similarity_threshold=0.9
# and the neo4j database where data is updated
# neo4j_database="neo4j",
)
res: ResolutionStats = await resolver.run()
print(res)

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
"""The SpaCySemanticMatchResolver merges nodes with same label
and similar textual properties (by default using the "name" property) based on spaCy
embeddings and cosine similarities of embedding vectors.

If the resolution is intended to be applied only on some nodes, for instance nodes that
belong to a specific document, a "WHERE" query can be added. The only variable in the
query scope is "entity".

WARNING: this process is destructive, initial nodes are deleted and replaced
by the resolved ones, but all relationships are kept.
See apoc.refactor.mergeNodes documentation for more details.
"""

import neo4j
from neo4j_graphrag.experimental.components.resolver import (
SpaCySemanticMatchResolver,
)
from neo4j_graphrag.experimental.components.types import ResolutionStats


async def main(driver: neo4j.Driver) -> None:
resolver = SpaCySemanticMatchResolver(
driver,
# let's filter all entities that belong to a certain docId
filter_query="WHERE (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(doc:"
"Document {id = 'docId'}",
# optionally, change the properties used for resolution (default is "name")
# resolve_properties=["name", "ssn"],
# the similarity threshold (default is 0.8)
# similarity_threshold=0.9
# the spaCy trained model (default is "en_core_web_lg")
# spacy_model="en_core_web_sm"
# and the neo4j database where data is updated
# neo4j_database="neo4j",
)
res: ResolutionStats = await resolver.run()
print(res)
Loading