Skip to content

Commit 9391662

Browse files
stellasiawilltai
andauthored
Ability to create lexical graph only (#127)
* Lexical graph component - code copied * User nex LexicalGraphBuilder component in entity_relation_extractor.py * Add tests for LexicalGraphBuilder * Update documentation and CHANGELOG.md * Create LexicalGraphConfig model * Deprecate 'create_lexical_graph' parameter in ERExtractor - add e2e tests * Fix imports in example - remove constants file (imported only from one location) * Fix example * Ruffify * Ruff * Reorder constant definition to match the config below * There is no need to deprecate things at this stage - to be discussed * Fix e2e test * Fix links in doc * Renaming * Add optional lexical graph config parameter for KG writer * Fix examples * Update doc * Update changelog * Update docs/source/user_guide_kg_builder.rst Co-authored-by: willtai <wtaisen@gmail.com> * Copyright header was missing * Improve doc * Typo * Improve description of lexical graph * ChatGPT-fy the doc + remove duplicates by adding links to the user guide when appropriate --------- Co-authored-by: willtai <wtaisen@gmail.com>
1 parent 4580d5f commit 9391662

File tree

16 files changed

+814
-213
lines changed

16 files changed

+814
-213
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
### Added
66
- Made `relations` and `potential_schema` optional in `SchemaBuilder`.
77
- Added a check to prevent the use of deprecated Cypher syntax for Neo4j versions 5.23.0 and above.
8+
- Added a `LexicalGraphBuilder` component to enable the import of the lexical graph (document, chunks) without performing entity and relation extraction.
89

910
### Changed
1011
- Vector and Hybrid retrievers used with `return_properties` now also return the node labels (`nodeLabels`) and the node's element ID (`id`).
@@ -100,7 +101,8 @@
100101
### IMPORTANT NOTICE
101102
- The `neo4j-genai` package is now deprecated. Users are advised to switch to the new package `neo4j-graphrag`.
102103
### Added
103-
- Ability to visualise pipeline with `my_pipeline.draw("pipeline.png")`
104+
- Ability to visualise pipeline with `my_pipeline.draw("pipeline.png")`.
105+
- `LexicalGraphBuilder` component to create the lexical graph without entity-relation extraction.
104106

105107
### Fixed
106108
- Pipelines now return correct results when the same pipeline is run in parallel.

docs/source/api.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,13 @@ TextChunkEmbedder
5151
.. autoclass:: neo4j_graphrag.experimental.components.embedder.TextChunkEmbedder
5252
:members: run
5353

54+
LexicalGraphBuilder
55+
===================
56+
57+
.. autoclass:: neo4j_graphrag.experimental.components.lexical_graph.LexicalGraphBuilder
58+
:members:
59+
:exclude-members: component_inputs, component_outputs
60+
5461
SchemaBuilder
5562
=============
5663

@@ -62,7 +69,7 @@ EntityRelationExtractor
6269

6370
.. autoclass:: neo4j_graphrag.experimental.components.entity_relation_extractor.EntityRelationExtractor
6471
:members:
65-
:undoc-members: component_inputs, component_outputs
72+
:exclude-members: component_inputs, component_outputs
6673

6774
LLMEntityRelationExtractor
6875
==========================

docs/source/user_guide_kg_builder.rst

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ A Knowledge Graph (KG) construction pipeline requires a few components:
2222
- **Document chunker**: split the text into smaller pieces of text, manageable by the LLM context window (token limit).
2323
- **Chunk embedder** (optional): compute the chunk embeddings.
2424
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG.
25+
- **LexicalGraphBuilder**: build the lexical graph (Document, Chunk and their relationships) (optional).
2526
- **Entity and relation extractor**: extract relevant entities and relations from the text.
2627
- **Knowledge Graph writer**: save the identified entities and relations.
2728
- **Entity resolver**: merge similar entities into a single node.
@@ -166,11 +167,43 @@ Example usage:
166167
os.environ["OPENAI_API_KEY"] = "sk-..."
167168
168169
169-
If OpenAI is not an option, see :ref:`embedders` to learn how to use sentence-transformers or create your own embedder.
170+
If OpenAI is not an option, see :ref:`embedders` to learn how to use other supported embedders.
170171

171172
The embeddings are added to each chunk metadata, and will be saved as a Chunk node property in the graph if
172173
`create_lexical_graph` is enabled in the `EntityRelationExtractor` (keep reading).
173174

175+
.. _lexical-graph-builder:
176+
177+
Lexical Graph Builder
178+
=====================
179+
180+
Once the chunks are extracted and embedded (if required), a graph can be created.
181+
182+
The **lexical graph** contains:
183+
184+
- `Document` node: represent the processed document and have a `path` property.
185+
- `Chunk` nodes: represent the text chunks. They have a `text` property and, if computed, an `embedding` property.
186+
- `NEXT_CHUNK` relationships between one chunk node and the next one in the document. It can be used to enhance the context in a RAG application.
187+
- `FROM_DOCUMENT` relationship between each chunk and the document it was built from.
188+
189+
Example usage:
190+
191+
.. code:: python
192+
193+
from neo4j_graphrag.experimental.pipeline.components.lexical_graph_builder import LexicalGraphBuilder
194+
from neo4j_graphrag.experimental.pipeline.components.types import LexicalGraphConfig
195+
196+
lexical_graph_builder = LexicalGraphBuilder(config=LexicalGraphConfig(id_prefix="example"))
197+
graph = await lexical_graph_builder.run(
198+
text_chunks=TextChunks(chunks=[
199+
TextChunk(text="some text", index=0),
200+
TextChunk(text="some text", index=1),
201+
]),
202+
document_info=DocumentInfo(path="my_document.pdf"),
203+
)
204+
205+
See :ref:`kg-writer-section` to learn how to write the resulting nodes and relationships to Neo4j.
206+
174207

175208
Schema Builder
176209
==============
@@ -292,17 +325,12 @@ This behaviour can be changed by using the `on_error` flag in the `LLMEntityRela
292325
In this scenario, any failing chunk will make the whole pipeline fail (for all chunks), and no data
293326
will be saved to Neo4j.
294327

328+
.. _lexical-graph-in-er-extraction:
295329

296330
Lexical Graph
297331
-------------
298332

299-
By default, the `LLMEntityRelationExtractor` adds some extra nodes and relationships to the extracted graph:
300-
301-
- `Document` node: represent the processed document and have a `path` property.
302-
- `Chunk` nodes: represent the text chunks. They have a `text` property and, if computed, an `embedding` property.
303-
- `NEXT_CHUNK` relationships between one chunk node and the next one in the document. It can be used to enhance the context in a RAG application.
304-
- `FROM_CHUNK` relationship between any extracted entity and the chunk it has been identified into.
305-
- `FROM_DOCUMENT` relationship between each chunk and the document it was built from.
333+
By default, the `LLMEntityRelationExtractor` also creates the :ref:`lexical graph<lexical-graph-builder>`.
306334

307335
If this 'lexical graph' is not desired, set the `created_lexical_graph` to `False` in the extractor constructor:
308336

@@ -314,6 +342,21 @@ If this 'lexical graph' is not desired, set the `created_lexical_graph` to `Fals
314342
)
315343
316344
345+
.. note::
346+
347+
- If `self.create_lexical_graph` is set to `True`, the complete lexical graph
348+
will be created, including the document and chunk nodes, along with the relationships
349+
between entities and the chunk they were extracted from.
350+
- If `self.create_lexical_graph` is set to `False` but `lexical_graph_config`
351+
is provided, the document and chunk nodes won't be created. However, relationships
352+
between chunks and the entities extracted from them will still be added to the graph.
353+
354+
.. warning::
355+
356+
If omitting `self.create_lexical_graph` and the chunk does not exist,
357+
this will result in no relationship being created in the database by the writer.
358+
359+
317360
Customizing the Prompt
318361
----------------------
319362

@@ -368,6 +411,8 @@ If more customization is needed, it is possible to subclass the `EntityRelationE
368411
See :ref:`entityrelationextractor`.
369412

370413

414+
.. _kg-writer-section:
415+
371416
Knowledge Graph Writer
372417
======================
373418

@@ -421,7 +466,7 @@ It is possible to create a custom writer using the `KGWriter` interface:
421466
422467
.. note::
423468

424-
The `validate_call` decorator is required when the input parameter contain a `pydantic` model.
469+
The `validate_call` decorator is required when the input parameter contain a `Pydantic` model.
425470

426471

427472
See :ref:`kgwritermodel` and :ref:`kgwriter` in API reference.

examples/customize/build_graph/components/extractors/custom_extractor.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@
55
OnError,
66
)
77
from neo4j_graphrag.experimental.components.pdf_loader import DocumentInfo
8-
from neo4j_graphrag.experimental.components.types import Neo4jGraph, TextChunks
8+
from neo4j_graphrag.experimental.components.types import (
9+
LexicalGraphConfig,
10+
Neo4jGraph,
11+
TextChunks,
12+
)
913

1014

1115
class MyExtractor(EntityRelationExtractor):
@@ -27,6 +31,7 @@ async def run(
2731
self,
2832
chunks: TextChunks,
2933
document_info: Optional[DocumentInfo] = None,
34+
lexical_graph_config: Optional[LexicalGraphConfig] = None,
3035
**kwargs: Any,
3136
) -> Neo4jGraph:
3237
# Implement your logic here
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from neo4j_graphrag.experimental.components.lexical_graph import (
2+
LexicalGraphBuilder,
3+
)
4+
from neo4j_graphrag.experimental.components.types import (
5+
GraphResult,
6+
LexicalGraphConfig,
7+
TextChunk,
8+
TextChunks,
9+
)
10+
11+
12+
async def main() -> GraphResult:
13+
""" """
14+
# optionally, define a LexicalGraphConfig object
15+
# shown below with default values
16+
config = LexicalGraphConfig(
17+
id_prefix="", # used to prefix the chunk and document IDs
18+
chunk_node_label="Chunk",
19+
document_node_label="Document",
20+
chunk_to_document_relationship_type="PART_OF_DOCUMENT",
21+
next_chunk_relationship_type="NEXT_CHUNK",
22+
node_to_chunk_relationship_type="PART_OF_CHUNK",
23+
chunk_embedding_property="embeddings",
24+
)
25+
builder = LexicalGraphBuilder(
26+
config=config, # optional
27+
)
28+
graph_result = await builder.run(
29+
text_chunks=TextChunks(chunks=[TextChunk(text="....", index=0)]),
30+
# document_info={"path": "example"}, # uncomment to create a "Document" node
31+
)
32+
return graph_result

examples/customize/build_graph/components/writers/custom_writer.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
import neo4j
66
from neo4j_graphrag.experimental.components.kg_writer import KGWriter, KGWriterModel
7-
from neo4j_graphrag.experimental.components.types import Neo4jGraph
7+
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig, Neo4jGraph
88
from pydantic import validate_call
99

1010

@@ -13,7 +13,11 @@ def __init__(self, driver: neo4j.Driver) -> None:
1313
self.driver = driver
1414

1515
@validate_call
16-
async def run(self, graph: Neo4jGraph) -> KGWriterModel:
16+
async def run(
17+
self,
18+
graph: Neo4jGraph,
19+
lexical_graph_config: LexicalGraphConfig = LexicalGraphConfig(),
20+
) -> KGWriterModel:
1721
try:
1822
self.driver.execute_query("my query")
1923
return KGWriterModel(status="SUCCESS")

examples/customize/build_graph/pipeline/kg_builder_from_text.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ async def define_and_run_pipeline(
4444
"""This is where we define and run the KG builder pipeline, instantiating a few
4545
components:
4646
- Text Splitter: in this example we use the fixed size text splitter
47+
- Chunk Embedder: to embed the chunks' text
4748
- Schema Builder: this component takes a list of entities, relationships and
4849
possible triplets as inputs, validate them and return a schema ready to use
4950
for the rest of the pipeline
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from __future__ import annotations
2+
3+
import asyncio
4+
5+
import neo4j
6+
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
7+
from neo4j_graphrag.experimental.components.embedder import TextChunkEmbedder
8+
from neo4j_graphrag.experimental.components.kg_writer import Neo4jWriter
9+
from neo4j_graphrag.experimental.components.lexical_graph import LexicalGraphBuilder
10+
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import (
11+
FixedSizeSplitter,
12+
)
13+
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig
14+
from neo4j_graphrag.experimental.pipeline import Pipeline
15+
from neo4j_graphrag.experimental.pipeline.pipeline import PipelineResult
16+
17+
18+
async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
19+
"""This is where we define and run the Lexical Graph builder pipeline, instantiating
20+
a few components:
21+
22+
- Text Splitter: to split the text into manageable chunks of fixed size
23+
- Chunk Embedder: to embed the chunks' text
24+
- Lexical Graph Builder: to build the lexical graph, ie creating the chunk nodes and relationships between them
25+
- KG writer: save the lexical graph to Neo4j
26+
"""
27+
pipe = Pipeline()
28+
# define the components
29+
pipe.add_component(
30+
FixedSizeSplitter(chunk_size=20, chunk_overlap=1),
31+
"splitter",
32+
)
33+
pipe.add_component(TextChunkEmbedder(embedder=OpenAIEmbeddings()), "chunk_embedder")
34+
# optional: define some custom node labels for the lexical graph:
35+
lexical_graph_config = LexicalGraphConfig(
36+
id_prefix="example",
37+
chunk_node_label="TextPart",
38+
)
39+
pipe.add_component(
40+
LexicalGraphBuilder(lexical_graph_config),
41+
"lexical_graph_builder",
42+
)
43+
pipe.add_component(Neo4jWriter(neo4j_driver), "writer")
44+
# define the execution order of component
45+
# and how the output of previous components must be used
46+
pipe.connect("splitter", "chunk_embedder", input_config={"text_chunks": "splitter"})
47+
pipe.connect(
48+
"chunk_embedder",
49+
"lexical_graph_builder",
50+
input_config={"text_chunks": "chunk_embedder"},
51+
)
52+
pipe.connect(
53+
"lexical_graph_builder",
54+
"writer",
55+
input_config={
56+
"graph": "lexical_graph_builder.graph",
57+
"lexical_graph_config": "lexical_graph_builder.config",
58+
},
59+
)
60+
# user input:
61+
# the initial text
62+
# and the list of entities and relations we are looking for
63+
pipe_inputs = {
64+
"splitter": {
65+
"text": """Albert Einstein was a German physicist born in 1879 who
66+
wrote many groundbreaking papers especially about general relativity
67+
and quantum mechanics. He worked for many different institutions, including
68+
the University of Bern in Switzerland and the University of Oxford."""
69+
},
70+
"lexical_graph_builder": {
71+
"document_info": {
72+
# 'path' can be anything
73+
"path": "example/lexical_graph_from_text.py"
74+
},
75+
},
76+
}
77+
# run the pipeline
78+
return await pipe.run(pipe_inputs)
79+
80+
81+
if __name__ == "__main__":
82+
with neo4j.GraphDatabase.driver(
83+
"bolt://localhost:7687", auth=("neo4j", "password")
84+
) as driver:
85+
print(asyncio.run(main(driver)))

0 commit comments

Comments
 (0)