Skip to content

Commit e1941ed

Browse files
authored
New schema and pruning (neo4j#347)
* Update schema definition * Add graph pruning component and tests (WIP) * Cleaning * Add pruner to SimpleKGPipeline * Add test for relationship enforcement * Change return model to have some stats about pruned objects * We need to filter out relationships if start/end node is not valid in all cases (additional_relationship_types or not) * Do not filter based on patterns if relationship type not in schema and additional_relationship_types is allowed * Raise proper error type * Ruff/mypy * Add e2e test for graph pruning component * Mypy * Update changelog and doc * Mypy * ChatGPT was wrong * Change edge case behaviour * Fix doc * Update doc * Fix condition * Remove incomplete comments * More pruning stats * Typo * Remove default value for consistency * Add a section to the doc * Mypy checks
1 parent 987abf6 commit e1941ed

File tree

16 files changed

+1637
-712
lines changed

16 files changed

+1637
-712
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
#### Strict mode
1717

18-
- Strict mode in `SimpleKGPipeline`: now properties and relationships are pruned only if they are defined in the input schema.
18+
- Strict mode in `SimpleKGPipeline`: the `enforce_schema` option is removed and replaced by a schema-driven pruning.
1919

2020
#### Schema definition
2121

docs/source/user_guide_kg_builder.rst

Lines changed: 65 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,10 @@ Customizing the SimpleKGPipeline
7373
Graph Schema
7474
------------
7575

76-
It is possible to guide the LLM by supplying a list of node and relationship types,
77-
and instructions on how to connect them (patterns). However, note that the extracted graph
78-
may not fully adhere to these guidelines unless schema enforcement is enabled
79-
(see :ref:`Schema Enforcement Behaviour`). Node and relationship types can be represented
76+
It is possible to guide the LLM by supplying a list of node and relationship types (
77+
with, optionally, a list of their expected properties)
78+
and instructions on how to connect them (patterns).
79+
Node and relationship types can be represented
8080
as either simple strings (for their labels) or dictionaries. If using a dictionary,
8181
it must include a label key and can optionally include description and properties keys,
8282
as shown below:
@@ -90,7 +90,7 @@ as shown below:
9090
# such as a description:
9191
{"label": "House", "description": "Family the person belongs to"},
9292
# or a list of properties the LLM will try to attach to the entity:
93-
{"label": "Planet", "properties": [{"name": "weather", "type": "STRING"}]},
93+
{"label": "Planet", "properties": [{"name": "name", "type": "STRING", "required": True}, {"name": "weather", "type": "STRING"}]},
9494
]
9595
# same thing for relationships:
9696
RELATIONSHIP_TYPES = [
@@ -124,7 +124,8 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated
124124
schema={
125125
"node_types": NODE_TYPES,
126126
"relationship_types": RELATIONSHIP_TYPES,
127-
"patterns": PATTERNS
127+
"patterns": PATTERNS,
128+
"additional_node_types": False,
128129
},
129130
# ...
130131
)
@@ -145,7 +146,6 @@ They are also accessible via the `SimpleKGPipeline` interface.
145146
# ...
146147
prompt_template="",
147148
lexical_graph_config=my_config,
148-
enforce_schema="STRICT"
149149
on_error="RAISE",
150150
# ...
151151
)
@@ -878,38 +878,6 @@ It can be used in this way:
878878

879879
The LLM to use can be customized, the only constraint is that it obeys the :ref:`LLMInterface <llminterface>`.
880880

881-
Schema Enforcement Behaviour
882-
----------------------------
883-
.. _schema-enforcement-behaviour:
884-
885-
By default, even if a schema is provided to guide the LLM in the entity and relation extraction, the LLM response is not validated against that schema.
886-
This behaviour can be changed by using the `enforce_schema` flag in the `LLMEntityRelationExtractor` constructor:
887-
888-
.. code:: python
889-
890-
from neo4j_graphrag.experimental.components.entity_relation_extractor import LLMEntityRelationExtractor
891-
from neo4j_graphrag.experimental.components.types import SchemaEnforcementMode
892-
893-
extractor = LLMEntityRelationExtractor(
894-
# ...
895-
enforce_schema=SchemaEnforcementMode.STRICT,
896-
)
897-
898-
In this scenario, any extracted node/relation/property that is not part of the provided schema will be pruned.
899-
Any relation whose start node or end node does not conform to the provided tuple in `potential_schema` will be pruned.
900-
If a relation start/end nodes are valid but the direction is incorrect, the latter will be inverted.
901-
If a node is left with no properties, it will be also pruned.
902-
903-
.. note::
904-
905-
If the input schema lacks a certain type of information, pruning is skipped.
906-
For example, if an entity is defined only by a label and has no properties,
907-
property pruning is not performed and all properties returned by the LLM are kept.
908-
909-
910-
.. warning::
911-
912-
Note that if the schema enforcement mode is on but the schema is not provided, no schema enforcement will be applied.
913881

914882
Error Behaviour
915883
---------------
@@ -1017,6 +985,64 @@ If more customization is needed, it is possible to subclass the `EntityRelationE
1017985
See :ref:`entityrelationextractor`.
1018986

1019987

988+
Schema Guidance and Graph Filtering
989+
===================================
990+
991+
The provided schema serves as a guiding structure for the language model during graph construction. However, it does not impose strict constraints on the model's output. As a result, the model may generate additional node labels, relationship types, or properties that are not explicitly defined in the schema.
992+
993+
By default, all extracted elements — including nodes, relationships, and properties — are retained in the constructed graph. This behavior can be configured using the following schema options:
994+
(see :ref:`graphschema`)
995+
996+
997+
Configuration Options
998+
---------------------
999+
1000+
- **Required Properties**
1001+
Required properties may be specified at the node or relationship type level. Any extracted node or relationship missing one or more of its required properties will be pruned from the graph.
1002+
1003+
- **Additional Properties** *(default: True)*
1004+
This node- or relationship-level option determines whether extra properties not listed in the schema should be retained.
1005+
1006+
- If set to ``True`` (default), all extracted properties are retained.
1007+
- If set to ``False``, only the properties defined in the schema are preserved; all others are removed.
1008+
1009+
1010+
.. note:: Node pruning
1011+
1012+
If, after property pruning using the above rule, a node is left without any property, it is removed from the graph.
1013+
1014+
1015+
- **Additional Node Types** *(default: True)*
1016+
This schema-level option specifies whether node types not defined in the schema are included in the graph.
1017+
1018+
- If set to ``True`` (default), such node types are retained.
1019+
- If set to ``False``, nodes with undefined types are removed.
1020+
1021+
- **Additional Relationship Types** *(default: True)*
1022+
This schema-level option specifies whether relationship types not defined in the schema are included in the graph.
1023+
1024+
- If set to ``True`` (default), such relationships are retained.
1025+
- If set to ``False``, relationships with undefined types are removed.
1026+
1027+
- **Additional Patterns** *(default: True)*
1028+
This schema-level option determines whether relationship patterns not explicitly listed in the schema are allowed.
1029+
1030+
- If set to ``True`` (default), all patterns are retained.
1031+
- If set to ``False``, only patterns defined in the schema are kept. **Note** `additional_relationship_types` must also be `False`.
1032+
1033+
1034+
1035+
Enforcement rules
1036+
_________________
1037+
1038+
In addition to the user-defined configuration options described above,
1039+
the `GraphPruning` component performs the following cleanup operations:
1040+
1041+
- Nodes with missing required properties are pruned.
1042+
- Nodes with no remaining properties are pruned.
1043+
- Relationships with invalid source or target nodes (i.e., nodes no longer present in the graph) are pruned.
1044+
- Relationships with incorrect direction have their direction corrected.
1045+
10201046
.. _kg-writer-section:
10211047

10221048
Knowledge Graph Writer

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ are listed in [the last section of this file](#customize).
128128
- [LLM-based](./customize/build_graph/components/extractors/llm_entity_relation_extractor.py)
129129
- [LLM-based with custom prompt](./customize/build_graph/components/extractors/llm_entity_relation_extractor_with_custom_prompt.py)
130130
- [Custom](./customize/build_graph/components/extractors/custom_extractor.py)
131+
- [Graph Pruner](./customize/build_graph/components/pruners/graph_pruner.py)
131132
- Knowledge Graph Writer:
132133
- [Neo4j writer](./customize/build_graph/components/writers/neo4j_writer.py)
133134
- [Custom](./customize/build_graph/components/writers/custom_writer.py)
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
"""This example demonstrates how to use the GraphPruner component."""
2+
3+
import asyncio
4+
5+
from neo4j_graphrag.experimental.components.graph_pruning import GraphPruning
6+
from neo4j_graphrag.experimental.components.schema import (
7+
GraphSchema,
8+
NodeType,
9+
PropertyType,
10+
RelationshipType,
11+
)
12+
from neo4j_graphrag.experimental.components.types import (
13+
Neo4jGraph,
14+
Neo4jNode,
15+
Neo4jRelationship,
16+
)
17+
18+
graph = Neo4jGraph(
19+
nodes=[
20+
Neo4jNode(
21+
id="Person/John",
22+
label="Person",
23+
properties={
24+
"firstName": "John",
25+
"lastName": "Doe",
26+
"occupation": "employee",
27+
},
28+
),
29+
Neo4jNode(
30+
id="Person/Jane",
31+
label="Person",
32+
properties={
33+
"firstName": "Jane",
34+
},
35+
),
36+
Neo4jNode(
37+
id="Person/Jack",
38+
label="Person",
39+
properties={"firstName": "Jack", "lastName": "Dae"},
40+
),
41+
Neo4jNode(
42+
id="Organization/Corp1",
43+
label="Organization",
44+
properties={"name": "CorpA"},
45+
),
46+
],
47+
relationships=[
48+
Neo4jRelationship(
49+
start_node_id="Person/John",
50+
end_node_id="Person/Jack",
51+
type="KNOWS",
52+
),
53+
Neo4jRelationship(
54+
start_node_id="Organization/CorpA",
55+
end_node_id="Person/Jack",
56+
type="WORKS_FOR",
57+
),
58+
Neo4jRelationship(
59+
start_node_id="Person/John",
60+
end_node_id="Person/Jack",
61+
type="PARENT_OF",
62+
),
63+
],
64+
)
65+
66+
schema = GraphSchema(
67+
node_types=(
68+
NodeType(
69+
label="Person",
70+
properties=[
71+
PropertyType(name="firstName", type="STRING", required=True),
72+
PropertyType(name="lastName", type="STRING", required=True),
73+
PropertyType(name="age", type="INTEGER"),
74+
],
75+
additional_properties=False,
76+
),
77+
NodeType(
78+
label="Organization",
79+
properties=[
80+
PropertyType(name="name", type="STRING", required=True),
81+
PropertyType(name="address", type="STRING"),
82+
],
83+
),
84+
),
85+
relationship_types=(
86+
RelationshipType(
87+
label="WORKS_FOR",
88+
properties=[PropertyType(name="since", type="LOCAL_DATETIME")],
89+
),
90+
RelationshipType(
91+
label="KNOWS",
92+
),
93+
),
94+
patterns=(
95+
("Person", "KNOWS", "Person"),
96+
("Person", "WORKS_FOR", "Organization"),
97+
),
98+
additional_node_types=False,
99+
additional_relationship_types=False,
100+
additional_patterns=False,
101+
)
102+
103+
104+
async def main() -> None:
105+
pruner = GraphPruning()
106+
res = await pruner.run(graph, schema)
107+
print("=" * 20, "FINAL CLEANED GRAPH:", "=" * 20)
108+
print(res.graph)
109+
print("=" * 20, "PRUNED ITEM:", "=" * 20)
110+
print(res.pruning_stats)
111+
print("-" * 10, "PRUNED NODES:")
112+
for node in res.pruning_stats.pruned_nodes:
113+
print(
114+
node.item.label,
115+
"with properties",
116+
node.item.properties,
117+
"pruned because",
118+
node.pruned_reason,
119+
node.metadata,
120+
)
121+
print("-" * 10, "PRUNED RELATIONSHIPS:")
122+
for rel in res.pruning_stats.pruned_relationships:
123+
print(rel.item.type, "pruned because", rel.pruned_reason)
124+
print("-" * 10, "PRUNED PROPERTIES:")
125+
for prop in res.pruning_stats.pruned_properties:
126+
print(
127+
prop.item,
128+
"from node label",
129+
prop.label,
130+
"pruned because",
131+
prop.pruned_reason,
132+
)
133+
134+
135+
if __name__ == "__main__":
136+
asyncio.run(main())

0 commit comments

Comments
 (0)