Skip to content

Automatic schema extraction from text #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7c831de
Add schema extraction prompt template
NathalieCharbel Apr 23, 2025
baf9302
Add schema from text using an LLM
NathalieCharbel Apr 23, 2025
2b14541
Update SimpleKGPipeline for automatic schema extraction
NathalieCharbel Apr 24, 2025
49452d4
Save/Read inferred schema
NathalieCharbel Apr 24, 2025
fa8a6af
Bug fixes
NathalieCharbel Apr 25, 2025
b52bed4
Add unit tests
NathalieCharbel Apr 25, 2025
41d359d
Allow schema parameter in SimpleKGBuilderConfig and refactor code
NathalieCharbel Apr 28, 2025
511bc3e
Update changelog and api rst
NathalieCharbel Apr 28, 2025
212ae0b
Update documentation
NathalieCharbel Apr 28, 2025
30c273d
Fix Changelog after rebase
NathalieCharbel Apr 29, 2025
52a2686
Ruff
NathalieCharbel Apr 29, 2025
b19e57c
Fix mypy issues
NathalieCharbel Apr 29, 2025
4eebee5
Ignore remaining mypy issues (temp)
NathalieCharbel Apr 29, 2025
7088286
Remove unused imports
NathalieCharbel Apr 29, 2025
9d05c76
Fix unit tests
NathalieCharbel Apr 29, 2025
f9a7c8c
Fix component connections
NathalieCharbel Apr 29, 2025
8458b75
Improve default schema extraction prompt and add examples
NathalieCharbel Apr 29, 2025
7558b56
Rename schema from text component
NathalieCharbel Apr 30, 2025
8885e2c
Fix remaining mypy errors
NathalieCharbel May 5, 2025
78633c6
Improve schema from text example
NathalieCharbel May 5, 2025
fef2e49
Ruff
NathalieCharbel May 5, 2025
b412a05
Remove flag for automatic schema extraction
NathalieCharbel May 5, 2025
5183439
Fix unit tests
NathalieCharbel May 6, 2025
d6b3491
Handle cases where LLM outputs a valid JSON array
NathalieCharbel May 6, 2025
3edf0d0
Fix e2e tests
NathalieCharbel May 6, 2025
49c399c
Address PR comments
NathalieCharbel May 6, 2025
bf2fb96
Add examples running SimpleKGPipeline
NathalieCharbel May 7, 2025
ffea761
Add inferred schema json and yaml files example
NathalieCharbel May 12, 2025
2ce0ff9
Improve handling LLM response
NathalieCharbel May 12, 2025
f69eace
Improve handling errors for extracted schema
NathalieCharbel May 12, 2025
89b3d1b
Replace warning logs with real deprecation warnings
NathalieCharbel May 12, 2025
83d90fb
Fix schema unit tests
NathalieCharbel May 12, 2025
29aec54
Ensure proper handling of schema when provided as dict
NathalieCharbel May 13, 2025
4e6d53a
Move example files to the right directories
NathalieCharbel May 13, 2025
48ec9b7
Add custom schema extraction error
NathalieCharbel May 13, 2025
44e76de
Handle invalid format for extracted schema
NathalieCharbel May 13, 2025
6bc46e1
Merge branch 'main' into automatic-schema-extraction-from-text
NathalieCharbel May 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

## Next

### Added

- Added support for automatic schema extraction from text using LLMs. In the `SimpleKGPipeline`, when the user provides no schema, the automatic schema extraction is enabled by default.
## 1.7.0

### Added
Expand Down
13 changes: 13 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,12 @@ SchemaBuilder
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaBuilder
:members: run

SchemaFromTextExtractor
-----------------------

.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaFromTextExtractor
:members: run

EntityRelationExtractor
=======================

Expand Down Expand Up @@ -362,6 +368,13 @@ ERExtractionTemplate
:members:
:exclude-members: format

SchemaExtractionTemplate
------------------------

.. autoclass:: neo4j_graphrag.generation.prompts.SchemaExtractionTemplate
:members:
:exclude-members: format

Text2CypherTemplate
--------------------

Expand Down
189 changes: 124 additions & 65 deletions docs/source/user_guide_kg_builder.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ A Knowledge Graph (KG) construction pipeline requires a few components (some of
- **Data loader**: extract text from files (PDFs, ...).
- **Text splitter**: split the text into smaller pieces of text (chunks), manageable by the LLM context window (token limit).
- **Chunk embedder** (optional): compute the chunk embeddings.
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG.
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG. Schema can be provided manually or extracted automatically using LLMs.
- **Lexical graph builder**: build the lexical graph (Document, Chunk and their relationships) (optional).
- **Entity and relation extractor**: extract relevant entities and relations from the text.
- **Knowledge Graph writer**: save the identified entities and relations.
Expand Down Expand Up @@ -75,10 +75,11 @@ Graph Schema

It is possible to guide the LLM by supplying a list of entities, relationships,
and instructions on how to connect them. However, note that the extracted graph
may not fully adhere to these guidelines. Entities and relationships can be
represented as either simple strings (for their labels) or dictionaries. If using
a dictionary, it must include a label key and can optionally include description
and properties keys, as shown below:
may not fully adhere to these guidelines unless schema enforcement is enabled
(see :ref:`Schema Enforcement Behaviour`). Entities and relationships can be represented
as either simple strings (for their labels) or dictionaries. If using a dictionary,
it must include a label key and can optionally include description and properties keys,
as shown below:

.. code:: python

Expand Down Expand Up @@ -117,6 +118,18 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated

.. code:: python

# Using the schema parameter (recommended approach)
kg_builder = SimpleKGPipeline(
# ...
schema={
"entities": ENTITIES,
"relations": RELATIONS,
"potential_schema": POTENTIAL_SCHEMA
},
# ...
)

# Using individual schema parameters (deprecated approach)
kg_builder = SimpleKGPipeline(
# ...
entities=ENTITIES,
Expand All @@ -125,6 +138,9 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated
# ...
)

.. note::
By default, if no schema is provided to the SimpleKGPipeline, automatic schema extraction will be performed using the LLM (See the :ref:`Automatic Schema Extraction with SchemaFromTextExtractor`).

Extra configurations
--------------------

Expand Down Expand Up @@ -412,41 +428,44 @@ within the configuration file.
"neo4j_database": "myDb",
"on_error": "IGNORE",
"prompt_template": "...",
"entities": [
"Person",
{
"label": "House",
"description": "Family the person belongs to",
"properties": [
{"name": "name", "type": "STRING"}
]
},
{
"label": "Planet",
"properties": [
{"name": "name", "type": "STRING"},
{"name": "weather", "type": "STRING"}
]
}
],
"relations": [
"PARENT_OF",
{
"label": "HEIR_OF",
"description": "Used for inheritor relationship between father and sons"
},
{
"label": "RULES",
"properties": [
{"name": "fromYear", "type": "INTEGER"}
]
}
],
"potential_schema": [
["Person", "PARENT_OF", "Person"],
["Person", "HEIR_OF", "House"],
["House", "RULES", "Planet"]
],

"schema": {
"entities": [
"Person",
{
"label": "House",
"description": "Family the person belongs to",
"properties": [
{"name": "name", "type": "STRING"}
]
},
{
"label": "Planet",
"properties": [
{"name": "name", "type": "STRING"},
{"name": "weather", "type": "STRING"}
]
}
],
"relations": [
"PARENT_OF",
{
"label": "HEIR_OF",
"description": "Used for inheritor relationship between father and sons"
},
{
"label": "RULES",
"properties": [
{"name": "fromYear", "type": "INTEGER"}
]
}
],
"potential_schema": [
["Person", "PARENT_OF", "Person"],
["Person", "HEIR_OF", "House"],
["House", "RULES", "Planet"]
]
},
"lexical_graph_config": {
"chunk_node_label": "TextPart"
}
Expand All @@ -462,31 +481,34 @@ or in YAML:
neo4j_database: myDb
on_error: IGNORE
prompt_template: ...
entities:
- label: Person
- label: House
description: Family the person belongs to
properties:
- name: name
type: STRING
- label: Planet
properties:
- name: name
type: STRING
- name: weather
type: STRING
relations:
- label: PARENT_OF
- label: HEIR_OF
description: Used for inheritor relationship between father and sons
- label: RULES
properties:
- name: fromYear
type: INTEGER
potential_schema:
- ["Person", "PARENT_OF", "Person"]
- ["Person", "HEIR_OF", "House"]
- ["House", "RULES", "Planet"]

# Using the schema parameter (recommended approach)
schema:
entities:
- Person
- label: House
description: Family the person belongs to
properties:
- name: name
type: STRING
- label: Planet
properties:
- name: name
type: STRING
- name: weather
type: STRING
relations:
- PARENT_OF
- label: HEIR_OF
description: Used for inheritor relationship between father and sons
- label: RULES
properties:
- name: fromYear
type: INTEGER
potential_schema:
- ["Person", "PARENT_OF", "Person"]
- ["Person", "HEIR_OF", "House"]
- ["House", "RULES", "Planet"]
lexical_graph_config:
chunk_node_label: TextPart

Expand Down Expand Up @@ -791,6 +813,41 @@ Here is a code block illustrating these concepts:
After validation, this schema is saved in a `SchemaConfig` object, whose dict representation is passed
to the LLM.

Automatic Schema Extraction
---------------------------

Instead of manually defining the schema, you can use the `SchemaFromTextExtractor` component to automatically extract a schema from your text using an LLM:

.. code:: python

from neo4j_graphrag.experimental.components.schema import SchemaFromTextExtractor
from neo4j_graphrag.llm import OpenAILLM

# Create the automatic schema extractor
schema_extractor = SchemaFromTextExtractor(
llm=OpenAILLM(
model_name="gpt-4o",
model_params={
"max_tokens": 2000,
"response_format": {"type": "json_object"},
},
)
)

The `SchemaFromTextExtractor` component analyzes the text and identifies entity types, relationship types, and their property types. It creates a complete `SchemaConfig` object that can be used in the same way as a manually defined schema.

You can also save and reload the extracted schema:

.. code:: python

# Save the schema to JSON or YAML files
schema_config.store_as_json("my_schema.json")
schema_config.store_as_yaml("my_schema.yaml")

# Later, reload the schema from file
from neo4j_graphrag.experimental.components.schema import SchemaConfig
restored_schema = SchemaConfig.from_file("my_schema.json") # or my_schema.yaml


Entity and Relation Extractor
=============================
Expand Down Expand Up @@ -832,6 +889,8 @@ The LLM to use can be customized, the only constraint is that it obeys the :ref:

Schema Enforcement Behaviour
----------------------------
.. _schema-enforcement-behaviour:

By default, even if a schema is provided to guide the LLM in the entity and relation extraction, the LLM response is not validated against that schema.
This behaviour can be changed by using the `enforce_schema` flag in the `LLMEntityRelationExtractor` constructor:

Expand Down
2 changes: 2 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This folder contains examples usage for the different features
supported by the `neo4j-graphrag` package:

- [Automatic Schema Extraction](#schema-extraction) from PDF or text
- [Build Knowledge Graph](#build-knowledge-graph) from PDF or text
- [Retrieve](#retrieve) information from the graph
- [Question Answering](#answer-graphrag) (Q&A)
Expand Down Expand Up @@ -122,6 +123,7 @@ are listed in [the last section of this file](#customize).
- [Chunk embedder]()
- Schema Builder:
- [User-defined](./customize/build_graph/components/schema_builders/schema.py)
- [Automatic schema extraction](./automatic_schema_extraction/schema_from_text.py)
- Entity Relation Extractor:
- [LLM-based](./customize/build_graph/components/extractors/llm_entity_relation_extractor.py)
- [LLM-based with custom prompt](./customize/build_graph/components/extractors/llm_entity_relation_extractor_with_custom_prompt.py)
Expand Down
Loading