Skip to content

Explicit configuration of schema extraction #374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions docs/source/user_guide_kg_builder.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,19 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated
# ...
)

.. note::
By default, if no schema is provided to the SimpleKGPipeline, automatic schema extraction will be performed using the LLM (See the :ref:`Automatic Schema Extraction`).

Schema Parameter Behavior
-------------------------

The `schema` parameter controls how entity and relation extraction is performed:

* **AUTO_EXTRACTION**: ``schema="AUTO_EXTRACTION"`` or (``schema=None``)
The schema is automatically extracted from the input text once. This guiding schema is then used to structure entity and relation extraction for all chunks. This guarantees all chunks have the same guiding schema.
(See :ref:`Automatic Schema Extraction`)

* **NO_EXTRACTION**: ``schema="NO_EXTRACTION"`` or empty schema (``{"node_types": ()}``)
No schema extraction is performed. Entity and relation extraction proceed without a predefined or derived schema, resulting in unguided extraction.


Extra configurations
--------------------
Expand Down
4 changes: 4 additions & 0 deletions src/neo4j_graphrag/experimental/components/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,10 @@ def node_type_from_label(self, label: str) -> Optional[NodeType]:
def relationship_type_from_label(self, label: str) -> Optional[RelationshipType]:
return self._relationship_type_index.get(label)

@classmethod
def create_empty(cls) -> Self:
return cls(node_types=tuple())

def save(
self,
file_path: Union[str, Path],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,9 @@
Sequence,
Union,
)
import logging
import warnings

from pydantic import ConfigDict, Field, model_validator
from pydantic import ConfigDict, Field, model_validator, field_validator
from typing_extensions import Self

from neo4j_graphrag.experimental.components.embedder import TextChunkEmbedder
Expand Down Expand Up @@ -66,8 +65,6 @@
)
from neo4j_graphrag.generation.prompts import ERExtractionTemplate

logger = logging.getLogger(__name__)


class SimpleKGPipelineConfig(TemplatePipelineConfig):
COMPONENTS: ClassVar[list[str]] = [
Expand Down Expand Up @@ -102,6 +99,15 @@ class SimpleKGPipelineConfig(TemplatePipelineConfig):

model_config = ConfigDict(arbitrary_types_allowed=True)

@field_validator("schema_", mode="before")
@classmethod
def validate_schema_literal(cls, v: Any) -> Any:
if v == "NO_EXTRACTION": # same as "empty" schema
return GraphSchema.create_empty()
if v == "AUTO_EXTRACTION": # same as no schema
return None
return v

@model_validator(mode="after")
def handle_schema_precedence(self) -> Self:
"""Handle schema precedence and warnings"""
Expand Down
10 changes: 8 additions & 2 deletions src/neo4j_graphrag/experimental/pipeline/kg_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from __future__ import annotations

from typing import List, Optional, Sequence, Union, Any
from typing import List, Optional, Sequence, Union, Any, Literal
import logging

import neo4j
Expand Down Expand Up @@ -99,7 +99,13 @@ def __init__(
entities: Optional[Sequence[EntityInputType]] = None,
relations: Optional[Sequence[RelationInputType]] = None,
potential_schema: Optional[List[tuple[str, str, str]]] = None,
schema: Optional[Union[GraphSchema, dict[str, list[Any]]]] = None,
schema: Optional[
Union[
GraphSchema,
dict[str, list[Any]],
Literal["NO_EXTRACTION", "AUTO_EXTRACTION"],
],
] = None,
from_pdf: bool = True,
text_splitter: Optional[TextSplitter] = None,
pdf_loader: Optional[DataLoader] = None,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,14 @@ def test_simple_kg_pipeline_config_manual_schema() -> None:
assert isinstance(config._get_schema(), SchemaBuilder)


def test_simple_kg_pipeline_config_literal_schema_validation() -> None:
config = SimpleKGPipelineConfig(schema="NO_EXTRACTION") # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at it here, I am wondering if "NO_EXTRACTION" is the best term to use? when putting it next to "AUTO_EXTRACTION" it makes sense but here it is a bit ambiguous? would "NO_SCHEMA" or "NO_GUIDING_SCHEMA" be better? as it does not only cover not being extracted but not provided at all? I am not strongly opinionated about it though as it is well-documented...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated after private discussion

assert config.schema_ == GraphSchema.create_empty()

config = SimpleKGPipelineConfig(schema="AUTO_EXTRACTION") # type: ignore
assert config.schema_ is None


def test_simple_kg_pipeline_config_schema_run_params() -> None:
config = SimpleKGPipelineConfig(
entities=["Person"],
Expand Down