-
Notifications
You must be signed in to change notification settings - Fork 103
Automatic schema extraction from text #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
NathalieCharbel
merged 37 commits into
neo4j:main
from
NathalieCharbel:automatic-schema-extraction-from-text
May 14, 2025
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
7c831de
Add schema extraction prompt template
NathalieCharbel baf9302
Add schema from text using an LLM
NathalieCharbel 2b14541
Update SimpleKGPipeline for automatic schema extraction
NathalieCharbel 49452d4
Save/Read inferred schema
NathalieCharbel fa8a6af
Bug fixes
NathalieCharbel b52bed4
Add unit tests
NathalieCharbel 41d359d
Allow schema parameter in SimpleKGBuilderConfig and refactor code
NathalieCharbel 511bc3e
Update changelog and api rst
NathalieCharbel 212ae0b
Update documentation
NathalieCharbel 30c273d
Fix Changelog after rebase
NathalieCharbel 52a2686
Ruff
NathalieCharbel b19e57c
Fix mypy issues
NathalieCharbel 4eebee5
Ignore remaining mypy issues (temp)
NathalieCharbel 7088286
Remove unused imports
NathalieCharbel 9d05c76
Fix unit tests
NathalieCharbel f9a7c8c
Fix component connections
NathalieCharbel 8458b75
Improve default schema extraction prompt and add examples
NathalieCharbel 7558b56
Rename schema from text component
NathalieCharbel 8885e2c
Fix remaining mypy errors
NathalieCharbel 78633c6
Improve schema from text example
NathalieCharbel fef2e49
Ruff
NathalieCharbel b412a05
Remove flag for automatic schema extraction
NathalieCharbel 5183439
Fix unit tests
NathalieCharbel d6b3491
Handle cases where LLM outputs a valid JSON array
NathalieCharbel 3edf0d0
Fix e2e tests
NathalieCharbel 49c399c
Address PR comments
NathalieCharbel bf2fb96
Add examples running SimpleKGPipeline
NathalieCharbel ffea761
Add inferred schema json and yaml files example
NathalieCharbel 2ce0ff9
Improve handling LLM response
NathalieCharbel f69eace
Improve handling errors for extracted schema
NathalieCharbel 89b3d1b
Replace warning logs with real deprecation warnings
NathalieCharbel 83d90fb
Fix schema unit tests
NathalieCharbel 29aec54
Ensure proper handling of schema when provided as dict
NathalieCharbel 4e6d53a
Move example files to the right directories
NathalieCharbel 48ec9b7
Add custom schema extraction error
NathalieCharbel 44e76de
Handle invalid format for extracted schema
NathalieCharbel 6bc46e1
Merge branch 'main' into automatic-schema-extraction-from-text
NathalieCharbel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
97 changes: 97 additions & 0 deletions
97
examples/build_graph/automatic_schema_extraction/simple_kg_builder_schema_from_pdf.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
"""This example demonstrates how to use SimpleKGPipeline with automatic schema extraction | ||
from a PDF file. When no schema is provided to SimpleKGPipeline, automatic schema extraction | ||
is performed using the LLM. | ||
|
||
Note: This example requires an OpenAI API key to be set in the .env file. | ||
""" | ||
|
||
import asyncio | ||
import logging | ||
import os | ||
from pathlib import Path | ||
from dotenv import load_dotenv | ||
import neo4j | ||
|
||
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline | ||
from neo4j_graphrag.llm import OpenAILLM | ||
from neo4j_graphrag.embeddings import OpenAIEmbeddings | ||
|
||
# Load environment variables from .env file | ||
load_dotenv() | ||
|
||
# Configure logging | ||
logging.basicConfig() | ||
logging.getLogger("neo4j_graphrag").setLevel(logging.INFO) | ||
|
||
# PDF file path | ||
root_dir = Path(__file__).parents[2] | ||
PDF_FILE = str( | ||
root_dir / "data" / "Harry Potter and the Chamber of Secrets Summary.pdf" | ||
) | ||
|
||
|
||
async def run_kg_pipeline_with_auto_schema() -> None: | ||
"""Run the SimpleKGPipeline with automatic schema extraction from a PDF file.""" | ||
|
||
# Define Neo4j connection | ||
uri = os.getenv("NEO4J_URI", "neo4j://localhost:7687") | ||
user = os.getenv("NEO4J_USER", "neo4j") | ||
password = os.getenv("NEO4J_PASSWORD", "password") | ||
|
||
# Define LLM parameters | ||
llm_model_params = { | ||
"max_tokens": 2000, | ||
"response_format": {"type": "json_object"}, | ||
"temperature": 0, # Lower temperature for more consistent output | ||
} | ||
|
||
# Initialize the Neo4j driver | ||
driver = neo4j.GraphDatabase.driver(uri, auth=(user, password)) | ||
|
||
# Create the LLM instance | ||
llm = OpenAILLM( | ||
model_name="gpt-4o", | ||
model_params=llm_model_params, | ||
) | ||
|
||
# Create the embedder instance | ||
embedder = OpenAIEmbeddings() | ||
|
||
try: | ||
# Create a SimpleKGPipeline instance without providing a schema | ||
# This will trigger automatic schema extraction | ||
kg_builder = SimpleKGPipeline( | ||
llm=llm, | ||
driver=driver, | ||
embedder=embedder, | ||
from_pdf=True, | ||
) | ||
|
||
print(f"Processing PDF file: {PDF_FILE}") | ||
# Run the pipeline on the PDF file | ||
await kg_builder.run_async(file_path=PDF_FILE) | ||
|
||
finally: | ||
# Close connections | ||
await llm.async_client.close() | ||
driver.close() | ||
|
||
|
||
async def main() -> None: | ||
"""Run the example.""" | ||
# Create data directory if it doesn't exist | ||
data_dir = root_dir / "data" | ||
data_dir.mkdir(exist_ok=True) | ||
|
||
# Check if the PDF file exists | ||
if not Path(PDF_FILE).exists(): | ||
print(f"Warning: PDF file not found at {PDF_FILE}") | ||
print("Please replace with a valid PDF file path.") | ||
return | ||
|
||
# Run the pipeline | ||
await run_kg_pipeline_with_auto_schema() | ||
|
||
|
||
if __name__ == "__main__": | ||
asyncio.run(main()) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.