Skip to content

Commit 8458b75

Browse files
Improve default schema extraction prompt and add examples
1 parent f9a7c8c commit 8458b75

File tree

3 files changed

+173
-17
lines changed

3 files changed

+173
-17
lines changed

examples/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
This folder contains examples usage for the different features
44
supported by the `neo4j-graphrag` package:
55

6+
- [Automatic Schema Extraction](#schema-extraction) from PDF or text
67
- [Build Knowledge Graph](#build-knowledge-graph) from PDF or text
78
- [Retrieve](#retrieve) information from the graph
89
- [Question Answering](#answer-graphrag) (Q&A)
@@ -122,6 +123,7 @@ are listed in [the last section of this file](#customize).
122123
- [Chunk embedder]()
123124
- Schema Builder:
124125
- [User-defined](./customize/build_graph/components/schema_builders/schema.py)
126+
- [Automatic schema extraction](./automatic_schema_extraction/schema_from_text.py)
125127
- Entity Relation Extractor:
126128
- [LLM-based](./customize/build_graph/components/extractors/llm_entity_relation_extractor.py)
127129
- [LLM-based with custom prompt](./customize/build_graph/components/extractors/llm_entity_relation_extractor_with_custom_prompt.py)
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
"""This example demonstrates how to use the SchemaFromText component
2+
to automatically extract a schema from text and save it to JSON and YAML files.
3+
4+
The SchemaFromText component uses an LLM to analyze the text and identify entities,
5+
relations, and their properties.
6+
7+
Note: This example requires an OpenAI API key to be set in the .env file.
8+
"""
9+
10+
import asyncio
11+
import logging
12+
import os
13+
from dotenv import load_dotenv
14+
15+
from neo4j_graphrag.experimental.components.schema import SchemaFromText, SchemaConfig
16+
from neo4j_graphrag.llm import OpenAILLM
17+
18+
# Load environment variables from .env file
19+
load_dotenv()
20+
21+
# Configure logging
22+
logging.basicConfig()
23+
logging.getLogger("neo4j_graphrag").setLevel(logging.INFO)
24+
25+
# Verify OpenAI API key is available
26+
if not os.getenv("OPENAI_API_KEY"):
27+
raise ValueError(
28+
"OPENAI_API_KEY environment variable not found. "
29+
"Please set it in the .env file in the root directory."
30+
)
31+
32+
# Sample text to extract schema from - it's about a company and its employees
33+
TEXT = """
34+
Acme Corporation was founded in 1985 by John Smith in New York City.
35+
The company specializes in manufacturing high-quality widgets and gadgets
36+
for the consumer electronics industry.
37+
38+
Sarah Johnson joined Acme in 2010 as a Senior Engineer and was promoted to
39+
Engineering Director in 2015. She oversees a team of 12 engineers working on
40+
next-generation products. Sarah holds a PhD in Electrical Engineering from MIT
41+
and has filed 5 patents during her time at Acme.
42+
43+
The company expanded to international markets in 2012, opening offices in London,
44+
Tokyo, and Berlin. Each office is managed by a regional director who reports
45+
directly to the CEO, Michael Brown, who took over leadership in 2008.
46+
47+
Acme's most successful product, the SuperWidget X1, was launched in 2018 and
48+
has sold over 2 million units worldwide. The product was developed by a team led
49+
by Robert Chen, who joined the company in 2016 after working at TechGiant for 8 years.
50+
51+
The company currently employs 250 people across its 4 locations and had a revenue
52+
of $75 million in the last fiscal year. Acme is planning to go public in 2024
53+
with an estimated valuation of $500 million.
54+
"""
55+
56+
# Define the file paths for saving the schema
57+
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "data")
58+
JSON_FILE_PATH = os.path.join(OUTPUT_DIR, "extracted_schema.json")
59+
YAML_FILE_PATH = os.path.join(OUTPUT_DIR, "extracted_schema.yaml")
60+
61+
62+
async def extract_and_save_schema() -> SchemaConfig:
63+
"""Extract schema from text and save it to JSON and YAML files."""
64+
65+
# Define LLM parameters
66+
llm_model_params = {
67+
"max_tokens": 2000,
68+
"response_format": {"type": "json_object"},
69+
"temperature": 0, # Lower temperature for more consistent output
70+
}
71+
72+
# Create the LLM instance
73+
llm = OpenAILLM(
74+
model_name="gpt-4o",
75+
model_params=llm_model_params,
76+
)
77+
78+
try:
79+
# Create a SchemaFromText component with the default template
80+
schema_extractor = SchemaFromText(llm=llm)
81+
82+
print("Extracting schema from text...")
83+
# Extract schema from text
84+
inferred_schema = await schema_extractor.run(text=TEXT)
85+
86+
# Ensure the output directory exists
87+
os.makedirs(OUTPUT_DIR, exist_ok=True)
88+
89+
print(f"Saving schema to JSON file: {JSON_FILE_PATH}")
90+
# Save the schema to JSON file
91+
inferred_schema.store_as_json(JSON_FILE_PATH)
92+
93+
print(f"Saving schema to YAML file: {YAML_FILE_PATH}")
94+
# Save the schema to YAML file
95+
inferred_schema.store_as_yaml(YAML_FILE_PATH)
96+
97+
print("\nExtracted Schema Summary:")
98+
print(f"Entities: {list(inferred_schema.entities.keys())}")
99+
print(f"Relations: {list(inferred_schema.relations.keys() if inferred_schema.relations else [])}")
100+
101+
if inferred_schema.potential_schema:
102+
print("\nPotential Schema:")
103+
for entity1, relation, entity2 in inferred_schema.potential_schema:
104+
print(f" {entity1} --[{relation}]--> {entity2}")
105+
106+
return inferred_schema
107+
108+
finally:
109+
# Close the LLM client
110+
await llm.async_client.close()
111+
112+
113+
async def main() -> None:
114+
"""Run the example."""
115+
116+
# Extract schema and save to files
117+
schema_config = await extract_and_save_schema()
118+
119+
print(f"\nSchema files have been saved to:")
120+
print(f" - JSON: {JSON_FILE_PATH}")
121+
print(f" - YAML: {YAML_FILE_PATH}")
122+
123+
print("\nExample of how to load the schema from files:")
124+
print(" from neo4j_graphrag.experimental.components.schema import SchemaConfig")
125+
print(f" schema_from_json = SchemaConfig.from_file('{JSON_FILE_PATH}')")
126+
print(f" schema_from_yaml = SchemaConfig.from_file('{YAML_FILE_PATH}')")
127+
128+
129+
if __name__ == "__main__":
130+
asyncio.run(main())

src/neo4j_graphrag/generation/prompts.py

Lines changed: 41 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -207,24 +207,48 @@ class SchemaExtractionTemplate(PromptTemplate):
207207
You are a top-tier algorithm designed for extracting a labeled property graph schema in
208208
structured formats.
209209
210-
Generate the generalized graph schema based on input text. Identify key entity types,
211-
their relationship types, and property types whenever it is possible. Return only
212-
abstract schema information, no concrete instances. Use singular PascalCase labels for
213-
entity types and UPPER_SNAKE_CASE for relationship types. Include property definitions
214-
only when the type can be confidently inferred, otherwise omit the properties.
210+
Generate a generalized graph schema based on the input text. Identify key entity types,
211+
their relationship types, and property types.
212+
213+
IMPORTANT RULES:
214+
1. Return only abstract schema information, not concrete instances.
215+
2. Use singular PascalCase labels for entity types (e.g., Person, Company, Product).
216+
3. Use UPPER_SNAKE_CASE for relationship types (e.g., WORKS_FOR, MANAGES).
217+
4. Include property definitions only when the type can be confidently inferred, otherwise omit them.
218+
5. When defining potential_schema, ensure that every entity and relation mentioned exists in your entities and relations lists.
219+
6. Do not create entity types that aren't clearly mentioned in the text.
220+
7. Keep your schema minimal and focused on clearly identifiable patterns in the text.
221+
215222
Accepted property types are: BOOLEAN, DATE, DURATION, FLOAT, INTEGER, LIST,
216-
LOCAL DATETIME, LOCAL TIME, POINT, STRING, ZONED DATETIME, ZONED TIME.
217-
Do not add extra keys or explanatory text. Return a valid JSON object without
218-
back‑ticks, markdown, or comments.
219-
220-
For example, if the text says "Alice lives in London", the output JSON object should
221-
adhere to the following format:
222-
{{"entities": [{{"label": "Person", "properties": [{{"name": "name", "type": "STRING"}}]}},
223-
{{"label": "City", "properties":[{{"name": "name", "type": "STRING"}}]}}],
224-
"relations": [{{"label": "LIVES_IN"}}],
225-
"potential_schema":[[ "Person", "LIVES_IN", "City"]]}}
226-
227-
More examples:
223+
LOCAL_DATETIME, LOCAL_TIME, POINT, STRING, ZONED_DATETIME, ZONED_TIME.
224+
225+
Return a valid JSON object that follows this precise structure:
226+
{{
227+
"entities": [
228+
{{
229+
"label": "Person",
230+
"properties": [
231+
{{
232+
"name": "name",
233+
"type": "STRING"
234+
}}
235+
]
236+
}},
237+
...
238+
],
239+
"relations": [
240+
{{
241+
"label": "WORKS_FOR"
242+
}},
243+
...
244+
],
245+
"potential_schema": [
246+
["Person", "WORKS_FOR", "Company"],
247+
...
248+
]
249+
}}
250+
251+
Examples:
228252
{examples}
229253
230254
Input text:

0 commit comments

Comments
 (0)