This repository holds the code for the WS24/25 master practical on knowledge graph fusion. The main aim of the practical was to explore methods and approaches for merging two knowledge graph https://www.apotheken-umschau.de/ and https://www.wirkstoffprofile.de/.
While both knowledge sources come from the same domain, their style and intended users are different.
There are 3 directories that contain data for
-
data_processing: This directory holds a script that was used for scraping data for
$KG_{2}$ fromhttps://www.wirkstoffprofile.de/. - data: Contains scraped JSON files from the Wirkstoffprofile portal in a structure of nested directories which correspond to the categories and subcategories used by the web page. The data is complete with the only exeption of the chemical structure schematics (available as images on the web page).
-
kg: Contains everything related to the running the Neo4j container. A
compose.ymlfile, aDockerfileand a helper scriptcustom-entrypoint.sh. There are also two subdirectories:-
kg/dumps-.dumpfiles for different versions of the original knowlege graph$KG_{1}$ and different versions of the merged graph. -
kg/plugins- A directory that holds Neo4j plugin files for libraries such APOC and Graph Data Science which are needed for certain operations on graphs.
-
The core of the project, i.e. code related to fusion approaches and their evaluation can be found in the following two directories:
-
fusion:
brute_force_utils.pyandmultiprocess_bruteforce.pyare the actual scripts used for executing the LLM-based fusion strategy.apply_fusion_brute_force_single_doc.ipynbdemonstrates the approach for demonstration purposes.- fusion/brute_force_logs includes textual logs that were created when running the LLM-based fusion with different models.
- fusion/small_scripts holds short scripts that demonstrate certain aspects that are relevant for this project, such as how to access the Vertex AI text similarity models or scripts related to the Langchain
LLMGraphTransformerclass. - fusion/metagraph contains parts of the implementation related to the metagraph based approach.
-
eval:
base_stats-*scripts that determine basic graph statistics, such as number and types of nodes, relationships for the original and merged KG.generate_eval_dataset.pyscript for generating QA dataset for RAGAS-based evalution with a corresponding log file and example responsesragas_eval.pyincludes the actual evaluation using Graph-based question answering and evaluation using metrics from the RAGAS framework- eval/eval_sets contains the actual question answer pairs generated by the
generate_eval_dataset.pyscript. combined_similarity_metric.pydemonstrates computations of a node similarity metric that is based on the combination of the Jaccard and cosine similarity (on text embeddings). eval/combined_similarity_alpha contains results for the combined metric based on different alpha values.- eval/small_scripts contains two demonstration scripts for using the WeightsAndBiases Weave platform for storing traces of LLM calls and visualizing evaluations. These were not used in the final evaluation because of time constraints, but could be very helpful.
In this approach, a large language model (LLM) is leveraged to assist in the extraction and merging of knowledge graphs. The process involves treating
For each document
-
Node Extraction: The LLM is prompted with
$KG_1$ as context, and the model extracts new nodes from the document. These nodes represent concepts, entities, or medical terms that are relevant to the medical domain. - Relationship Extraction: The LLM also extracts relationships between the nodes it identifies. These relationships can represent associations, interactions, or dependencies between medical concepts.
-
Subgraph Merging: The extracted nodes and relationships are combined with
$KG_1$ to form an enriched subgraph,$KG_d$ . This subgraph is iteratively added to the existing knowledge graph, expanding its coverage.
A unique feature of this brute-force method is the repetitive use of the same prompt to extract more nodes and relationships. Over time, as more documents are processed, the graph continues to grow. Furthermore, the node identifiers and names are adjusted based on existing entities in
This approach has the advantage of flexibility, as it can scale to new domains by simply changing the document corpus. However, it requires careful post-processing to ensure that the extracted entities are meaningful and consistent with the existing graph. It is also important to note, that recursively running the same relation or node extraction prompts can greatly increase the number of extracted entities and edges. While the number of extracted values begins to plateau after 3 iterations, the overall number of extracted values can be doubled or even tripled or quadrupled through multiple iterations.
While this approach was not fully implemented, many of its components are either used as part of the other approach, or are implemented separately. This is an outline how existing elements would have to be combined and extended for conducting experiments on metagraph-based fusion:
fusion/metagraph/metagraph.pyshowscases how a metagraph projection can be created using the gds library in Neo4j. And how previously created node label (or relationship labels) embeddings can be added to the metagraph, to perform KNN similarity.fusion/metagraph/string_embedding.pyincludes code to retrieve and store string embeddings from the Vertex AI api.
The metagraph-based fusion approach involves creating a higher-level abstraction of the knowledge graphs by merging their metagraphs first before combining the individual nodes and relationships. A metagraph captures the structure and semantics of the graph at a more abstract level, providing a framework for aligning the two knowledge graphs.
The process starts by analyzing and comparing the metagraphs of
Key steps in the metagraph-based fusion include:
- Semantic Alignment: Semantic similarity between the node labels, relationship labels, and graph structures is identified. This could involve string matching techniques, embedding models, or ontology-based methods.
-
Node Merging: Nodes from
$KG_1$ and$KG_2$ that represent the same or similar entities are identified and merged. This step might include heuristic approaches or similarity measures like Jaccard similarity or cosine similarity from embeddings. - Relationship Merging: Relationships that connect nodes in both graphs are also merged. This requires ensuring that the semantics of the relationships align, possibly using embedding models for relationship alignment.
A challenge with this approach is the lack of consistent ontologies or schemas in real-world graphs, which often makes it difficult to directly apply this technique without significant customization. But this method has significant advantages over the pure LLM-based approach, because aligning the types of nodes and relationships used by both
To assess the quality of the fused knowledge graph, a question-answering (QA) evaluation inspired by the RAGAS framework was performed. The goal of this evaluation is to verify how well the merged graph supports answering medical queries based on the extracted knowledge.
-
Question Generation: A set of question-answer pairs is generated from a subset of documents
$D$ (the corpus). These questions are designed to cover various medical topics, ensuring they require knowledge from both$KG_1$ and$KG_2$ . - Entity Extraction: For each question, medical entities (e.g., diseases, drugs, treatments) are identified. These entities are queried in the merged graph to retrieve relevant neighbors and associated knowledge.
- Answer Generation: Using the retrieved subgraph, a natural language answer is generated. This could involve extracting facts directly from the graph or synthesizing information from multiple nodes.
The evaluation measures three primary metrics:
- Factual correctness: measures the factual accuracy of the generated response compared to the reference. This metric evaluates how well the generated response aligns with the reference by breaking down both into claims and using natural language inference to determine factual overlap.
- BLEU score: evaluates the quality of the generated response by comparing it to reference answers. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty.
- Faithfulness: measures how factually consistent a response is with the retrieved context. A response is considered faithful if all its claims can be supported by the retrieved context.
In the brute-force LLM-based fusion approach, the merged graph demonstrated a clear improvement over the baseline graph (
These low values have one explanation. For most of these questions the relevant data was NOT INCLUDED in the merged knowledge graph. BUT this does not mean that the merging algorithm does not work as intended, because we can not be sure that the QA evaluation dataset is complete. As mentioned before, this can be remedied just by:
- increasing the number of iterations when running node and relationship extraction in the brute-force approach
- increasing the number of generated question answer pairs in the QA dataset generation. Ideally we would want to have the following situation: e.g. if there are 100 statements (facts) in a document, then 100 question answer pairs should be generated based on these facts. In such a case the RAGAS metrics will reflect the actual quality of the merging approach and can be used to compare different merging approaches.
Below we can see examples of questions, the gold (correct) response and the response provided by the KG retrieval. Most of the results in the QA dataset evaluation look like the two bottom answers. (i.e the necessary context information is simply not included in the graph)
The process of knowledge graph fusion is complex and non-trivial, especially when merging disparate sources like
The provided setup for using Neo4j can be found in the kg directory and can be used by running docker compose up. An additional custom-entrypoint.sh script is used in the setup sequence for managing the database dump which should be used inside the Neo4j container. There are two main scenarios how this setup is used:
-
Load an existing Neo4j dump: In the Dockerfile set the
DUMP_FILEvariable to the name of the dump file that should be loaded. The corresponding file should be located in thekg/dumpssubdirectory. TheSKIP_DUMPvariable in thecustom-entrypoint.shscript should be set tofalse. -
Empty (default) Neo4j database: Delete all existing
kg_neo4j_datavolumes and set theSKIP_DUMPvariable totrue.
Additionally, if you woud like to export an existing Neo4j database by creating a .dump file, execute the following steps:
- stop Neo4j inside container:
neo4j-admin server stop - inside container:
neo4j-admin database dump neo4j --to-path="." - copy the .dump file to host - either via
docker cpor docker desktop UI
It is important to note that the Neo4j container itself has to run at the time of executing the first command. Normally stopping the Neo4j server leads to a stopping of the container automatically. This behaviour is overriden by an additional command in the custom-entrypoint.sh script: tail -f /dev/null
Note: Before running the merging scripts one has to make sure that the original graph has the same id attribute names that are used by the scripts. Consequently, one has to execute the following Cypher query beforehand:
MATCH (n)
WHERE n.id IS NOT NULL
SET n.custom_node_id = n.id
REMOVE n.id
When using the free trial period version of Google Cloud strict quotas are in place for using the different models provided by the Vertex AI api. These quotas can be examined in the Google Cloud Console in API->Quotas.
The use of gemini models, for example, is limited by 1 request per minute, per region, per base model. There are also additional limits on input and output token lengths, but they are less important for the experiments in this repository. Thus, when executing a code which uses a particular base model, they only way to increase throughput is by using multiple regions at the same time. For gemini 1.5 flash 001, for example, one could use the following west european regions: regions = ["europe-west1", "europe-west2", "europe-west3", "europe-west4", "europe-west6", "europe-west8", "europe-west9"]. While models from the Gemini family are widely available across many regions, other models provided by the Vertex AI api (such as models by Mistral, Meta and Anthropic) are only accessible in just one (US) region and are consequently not suitable for multiprocessing in which each process uses an individual endpoint with a different region.



