Skip to content

A project on knowledge graph fusion in the medical domain implemented for the master practical module at Heidelberg University.

Notifications You must be signed in to change notification settings

padieul/medical_kg_fusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring knowledge graph fusion approaches

Intro

This repository holds the code for the WS24/25 master practical on knowledge graph fusion. The main aim of the practical was to explore methods and approaches for merging two knowledge graph $KG_{1}$ and $KG_{2}$ from the medical domain and evaluate the results. $KG_{1}$ is an existing knowledge graph extracted from German language medical magazine Apothekenumschau https://www.apotheken-umschau.de/ and $KG_{2}$ a graph created from a database of medical substances https://www.wirkstoffprofile.de/.

While both knowledge sources come from the same domain, their style and intended users are different. $KG_{1}$ hold quite informal common knowledge acessible to layperson whereas $KG_{2}$ is a structured indepth description of medical chemicals and their properties intended for use by medical professionals. Thus, creating a useful merged graph that combines information from the two in a meaningful way is a non-trivial challenge.

Repository Structure

There are 3 directories that contain data for $KG_{2}$, related scraping scripts and a set up for running and using Neo4j.

  • data_processing: This directory holds a script that was used for scraping data for $KG_{2}$ from https://www.wirkstoffprofile.de/.
  • data: Contains scraped JSON files from the Wirkstoffprofile portal in a structure of nested directories which correspond to the categories and subcategories used by the web page. The data is complete with the only exeption of the chemical structure schematics (available as images on the web page).
  • kg: Contains everything related to the running the Neo4j container. A compose.yml file, a Dockerfile and a helper script custom-entrypoint.sh. There are also two subdirectories:
    • kg/dumps - .dump files for different versions of the original knowlege graph $KG_{1}$ and different versions of the merged graph.
    • kg/plugins - A directory that holds Neo4j plugin files for libraries such APOC and Graph Data Science which are needed for certain operations on graphs.

The core of the project, i.e. code related to fusion approaches and their evaluation can be found in the following two directories:

  • fusion:

    • brute_force_utils.py and multiprocess_bruteforce.py are the actual scripts used for executing the LLM-based fusion strategy. apply_fusion_brute_force_single_doc.ipynb demonstrates the approach for demonstration purposes.
    • fusion/brute_force_logs includes textual logs that were created when running the LLM-based fusion with different models.
    • fusion/small_scripts holds short scripts that demonstrate certain aspects that are relevant for this project, such as how to access the Vertex AI text similarity models or scripts related to the Langchain LLMGraphTransformer class.
    • fusion/metagraph contains parts of the implementation related to the metagraph based approach.
  • eval:

    • base_stats-* scripts that determine basic graph statistics, such as number and types of nodes, relationships for the original and merged KG.
    • generate_eval_dataset.py script for generating QA dataset for RAGAS-based evalution with a corresponding log file and example responses
    • ragas_eval.py includes the actual evaluation using Graph-based question answering and evaluation using metrics from the RAGAS framework
    • eval/eval_sets contains the actual question answer pairs generated by the generate_eval_dataset.py script.
    • combined_similarity_metric.py demonstrates computations of a node similarity metric that is based on the combination of the Jaccard and cosine similarity (on text embeddings). eval/combined_similarity_alpha contains results for the combined metric based on different alpha values.
    • eval/small_scripts contains two demonstration scripts for using the WeightsAndBiases Weave platform for storing traces of LLM calls and visualizing evaluations. These were not used in the final evaluation because of time constraints, but could be very helpful.

Fusion Approaches

Brute Force LLM-based Fusion

In this approach, a large language model (LLM) is leveraged to assist in the extraction and merging of knowledge graphs. The process involves treating $KG_1$ as the base graph and using the model to extract additional information from documents to augment it.

For each document $d \in D$, the following steps are performed:

  1. Node Extraction: The LLM is prompted with $KG_1$ as context, and the model extracts new nodes from the document. These nodes represent concepts, entities, or medical terms that are relevant to the medical domain.
  2. Relationship Extraction: The LLM also extracts relationships between the nodes it identifies. These relationships can represent associations, interactions, or dependencies between medical concepts.
  3. Subgraph Merging: The extracted nodes and relationships are combined with $KG_1$ to form an enriched subgraph, $KG_d$. This subgraph is iteratively added to the existing knowledge graph, expanding its coverage.

A unique feature of this brute-force method is the repetitive use of the same prompt to extract more nodes and relationships. Over time, as more documents are processed, the graph continues to grow. Furthermore, the node identifiers and names are adjusted based on existing entities in $KG_1$ to maintain consistency and prevent duplication.

This approach has the advantage of flexibility, as it can scale to new domains by simply changing the document corpus. However, it requires careful post-processing to ensure that the extracted entities are meaningful and consistent with the existing graph. It is also important to note, that recursively running the same relation or node extraction prompts can greatly increase the number of extracted entities and edges. While the number of extracted values begins to plateau after 3 iterations, the overall number of extracted values can be doubled or even tripled or quadrupled through multiple iterations.

Metagraph-based Fusion

While this approach was not fully implemented, many of its components are either used as part of the other approach, or are implemented separately. This is an outline how existing elements would have to be combined and extended for conducting experiments on metagraph-based fusion:

  • fusion/metagraph/metagraph.pyshowscases how a metagraph projection can be created using the gds library in Neo4j. And how previously created node label (or relationship labels) embeddings can be added to the metagraph, to perform KNN similarity.
  • fusion/metagraph/string_embedding.py includes code to retrieve and store string embeddings from the Vertex AI api.

The metagraph-based fusion approach involves creating a higher-level abstraction of the knowledge graphs by merging their metagraphs first before combining the individual nodes and relationships. A metagraph captures the structure and semantics of the graph at a more abstract level, providing a framework for aligning the two knowledge graphs.

The process starts by analyzing and comparing the metagraphs of $KG_1$ and $KG_2$. The metagraph serves as a schema or blueprint for the entire graph, capturing the relationships between different kinds of entities and their properties. Once the metagraphs are aligned, the individual graphs are merged based on the metagraph's structure.

Key steps in the metagraph-based fusion include:

  1. Semantic Alignment: Semantic similarity between the node labels, relationship labels, and graph structures is identified. This could involve string matching techniques, embedding models, or ontology-based methods.
  2. Node Merging: Nodes from $KG_1$ and $KG_2$ that represent the same or similar entities are identified and merged. This step might include heuristic approaches or similarity measures like Jaccard similarity or cosine similarity from embeddings.
  3. Relationship Merging: Relationships that connect nodes in both graphs are also merged. This requires ensuring that the semantics of the relationships align, possibly using embedding models for relationship alignment.

A challenge with this approach is the lack of consistent ontologies or schemas in real-world graphs, which often makes it difficult to directly apply this technique without significant customization. But this method has significant advantages over the pure LLM-based approach, because aligning the types of nodes and relationships used by both $KG_{1}$ and $KG_{2}$ through the metagraphs reduces the overall number of types and thus allows for easier merging of nodes and relationships.

Evaluation

To assess the quality of the fused knowledge graph, a question-answering (QA) evaluation inspired by the RAGAS framework was performed. The goal of this evaluation is to verify how well the merged graph supports answering medical queries based on the extracted knowledge.

  1. Question Generation: A set of question-answer pairs is generated from a subset of documents $D$ (the corpus). These questions are designed to cover various medical topics, ensuring they require knowledge from both $KG_1$ and $KG_2$.
  2. Entity Extraction: For each question, medical entities (e.g., diseases, drugs, treatments) are identified. These entities are queried in the merged graph to retrieve relevant neighbors and associated knowledge.
  3. Answer Generation: Using the retrieved subgraph, a natural language answer is generated. This could involve extracting facts directly from the graph or synthesizing information from multiple nodes.

The evaluation measures three primary metrics:

  • Factual correctness: measures the factual accuracy of the generated response compared to the reference. This metric evaluates how well the generated response aligns with the reference by breaking down both into claims and using natural language inference to determine factual overlap.
  • BLEU score: evaluates the quality of the generated response by comparing it to reference answers. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty.
  • Faithfulness: measures how factually consistent a response is with the retrieved context. A response is considered faithful if all its claims can be supported by the retrieved context.

In the brute-force LLM-based fusion approach, the merged graph demonstrated a clear improvement over the baseline graph ($KG_1$ only). This improvement is reflected in higher factual correctness and faithfulness scores, although the BLEU scores remained relatively low. These results indicate that the fused graph, especially when augmented with LLM-generated nodes and relationships, leads to improvements in answering questions based on medical knowledge.

These low values have one explanation. For most of these questions the relevant data was NOT INCLUDED in the merged knowledge graph. BUT this does not mean that the merging algorithm does not work as intended, because we can not be sure that the QA evaluation dataset is complete. As mentioned before, this can be remedied just by:

  1. increasing the number of iterations when running node and relationship extraction in the brute-force approach
  2. increasing the number of generated question answer pairs in the QA dataset generation. Ideally we would want to have the following situation: e.g. if there are 100 statements (facts) in a document, then 100 question answer pairs should be generated based on these facts. In such a case the RAGAS metrics will reflect the actual quality of the merging approach and can be used to compare different merging approaches.

Below we can see examples of questions, the gold (correct) response and the response provided by the KG retrieval. Most of the results in the QA dataset evaluation look like the two bottom answers. (i.e the necessary context information is simply not included in the graph)

Conclusion

The process of knowledge graph fusion is complex and non-trivial, especially when merging disparate sources like $KG_1$ and $KG_2$, which vary significantly in terms of structure, terminology, and intended audience. Through the explored fusion approaches�such as the brute-force LLM-based method and the metagraph-based fusion�we have demonstrated that large language models can significantly enhance the graph by extracting new entities and relationships from text data. However, this process also requires careful consideration of semantic alignment, consistency, and post-processing to ensure the quality of the merged graph. While LLM-based methods offer flexibility, they come with challenges like node duplication and semantic mismatches. The evaluation results suggest that the fused knowledge graph shows some improvement over individual knowledge graphs. Moving forward, future work can focus on refining fusion methods by leveraging metagraph merging strategies, and exploring further integration of LLMs for better scalability and accuracy. Future projects should definetely explore the use of QA-on-graph-based evaluaton with RAGAS metrics, because it is an evaluation approach that can be truly universal and domain agnostic. By creating different types of QA datasets on subsets of graphs or the provided text corpus one could analyze very specific aspects of the merging process. Despite the challenges, knowledge graph fusion remains a promising avenue for integrating diverse data sources and creating richer, more comprehensive knowledge representations. High-quality knowledge graphs are extremely useful for many different applications.

Additional:

Setting up and using Neo4j

The provided setup for using Neo4j can be found in the kg directory and can be used by running docker compose up. An additional custom-entrypoint.sh script is used in the setup sequence for managing the database dump which should be used inside the Neo4j container. There are two main scenarios how this setup is used:

  • Load an existing Neo4j dump: In the Dockerfile set the DUMP_FILE variable to the name of the dump file that should be loaded. The corresponding file should be located in the kg/dumps subdirectory. The SKIP_DUMP variable in the custom-entrypoint.sh script should be set to false.

  • Empty (default) Neo4j database: Delete all existing kg_neo4j_data volumes and set the SKIP_DUMP variable to true.

Additionally, if you woud like to export an existing Neo4j database by creating a .dump file, execute the following steps:

  • stop Neo4j inside container: neo4j-admin server stop
  • inside container: neo4j-admin database dump neo4j --to-path="."
  • copy the .dump file to host - either via docker cp or docker desktop UI

It is important to note that the Neo4j container itself has to run at the time of executing the first command. Normally stopping the Neo4j server leads to a stopping of the container automatically. This behaviour is overriden by an additional command in the custom-entrypoint.sh script: tail -f /dev/null

Note: Before running the merging scripts one has to make sure that the original graph has the same id attribute names that are used by the scripts. Consequently, one has to execute the following Cypher query beforehand:

MATCH (n)
WHERE n.id IS NOT NULL
SET n.custom_node_id = n.id
REMOVE n.id

On multiprocessing and using the Vertex AI api

When using the free trial period version of Google Cloud strict quotas are in place for using the different models provided by the Vertex AI api. These quotas can be examined in the Google Cloud Console in API->Quotas.

The use of gemini models, for example, is limited by 1 request per minute, per region, per base model. There are also additional limits on input and output token lengths, but they are less important for the experiments in this repository. Thus, when executing a code which uses a particular base model, they only way to increase throughput is by using multiple regions at the same time. For gemini 1.5 flash 001, for example, one could use the following west european regions: regions = ["europe-west1", "europe-west2", "europe-west3", "europe-west4", "europe-west6", "europe-west8", "europe-west9"]. While models from the Gemini family are widely available across many regions, other models provided by the Vertex AI api (such as models by Mistral, Meta and Anthropic) are only accessible in just one (US) region and are consequently not suitable for multiprocessing in which each process uses an individual endpoint with a different region.

About

A project on knowledge graph fusion in the medical domain implemented for the master practical module at Heidelberg University.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published