feat: add vqa pipeline #69

ChenZiHong-Gavin · 2025-10-20T09:23:51Z

This PRintroduces significant improvements to the handling of multi-modal (text, image, table, equation) documents and their integration into the knowledge graph generation pipeline. The changes include a refactor of the document insertion workflow to separately process text and multi-modal documents, the addition of filtering and validation for input data, and updates to key data structures and configuration files to better support multi-modal processing. Logging has also been improved for better debugging and traceability.

Copilot

Pull Request Overview

This PR adds a VQA (Visual Question Answering) pipeline by introducing support for multimodal document processing. The key changes include:

Adding a centralized filter method in BaseReader to validate and filter text, image, table, and equation entries
Updating all reader implementations to use this common filtering logic
Providing a new VQA demo JSON file with scientific paper content including text, images, and tables

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
resources/input_examples/vqa_demo.json	New demo file containing scientific paper data with text, images, tables, and equations
graphgen/bases/base_reader.py	Added `filter` method to validate content and check image existence
graphgen/models/reader/txt_reader.py	Updated to call `filter` method on results
graphgen/models/reader/pdf_reader.py	Updated to call `filter` method and removed duplicate filtering logic
graphgen/models/reader/jsonl_reader.py	Updated validation logic and added `filter` call
graphgen/models/reader/json_reader.py	Updated validation logic and added `filter` call
graphgen/models/reader/csv_reader.py	Updated validation logic and added `filter` call
graphgen/configs/vqa_config.yaml	Changed input file from PDF demo to VQA JSON demo
graphgen/graphgen.py	Added debug print statement

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/graphgen.py

graphgen/bases/base_reader.py

…to feature/vqa-pipeline

Copilot

Pull Request Overview

Copilot reviewed 21 out of 27 changed files in this pull request and generated 9 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/operators/split/split_chunks.py

graphgen/models/reader/jsonl_reader.py

graphgen/models/reader/json_reader.py

graphgen/graphgen.py

Copilot · 2025-10-21T09:37:22Z

graphgen/bases/base_reader.py

+        :return: Filtered list of dictionaries.
+        """
+
+        def _image_exists(path_or_url: str, timeout: int = 3) -> bool:


The _image_exists function makes a network request for every URL with a 3-second timeout. For documents with many images, this could significantly slow down the filtering process. Consider implementing caching or batch validation to improve performance.

graphgen/bases/base_reader.py

graphgen/graphgen.py

graphgen/bases/datatypes.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…elab/GraphGen into feature/vqa-pipeline

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

Copilot reviewed 37 out of 43 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-22T08:59:25Z

graphgen/models/reader/pdf_reader.py

            for key in ("page_idx", "bbox", "text_level"):
                if item.get(key) is not None:
                    del item[key]
-            if item["type"] == "text" and not item["content"].strip():
-                continue
            results.append(item)


Direct modification of dictionary during iteration is error-prone. The deletion of keys from item should happen before appending to results, or a filtered copy should be created instead. Consider using dictionary comprehension: item = {k: v for k, v in item.items() if k not in ('page_idx', 'bbox', 'text_level')}.

Copilot · 2025-10-22T08:59:25Z

graphgen/models/reader/csv_reader.py

+        for _, row in df.iterrows():
+            if row.get("type") == "text" and self.text_column not in row:
+                raise ValueError(
+                    f"Missing '{self.text_column}' in document: {row.to_dict()}"
+                )


Using df.iterrows() is inefficient for pandas DataFrames. This validation should be performed using vectorized operations or at least df.itertuples() for better performance. Consider: text_rows = df[df['type'] == 'text']; if text_rows[self.text_column].isna().any(): raise ValueError(...).

Suggested change

for _, row in df.iterrows():

if row.get("type") == "text" and self.text_column not in row:

raise ValueError(

f"Missing '{self.text_column}' in document: {row.to_dict()}"

)

text_rows = df[df["type"] == "text"]

if self.text_column not in df.columns or text_rows[self.text_column].isna().any():

# Find the first offending row for error message

missing_row = text_rows[text_rows[self.text_column].isna()].iloc[0] if self.text_column in df.columns and not text_rows.empty else None

raise ValueError(

f"Missing '{self.text_column}' in document: {missing_row.to_dict() if missing_row is not None else 'Column not found'}"

)

Copilot · 2025-10-22T08:59:25Z

graphgen/bases/base_reader.py

+            try:
+                resp = requests.head(path_or_url, allow_redirects=True, timeout=timeout)
+                return resp.status_code == 200
+            except requests.RequestException:
+                return False


The _image_exists function makes a blocking HTTP request for each image URL, which will severely impact performance when processing documents with many images. Consider adding async support or implementing batch validation with connection pooling.

graphgen/models/partitioner/anchor_bfs_partitioner.py

Copilot

Pull Request Overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/models/reader/csv_reader.py

graphgen/models/kg_builder/mm_kg_builder.py

graphgen/graphgen.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…elab/GraphGen into feature/vqa-pipeline

Copilot

Pull Request Overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 6 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/models/partitioner/anchor_bfs_partitioner.py

graphgen/bases/base_reader.py

graphgen/models/kg_builder/mm_kg_builder.py

graphgen/operators/partition/partition_kg.py

graphgen/models/generator/vqa_generator.py

graphgen/templates/kg/kg_extraction.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…elab/GraphGen into feature/vqa-pipeline

ChenZiHong-Gavin added 3 commits October 20, 2025 17:18

docs: replace vqa_demo.json

ee2e59a

fix: support content type for input data

90f3a72

feat: filter non-exist content

12ee557

ChenZiHong-Gavin requested a review from Copilot October 21, 2025 03:35

Copilot AI reviewed Oct 21, 2025

View reviewed changes

graphgen/graphgen.py Outdated Show resolved Hide resolved

graphgen/bases/base_reader.py Show resolved Hide resolved

ChenZiHong-Gavin added 5 commits October 21, 2025 11:45

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

5cee7f2

…to feature/vqa-pipeline

docs: add test data

341231a

refactor: turn log level to DEBUG when extracting KG

cbbd2ae

refactor: turn log level to DEBUG when extracting KG

b854079

feat: add support for multi-modal chunk

7c66cd7

ChenZiHong-Gavin requested a review from Copilot October 21, 2025 09:35

Copilot AI reviewed Oct 21, 2025

View reviewed changes

ChenZiHong-Gavin and others added 8 commits October 21, 2025 18:49

Update graphgen/models/reader/jsonl_reader.py

30c43db

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: DEBUG log level for FileHandler & INFO log level for RichHandler

7c71be7

Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…

5a624ac

…elab/GraphGen into feature/vqa-pipeline

fix: fix language check

8042f03

Update graphgen/models/reader/json_reader.py

6b0c8a3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat: add mm_kg_builder

16c0d85

feat: add anchor_bfs_partitioner

8b05bb3

fix: fix language check

4df2948

ChenZiHong-Gavin requested a review from Copilot October 22, 2025 08:57

Copilot AI reviewed Oct 22, 2025

View reviewed changes

feat: add vqa_generator

6fa1537

ChenZiHong-Gavin requested a review from Copilot October 23, 2025 03:54

ChenZiHong-Gavin marked this pull request as ready for review October 23, 2025 03:54

Copilot AI reviewed Oct 23, 2025

View reviewed changes

graphgen/models/reader/csv_reader.py Outdated Show resolved Hide resolved

graphgen/models/kg_builder/mm_kg_builder.py Show resolved Hide resolved

graphgen/graphgen.py Show resolved Hide resolved

ChenZiHong-Gavin and others added 4 commits October 23, 2025 11:59

Update graphgen/models/reader/csv_reader.py

c8c6979

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/models/partitioner/anchor_bfs_partitioner.py

3ee98a9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat: add vqa_generator

22aae9a

Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…

122cd4c

…elab/GraphGen into feature/vqa-pipeline

ChenZiHong-Gavin requested a review from Copilot October 23, 2025 10:36

Copilot AI reviewed Oct 23, 2025

View reviewed changes

ChenZiHong-Gavin and others added 4 commits October 23, 2025 18:44

fix: fix aggregated template

aa87906

Update graphgen/operators/partition/partition_kg.py

d5bbdcb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: fix fetching img_path in vqa_generator

ef2e109

Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…

b2db994

…elab/GraphGen into feature/vqa-pipeline

ChenZiHong-Gavin merged commit 56362ac into main Oct 23, 2025
3 checks passed

ChenZiHong-Gavin deleted the feature/vqa-pipeline branch October 23, 2025 11:07

ChenZiHong-Gavin mentioned this pull request Oct 23, 2025

[Summary] GraphGen Roadmap #49

Open

30 tasks

feat: add vqa pipeline #69

feat: add vqa pipeline #69

Uh oh!

Conversation

ChenZiHong-Gavin commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenZiHong-Gavin commented Oct 20, 2025 •

edited

Loading