Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

@ChenZiHong-Gavin ChenZiHong-Gavin commented Oct 20, 2025

This PRintroduces significant improvements to the handling of multi-modal (text, image, table, equation) documents and their integration into the knowledge graph generation pipeline. The changes include a refactor of the document insertion workflow to separately process text and multi-modal documents, the addition of filtering and validation for input data, and updates to key data structures and configuration files to better support multi-modal processing. Logging has also been improved for better debugging and traceability.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a VQA (Visual Question Answering) pipeline by introducing support for multimodal document processing. The key changes include:

  • Adding a centralized filter method in BaseReader to validate and filter text, image, table, and equation entries
  • Updating all reader implementations to use this common filtering logic
  • Providing a new VQA demo JSON file with scientific paper content including text, images, and tables

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
resources/input_examples/vqa_demo.json New demo file containing scientific paper data with text, images, tables, and equations
graphgen/bases/base_reader.py Added filter method to validate content and check image existence
graphgen/models/reader/txt_reader.py Updated to call filter method on results
graphgen/models/reader/pdf_reader.py Updated to call filter method and removed duplicate filtering logic
graphgen/models/reader/jsonl_reader.py Updated validation logic and added filter call
graphgen/models/reader/json_reader.py Updated validation logic and added filter call
graphgen/models/reader/csv_reader.py Updated validation logic and added filter call
graphgen/configs/vqa_config.yaml Changed input file from PDF demo to VQA JSON demo
graphgen/graphgen.py Added debug print statement

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 21 out of 27 changed files in this pull request and generated 9 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

:return: Filtered list of dictionaries.
"""

def _image_exists(path_or_url: str, timeout: int = 3) -> bool:
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _image_exists function makes a network request for every URL with a 3-second timeout. For documents with many images, this could significantly slow down the filtering process. Consider implementing caching or batch validation to improve performance.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 37 out of 43 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 172 to 175
for key in ("page_idx", "bbox", "text_level"):
if item.get(key) is not None:
del item[key]
if item["type"] == "text" and not item["content"].strip():
continue
results.append(item)
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct modification of dictionary during iteration is error-prone. The deletion of keys from item should happen before appending to results, or a filtered copy should be created instead. Consider using dictionary comprehension: item = {k: v for k, v in item.items() if k not in ('page_idx', 'bbox', 'text_level')}.

Copilot uses AI. Check for mistakes.
Comment on lines 12 to 16
for _, row in df.iterrows():
if row.get("type") == "text" and self.text_column not in row:
raise ValueError(
f"Missing '{self.text_column}' in document: {row.to_dict()}"
)
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using df.iterrows() is inefficient for pandas DataFrames. This validation should be performed using vectorized operations or at least df.itertuples() for better performance. Consider: text_rows = df[df['type'] == 'text']; if text_rows[self.text_column].isna().any(): raise ValueError(...).

Suggested change
for _, row in df.iterrows():
if row.get("type") == "text" and self.text_column not in row:
raise ValueError(
f"Missing '{self.text_column}' in document: {row.to_dict()}"
)
text_rows = df[df["type"] == "text"]
if self.text_column not in df.columns or text_rows[self.text_column].isna().any():
# Find the first offending row for error message
missing_row = text_rows[text_rows[self.text_column].isna()].iloc[0] if self.text_column in df.columns and not text_rows.empty else None
raise ValueError(
f"Missing '{self.text_column}' in document: {missing_row.to_dict() if missing_row is not None else 'Column not found'}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +51
try:
resp = requests.head(path_or_url, allow_redirects=True, timeout=timeout)
return resp.status_code == 200
except requests.RequestException:
return False
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _image_exists function makes a blocking HTTP request for each image URL, which will severely impact performance when processing documents with many images. Consider adding async support or implementing batch validation with connection pooling.

Copilot uses AI. Check for mistakes.
@ChenZiHong-Gavin ChenZiHong-Gavin marked this pull request as ready for review October 23, 2025 03:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ChenZiHong-Gavin and others added 4 commits October 23, 2025 11:59
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ChenZiHong-Gavin ChenZiHong-Gavin merged commit 56362ac into main Oct 23, 2025
3 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the feature/vqa-pipeline branch October 23, 2025 11:07
@ChenZiHong-Gavin ChenZiHong-Gavin mentioned this pull request Oct 23, 2025
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants