- 
                Notifications
    
You must be signed in to change notification settings  - Fork 37
 
feat: add vqa pipeline #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a VQA (Visual Question Answering) pipeline by introducing support for multimodal document processing. The key changes include:
- Adding a centralized 
filtermethod inBaseReaderto validate and filter text, image, table, and equation entries - Updating all reader implementations to use this common filtering logic
 - Providing a new VQA demo JSON file with scientific paper content including text, images, and tables
 
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description | 
|---|---|
| resources/input_examples/vqa_demo.json | New demo file containing scientific paper data with text, images, tables, and equations | 
| graphgen/bases/base_reader.py | Added filter method to validate content and check image existence | 
| graphgen/models/reader/txt_reader.py | Updated to call filter method on results | 
| graphgen/models/reader/pdf_reader.py | Updated to call filter method and removed duplicate filtering logic | 
| graphgen/models/reader/jsonl_reader.py | Updated validation logic and added filter call | 
| graphgen/models/reader/json_reader.py | Updated validation logic and added filter call | 
| graphgen/models/reader/csv_reader.py | Updated validation logic and added filter call | 
| graphgen/configs/vqa_config.yaml | Changed input file from PDF demo to VQA JSON demo | 
| graphgen/graphgen.py | Added debug print statement | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…to feature/vqa-pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 21 out of 27 changed files in this pull request and generated 9 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| :return: Filtered list of dictionaries. | ||
| """ | ||
| 
               | 
          ||
| def _image_exists(path_or_url: str, timeout: int = 3) -> bool: | 
    
      
    
      Copilot
AI
    
    
    
      Oct 21, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _image_exists function makes a network request for every URL with a 3-second timeout. For documents with many images, this could significantly slow down the filtering process. Consider implementing caching or batch validation to improve performance.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…elab/GraphGen into feature/vqa-pipeline
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 37 out of 43 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| for key in ("page_idx", "bbox", "text_level"): | ||
| if item.get(key) is not None: | ||
| del item[key] | ||
| if item["type"] == "text" and not item["content"].strip(): | ||
| continue | ||
| results.append(item) | 
    
      
    
      Copilot
AI
    
    
    
      Oct 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct modification of dictionary during iteration is error-prone. The deletion of keys from item should happen before appending to results, or a filtered copy should be created instead. Consider using dictionary comprehension: item = {k: v for k, v in item.items() if k not in ('page_idx', 'bbox', 'text_level')}.
| for _, row in df.iterrows(): | ||
| if row.get("type") == "text" and self.text_column not in row: | ||
| raise ValueError( | ||
| f"Missing '{self.text_column}' in document: {row.to_dict()}" | ||
| ) | 
    
      
    
      Copilot
AI
    
    
    
      Oct 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using df.iterrows() is inefficient for pandas DataFrames. This validation should be performed using vectorized operations or at least df.itertuples() for better performance. Consider: text_rows = df[df['type'] == 'text']; if text_rows[self.text_column].isna().any(): raise ValueError(...).
| for _, row in df.iterrows(): | |
| if row.get("type") == "text" and self.text_column not in row: | |
| raise ValueError( | |
| f"Missing '{self.text_column}' in document: {row.to_dict()}" | |
| ) | |
| text_rows = df[df["type"] == "text"] | |
| if self.text_column not in df.columns or text_rows[self.text_column].isna().any(): | |
| # Find the first offending row for error message | |
| missing_row = text_rows[text_rows[self.text_column].isna()].iloc[0] if self.text_column in df.columns and not text_rows.empty else None | |
| raise ValueError( | |
| f"Missing '{self.text_column}' in document: {missing_row.to_dict() if missing_row is not None else 'Column not found'}" | |
| ) | 
| try: | ||
| resp = requests.head(path_or_url, allow_redirects=True, timeout=timeout) | ||
| return resp.status_code == 200 | ||
| except requests.RequestException: | ||
| return False | 
    
      
    
      Copilot
AI
    
    
    
      Oct 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _image_exists function makes a blocking HTTP request for each image URL, which will severely impact performance when processing documents with many images. Consider adding async support or implementing batch validation with connection pooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 40 out of 46 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…elab/GraphGen into feature/vqa-pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 40 out of 46 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…elab/GraphGen into feature/vqa-pipeline
This PRintroduces significant improvements to the handling of multi-modal (text, image, table, equation) documents and their integration into the knowledge graph generation pipeline. The changes include a refactor of the document insertion workflow to separately process text and multi-modal documents, the addition of filtering and validation for input data, and updates to key data structures and configuration files to better support multi-modal processing. Logging has also been improved for better debugging and traceability.