Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

@ChenZiHong-Gavin ChenZiHong-Gavin commented Oct 17, 2025

This pull request adds support for PDF input files and prepares for a new VQA (Visual Question Answering) generation mode across the graphgen pipeline. It introduces a new PDFReader class for reading and parsing PDF files using MinerU, updates configuration files to recognize PDFs and VQA mode, and refactors reader class names for consistency. The VQA generator is also stubbed for future implementation.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds PDF ingestion via MinerU and introduces a stubbed Visual Question Answering (VQA) generation mode while refactoring reader class names to uppercase acronyms. Key changes:

  • New PDFReader and MinerUParser for parsing PDFs; input examples updated to include a "type" field.
  • Added VQAGenerator stub and generation mode wiring; configs and script updated to recognize vqa mode.
  • Refactored reader class names (CsvReader -> CSVReader etc.) and adjusted data filtering logic in graph insertion.

Reviewed Changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/integration_tests/models/reader/test_mineru_parser.py Adds integration tests for MinerUParser parsing and empty PDF handling.
scripts/generate/generate_vqa.sh New script to run VQA configuration.
resources/input_examples/*.json / *.jsonl / *.csv Normalizes examples to include type="text" and adds VQA demo data.
graphgen/operators/read/read_files.py Adds PDF support and renames reader classes; adjusts instantiation for PDF with output_dir.
graphgen/operators/generate/generate_qas.py Wires in vqa mode to select VQAGenerator.
graphgen/models/reader/*.py Renames reader classes and adds PDFReader/MinerUParser implementation.
graphgen/models/generator/vqa_generator.py Introduces stubbed VQAGenerator class.
graphgen/models/init.py Exports renamed readers and VQAGenerator.
graphgen/graphgen.py Adjusts read_files call (adds working_dir) and filters docs by type; adds debug print.
graphgen/generate.py Simplifies generation flow; removes conditional quiz/judge enable check.
graphgen/configs/*_config.yaml Adds pdf to supported input types and vqa to mode comments plus new vqa_config.yaml.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ChenZiHong-Gavin and others added 6 commits October 20, 2025 15:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin marked this pull request as ready for review October 20, 2025 07:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 30 out of 32 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ChenZiHong-Gavin and others added 4 commits October 20, 2025 16:04
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin merged commit 9217306 into main Oct 20, 2025
3 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the feat/pdf-reader branch October 20, 2025 08:15
@ChenZiHong-Gavin ChenZiHong-Gavin mentioned this pull request Oct 23, 2025
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants