feat: add pdf_reader & tests for MinerUParser #65

ChenZiHong-Gavin · 2025-10-17T06:54:10Z

This pull request adds support for PDF input files and prepares for a new VQA (Visual Question Answering) generation mode across the graphgen pipeline. It introduces a new PDFReader class for reading and parsing PDF files using MinerU, updates configuration files to recognize PDFs and VQA mode, and refactors reader class names for consistency. The VQA generator is also stubbed for future implementation.

Copilot

Pull Request Overview

Adds PDF ingestion via MinerU and introduces a stubbed Visual Question Answering (VQA) generation mode while refactoring reader class names to uppercase acronyms. Key changes:

New PDFReader and MinerUParser for parsing PDFs; input examples updated to include a "type" field.
Added VQAGenerator stub and generation mode wiring; configs and script updated to recognize vqa mode.
Refactored reader class names (CsvReader -> CSVReader etc.) and adjusted data filtering logic in graph insertion.

Reviewed Changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/integration_tests/models/reader/test_mineru_parser.py	Adds integration tests for MinerUParser parsing and empty PDF handling.
scripts/generate/generate_vqa.sh	New script to run VQA configuration.
resources/input_examples/.json / .jsonl / *.csv	Normalizes examples to include type="text" and adds VQA demo data.
graphgen/operators/read/read_files.py	Adds PDF support and renames reader classes; adjusts instantiation for PDF with output_dir.
graphgen/operators/generate/generate_qas.py	Wires in vqa mode to select VQAGenerator.
graphgen/models/reader/*.py	Renames reader classes and adds PDFReader/MinerUParser implementation.
graphgen/models/generator/vqa_generator.py	Introduces stubbed VQAGenerator class.
graphgen/models/init.py	Exports renamed readers and VQAGenerator.
graphgen/graphgen.py	Adjusts read_files call (adds working_dir) and filters docs by type; adds debug print.
graphgen/generate.py	Simplifies generation flow; removes conditional quiz/judge enable check.
graphgen/configs/*_config.yaml	Adds pdf to supported input types and vqa to mode comments plus new vqa_config.yaml.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/graphgen.py

graphgen/generate.py

graphgen/models/reader/pdf_reader.py

tests/integration_tests/models/reader/test_mineru_parser.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…GraphGen into feat/pdf-reader

Copilot

Pull Request Overview

Copilot reviewed 30 out of 32 changed files in this pull request and generated 6 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/models/generator/vqa_generator.py

graphgen/generate.py

graphgen/operators/read/read_files.py

graphgen/models/reader/pdf_reader.py

tests/integration_tests/models/reader/test_mineru_parser.py

graphgen/configs/atomic_config.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ChenZiHong-Gavin added 4 commits October 17, 2025 14:51

feat: add pdf_reader & tests for MinerUParser

35b7b8f

fix: delete useless code

010b9ae

feat(graphgen): add vqa configs

7a1457f

wip: vqa config

110390d

ChenZiHong-Gavin requested a review from Copilot October 20, 2025 07:45

Copilot AI reviewed Oct 20, 2025

View reviewed changes

ChenZiHong-Gavin and others added 6 commits October 20, 2025 15:50

Update graphgen/graphgen.py

2b904e7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/generate.py

685f5d2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/integration_tests/models/reader/test_mineru_parser.py

d00c663

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: fix docstring

1d0bcd3

docs: update webui input files

e6e02eb

Merge branch 'feat/pdf-reader' of https://github.com/open-sciencelab/…

d88c479

…GraphGen into feat/pdf-reader

ChenZiHong-Gavin marked this pull request as ready for review October 20, 2025 07:54

feat: auto pick device for mineru

ce6ce4b

ChenZiHong-Gavin requested a review from Copilot October 20, 2025 08:01

Copilot AI reviewed Oct 20, 2025

View reviewed changes

ChenZiHong-Gavin and others added 4 commits October 20, 2025 16:04

Update graphgen/models/generator/vqa_generator.py

f487673

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/operators/read/read_files.py

2434f27

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/configs/atomic_config.yaml

285689b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: fix lint problems

6d6f160

ChenZiHong-Gavin merged commit 9217306 into main Oct 20, 2025
3 checks passed

ChenZiHong-Gavin deleted the feat/pdf-reader branch October 20, 2025 08:15

ChenZiHong-Gavin mentioned this pull request Oct 23, 2025

[Summary] GraphGen Roadmap #49

Open

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add pdf_reader & tests for MinerUParser #65

feat: add pdf_reader & tests for MinerUParser #65

Uh oh!

ChenZiHong-Gavin commented Oct 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add pdf_reader & tests for MinerUParser #65

feat: add pdf_reader & tests for MinerUParser #65

Uh oh!

Conversation

ChenZiHong-Gavin commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenZiHong-Gavin commented Oct 17, 2025 •

edited

Loading