- 
                Notifications
    
You must be signed in to change notification settings  - Fork 37
 
feat: add pdf_reader & tests for MinerUParser #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds PDF ingestion via MinerU and introduces a stubbed Visual Question Answering (VQA) generation mode while refactoring reader class names to uppercase acronyms. Key changes:
- New PDFReader and MinerUParser for parsing PDFs; input examples updated to include a "type" field.
 - Added VQAGenerator stub and generation mode wiring; configs and script updated to recognize vqa mode.
 - Refactored reader class names (CsvReader -> CSVReader etc.) and adjusted data filtering logic in graph insertion.
 
Reviewed Changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description | 
|---|---|
| tests/integration_tests/models/reader/test_mineru_parser.py | Adds integration tests for MinerUParser parsing and empty PDF handling. | 
| scripts/generate/generate_vqa.sh | New script to run VQA configuration. | 
| resources/input_examples/*.json / *.jsonl / *.csv | Normalizes examples to include type="text" and adds VQA demo data. | 
| graphgen/operators/read/read_files.py | Adds PDF support and renames reader classes; adjusts instantiation for PDF with output_dir. | 
| graphgen/operators/generate/generate_qas.py | Wires in vqa mode to select VQAGenerator. | 
| graphgen/models/reader/*.py | Renames reader classes and adds PDFReader/MinerUParser implementation. | 
| graphgen/models/generator/vqa_generator.py | Introduces stubbed VQAGenerator class. | 
| graphgen/models/init.py | Exports renamed readers and VQAGenerator. | 
| graphgen/graphgen.py | Adjusts read_files call (adds working_dir) and filters docs by type; adds debug print. | 
| graphgen/generate.py | Simplifies generation flow; removes conditional quiz/judge enable check. | 
| graphgen/configs/*_config.yaml | Adds pdf to supported input types and vqa to mode comments plus new vqa_config.yaml. | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…GraphGen into feat/pdf-reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 30 out of 32 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request adds support for PDF input files and prepares for a new VQA (Visual Question Answering) generation mode across the graphgen pipeline. It introduces a new
PDFReaderclass for reading and parsing PDF files using MinerU, updates configuration files to recognize PDFs and VQA mode, and refactors reader class names for consistency. The VQA generator is also stubbed for future implementation.