-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add PDF processing support and enhance document handling #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Updated allowed document MIME types to include 'application/pdf'. - Implemented PDF content extraction in a new module (pdf-parser.ts). - Integrated PDF processing into the document extraction workflow. - Enhanced error handling for PDF processing, including password protection. - Added functions for normalizing and cleaning text extracted from PDFs. - Implemented chunking of text for better handling of large documents. - Introduced image extraction markers and descriptions for images in PDFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces PDF processing support to the document handling system, expanding the supported file formats from text, Office documents, CSV, JSON, and ZIP files to include PDFs.
- Adds comprehensive PDF text extraction and image processing capabilities using PDF.js
- Integrates PDF support into existing document processor and attachment validation
- Updates examples to demonstrate PDF processing functionality
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
src/utils/pdf-parser.ts | Complete PDF processing implementation with text extraction, image processing, and chunking logic |
src/utils/document-processor.ts | Integrates PDF extraction into main document processor and updates supported types |
src/utils/attachments.ts | Adds PDF MIME type to allowed document types for attachment validation |
package.json | Adds required dependencies for PDF processing (canvas, pdfjs-dist) |
examples/attachment-demo-server.ts | Updates examples and documentation to showcase PDF processing capabilities |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
- Add proper documentation for WASM path configuration - Fix pages metadata to return actual page count from PDF document - Improve placeholder image description function documentation - Address Copilot code review suggestions for better clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This pull request adds support for PDF document attachments in the attachment demo server and updates the documentation and example commands accordingly. It also introduces new dependencies required for PDF processing.
PDF Support Enhancements:
examples/attachment-demo-server.ts
. [1] [2]curl
commands demonstrating how to send PDF documents (both via URL and base64 data) to the server for analysis inexamples/attachment-demo-server.ts
.Dependency Updates for PDF Handling:
pdfjs-dist
andcanvas
as new dependencies inpackage.json
to enable PDF parsing and rendering.pnpm-lock.yaml
to include the new dependencies and their transitive packages, such aspdfjs-dist
,canvas
, and related native modules. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]