-
Notifications
You must be signed in to change notification settings - Fork 56
Feat/pdf ingest fix #835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pdf ingest fix #835
Conversation
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
Caution Review failedThe pull request is closed. WalkthroughRefactors PDF text extraction to use per-page textContent paragraphs via buildParagraphsFromPage, removing operator-args parsing and fallback paths. Updates per-page processing, overlap handling, and logging; adds a Logger.debug call for all text chunks. Image handling and public API signatures remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PDFProcessor as PDF Chunks Processor
participant PDFJS as PDF.js
participant Logger as Logs/Metrics
Client->>PDFProcessor: generateChunksFromPdf(pdf)
loop For each page
PDFProcessor->>PDFJS: getPage(pageNum)
PDFProcessor->>PDFJS: page.getTextContent()
Note over PDFProcessor: buildParagraphsFromPage(textContent)
PDFProcessor->>PDFProcessor: prepend pageOverlap to first paragraph
PDFProcessor->>PDFProcessor: processTextParagraphs(paragraphs, prevOverlap)
alt Page has images
PDFProcessor->>PDFJS: extractImages(page)
PDFProcessor->>PDFProcessor: integrate image chunks (image markers insertion)
end
PDFProcessor->>Logger: log page info (paragraphs.length, images count)
end
PDFProcessor->>Logger: Logger.debug("All text chunks", { text_chunks })
PDFProcessor-->>Client: chunks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/pdfChunks.ts (1)
333-389
: Paragraph construction misses intra-line spacing and causes readability regressions; also compute itemCount here to avoid a second getTextContent callCurrent logic concatenates item.str without considering X gaps, which can collapse words in many PDFs. Add a simple X-gap heuristic and return the item count so the caller doesn’t call getTextContent twice.
- const buildParagraphsFromPage = async ( - page: pdfjsLib.PDFPageProxy, - ): Promise<string[]> => { + const buildParagraphsFromPage = async ( + page: pdfjsLib.PDFPageProxy, + ): Promise<{ paragraphs: string[]; itemCount: number }> => { const textContent = await page.getTextContent({ includeMarkedContent: false, disableNormalization: false, }) // Build lines using hasEOL and Y-position changes (handles PPT/DOC exports) const lines: string[] = [] let current = "" let prevY: number | null = null let prevH: number | null = null + let prevX: number | null = null for (const item of textContent.items as any[]) { const str: string = item && typeof item.str === "string" ? item.str : "" if (!str) continue const tr = Array.isArray(item.transform) ? item.transform : [] + const x = typeof tr[4] === "number" ? tr[4] : null const y = typeof tr[5] === "number" ? tr[5] : null const h = typeof item.height === "number" ? item.height : null let newLine = false if (prevY != null && y != null) { const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance if (Math.abs(y - prevY) > tol) newLine = true } if (newLine || (item as any).hasEOL) { if (current.length > 0) lines.push(current) current = str + prevX = x } else { - current += str + // Insert a space when there's a notable horizontal gap between chunks. + const gapTol = Math.max(prevH || 0, h || 0, 10) * 0.6 + if ( + prevX != null && + x != null && + x > prevX && + x - prevX > gapTol && + current.length > 0 && + !/\s$/.test(current) && + !/^\s/.test(str) + ) { + current += " " + } + current += str + prevX = x } prevY = y prevH = h } if (current.trim().length > 0) lines.push(current) // Group lines into paragraphs separated by blank lines const paragraphs: string[] = [] let buf: string[] = [] const pushPara = () => { if (buf.length === 0) return paragraphs.push(buf.join("\n")) buf = [] } for (const ln of lines) { if (ln.trim().length === 0) pushPara() else buf.push(ln) } pushPara() // Filter raw paragraphs - check trimmed length but don't apply full cleaning yet - return paragraphs.filter((p) => p.trim().length > 0) + return { + paragraphs: paragraphs.filter((p) => p.trim().length > 0), + itemCount: (textContent.items as any[]).length, + } }
🧹 Nitpick comments (5)
server/pdfChunks.ts (5)
400-406
: Avoid duplicate page.getTextContent call; use the count returned by buildParagraphsFromPageThis saves one heavy API call per page.
- let paragraphs: string[] = await buildParagraphsFromPage(page) - - - - let textOperatorCount = (await page.getTextContent()).items.length + const { paragraphs, itemCount: textOperatorCount } = + await buildParagraphsFromPage(page)
415-420
: Tighten log keys and drop redundant field"primaryParagraphs" vs "finalParagraphs" are identical post-refactor. Use a single "paragraphs" key.
Logger.debug("Text extraction summary for page", { pageNum, - primaryParagraphs: paragraphs.length, - - finalParagraphs: paragraphs.length, + paragraphs: paragraphs.length, textOperatorCount, initialPageOverlap: pageOverlap, })
1300-1304
: Also avoid emitting full payloads at debug levelConsider logging counts and hashes, not full arrays, to reduce noise and accidental PII leakage.
17-21
: LGTM on removing the old line-height constant; keep env validation for MIN_IMAGE_DIM_PXThe remaining env-driven default is fine. Consider clamping to a sane range (e.g., 64–2048) and logging the effective value once at startup.
703-705
: Fix image-type usage and drop ts-ignoreimage-type is synchronous; the await + ts-ignore combo weakens type safety. Either:
- keep image-type and call it synchronously, or
- switch to file-type (async) if you need broader detection.
- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer) if (!type) type = { mime: "image/png", ext: "png" }Repeat similarly at the other call sites.
Also applies to: 812-814, 1078-1097
server/pdfChunks.ts
Outdated
describeImages, | ||
}) | ||
|
||
console.log("All text chunks", { text_chunks }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove console.log dumping all text chunks (PII/log volume risk)
This prints full extracted content to stdout in production. Use structured debug logs with sampling or summaries instead.
- console.log("All text chunks", { text_chunks })
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
console.log("All text chunks", { text_chunks }) |
🤖 Prompt for AI Agents
In server/pdfChunks.ts around line 1299, remove the plain console.log that dumps
all text_chunks to stdout; instead use the application's structured logger at
debug/trace level (or guard behind a verbose env flag) and log either a safe
summary (e.g., count of chunks, lengths, or hashes) or a sampled/truncated
subset to avoid exposing PII and high-volume output. Ensure the log call does
not include full chunk contents, and add a comment noting the change for
auditing.
Removed older Text processing
Summary by CodeRabbit