-
Notifications
You must be signed in to change notification settings - Fork 56
pdfChunks file updated for better text processing #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughReplaces op-list-centric PDF parsing with a per-page textContent paragraph pipeline, adds normalizeText/cleanText and CTM matrix utilities, expands image extraction/validation/PNG conversion/MD5 caching and optional LLM descriptions, preserves page overlap across chunks, and adds a direct test harness script. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev
participant TestScript as testPdfDirect.ts
participant FPS as FileProcessorService
participant Extractor as extractTextAndImagesWithChunksFromPDF
participant PDFJS as pdf.js
participant LLM as describeImageWithllm
participant Disk as downloads/.../{docid}
Dev->>TestScript: run testPdfDirect()
TestScript->>FPS: processFile(pdfBuffer, "application/pdf", ...)
FPS-->>TestScript: KB-flow results
TestScript->>Extractor: extractTextAndImagesWithChunksFromPDF(uint8array, docId, extractImages=true, describeImages=true)
Extractor->>PDFJS: loadDocument -> getPage N
loop per page
PDFJS-->>Extractor: page.getTextContent()
Extractor->>Extractor: buildParagraphsFromPage -> normalizeText/cleanText -> create text chunks (handle pageOverlap)
PDFJS-->>Extractor: page.objs / inline images
alt image meets checks
Extractor->>Extractor: convert to PNG -> detect MIME -> md5(hash)
opt describeImages
Extractor->>LLM: describeImageWithllm(imageBuffer, hash)
LLM-->>Extractor: description
end
Extractor->>Disk: save image + metadata
Extractor->>Extractor: append image placeholder to text chunks (if enabled)
else
Extractor-->>Extractor: skip image & log
end
end
Extractor-->>TestScript: {text_chunks, image_chunks, positions}
TestScript-->>Dev: print comparison & samples
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces significant improvements to the PDF processing pipeline, focusing on enhancing the accuracy and robustness of both text and image extraction. The changes aim to provide a more reliable foundation for chunking PDF content, ensuring better data quality for downstream applications. The refactored text cleaning and extraction methods, coupled with a more sophisticated image handling mechanism, contribute to a more comprehensive and debuggable PDF parsing solution.
Highlights
- Improved Text Cleaning: The
cleanText
function has been completely refactored, introducingnormalizeText
for initial Unicode and whitespace normalization, andsmartDespaceLine
for intelligent handling of letter spacing within lines, leading to more accurate text representation. - Enhanced PDF Text Extraction: The PDF text extraction logic now leverages PDF.js's
getTextContent
API to robustly build paragraphs, accounting for line breaks, Y-position changes, and blank lines, which significantly improves text chunking, especially for documents exported from tools like PowerPoint or Word. - Refined Image Extraction: Image extraction has been made more comprehensive, including the introduction of a
MIN_IMAGE_DIM_PX
environment variable to filter out small images, improved image object resolution, handling of inline images, and more robust processing of various image data formats. It also includes CTM tracking for better image bounds calculation. - Persistent Cross-Image Overlap: The logic for
crossImageOverlap
has been updated to append image placeholders ([[IMG#<id>]]
) rather than replacing previous overlap text, ensuring better continuity of context across images. - Extensive Debugging and Testing: Numerous debug logs have been added throughout the
pdfChunks.ts
file to provide better visibility into the text and image processing flow. A new script,server/scripts/testPdfDirect.ts
, has also been added to facilitate direct testing and comparison of PDF processing with and without image extraction.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant improvements to PDF text processing by implementing a more sophisticated text cleaning pipeline and switching to getTextContent
for more reliable text extraction. The image extraction logic has also been made more robust. While these are great enhancements, the current changes include a substantial amount of debugging code (console.log
) that must be removed before merging. There are also some critical issues, such as the use of hardcoded magic numbers for PDF operators, and a new test script with hardcoded absolute paths that will break for other developers. I've provided comments on these points and other opportunities to improve code quality and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (10)
server/pdfChunks.ts (8)
347-351
: Avoid duplicate getTextContent() calls per page.You fetch textContent inside buildParagraphsFromPage and again for textOperatorCount. Cache once.
Minimal change:
- let paragraphs: string[] = await buildParagraphsFromPage(page) - let currentParagraph = "" - let textOperatorCount = (await page.getTextContent()).items.length + const textContent = await page.getTextContent({ + includeMarkedContent: false, + disableNormalization: false, + }) + const paragraphs = await (async () => { + // Inline the builder using textContent to avoid extra fetch + // (copy of buildParagraphsFromPage body but using textContent) + // Or refactor builder to accept textContent. + const items = textContent.items as any[] + const lines: string[] = [] + let current = "", prevY: number | null = null, prevH: number | null = null + for (const item of items) { + const str: string = item && typeof item.str === "string" ? item.str : "" + if (!str) continue + const tr = Array.isArray(item.transform) ? item.transform : [] + const y = typeof tr[5] === "number" ? tr[5] : null + const h = typeof item.height === "number" ? item.height : null + const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 + const newLine = prevY != null && y != null && Math.abs(y - prevY) > tol + if (newLine || (item as any).hasEOL) { if (current) lines.push(current); current = str } else { current += str } + prevY = y; prevH = h + } + if (current.trim()) lines.push(current) + const paras: string[] = [] + let buf: string[] = [] + for (const ln of lines) { if (ln.trim()) buf.push(ln); else if (buf.length) { paras.push(buf.join("\n")); buf = [] } } + if (buf.length) paras.push(buf.join("\n")) + return paras.map(cleanText).filter((p) => p.length > 0) + })() + let currentParagraph = "" + const textOperatorCount = textContent.items.lengthOr refactor buildParagraphsFromPage to accept a pre-fetched textContent.
Also applies to: 409-413
435-469
: Remove unused CTM tracking or wire it to real positioning logic.currentCTM/ctmStack are computed but never used. Dead code adds complexity.
- let currentCTM: [number, number, number, number, number, number] = [1, 0, 0, 1, 0, 0] - const ctmStack: [number, number, number, number, number, number][] = [] - const mul = (m1: number[], m2: number[]): [number, number, number, number, number, number] => { ... } - const applyToPoint = (m: number[], x: number, y: number): { x: number; y: number } => { ... } + // CTM tracking removed: re-add only when bounding boxes are needed @@ - case PDFJS.OPS.save: { - ctmStack.push([...currentCTM]) - break - } - case PDFJS.OPS.restore: { - if (ctmStack.length) currentCTM = ctmStack.pop()! - break - } + // save/restore CTM handling not needed currentlyAlso applies to: 556-563
1191-1193
: Use debug level for cache hits instead of warn.Cache reuse is expected, not a warning.
-Logger.warn( - `Reusing description for repeated image ${imageName} on page ${pageNum}`, -) +Logger.debug( + `Reusing description for repeated image ${imageName} on page ${pageNum}`, +)
497-499
: Remove unused variables.imagesOnPage and vectorOpsDetected are set but never used.
-let imagesOnPage = 0 -let vectorOpsDetected = false +// removed unused counters
1281-1284
: crossImageOverlap is appended but never consumed; drop or cap it.Avoid unbounded growth of a never-used string; either remove the feature or cap to a small window.
- crossImageOverlap += ` [[IMG#${globalSeq.value}]] ` + // crossImageOverlap removed (unused) @@ - if (overlapText.trim()) { - const previousCrossImageOverlap = crossImageOverlap - crossImageOverlap = crossImageOverlap - ? `${crossImageOverlap} ${overlapText}` - : overlapText - } else { - // no update - } + // overlapText not accumulated; handled within chunking onlyAlso applies to: 1329-1334
1375-1376
: Replace stray console.log with Logger or remove.“Calling destroy” console log is noisy in production.
- console.log("Calling destroy") + Logger.debug("Destroying PDF document")
165-214
: Remove dead text-operator extraction helpers.validateTextItem and extractTextFromArgs are unused after switching to textContent; delete to reduce surface area and logging.
-/** - * Validate text item - */ -function validateTextItem(item: any): boolean { ... } - -/** - * Extract text from various PDF.js text operators with enhanced validation - */ -function extractTextFromArgs(args: any[]): string { ... }
17-18
: Bound the hash cache to avoid unbounded memory use.seenHashDescriptions can grow indefinitely across documents. Consider LRU with a max size.
I can add a tiny LRU (e.g., QuickLRU) or a capped Map eviction policy if you want.
server/scripts/testPdfDirect.ts (2)
2-2
: Drop unused import.resolve is unused.
-import { resolve } from "path"
35-40
: Print counts only; avoid dumping large arrays.Keep this harness light—counts are already sufficient.
No diff needed if you keep counts and remove any future array dumps.
Also applies to: 52-57
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/pdfChunks.ts
(12 hunks)server/scripts/testPdfDirect.ts
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/scripts/testPdfDirect.ts (2)
server/services/fileProcessor.ts (1)
FileProcessorService
(22-158)server/pdfChunks.ts (1)
extractTextAndImagesWithChunksFromPDF
(282-1378)
server/pdfChunks.ts (3)
server/chunks.ts (1)
chunkTextByParagraph
(9-112)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(3-71)server/lib/describeImageWithllm.ts (1)
describeImageWithllm
(56-112)
🔇 Additional comments (3)
server/pdfChunks.ts (2)
694-696
: Confirmed — image-type default export is async; await usage is correct.
image-type(input: ArrayBuffer | Uint8Array | Buffer) -> Promise<{ext: string, mime: string} | undefined>, so the await usage at lines 694–696 (and 787–789, 1115–1132) is valid.
13-16
: Verify WASM asset paths in all deployment targetsFile: server/pdfChunks.ts (lines 13–16)
const openjpegWasmPath = path.join(__dirname, "../node_modules/pdfjs-dist/wasm/") + "/" const qcmsWasmPath = path.join(__dirname, "../node_modules/pdfjs-dist/wasm/") + "/"
Ensure pdfjs-dist/wasm is present at runtime for all deployment targets (bundled/serverless). If not, make these paths configurable (env var) or copy the wasm assets into a stable runtime location during your build/deploy pipeline.
server/scripts/testPdfDirect.ts (1)
3-5
: Ensure the script's runner resolves tsconfig path aliases (@/*).
- server/tsconfig.json defines "@/" → ["./"] and server/scripts/testPdfDirect.ts imports "@/services/..." and "@/pdfChunks".
- Bun's runtime respects tsconfig "paths", so running the script with Bun will resolve these aliases. (bun.com)
- I found no tsconfig-paths/module-alias/ts-node helpers in the repo; if you run this with Node/ts-node, add a runtime mapper (e.g. tsconfig-paths) or switch to relative imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
server/pdfChunks.ts (3)
1111-1117
: Respect Uint8Array view offsets when creating Buffer.Using the underlying ArrayBuffer can include unrelated bytes. Create Buffer from the view.
-const imgBuffer = Buffer.from(uint8Data.buffer) +const imgBuffer = Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength)
236-239
: Don’t log raw extracted text (PII/log bloat).Replace with length-only, routed via Logger at debug level.
- const result = typeof text === "string" ? text : "" - console.log("EXTRACT TEXT DEBUG: Final extracted text:", result) - return result + const result = typeof text === "string" ? text : "" + Logger.debug(`extractTextFromArgs: len=${result.length}`) + return result
1433-1447
: Remove plaintext dumps of all chunks and positions.These can leak sensitive content and balloon logs. Keep summaries only.
- console.log("FINAL DEBUG: PDF processing completed for", docid) - console.log("FINAL DEBUG: Processing summary:", { - totalTextChunks: text_chunks.length, - totalImageChunks: image_chunks.length, - textChunkPositions: text_chunk_pos.length, - imageChunkPositions: image_chunk_pos.length, - extractImages, - describeImages, - }) - - console.log("FINAL DEBUG: All text chunks:", text_chunks) - console.log("FINAL DEBUG: All text chunk positions:", text_chunk_pos) - console.log("FINAL DEBUG: All image chunks:", image_chunks) - console.log("FINAL DEBUG: All image chunk positions:", image_chunk_pos) + Logger.debug({ + msg: "PDF processing summary", + docid, + totalTextChunks: text_chunks.length, + totalImageChunks: image_chunks.length, + textChunkPositions: text_chunk_pos.length, + imageChunkPositions: image_chunk_pos.length, + extractImages, + describeImages, + })
🧹 Nitpick comments (7)
server/pdfChunks.ts (7)
394-396
: Use the declared constant instead of the magic number.You introduced
MIN_LINE_HEIGHT_FOR_TOLERANCE
(Line 20) but still use10
here. Replace to keep intent clear and single-sourced.- const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance + const tol = Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * 0.4 // dynamic toleranceAlso applies to: 20-20
119-124
: Text normalization: broaden zero-width character removal.Handle additional zero-width characters and BOM to reduce artifacts.
- normalized = normalized.replace(/\u200B/g, "") // zero-width space + normalized = normalized.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "") // zero-widths & BOMAlso applies to: 30-44
506-512
: Avoid second getTextContent() per page just for a count.
buildParagraphsFromPage
already fetched textContent; computetextOperatorCount
there and return it, or derive fromopList
if feasible. Saves a heavy call per page.Would you like me to propose a small refactor to return
{ paragraphs, textItemCount }
frombuildParagraphsFromPage
and wire it through?
519-526
: Remove @ts-ignore around page.objs access; add safe typing.Use a local
const objs: any = (page as any).objs
with guards.-// @ts-ignore -if ( - typeof (page.objs as any).has === "function" && - (page.objs as any).has(name) -) { - // @ts-ignore - return (page.objs as any).get(name) -} -const obj = (page.objs as any).get(name) +const objs: any = (page as any).objs +if (objs && typeof objs.has === "function" && objs.has(name)) { + return objs.get(name) +} +const obj = objs?.get ? objs.get(name) : undefined return obj || null
552-578
:vectorOpsDetected
is set but never used.Drop the flag and related checks to reduce noise.
617-621
: Switch readability: avoidcase extractImages ? ... : null
.Wrap image cases in an
if (!extractImages) break;
guard and keep plain OPS cases; simplifies control flow.
1454-1455
: Replace stray console.log with Logger.- console.log("Calling destroy") + Logger.debug("Destroying PDF document")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/pdfChunks.ts
(10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.
Applied to files:
server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.
Applied to files:
server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
getLogger
(33-90)server/chunks.ts (1)
chunkTextByParagraph
(9-112)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(3-71)server/lib/describeImageWithllm.ts (1)
describeImageWithllm
(56-112)
🔇 Additional comments (3)
server/pdfChunks.ts (3)
318-321
: Good: docid is sanitized before filesystem use.Using
safeDocId
eliminates path traversal risk. Nice hardening.
107-145
: Paragraph placeholder approach is robust; good tradeoff.Using PUA sentinels + runtime-unique token reduces collision risk while preserving double-newline paragraphs. Nice.
741-744
: Don't remove await — image-type(buffer) is async; suggested change is incorrect.imageType(buffer) returns a Promise and must be awaited; retain the existing await in server/pdfChunks.ts (lines 741–744, 835–837, 1193–1211). Fix typings (or cast) if @ts-ignore is only masking a type issue — do not remove await.
Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
server/pdfChunks.ts (2)
1390-1393
: Stop logging full chunks (PII risk + log bloat).Dumping entire text/image chunks can leak sensitive data and explode logs. Log counts/metrics instead.
- Logger.debug("All text chunks", { text_chunks }) - Logger.debug("All text chunk positions", { text_chunk_pos }) - Logger.debug("All image chunks", { image_chunks }) - Logger.debug("All image chunk positions", { image_chunk_pos }) + Logger.debug("Chunk summaries", { + text_chunks_len: text_chunks.length, + image_chunks_len: image_chunks.length, + text_pos_len: text_chunk_pos.length, + image_pos_len: image_chunk_pos.length, + })
1077-1085
: Respect Uint8Array view offsets when creating Buffer.Buffer.from(uint8Data.buffer) may include unrelated bytes. Use offset/length (or the view directly).
- const imgBuffer = Buffer.from(uint8Data.buffer) + const imgBuffer = Buffer.from( + uint8Data.buffer, + uint8Data.byteOffset, + uint8Data.byteLength, + )
🧹 Nitpick comments (12)
server/pdfChunks.ts (12)
19-21
: Replace magic numbers with named constants for line-break tolerance.Use the defined MIN_LINE_HEIGHT_FOR_TOLERANCE and a named factor; make the factor configurable.
-const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 +const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 +const LINE_BREAK_TOLERANCE_FACTOR = parseFloat( + process.env.LINE_BREAK_TOLERANCE_FACTOR || "0.4", +)- const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance + const tol = + Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * + LINE_BREAK_TOLERANCE_FACTORAlso applies to: 392-394
270-272
: Use central config for chunk size (avoid hardcoding 512).Keeps behavior consistent with server/integrations/dataSource/config.ts.
- const chunks = chunkTextByParagraph(cleanedText, 512, 128) + const chunks = chunkTextByParagraph( + cleanedText, + DATASOURCE_CONFIG.MAX_CHUNK_SIZE, + 128, + )
28-42
: Normalize CRLF to LF in normalizeText.Prevents stray CR from PDFs/Office exports.
let normalized = input.normalize("NFC") // Strip control chars except newline/tab normalized = normalized.replace(/[^\P{C}\n\t]/gu, "") // Normalize whitespace + normalized = normalized.replace(/\r\n?/g, "\n") // CRLF/CR -> LF normalized = normalized.replace(/\u00A0/g, " ") // nbsp → space normalized = normalized.replace(/\u200B/g, "") // zero-width space normalized = normalized.replace(/\t+/g, " ") // tabs → single space
65-73
: Handle any whitespace between spaced letters (not just a single space).parts[j] === " " misses double spaces; use a whitespace check.
- parts[j] === " " && + /^\s+$/.test(parts[j]) &&
577-586
: Simplify switch: avoid conditional case labels.Use plain OPS cases and guard with extractImages inside the block for readability and type-safety.
- case extractImages ? PDFJS.OPS.paintImageXObject : null: - case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null: - case extractImages ? PDFJS.OPS.paintInlineImageXObject : null: - case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: { + case PDFJS.OPS.paintImageXObject: + case PDFJS.OPS.paintImageXObjectRepeat: + case PDFJS.OPS.paintInlineImageXObject: + case PDFJS.OPS.paintImageMaskXObject: { + if (!extractImages) breakAlso applies to: 615-618
1226-1229
: Demote noisy warning to debug when reusing cached description.Cache hits are normal; don’t warn.
- Logger.warn( - `Reusing description for repeated image ${imageName} on page ${pageNum}`, - ) + Logger.debug( + `Reusing description for repeated image ${imageName} on page ${pageNum}`, + )
520-527
: Replace @ts-ignore with explicit reasoning or safer typing.Where unavoidable (page.objs.has), add a one-line comment explaining the pdf.js typing gap; elsewhere prefer type assertions.
Examples:
- // @ts-ignore - if ( + // pdf.js page.objs is untyped; some builds expose .has(name) + if ( typeof (page.objs as any).has === "function" && (page.objs as any).has(name) ) { - // @ts-ignore - return (page.objs as any).get(name) + return (page.objs as any).get(name) }- // @ts-ignore draw directly - cctx.drawImage(imgLike, 0, 0) + cctx.drawImage(imgLike as any, 0, 0)- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer)Also applies to: 822-823, 1156-1156
551-552
: Remove unused flags (imagesOnPage, vectorOpsDetected) or use them.They’re set but never read; simplify to reduce noise.
- let imagesOnPage = 0 - let vectorOpsDetected = false + // (removed unused variables)And drop their increments/assignments.
Also applies to: 575-576
503-504
: Avoid double getTextContent per page.buildParagraphsFromPage calls getTextContent; you call it again for textOperatorCount. Consider passing textContent down or counting from the first call.
Also applies to: 372-376
1212-1217
: Hashing: update directly with Buffer (micro-optimization).- const imageHash = crypto - .createHash("md5") - .update(new Uint8Array(buffer)) + const imageHash = crypto + .createHash("md5") + .update(buffer) .digest("hex")
17-18
: Consider bounding seenHashDescriptions (memory).Unbounded Map may grow across process lifetime; use LRU/TTL.
540-549
: Dead code: currentParagraph/flushParagraph no-ops.You never append to currentParagraph. Remove to reduce confusion, or wire it up.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/integrations/microsoft/index.ts
(1 hunks)server/pdfChunks.ts
(10 hunks)
✅ Files skipped from review due to trivial changes (1)
- server/integrations/microsoft/index.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.
Applied to files:
server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.
Applied to files:
server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
getLogger
(33-90)server/chunks.ts (1)
chunkTextByParagraph
(9-112)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(3-71)server/lib/describeImageWithllm.ts (1)
describeImageWithllm
(56-112)
🔇 Additional comments (4)
server/pdfChunks.ts (4)
317-324
: Doc ID sanitization looks good.Safe FS usage with safeDocId; nice.
989-1076
: Correct 1bpp handling and ImageData(width,height) fix look solid.Bit-packed grayscale expansion and explicit height are correct.
103-147
: Paragraph placeholder approach is robust.Unique PUA-wrapped token avoids collisions; cleanup logic is correct.
557-575
: Using PDFJS.OPS constants fixes prior magic-number issue.Readability and forward-compat improved.
Also applies to: 577-586
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
server/pdfChunks.ts (4)
19-20
: Replace magic numbers in line-gap tolerance with named constants.You already have
MIN_LINE_HEIGHT_FOR_TOLERANCE
; also name the tolerance factor.-// Minimum line height used for calculating line break detection tolerance (in PDF units) -const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 +// Minimum line height used for calculating line break detection tolerance (in PDF units) +const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 +// Fraction of line height used as Y-delta tolerance for line breaks +const LINE_GAP_TOLERANCE_FACTOR = parseFloat( + process.env.LINE_GAP_TOLERANCE_FACTOR || "0.4", +)- const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance + const tol = + Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * + LINE_GAP_TOLERANCE_FACTORAlso applies to: 392-394
1383-1386
: Avoid dumping full chunk contents to logs; gate behind an env flag.Large plaintext logs risk noise/PII. Keep your dev-time verbosity but guard it.
- Logger.debug("All text chunks", { text_chunks }) - Logger.debug("All text chunk positions", { text_chunk_pos }) - Logger.debug("All image chunks", { image_chunks }) - Logger.debug("All image chunk positions", { image_chunk_pos }) + if (process.env.DEBUG_PDF_CONTENT === "true") { + Logger.debug("All text chunks", { text_chunks }) + Logger.debug("All text chunk positions", { text_chunk_pos }) + Logger.debug("All image chunks", { image_chunks }) + Logger.debug("All image chunk positions", { image_chunk_pos }) + } else { + Logger.debug("Chunk counts", { + textChunks: text_chunks.length, + imageChunks: image_chunks.length, + textPos: text_chunk_pos.length, + imagePos: image_chunk_pos.length, + }) + }
727-731
: Drop unnecessaryawait
and@ts-ignore
onimageType
; type it and keep default.
image-type
is sync; theawait
and suppressions hide type issues.- // @ts-ignore - let type = await imageType(buffer) - if (!type) type = { mime: "image/png", ext: "png" } + let type = imageType(buffer) as { mime: string; ext?: string } | null + if (!type) type = { mime: "image/png", ext: "png" }- // @ts-ignore - let type = await imageType(buffer) - if (!type) type = { mime: "image/png", ext: "png" } + let type = imageType(buffer) as { mime: string; ext?: string } | null + if (!type) type = { mime: "image/png", ext: "png" }- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer) as { mime: string; ext?: string } | null + if (!type) { + Logger.debug("Could not determine MIME type, defaulting to image/png", { imageName }) + type = { mime: "image/png", ext: "png" } + }Also applies to: 826-829, 1164-1183
1087-1093
: RespectUint8Array
view offsets when creatingBuffer
.Using the underlying
ArrayBuffer
can include extra bytes.- const imgBuffer = Buffer.from(uint8Data.buffer) + const imgBuffer = Buffer.from( + uint8Data.buffer, + uint8Data.byteOffset, + uint8Data.byteLength, + )
🧹 Nitpick comments (5)
server/pdfChunks.ts (5)
236-238
: Inline the trivial return.Minor readability nit.
- const result = typeof text === "string" ? text : "" - return result + return typeof text === "string" ? text : ""
270-270
: Use configured chunk size instead of hardcoded 512/128.Aligns with
DATASOURCE_CONFIG.MAX_CHUNK_SIZE
and makes behavior tunable.- const chunks = chunkTextByParagraph(cleanedText, 512, 128) + const chunks = chunkTextByParagraph( + cleanedText, + DATASOURCE_CONFIG.MAX_CHUNK_SIZE, + Math.min(128, Math.floor(DATASOURCE_CONFIG.MAX_CHUNK_SIZE / 4)), + )
544-548
: CTM tracking is computed but not consumed.
currentCTM/ctmStack
are updated but never used for placement decisions or cropping. Either wire this into downstream logic or remove to reduce complexity.Also applies to: 589-604
1234-1236
: Adjust log level for cache hits.Reusing an image description is expected;
warn
is noisy. Suggestdebug
orinfo
.- Logger.warn( + Logger.debug( `Reusing description for repeated image ${imageName} on page ${pageNum}`, )
28-42
: Broaden zero-width character removal (optional).Add other common ZW chars to avoid stray artifacts in output.
- normalized = normalized.replace(/\u200B/g, "") // zero-width space + normalized = normalized.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "") // zero-width chars
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/pdfChunks.ts
(8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.
Applied to files:
server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.
Applied to files:
server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
getLogger
(33-90)server/chunks.ts (1)
chunkTextByParagraph
(9-112)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(3-71)server/lib/describeImageWithllm.ts (1)
describeImageWithllm
(56-112)
🔇 Additional comments (1)
server/pdfChunks.ts (1)
513-521
: Ordering contract: verify consumer merges by*_pos
.Images are emitted during operator pass; text chunks are appended after page processing. Ensure downstream merges by
pos
to preserve reading order; otherwise, order will skew.Also applies to: 1350-1356, 1358-1359
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
server/pdfChunks.ts (5)
613-616
: Remove?: null
in switch cases; guard inside the case.The
case extractImages ? OPS : null
creates deadcase null
branches.- case extractImages ? PDFJS.OPS.paintImageXObject : null: - case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null: - case extractImages ? PDFJS.OPS.paintInlineImageXObject : null: - case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: { + case PDFJS.OPS.paintImageXObject: + case PDFJS.OPS.paintImageXObjectRepeat: + case PDFJS.OPS.paintInlineImageXObject: + case PDFJS.OPS.paintImageMaskXObject: { + if (!extractImages) break
1397-1400
: Don’t log full chunk contents (PII risk, heavy logs).Even at debug, avoid dumping arrays of content. Log counts/lengths only or behind an explicit env flag.
- Logger.debug("All text chunks", { text_chunks }) - Logger.debug("All text chunk positions", { text_chunk_pos }) - Logger.debug("All image chunks", { image_chunks }) - Logger.debug("All image chunk positions", { image_chunk_pos }) + // Avoid dumping raw content; counts are already logged above.
522-541
: Explain or eliminate @ts-ignore usages around page.objs.Either narrow types with assertions or add a brief why-it’s-safe comment per occurrence to prevent type regressions.
883-889
: Also exit early for small image-like objects.Same fall-through risk here; add a continue.
- if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) { + if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) { Logger.debug("Skipped small image-like object", { imageName, width, height, minRequired: MIN_IMAGE_DIM_PX, }) - } else { + continue + } else {
783-799
: Ensure “skip” actually skips further processing (canvas-like images).On small or too-large canvas images you log but don’t exit; control falls through to other fallbacks and may still process the image.
- if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) { + if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) { Logger.debug("Skipped small canvas image", { imageName, width, height, minRequired: MIN_IMAGE_DIM_PX, }) - } else { + continue + } else { const buffer = c.toBuffer("image/png") // Run all filters BEFORE attempting LLM description if ( buffer.length > DATASOURCE_CONFIG.MAX_IMAGE_FILE_SIZE_MB * 1024 * 1024 ) { Logger.warn( `Skipping objs/canvas image due to size ${(buffer.length / (1024 * 1024)).toFixed(2)} MB: ${imageName}`, ) - } else { + continue + } else {
🧹 Nitpick comments (6)
server/pdfChunks.ts (6)
18-20
: Use the defined constant in tolerance calc (avoid magic number).You introduced MIN_LINE_HEIGHT_FOR_TOLERANCE but still hardcode 10 in the tolerance. Use the constant for readability and single-point tuning.
-const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 +const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10 @@ - const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance + const tol = Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * 0.4 // dynamic toleranceAlso applies to: 392-394
369-424
: Avoid duplicate getTextContent call per page.You fetch textContent inside buildParagraphsFromPage and again for textOperatorCount. Cache once and reuse (pass item count out or return alongside paragraphs).
Also applies to: 502-503
549-574
: Remove or use vectorOpsDetected.Flag is set but never read; drop it or hook it into heuristics.
800-803
: Drop unnecessaryawait
on imageType; it’s synchronous.Removes
@ts-ignore
, improves perf and types.- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer)- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer)- // @ts-ignore - let type = await imageType(buffer) + let type = imageType(buffer)Also applies to: 909-912, 1176-1182
1099-1105
: Respect Uint8Array view offsets when creating Buffer.Using the underlying ArrayBuffer can include unrelated bytes.
- const imgBuffer = Buffer.from(uint8Data.buffer) + const imgBuffer = Buffer.from( + uint8Data.buffer, + uint8Data.byteOffset, + uint8Data.byteLength, + )
1247-1249
: Use debug (not warn) when reusing cached image descriptions.Re-use is expected; warn is too noisy.
- Logger.warn( - `Reusing description for repeated image ${imageName} on page ${pageNum}`, - ) + Logger.debug( + `Reusing description for repeated image ${imageName} on page ${pageNum}`, + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/pdfChunks.ts
(8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.
Applied to files:
server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.
Applied to files:
server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
getLogger
(33-90)server/chunks.ts (1)
chunkTextByParagraph
(9-112)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(3-71)server/lib/describeImageWithllm.ts (1)
describeImageWithllm
(56-112)
🔇 Additional comments (6)
server/pdfChunks.ts (6)
28-42
: normalizeText: solid Unicode and whitespace normalization.NFC + control-char stripping (except NL/TAB), nbsp/ZWSP handling, and trim look good.
112-125
: cleanText pipeline is well-structured and safer placeholder.Private-use sentinel for paragraph breaks + hyphenation and punctuation fixes are sensible. Good step-up from magic strings like [[PARA]].
Also applies to: 133-147
46-99
: Validate despacing behavior for mixed letter-digit runs.Current rule merges any >=3 single tokens, including digits. “B 2 B” -> “B2B” and “3 D” -> “3D”. If that’s intended, fine; otherwise restrict digits.
306-324
: Ordering guarantee: confirm consumer uses positions to reconstruct interleaving.Images are pushed during operator iteration; text is flushed after the page. Ensure downstream merges by sequence (pos arrays) and not by array order. If not, we’ll need intra-page flushing of text around image ops.
Also applies to: 1332-1345
158-186
: Matrix math utility at module scope: good move.Reusability + testability improved.
1009-1096
: Pixel-format handling incl. 1bpp unpack: looks correct.Expected length computation and MSB-first bit expansion are handled correctly; ImageData(width, height) usage fixed.
case extractImages ? PDFJS.OPS.paintImageXObject : null: | ||
case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null: | ||
case extractImages ? PDFJS.OPS.paintInlineImageXObject : null: | ||
case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
One-shot verification script for the above changes.
Run to spot remaining patterns across the repo.
Also applies to: 835-847, 948-956, 1308-1318, 783-799, 883-889, 800-803, 909-912, 1099-1105, 1397-1400
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for switch cases with '?: null'..."
rg -n 'case\s+[^:\n]+\?\s*[^:]+:\s*null' -C1 --type=ts
echo -e "\nFinding Buffer.from(ArrayBuffer) without offsets..."
rg -nP 'Buffer\.from\(\s*[^,()]+\.buffer\s*\)' --type=ts -C2
echo -e "\nFinding 'await imageType(' occurrences..."
rg -n 'await\s+imageType\(' --type=ts -C2
echo -e "\nFinding debug logs dumping chunk arrays..."
rg -nP 'Logger\.debug\([^)]*(text_chunks|image_chunks|text_chunk_pos|image_chunk_pos)' --type=ts -C2
echo -e "\nChecking for path.join(baseDir, docid) writes..."
rg -nP 'path\.(join|resolve)\([^)]*docid[^)]*\).*writeFile' --type=ts -C3
Length of output: 3184
Fix critical Buffer.from(ArrayBuffer) bug; remove ternary case labels; verify imageType usage; stop dumping large arrays to logs
- Critical — server/pdfChunks.ts:1099: Buffer.from(uint8Data.buffer) uses the entire backing ArrayBuffer (ignores byteOffset/length). Replace with Buffer.from(uint8Data) or Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength).
- Logic — server/pdfChunks.ts:613–616: lines like
case extractImages ? PDFJS.OPS.paintImageXObject : null:
producecase null
when extractImages is false; remove the ternary and either guard extractImages outside the switch or check inside the case block. - Verify — server/pdfChunks.ts:801–803, 909–912, 1177–1179 and server/integrations/dataSource/index.ts:547: multiple
await imageType(...)
uses. Confirm whether your image-type dependency is async; if it’s synchronous, dropawait
and the@ts-ignore
; if async, consider reducing per-image overhead. - Logging — server/pdfChunks.ts:1397–1400:
Logger.debug
dumps full chunk arrays (text_chunks, image_chunks, …). Avoid logging large arrays (truncate or gate behind debug-level sampling).
🤖 Prompt for AI Agents
In server/pdfChunks.ts around lines 613–616, 801–803, 909–912, 1099, 1177–1179
and around 1397–1400: remove the ternary expressions from the switch case labels
(e.g., `case extractImages ? PDFJS.OPS.paintImageXObject : null:`) and instead
either guard the switch with `if (extractImages) { ... }` or keep the case
labels as the OPS constants and check `if (!extractImages) break` inside the
case body; fix the Buffer.from(ArrayBuffer) bug at line ~1099 by using the
Uint8Array directly (Buffer.from(uint8Data)) or pass the buffer with offsets
(Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength)) so
you don't include the entire backing ArrayBuffer; review uses of `await
imageType(...)` at the listed lines and either remove `await` (and the
@ts-ignore) if imageType is synchronous or keep it but batch/reduce per-image
calls if it is async; and stop dumping full large arrays in Logger.debug around
lines 1397–1400—truncate, sample, or gate those logs behind a strict debug flag
to avoid huge outputs.
const baseDir = path.resolve( | ||
process.env.IMAGE_DIR || | ||
"downloads/xyne_images_db", | ||
) | ||
const outputDir = path.join(baseDir, docid) | ||
await fsPromises.mkdir(outputDir, { | ||
recursive: true, | ||
}) | ||
const imageFilename = `${globalSeq.value}.${type.ext || "png"}` | ||
const imagePath = path.join( | ||
outputDir, | ||
imageFilename, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block path traversal via docid when writing images.
Current path.join(baseDir, docid) can escape baseDir. Keep raw docid (per product decision) but validate resolved path.
- const baseDir = path.resolve(
+ const baseDir = path.resolve(
process.env.IMAGE_DIR ||
"downloads/xyne_images_db",
)
- const outputDir = path.join(baseDir, docid)
+ const resolvedOutputDir = path.resolve(baseDir, docid)
+ const rel = path.relative(baseDir, resolvedOutputDir)
+ if (rel.startsWith("..") || path.isAbsolute(rel)) {
+ Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+ break
+ }
+ const outputDir = resolvedOutputDir
- const baseDir = path.resolve(
+ const baseDir = path.resolve(
process.env.IMAGE_DIR ||
"downloads/xyne_images_db",
)
- const outputDir = path.join(baseDir, docid)
+ const resolvedOutputDir = path.resolve(baseDir, docid)
+ const rel = path.relative(baseDir, resolvedOutputDir)
+ if (rel.startsWith("..") || path.isAbsolute(rel)) {
+ Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+ break
+ }
+ const outputDir = resolvedOutputDir
- const baseDir = path.resolve(
+ const baseDir = path.resolve(
process.env.IMAGE_DIR || "downloads/xyne_images_db",
)
- const outputDir = path.join(baseDir, docid)
+ const resolvedOutputDir = path.resolve(baseDir, docid)
+ const rel = path.relative(baseDir, resolvedOutputDir)
+ if (rel.startsWith("..") || path.isAbsolute(rel)) {
+ Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+ continue
+ }
+ const outputDir = resolvedOutputDir
Also applies to: 948-956, 1308-1318
🤖 Prompt for AI Agents
In server/pdfChunks.ts around lines 835-847, the code uses path.join(baseDir,
docid) which allows docid to escape baseDir; keep using the raw docid per
product decision but replace the join with path.resolve(baseDir, docid) and
validate that the resulting resolved path is inside baseDir (e.g., compute
path.relative(baseDir, resolvedPath) and ensure it does not start with '..' and
that resolvedPath starts with baseDir + path.sep or equals baseDir); if
validation fails, log/error and abort the write. Apply the same
resolve+validation logic to the other occurrences at lines 948-956 and
1308-1318, and only call fsPromises.mkdir/write after the path has been
validated.
Description
| Aspect | Old
cleanText
(commented) | New Implementation | |--------|----------------------------|-------------------| | Scope | Simple control character removal | Comprehensive text normalization | | Unicode | No Unicode normalization | Uses NFC normalization | | Whitespace | Basic line ending normalization | Advanced whitespace handling (nbsp, zero-width, tabs) | | Letter spacing | No handling | Smart letter-spacing correction viasmartDespaceLine
| | Structure | Single function | Modular approach (normalizeText
+smartDespaceLine
+cleanText
) | | PDF-specific issues | Limited handling | Addresses common PDF extraction problems |Current
cleanText
Function: The newcleanText
function (not commented) combines multiple steps:normalizeText
for basic cleaningsmartDespaceLine
to fix spacing issuesThe evolution shows a progression from basic text cleaning to sophisticated PDF text normalization that handles real-world extraction challenges like letter-spacing artifacts, inconsistent Unicode representations, and complex whitespace patterns.
API Request$0.0266
Thinking:The user is asking about specific code in the
server/pdfChunks.ts
file, specifically about: 1.input.normalize("NFC")
- what does this do? 2. Explain thenormalizeText
andsmartDespaceLine
functions 3. Difference with the commentedcleanText
function I already provided a detailed explanation, but I didn't use a tool. Since I've already provided a comprehensive answer to their questions about the code, and I have the file content available to me, I should use theattempt_completion
tool to present this as the final result.Task Completed
I've analyzed the text processing functions in
server/pdfChunks.ts
. Here's a comprehensive explanation of each:input.normalize("NFC")
This applies Unicode Normalization Form Canonical (NFC) to ensure characters are in their canonical composed form. For example, if "é" is represented as separate characters (base "e" + combining accent "´"), NFC combines them into a single composed character. This is crucial for PDFs that may contain inconsistent Unicode representations.
normalizeText
FunctionA foundational text cleaning function that:
Unicode normalization - Uses NFC as explained above
Control character removal - Strips control characters except newlines/tabs using
/[^\P{C}\n\t]/gu
Whitespace normalization:
\u00A0
) to regular spaces\u200B
)Trim - Removes leading/trailing whitespace
smartDespaceLine
FunctionFixes letter-spacing artifacts common in PDF text extraction. PDFs often extract text with artificial spacing like "N A S A" instead of "NASA". This function:
Difference with Commented
cleanText
The commented version was simpler and only handled basic control character removal and line ending normalization. The current implementation is much more sophisticated:
| Aspect | Old (commented) | New Implementation | |--------|-----------------|-------------------| | Scope | Simple control char removal | Comprehensive normalization | | Unicode | None | NFC normalization | | Whitespace | Basic line endings | Advanced (nbsp, zero-width, tabs) | | Letter spacing | No handling | Smart correction via
smartDespaceLine
| | Structure | Single function | Modular approach | | PDF-specific | Limited | Addresses real PDF extraction issues |The current
cleanText
function combines all these steps plus handles hyphenation across line breaks, paragraph management, punctuation spacing, and excessive space normalization - making it much better suited for processing real-world PDF text extraction challenges.Image based changes overview
CTM (Current Transformation Matrix) and Enhanced Image Capturing
CTM stands for Current Transformation Matrix - a 2D transformation matrix used in PDF rendering to track coordinate system transformations. It's represented as
[a, b, c, d, e, f]
corresponding to:CTM Tracking Enhancement:
The CTM tracking enhances image capturing by:
Code Block Analysis:
1. Transform and Save/Restore Operations:
What this does:
2.
resolveImageByName
Function:Enhancement: This provides a robust way to retrieve image objects by name from PDF.js's internal object store, handling different PDF.js builds and API variations.
3.
isCanvasLike
andisImageLike
Logic:What these do:
isCanvasLike
: Detects HTML5 Canvas objects that can be directly converted to PNG viatoBuffer()
isImageLike
: Detects image-like objects (have dimensions but aren't canvas) that need to be drawn onto a canvas first4. GRAYSCALE_1BPP Processing:
What this does:
5. Image Hash and Description Generation:
What this does:
Overall Enhancement Impact:
These changes transform basic image extraction into a sophisticated system that:
The result is much more reliable and efficient image extraction from complex PDF documents.
Summary by CodeRabbit
New Features
Improvements
Tests
Style