-
Notifications
You must be signed in to change notification settings - Fork 4.6k
feat: KB - PDF with password #4106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…oving chunking. (#3774) - **Refactored PDF Readers:** Unified PDF reading logic in `BasePDFReader`, reducing duplicate code and improving maintainability. All derived PDF reader classes (`PDFReader`, `PDFUrlReader`, `PDFImageReader`, `PDFUrlImageReader`) now delegate the main reading and document creation logic to shared methods. - **Page Number Handling:** Introduced `_clean_page_numbers` to detect, remove, or reformat page numbers from PDF page texts. This addresses issues where page numbers merged with content, e.g., "this is chapter 1" + page number "2" became "this is chapter 12". Now, page numbers are reliably separated or removed. - **Configurable Page Number Formats:** Added options to customize how (or if) page numbers appear in document contents via `page_start_numbering_format` and `page_end_numbering_format`. - **Improved OCR Processing:** Centralized and streamlined image-based text extraction using `_ocr_reader` and `_async_ocr_reader`. - **Asynchronous Support:** Asynchronous PDF reading is now consistent with synchronous logic, including correct page number and OCR handling. - **Chunking Logic Unchanged:** The chunking of documents (splitting large documents into smaller chunks) remains functionally unchanged, but is now called from a shared method. - **Page Number Merging Bug Fixed:** Previously, when a page contained text like "this is chapter 1" and the page number was "2", the resulting content was "this is chapter 12" (merging the page number with the last digit of the content). With the new logic, page numbers are detected and removed or formatted correctly, so this issue is resolved. - The changes are internal refactors and improvements; the public APIs and main reading workflow remain the same. - Default behaviors for chunking and document splitting are preserved. - If no page numbers exist or numbering is inconsistent, content is processed as before. - When page numbering is recognized, it is made clear that it concerns page numbering by formatting it as `<start page {page_nr}>` and `<end page {page_nr}>`. These can be removed completely, formatted differently or set to `{page_nr}` to preserve the old behavior. - [x] Bug fix - [x] New feature - [ ] Breaking change -> No, though I would recommend to change the default for the new flag "split_on_pages" to False. Now it's true, which is backwards compatible. - [x] Improvement - [ ] Model update - [ ] Other: --- - [x] Code complies with style guidelines - [ ] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) <- Doesn't for me? Will try again... [See here](#3602) - [x] Self-review completed - [x] Documentation updated (comments, docstrings) - [ ] Examples and guides: Relevant cookbook examples have been included or updated (if applicable) - [ ] Tested in clean environment - [x] Tests added/updated (if applicable) --------- Co-authored-by: Siete Frouws <siete.frouws@bincy.nl> Co-authored-by: Mustafa Esoofally <coolmusta@gmail.com>
## Summary Reorder messaging to preserve conversation history. Solves: #3849 ## Type of change - [x] Bug fix - [x] New feature - [x] Breaking change - [ ] Improvement - [ ] Model update - [ ] Other: --- ## Checklist - [x] Code complies with style guidelines - [x] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [ ] Self-review completed - [ ] Documentation updated (comments, docstrings) - [ ] Examples and guides: Relevant cookbook examples have been included or updated (if applicable) - [x] Tested in clean environment - [x] Tests added/updated (if applicable) --------- Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary Describe key changes, mention related issues or motivation for the changes. (If applicable, issue number: #\_\_\_\_) ## Type of change - [ X] Bug fix - [ ] New feature - [ ] Breaking change - [ ] Improvement - [ ] Model update - [ ] Other: --- ## Checklist - [ ] Code complies with style guidelines - [ ] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [ ] Self-review completed - [ ] Documentation updated (comments, docstrings) - [ ] Examples and guides: Relevant cookbook examples have been included or updated (if applicable) - [ ] Tested in clean environment - [ ] Tests added/updated (if applicable) --- ## Additional Notes Add any important context (deployment instructions, screenshots, security considerations, etc.) --------- Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary add missing reranker in pgvector hybrid search ## Type of change - [x] Bug fix - [ ] New feature - [ ] Breaking change - [ ] Improvement - [ ] Model update - [ ] Other: --- ## Checklist - [ ] Code complies with style guidelines - [ ] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [ ] Self-review completed - [ ] Documentation updated (comments, docstrings) - [ ] Examples and guides: Relevant cookbook examples have been included or updated (if applicable) - [ ] Tested in clean environment - [ ] Tests added/updated (if applicable) --- ## Additional Notes Add any important context (deployment instructions, screenshots, security considerations, etc.) --------- Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary Adds new streaming events when using `output_model` (If applicable, issue number: #\_\_\_\_) ## Type of change - [ ] Bug fix - [ ] New feature - [ ] Breaking change - [ ] Improvement - [ ] Model update - [ ] Other: --- ## Checklist - [ ] Code complies with style guidelines - [ ] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [ ] Self-review completed - [ ] Documentation updated (comments, docstrings) - [ ] Examples and guides: Relevant cookbook examples have been included or updated (if applicable) - [ ] Tested in clean environment - [ ] Tests added/updated (if applicable) --- ## Additional Notes Add any important context (deployment instructions, screenshots, security considerations, etc.)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Type of change
Checklist
./scripts/format.shand./scripts/validate.sh)