Skip to content

Conversation

@Mustafa-Esoofally
Copy link
Contributor

@Mustafa-Esoofally Mustafa-Esoofally commented Aug 5, 2025

Summary

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Improvement
  • Model update
  • Other:

Checklist

  • Code complies with style guidelines
  • Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
  • Self-review completed
  • Documentation updated (comments, docstrings)
  • Examples and guides: Relevant cookbook examples have been included or updated (if applicable)
  • Tested in clean environment
  • Tests added/updated (if applicable)

@Mustafa-Esoofally Mustafa-Esoofally requested a review from a team as a code owner August 5, 2025 15:15
Mustafa-Esoofally and others added 18 commits August 5, 2025 20:47
…oving chunking. (#3774)

- **Refactored PDF Readers:** Unified PDF reading logic in
`BasePDFReader`, reducing duplicate code and improving maintainability.
All derived PDF reader classes (`PDFReader`, `PDFUrlReader`,
`PDFImageReader`, `PDFUrlImageReader`) now delegate the main reading and
document creation logic to shared methods.
- **Page Number Handling:** Introduced `_clean_page_numbers` to detect,
remove, or reformat page numbers from PDF page texts. This addresses
issues where page numbers merged with content, e.g., "this is chapter 1"
+ page number "2" became "this is chapter 12". Now, page numbers are
reliably separated or removed.
- **Configurable Page Number Formats:** Added options to customize how
(or if) page numbers appear in document contents via
`page_start_numbering_format` and `page_end_numbering_format`.
- **Improved OCR Processing:** Centralized and streamlined image-based
text extraction using `_ocr_reader` and `_async_ocr_reader`.
- **Asynchronous Support:** Asynchronous PDF reading is now consistent
with synchronous logic, including correct page number and OCR handling.
- **Chunking Logic Unchanged:** The chunking of documents (splitting
large documents into smaller chunks) remains functionally unchanged, but
is now called from a shared method.

- **Page Number Merging Bug Fixed:** Previously, when a page contained
text like "this is chapter 1" and the page number was "2", the resulting
content was "this is chapter 12" (merging the page number with the last
digit of the content). With the new logic, page numbers are detected and
removed or formatted correctly, so this issue is resolved.

- The changes are internal refactors and improvements; the public APIs
and main reading workflow remain the same.
- Default behaviors for chunking and document splitting are preserved.
- If no page numbers exist or numbering is inconsistent, content is
processed as before.
- When page numbering is recognized, it is made clear that it concerns
page numbering by formatting it as `<start page {page_nr}>` and `<end
page {page_nr}>`. These can be removed completely, formatted differently
or set to `{page_nr}` to preserve the old behavior.

- [x] Bug fix
- [x] New feature
- [ ] Breaking change -> No, though I would recommend to change the
default for the new flag "split_on_pages" to False. Now it's true, which
is backwards compatible.
- [x] Improvement
- [ ] Model update
- [ ] Other:

---

- [x] Code complies with style guidelines
- [ ] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`) <- Doesn't for me? Will try again... [See
here](#3602)
- [x] Self-review completed
- [x] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [ ] Tested in clean environment
- [x] Tests added/updated (if applicable)

---------

Co-authored-by: Siete Frouws <siete.frouws@bincy.nl>
Co-authored-by: Mustafa Esoofally <coolmusta@gmail.com>
## Summary

Reorder messaging to preserve conversation history. Solves: #3849

## Type of change

- [x] Bug fix
- [x] New feature
- [x] Breaking change
- [ ] Improvement
- [ ] Model update
- [ ] Other:

---

## Checklist

- [x] Code complies with style guidelines
- [x] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [ ] Self-review completed
- [ ] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [x] Tested in clean environment
- [x] Tests added/updated (if applicable)

---------

Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary

Describe key changes, mention related issues or motivation for the
changes.

(If applicable, issue number: #\_\_\_\_)

## Type of change

- [ X] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Improvement
- [ ] Model update
- [ ] Other:

---

## Checklist

- [ ] Code complies with style guidelines
- [ ] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [ ] Self-review completed
- [ ] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [ ] Tested in clean environment
- [ ] Tests added/updated (if applicable)

---

## Additional Notes

Add any important context (deployment instructions, screenshots,
security considerations, etc.)

---------

Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary

add missing reranker in pgvector hybrid search

## Type of change

- [x] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Improvement
- [ ] Model update
- [ ] Other:

---

## Checklist

- [ ] Code complies with style guidelines
- [ ] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [ ] Self-review completed
- [ ] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [ ] Tested in clean environment
- [ ] Tests added/updated (if applicable)

---

## Additional Notes

Add any important context (deployment instructions, screenshots,
security considerations, etc.)

---------

Co-authored-by: Dirk Brand <dirkbrnd@gmail.com>
## Summary

Adds new streaming events when using `output_model`

(If applicable, issue number: #\_\_\_\_)

## Type of change

- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Improvement
- [ ] Model update
- [ ] Other:

---

## Checklist

- [ ] Code complies with style guidelines
- [ ] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [ ] Self-review completed
- [ ] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [ ] Tested in clean environment
- [ ] Tests added/updated (if applicable)

---

## Additional Notes

Add any important context (deployment instructions, screenshots,
security considerations, etc.)
@dirkbrnd dirkbrnd merged commit e98929f into main Aug 8, 2025
3 checks passed
@dirkbrnd dirkbrnd deleted the knowledge/pdf-pw-protected branch August 8, 2025 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants