Skip to content

Conversation

aksg87
Copy link
Collaborator

@aksg87 aksg87 commented Aug 7, 2025

Description

Fix chunking bug that created empty intervals when newlines fall at chunk boundaries.

The ChunkIterator was creating empty TokenInterval objects (start_index == end_index) when newlines coincided with chunk boundaries, causing a ValueError. Added check to ensure intervals are non-empty.

Fixes #71

Bug fix

How Has This Been Tested?

$ python -m pytest tests/chunking_test.py -v
# All 18 tests pass including new regression test
$ python -m pytest tests/
# All 161 tests pass

Checklist:

  • I have read and acknowledged Google's Open Source Code of conduct.
  • I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
  • I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
  • I have added tests, or I have ensured existing tests cover the changes
  • I have followed Google's Python Style Guide and ran pylint over the affected code.

- Fix empty interval bug when newline falls at chunk boundary (issue #71)
- Add concise comment explaining the fix logic
- Remove excessive/obvious comments from chunking tests
- Improve test docstring to be more descriptive and professional
@github-actions github-actions bot added the size/XS Pull request with less than 50 lines changed label Aug 7, 2025
@aksg87 aksg87 self-assigned this Aug 7, 2025
Copy link

github-actions bot commented Aug 7, 2025

⚠️ Branch Update Required

Your branch is 17 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

@aksg87 aksg87 merged commit ea71754 into main Aug 7, 2025
13 checks passed
@aksg87 aksg87 deleted the fix-chunking-empty-intervals-issue-71 branch August 7, 2025 07:56
aksg87 added a commit that referenced this pull request Aug 21, 2025
- Fix empty interval bug when newline falls at chunk boundary (issue #71)
- Add concise comment explaining the fix logic
- Remove excessive/obvious comments from chunking tests
- Improve test docstring to be more descriptive and professional
sinnaj pushed a commit to sinnaj/langextract that referenced this pull request Sep 3, 2025
- Fix empty interval bug when newline falls at chunk boundary (issue #71)
- Add concise comment explaining the fix logic
- Remove excessive/obvious comments from chunking tests
- Improve test docstring to be more descriptive and professional
aksg87 added a commit that referenced this pull request Sep 12, 2025
- Fix empty interval bug when newline falls at chunk boundary (issue #71)
- Add concise comment explaining the fix logic
- Remove excessive/obvious comments from chunking tests
- Improve test docstring to be more descriptive and professional
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Pull request with less than 50 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can langextract be used to extract desired text from html?

1 participant