Add fast sentence detection #6

maziyarpanahi · 2025-10-29T11:59:00Z

Pull Request

Description

Brief description of what this PR does.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring
Performance improvement
Test addition/improvement

Changes Made

Change 1
Change 2
Change 3

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change with different models/inputs

Documentation

I have updated the documentation accordingly
I have added docstrings to new functions/classes
I have updated the CHANGELOG.md

Code Quality

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings

Dependencies

I have not added any new dependencies
OR I have added new dependencies and they are justified because: ____

Checklist

I have read the contributing guidelines
My commits have clear, descriptive messages
I have squashed/organized my commits appropriately

Related Issues

Closes #(issue number)
Related to #(issue number)

Screenshots/Examples

If applicable, add screenshots or example outputs to help explain your changes.

…int to ensure compatibility and enhance text processing capabilities in the project.

…pendencies: Enhance the clarity of the requirements section, specify the use of `uv` for installation, and detail the installation process for Hugging Face support and PyTorch, ensuring users have a better understanding of the setup process.

…ntroduce optional parameters for sentence detection, including language and cleaning heuristics, and refactor the function to handle segmented input more effectively. Update CLI to lazily load analyze_text and related functions for improved performance.

…ion handling to catch both ImportError and OSError, improving robustness in model loading. Additionally, include sentences module in processing exports for better accessibility.

… entity predictions by introducing a metadata field, allowing for additional contextual information. Update grouping logic to consider sentence indices and improve JSON output formatting to include metadata attributes, ensuring richer data representation.

…nce segmentation using pySBD, including the SentenceSpan class for representing sentences and their character boundaries. Implement caching for segmenter instances and fallback logic for span generation when character offsets are unavailable.

…e exception handling to catch both ImportError and OSError, improving robustness in tokenizer initialization.

…te text files, 'clinical_note.txt' and 'long_clinical_note.txt', to enhance test coverage for clinical documentation scenarios, ensuring comprehensive validation of processing functionalities.

…ion tests in `test_sentence_detection_real.py` to validate sentence detection functionality with real models, ensuring consistent behavior and proper handling of placeholder-only segments.

…ure `analyze_text`, `get_model_max_length`, and `list_models` are lazily imported, improving performance and allowing for easier testing by exposing these functions for patching without eager imports.

maziyarpanahi added 11 commits October 20, 2025 22:45

Add pysbd dependency to pyproject.toml: Include pysbd version constra…

fc6eb2e

…int to ensure compatibility and enhance text processing capabilities in the project.

Enhance Hugging Face availability check in ModelLoader: Update except…

58680ba

…ion handling to catch both ImportError and OSError, improving robustness in model loading. Additionally, include sentences module in processing exports for better accessibility.

Enhance Hugging Face availability check in tokenization module: Updat…

2709b95

…e exception handling to catch both ImportError and OSError, improving robustness in tokenizer initialization.

Add clinical note fixtures for testing: Introduce two new clinical no…

828a699

…te text files, 'clinical_note.txt' and 'long_clinical_note.txt', to enhance test coverage for clinical documentation scenarios, ensuring comprehensive validation of processing functionalities.

Add integration tests for sentence detection: Introduce slow integrat…

b22a305

…ion tests in `test_sentence_detection_real.py` to validate sentence detection functionality with real models, ensuring consistent behavior and proper handling of placeholder-only segments.

Update __about__.py

92ff9e6

Implement lazy loading for CLI functions: Update openmed.cli to ens…

3f1759b

…ure `analyze_text`, `get_model_max_length`, and `list_models` are lazily imported, improving performance and allowing for easier testing by exposing these functions for patching without eager imports.

maziyarpanahi merged commit fcc8149 into master Oct 29, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add fast sentence detection #6

Add fast sentence detection #6

maziyarpanahi commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add fast sentence detection #6

Add fast sentence detection #6

Conversation

maziyarpanahi commented Oct 29, 2025

Pull Request

Description

Type of Change

Changes Made

Testing

Documentation

Code Quality

Dependencies

Checklist

Related Issues

Screenshots/Examples

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants