A modular library for segmenting text into sentences and paragraphs based on character-level features.
- Character-level text segmentation
- Support for sentence and paragraph boundaries
- Character span extraction for precise text positions
- Customizable window sizes for context
- Support for abbreviations
- Highly optimized performance (up to 280,000 characters/second)
- Secure model serialization with skops
- Optional ONNX model conversion and inference
pip install charboundary
Install with additional dependencies:
# With NumPy support for faster processing
pip install charboundary[numpy]
# With ONNX support for model conversion and faster inference
pip install charboundary[onnx]
# With all optional dependencies
pip install charboundary[numpy,onnx]
CharBoundary comes with pre-trained models of different sizes:
- Small - Small footprint (5 token context window, 32 trees) - Processes ~85,000 characters/second
- Medium - Default, best performance (7 token context window, 64 trees) - Processes ~280,000 characters/second
- Large - Most accurate (9 token context window, 256 trees) - Processes ~175,000 characters/second
Package Distribution: To keep the package size reasonable:
- Included in the package: Only the small model (~1MB)
- Downloaded on demand: Medium (~2MB) and large (~6MB) models are automatically downloaded from GitHub when first used
The download happens automatically and transparently the first time you use functions like get_default_segmenter()
or get_large_segmenter()
if the models aren't already available locally.
For maximum performance, install the ONNX dependencies and use the optimized models:
# Install ONNX support
pip install charboundary[onnx]
# Use ONNX-accelerated model with recommended optimization level
from charboundary import get_large_onnx_segmenter
# Get optimized model (downloads automatically if needed)
segmenter = get_large_onnx_segmenter()
# Process text at maximum speed
sentences = segmenter.segment_to_sentences(your_text)
The ONNX acceleration provides up to 2.1x performance improvement with zero impact on result quality. ONNX models are now XZ-compressed by default (60% smaller files). See examples/onnx_optimization_simple.py
for a complete benchmark example.
from charboundary import get_default_segmenter
# Get the pre-trained medium-sized segmenter (default)
segmenter = get_default_segmenter()
# Segment text into sentences and paragraphs
text = """
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.). This release does not include Employee’s right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.
"""
# Get list of sentences (with default threshold)
sentences = segmenter.segment_to_sentences(text)
print("\n-\n".join(sentences))
Output
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.).
-
This release does not include Employee’s right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.
# Control segmentation sensitivity with threshold parameter
# Lower threshold = more aggressive segmentation (higher recall)
high_recall_sentences = segmenter.segment_to_sentences(text, threshold=0.1)
# Higher threshold = conservative segmentation (higher precision)
high_precision_sentences = segmenter.segment_to_sentences(text, threshold=0.9)`
print(len(high_recall_sentences), len(high_precision_sentences))
# Output: 2 1
You can also choose a specific model size based on your needs:
from charboundary import get_small_segmenter, get_large_segmenter
# For faster processing with smaller memory footprint
small_segmenter = get_small_segmenter()
# For highest accuracy (but larger memory footprint)
large_segmenter = get_large_segmenter()
The models are optimized for handling:
- Quotation marks in the middle or at the end of sentences
- Common abbreviations (including legal abbreviations)
- Legal citations (e.g., "Brown v. Board of Education, 347 U.S. 483 (1954)")
- Multi-line quotes
- Enumerated lists (partial support)
CharBoundary can provide exact character spans (start and end positions) for each segment:
from charboundary import get_default_segmenter
segmenter = get_default_segmenter()
text = "This is a sample text. It has multiple sentences. Here's the third one."
# Get sentences with their character spans
sentences_with_spans = segmenter.segment_to_sentences_with_spans(text)
for sentence, (start, end) in sentences_with_spans:
print(f"Span ({start}-{end}): {sentence}")
# Verify the span points to the correct text
assert text[start:end].strip() == sentence.strip()
# Get only the spans without the text
spans = segmenter.get_sentence_spans(text)
print(f"Spans: {spans}") # [(0, 22), (22, 46), (46, 67)]
# The spans cover EVERY character in the input
assert sum(end - start for start, end in spans) == len(text)
# Do the same for paragraphs
paragraph_spans = segmenter.get_paragraph_spans(text)
See the complete example in examples/character_spans_example.py
.
from charboundary import TextSegmenter
# Create a segmenter (will be initialized with default parameters)
segmenter = TextSegmenter()
# Train the model on sample data
training_data = [
"This is a sentence.<|sentence|> This is another sentence.<|sentence|><|paragraph|>",
"This is a new paragraph.<|sentence|> It has multiple sentences.<|sentence|><|paragraph|>"
]
segmenter.train(data=training_data)
# Segment text into sentences and paragraphs
text = "Hello, world! This is a test. This is another sentence."
segmented_text = segmenter.segment_text(text)
print(segmented_text)
# Output: "Hello, world!<|sentence|> This is a test.<|sentence|> This is another sentence.<|sentence|>"
# Get list of sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ["Hello, world!", "This is a test.", "This is another sentence."]
CharBoundary uses skops for secure model serialization. This provides better security than pickle for sharing and loading models.
# Train a model
segmenter = TextSegmenter()
segmenter.train(data=training_data)
# Save the model with skops
segmenter.save("model.skops", format="skops")
# Load a model with security checks (default)
# This will reject loading custom types for security
segmenter = TextSegmenter.load("model.skops", use_skops=True)
# Load a model with trusted types enabled
# Only use this with models from trusted sources
segmenter = TextSegmenter.load("model.skops", use_skops=True, trust_model=True)
- When loading models from untrusted sources, avoid setting
trust_model=True
- When loading fails with untrusted types, skops will list the untrusted types that need to be approved
- The library will fall back to pickle if skops loading fails, but this is less secure
When the onnx
optional dependency is installed, you can convert and use models in ONNX format for faster inference:
from charboundary import get_default_segmenter, get_medium_onnx_segmenter
from charboundary.models import create_model
from charboundary.segmenters import SegmenterConfig
# Option 1: Get a segmenter with ONNX already enabled (downloads model if needed)
segmenter = get_medium_onnx_segmenter()
sentences = segmenter.segment_to_sentences(text)
# Option 2: Create a model with ONNX support enabled and optimization level
model = create_model(
model_type="random_forest",
use_onnx=True,
onnx_optimization_level=2 # Use extended optimizations (level 0-3)
)
# Option 3: Create a segmenter with ONNX configuration
segmenter = TextSegmenter(
config=SegmenterConfig(
use_onnx=True,
onnx_optimization_level=2 # Extended optimizations
)
)
# Option 4: Convert an existing model to ONNX
segmenter = get_default_segmenter()
segmenter.model.to_onnx() # Converts the model to ONNX format
# Save the ONNX model to a file (now with XZ compression by default)
segmenter.model.save_onnx("model.onnx") # Creates model.onnx.xz by default
segmenter.model.save_onnx("model.onnx", compress=False) # No compression
# Load an ONNX model with specified optimization level (handles compressed files automatically)
new_segmenter = get_default_segmenter()
new_segmenter.model.load_onnx("model.onnx") # Works with both model.onnx or model.onnx.xz
new_segmenter.model.enable_onnx(True, optimization_level=2) # Enable ONNX with extended optimizations
# Run inference with the ONNX model
sentences = new_segmenter.segment_to_sentences(text)
All models (including ONNX versions) can be automatically downloaded from the GitHub repository if they're not available locally:
# These functions download models from GitHub if not found locally
from charboundary import (
get_small_onnx_segmenter, # Small model with ONNX (~5MB, included in package)
get_medium_onnx_segmenter, # Medium model with ONNX (~33MB, downloaded on demand)
get_large_onnx_segmenter # Large model with ONNX (~188MB, downloaded on demand)
)
# Explicitly download an ONNX model
from charboundary import download_onnx_model
download_onnx_model("large", force=True) # Force redownload even if exists
ONNX Package Distribution: To keep the package size reasonable:
- Included in the package: Only the small ONNX model (~2MB, XZ compressed)
- Downloaded on demand: Medium (~13MB) and large (~75MB) XZ compressed ONNX models
See the examples/remote_onnx_example.py
file for a complete example of remote model usage.
ONNX models can provide significant performance improvements, especially for inference:
- Faster prediction speeds (up to 2-5x speedup)
- Better hardware acceleration support
- Easier deployment in production environments
- Smaller memory footprint
See the examples/onnx_model_example.py
file for a complete example of ONNX usage.
The package includes a comprehensive utility script for working with ONNX models:
scripts/onnx_utils.py
: A unified tool for converting, benchmarking, and testing ONNX models# Convert all built-in models to ONNX with optimal optimization levels python scripts/onnx_utils.py convert --all-models # Convert a specific built-in model to ONNX with a custom optimization level python scripts/onnx_utils.py convert --model-name small --optimization-level 2 # Convert a custom model file to ONNX python scripts/onnx_utils.py convert --input-file path/to/model.skops.xz --output-file path/to/model.onnx # Benchmark all built-in models with optimal optimization levels python scripts/onnx_utils.py benchmark --all-models # Benchmark a specific built-in model with all optimization levels python scripts/onnx_utils.py benchmark --model-name medium # Test ONNX model functionality and accuracy python scripts/onnx_utils.py test --all-models
See scripts/SCRIPT_USAGE.md for more detailed examples and output logs of each command.
These scripts are particularly useful for converting pre-trained models to ONNX format and evaluating the performance benefits.
CharBoundary supports ONNX optimization levels to further improve performance without any impact on prediction quality:
Level | Description | Speed Boost | Recommended For |
---|---|---|---|
0 | No optimization | Baseline | Debugging only |
1 | Basic optimizations | 5-10% | Memory-constrained environments |
2 | Extended optimizations | 10-20% | Most use cases (recommended) |
3 | All optimizations | 15-25% | Large models, high-performance computing |
Key Benefits:
- Same prediction results as scikit-learn across all levels
- Easy to configure with a single parameter
- Safe to use in production (Level 2 recommended)
How to use optimization levels:
# Method 1: When creating a model
from charboundary.models import create_model
model = create_model(
model_type="random_forest",
use_onnx=True,
onnx_optimization_level=2 # Extended optimizations recommended
)
# Method 2: When creating a segmenter
from charboundary import TextSegmenter
from charboundary.segmenters import SegmenterConfig
segmenter = TextSegmenter(
config=SegmenterConfig(
use_onnx=True,
onnx_optimization_level=2 # Extended optimizations recommended
)
)
# Method 3: When getting a pre-built model
from charboundary import get_medium_onnx_segmenter
segmenter = get_medium_onnx_segmenter()
segmenter.model.enable_onnx(True, optimization_level=2)
# Method 4: Setting it on an existing model
segmenter.model.enable_onnx(True, optimization_level=2)
The large model benefits the most from optimization level 3, while small and medium models perform best with level 2.
For a complete runnable example that demonstrates all optimization levels, see:
examples/onnx_optimization_simple.py
The ONNX conversion with optimization levels provides significant performance improvements:
Model | scikit-learn (chars/sec) |
ONNX Level 1 (chars/sec) |
ONNX Level 2/3 (chars/sec) |
Total Speedup |
---|---|---|---|---|
Small | 544,678 | 572,176 | 600,853 (Level 2) | 1.10x |
Medium | 293,897 | 345,157 | 379,672 (Level 2) | 1.29x |
Large | 179,998 | 301,539 | 376,923 (Level 3) | 2.09x |
Key observations:
- Larger models benefit more from optimization (up to 2.09x faster)
- Level 2 is optimal for small/medium models
- Level 3 provides best performance for large models
- All optimization levels preserve exact prediction quality
These performance improvements come with zero impact on accuracy or prediction quality - the results remain identical to the original scikit-learn models.
You can customize the segmenter with various parameters:
from charboundary.segmenters import TextSegmenter, SegmenterConfig
config = SegmenterConfig(
# Model configuration
model_type="random_forest", # Type of model to use
model_params={ # Parameters for the model
"n_estimators": 100,
"max_depth": 16,
"class_weight": "balanced"
},
threshold=0.5, # Classification threshold (0.0-1.0)
# Lower=more sentences (recall), Higher=fewer sentences (precision)
# Window configuration
left_window=3, # Size of left context window
right_window=3, # Size of right context window
# Domain knowledge
abbreviations=["Dr.", "Mr.", "Mrs.", "Ms."], # Custom abbreviations
# Performance settings
use_numpy=True, # NumPy for faster processing
cache_size=1024, # Cache size for character encoding
num_workers=4, # Number of worker processes
# ONNX acceleration (2-5x faster inference)
use_onnx=True, # Enable ONNX inference if available
onnx_optimization_level=2 # Optimization level (recommended=2):
# 0=None/debug, 1=Basic, 2=Extended, 3=Maximum
)
segmenter = TextSegmenter(config=config)
The CharBoundary library includes sophisticated feature engineering tailored for text segmentation. These features help the model distinguish between actual sentence boundaries and other characters that may appear similar (like periods in abbreviations or quotes in the middle of sentences).
Key features include:
-
Quotation Handling:
- Quote balance tracking (detecting matched pairs of quotes)
- Word completion detection for quotes
- Multi-line quote recognition
-
List and Enumeration Detection:
- Recognition of enumerated list items (
(1)
,(2)
,(a)
,(b)
, etc.) - Detection of list introductions (colons, phrases like "as follows:")
- Special handling for semicolons in list structures
- Recognition of enumerated list items (
-
Abbreviation Detection:
- Comprehensive lists of common and domain-specific abbreviations
- Legal abbreviations and citations
-
Contextual Analysis:
- Distinction between primary terminators (
.
,!
,?
) and secondary terminators ("
,'
,:
,;
) - Detection of lowercase letters following potential terminators
- Analysis of surrounding context for sentence boundaries
- Distinction between primary terminators (
These features enable the model to make intelligent decisions about text segmentation, particularly for complex cases like legal documents, technical texts, and documents with complex structure.
# Get current abbreviations
abbrevs = segmenter.get_abbreviations()
# Add new abbreviations
segmenter.add_abbreviation("Ph.D")
# Remove abbreviations
segmenter.remove_abbreviation("Dr.")
# Set a new list of abbreviations
segmenter.set_abbreviations(["Dr.", "Mr.", "Prof.", "Ph.D."])
CharBoundary provides a command-line interface for common operations:
# Get help for all commands
charboundary --help
# Get help for a specific command
charboundary analyze --help
charboundary train --help
charboundary best-model --help
Process text using a trained model:
# Analyze with default annotated output
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt
# Output sentences (one per line)
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --format sentences
# Output paragraphs
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --format paragraphs
# Save output to a file and generate metrics
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --output segmented.txt --metrics metrics.json
# Adjust segmentation sensitivity using threshold
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --threshold 0.3 # More sensitive (higher recall)
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --threshold 0.5 # Default balance
charboundary analyze --model charboundary/resources/small_model.skops.xz --input input.txt --threshold 0.8 # More conservative (higher precision)
The threshold parameter lets you control the trade-off between Type I errors (false positives) and Type II errors (false negatives):
# Create a test file
echo "The plaintiff, Mr. Brown vs. Johnson Corp., argued that patent no. 12345 was infringed. Dr. Smith provided expert testimony on Feb. 2nd." > legal_text.txt
# Low threshold (0.2) - High recall, more boundaries detected
charboundary analyze --model charboundary/resources/small_model.skops.xz --input legal_text.txt --format sentences --threshold 0.2
Output with low threshold (0.2):
The plaintiff, Mr.
Brown vs.
Johnson Corp.
, argued that patent no.
12345 was infringed.
Dr.
Smith provided expert testimony on Feb.
2nd.
# Default threshold (0.5) - Balanced approach
charboundary analyze --model charboundary/resources/small_model.skops.xz --input legal_text.txt --format sentences --threshold 0.5
Output with default threshold (0.5):
The plaintiff, Mr. Brown vs. Johnson Corp., argued that patent no. 12345 was infringed.
Dr. Smith provided expert testimony on Feb.
2nd.
# High threshold (0.8) - High precision, only confident boundaries
charboundary analyze --model charboundary/resources/small_model.skops.xz --input legal_text.txt --format sentences --threshold 0.8
Output with high threshold (0.8):
The plaintiff, Mr. Brown vs. Johnson Corp., argued that patent no. 12345 was infringed.
Dr. Smith provided expert testimony on Feb. 2nd.
Train a custom model on annotated data:
# Train with default parameters
charboundary train --data training_data.txt --output model.skops
# Train with custom parameters
charboundary train --data training_data.txt --output model.skops \
--left-window 4 --right-window 6 --n-estimators 100 --max-depth 16 \
--sample-rate 0.1 --max-samples 10000 --threshold 0.5 --metrics-file train_metrics.json
# Train with feature selection to improve performance
charboundary train --data training_data.txt --output model.skops \
--use-feature-selection --feature-selection-threshold 0.01 --max-features 50
Training data should contain annotated text with <|sentence|>
and <|paragraph|>
markers.
The library supports automatic feature selection during training, which can improve both accuracy and inference speed:
- Basic Feature Selection: Use
--use-feature-selection
to enable automatic feature selection - Threshold Selection: Set importance threshold with
--feature-selection-threshold
(default: 0.01) - Maximum Features: Limit the number of features with
--max-features
Feature selection works in two stages:
- First, it trains an initial model to identify feature importance
- Then, it filters out less important features and retrains using only the selected features
This can significantly reduce model complexity while maintaining or even improving accuracy, especially for deployment on resource-constrained environments.
Find the best model parameters by training multiple models:
# Find best model with default parameter ranges
charboundary best-model --data training_data.txt --output best_model.skops
# Customize parameter search space
charboundary best-model --data training_data.txt --output best_model.skops \
--left-window-values 3 5 7 --right-window-values 3 5 7 \
--n-estimators-values 50 100 200 --max-depth-values 8 16 24 \
--threshold-values 0.3 0.5 0.7 --sample-rate 0.1 --max-samples 10000
# Use validation data for model selection
charboundary best-model --data training_data.txt --output best_model.skops \
--validation validation_data.txt --metrics-file best_metrics.json
The CLI can be installed using either pip
or pipx
:
# Install globally as an application
pipx install charboundary
# Or use within a project
pip install charboundary
The library includes a profiling script to identify performance bottlenecks:
# Profile all operations (training, inference, model loading)
python scripts/benchmark/profile_model.py --mode all
# Profile just the training process
python scripts/benchmark/profile_model.py --mode train --samples 500
# Profile just the inference process
python scripts/benchmark/profile_model.py --mode inference --iterations 200
# Profile model loading
python scripts/benchmark/profile_model.py --mode load --model charboundary/resources/medium_model.skops.xz
# Save profiling results to a file
python scripts/benchmark/profile_model.py --output profile_results.txt
The library is highly optimized for performance while maintaining accuracy. The medium model offers the best balance of speed and accuracy:
Performance Results:
Documents processed: 1
Total characters: 991,000
Total sentences found: 2,000
Processing time: 3.53 seconds
Processing speed: 280,000 characters/second
Average sentence length: 495.5 characters
Key optimizations:
- Direct file path handling for resource loading
- Frozensets for character testing
- Pre-allocated arrays with proper types
- Character n-gram caching
- Pattern hash for common text patterns
- Prediction caching for repeated segments
You can run the benchmark tests yourself with the included scripts:
python scripts/test/test_small_model.py
python scripts/test/test_medium_model.py
python scripts/test/test_large_model.py
See CHANGELOG.md for the project history and version details.
MIT