The Normalized Semantic Chunker is a cutting-edge tool that unlocks the full potential of semantic chunking in an expanded range of NLP applications processing text documents and splits them into semantically coherent segments while ensuring optimal chunk size for downstream NLP tasks. This innovative solution builds upon concepts from YouTube's Advanced Text Splitting for RAG and implementation patterns from LangChain's semantic chunker documentation. Conventional semantic chunkers prioritize content coherence but often produce chunks with highly variable token counts. This leads to issues like context window overflow and inconsistent retrieval quality, significantly impacting token-sensitive applications such as retrieval-augmented generation (RAG). The Normalized Semantic Chunker overcomes these challenges by combining semantic cohesion with statistical guarantees for token size compliance. It ensures chunks are not only semantically meaningful but also fall within an optimal size range in terms of token count. This enables more precise and efficient text preparation for embeddings, RAG pipelines, and other NLP applications. Whether working with long documents, varied content structures, or token-sensitive NLP architectures, the Normalized Semantic Chunker provides a robust, adaptable solution for optimizing text segmentation.
- Adaptive Semantic Chunking: Intelligently splits text based on semantic similarity between consecutive sentences.
- Precise Chunk Size Control: Advanced algorithm statistically ensures compliance with maximum token limits.
- Parallel Multi-Percentile Optimization: Efficiently searches for the optimal similarity percentile using parallel processing.
- Intelligent Small Chunk Management: Automatically merges undersized chunks with their most semantically similar neighbors.
- Smart Oversized Chunk Handling: Intelligently splits chunks that exceed token threshold limits while preserving semantic integrity.
- GPU Acceleration: CUDA-enabled for fast embedding generation using PyTorch.
- Comprehensive Processing Pipeline: From raw text to optimized chunks in a single workflow.
- Universal REST API with FastAPI: Modern, high-performance API interface with automatic documentation, data validation, and seamless integration capabilities for any system or language.
- Docker Integration: Easy deployment with Docker and docker-compose.
- Adaptive Processing: Adjusts processing parameters based on document size for optimal resource usage.
- Model Caching: Caches embedding models with timeout for improved performance.
- Format Support: Handles text (.txt), markdown (.md), and structured JSON (.json) files.
- Resource Management: Intelligently manages system resources based on available RAM and CPU cores.
- How the Text Chunking Algorithm Works
- Advantages of the Solution
- Installation and Deployment
- Using the API
- Contributing
The core innovation of Normalized Semantic Chunker lies in its multi-step pipeline that combines NLP techniques with statistical optimization to ensure both semantic coherence and size consistency:
- The application exposes a simple REST API endpoint where users can upload a text document and parameters for maximum token limits and embedding model selection.
- The text is initially split into sentences using sophisticated regex pattern matching.
- Each sentence is transformed into a vector embedding using state-of-the-art transformer models (default:
sentence-transformers/all-MiniLM-L6-v2
). - The angular similarity between consecutive sentence vectors is calculated.
- A parallel search algorithm identifies the optimal percentile of the similarity distribution that respects the specified size constraints.
- Chunks are formed by grouping sentences across boundaries identified by the chosen percentile.
- A post-processing step identifies and merges chunks too small with their most semantically similar neighbours, ensuring size constraints are met.
- A final step splits any remaining chunks that exceed the maximum token limit, prioritizing sentence boundaries.
- The application returns a well-structured JSON response containing the chunks, metadata, and performance statistics, ready for immediate integration into production environments.
Unlike traditional approaches, Normalized Semantic Chunker uses a sophisticated statistical method to ensure that chunks generally stay below a maximum token limit.
During the percentile search, potential chunkings are evaluated based on an estimate of their 95th percentile token count:
# Calculate the estimated 95th percentile using z-score of 1.645
estimated_95th_percentile = average_tokens + (1.645 * std_dev)
if estimated_95th_percentile <= max_tokens:
# This percentile is considered valid
return chunks_with_tokens, percentile, average_tokens
This approach ensures that approximately 95% of the generated chunks respect the specified token limit while automatically handling the few edge cases through a subsequent splitting step.
The algorithm leverages parallel processing to simultaneously test multiple percentiles, significantly speeding up the search for the optimal splitting point:
with ProcessPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(_process_percentile_range, args)
for args in process_args
]
This parallel implementation allows for quickly finding the best balance between semantic cohesion and adherence to size constraints.
Feature | Traditional Chunking | Normalized Semantic Chunker |
---|---|---|
Boundary Determination | Fixed rules or token counts | Statistical analysis of semantic similarity distribution |
Size Control | Often approximate or not guaranteed | Statistical guarantee (e.g., ~95%) + explicit splitting/merging |
Semantic Cohesion | Can split related concepts | Preserves semantic cohesion via similarity analysis |
Outlier Handling | Limited or absent | Intelligent merging of small chunks & splitting of large ones |
Parallelization | Rarely implemented | Built-in parallel multi-core optimization |
Adaptability | Requires manual parameter tuning | Automatically finds optimal parameters for each document type and size |
Chunks generated by Normalized Semantic Chunker are ideal for Retrieval-Augmented Generation systems:
- Semantic Coherence: Each chunk contains semantically related information.
- Balanced Sizes: Chunks adhere to maximum size limits while avoiding excessively small fragments through merging.
- Representativeness: Each chunk aims to contain a complete and coherent unit of information.
The parallel implementation and statistical approach offer:
- Processing Speed: Parallel optimization on multi-core systems.
- GPU Acceleration: Fast embedding generation using CUDA-enabled PyTorch.
- Scalability: Efficient handling of large documents with adaptive processing based on document size.
- Consistent Quality: Predictable and reliable results regardless of text type.
- Resource Management: Intelligent allocation of CPU cores and memory based on document size and system resources.
The algorithm adapts automatically to different types of content:
- Adaptive Parameters: Automatic identification of the best chunking parameters for each document.
- Configurability: Ability to specify custom maximum token limits (max_tokens) and control small chunk merging.
- Extensibility: Modular architecture easily extendable with new features.
- Embedding Model Selection: Switch between different transformer models based on your needs.
- Docker and Docker Compose (for Docker deployment)
- NVIDIA GPU with CUDA support (recommended)
- NVIDIA Container Toolkit (for GPU passthrough in Docker)
- Python 3.10-3.12 (Python 3.11 specifically recommended, Python 3.13+ not supported due to issues with PyTorch and sentence-transformers dependencies)
Before proceeding with any installation method, clone the repository:
git clone https://github.com/smart-models/Normalized-Semantic-Chunker.git
cd Normalized-Semantic-Chunker
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Linux/Mac
For Windows users:
- Using Command Prompt:
.venv\Scripts\activate.bat
- Using PowerShell:
# If you encounter execution policy restrictions, run this once per session: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process # Then activate the virtual environment: .venv\Scripts\Activate.ps1
Note: PowerShell's default security settings may prevent script execution. The above command temporarily allows scripts for the current session only, which is safer than changing system-wide settings.
-
Install dependencies:
pip install -r requirements.txt
Note: For GPU support, ensure you install the correct PyTorch version:
pip install --extra-index-url https://download.pytorch.org/whl/cu121 torch==2.1.1+cu121
-
Run the FastAPI server:
uvicorn normalized_semantic_chunker:app --reload
-
The API will be available at
http://localhost:8000
.Access the API documentation and interactive testing interface at
http://localhost:8000/docs
.
-
Create required directories for persistent storage:
# Linux/macOS mkdir -p models logs # Windows CMD mkdir models mkdir logs # Windows PowerShell New-Item -ItemType Directory -Path models -Force New-Item -ItemType Directory -Path logs -Force # Or with PowerShell alias mkdir -Force models, logs
-
Deploy with Docker Compose:
CPU-only deployment (default, works on all machines):
cd docker docker compose --profile cpu up -d
GPU-accelerated deployment (requires NVIDIA GPU and drivers):
cd docker docker compose --profile gpu up -d
Stopping the service:
Important: Always match the
--profile
flag between yourup
anddown
commands:# To stop CPU deployment docker compose --profile cpu down # To stop GPU deployment docker compose --profile gpu down
This ensures Docker Compose correctly identifies and manages the specific set of containers you intended to control.
Note: The GPU-accelerated deployment requires an NVIDIA GPU with appropriate drivers installed. If you don't have an NVIDIA GPU, use the CPU-only deployment.
-
The API will be available at
http://localhost:8000
.Access the API documentation and interactive testing interface at
http://localhost:8000/docs
.
-
POST
/normalized_semantic_chunker/
Chunks a text document into semantically coherent segments while controlling token size.Parameters:
file
: The text file to be chunked (supports .txt, .md, and .json formats)max_tokens
: Maximum token count per chunk (integer, required)model
: Embedding model to use for semantic analysis (string, default:sentence-transformers/all-MiniLM-L6-v2
)merge_small_chunks
: Whether to merge undersized chunks (boolean, default:true
)verbosity
: Show detailed logs (boolean, default:false
)
Response: Returns a JSON object containing:
chunks
: Array of text segments with their token counts and IDsmetadata
: Processing statistics including chunk count, token statistics, percentile used, model name, and processing time
JSON Input Format: When using JSON files as input, the expected structure is:
{ "chunks": [ { "text": "First chunk of text content...", "metadata_field": "Additional metadata is allowed..." }, { "text": "Second chunk of text content...", "id": 12345 }, ... ] }
The service will process each text chunk individually, maintaining the chunk boundaries provided in your JSON file, then apply semantic chunking within those boundaries as needed. Additional metadata fields beyond
text
are allowed and will be ignored during processing, so you can include any extra information you need while still having the JSON process correctly. -
GET
/
Health check endpoint that returns service status, GPU availability, and API version.
# Basic usage with required parameters
curl -X POST "http://localhost:8000/normalized_semantic_chunker/?max_tokens=512" \
-F "file=@document.txt"
# With all parameters specified
curl -X POST "http://localhost:8000/normalized_semantic_chunker/?max_tokens=512&model=sentence-transformers/all-MiniLM-L6-v2&merge_small_chunks=true&verbosity=false" \
-F "file=@document.txt" \
-H "accept: application/json"
# Health check endpoint
curl http://localhost:8000/
import requests
import json
# Replace with your actual API endpoint if hosted elsewhere
api_url = 'http://localhost:8000/normalized_semantic_chunker/'
file_path = 'document.txt' # Your input text file
max_tokens_per_chunk = 512
# model_name = "sentence-transformers/all-MiniLM-L6-v2" # Optional: specify a different model
merge_small_chunks = True # Whether to merge undersized chunks with semantically similar neighbors
verbosity = False # Whether to show detailed logs
try:
with open(file_path, 'rb') as f:
files = {'file': (file_path, f, 'text/plain')}
params = {
'max_tokens': max_tokens_per_chunk,
'merge_small_chunks': merge_small_chunks,
'verbosity': verbosity
}
# if model_name: # Uncomment to specify a model
# params['model'] = model_name
response = requests.post(api_url, files=files, params=params)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print(f"Successfully chunked document into {result['metadata']['n_chunks']} chunks.")
# Save the response to a file
output_file = 'response.json'
# print("Metadata:", result['metadata'])
# print("First chunk:", result['chunks'][0])
with open(output_file, 'w', encoding='utf-8') as outfile:
json.dump(result, outfile, indent=4, ensure_ascii=False)
print(f"Response saved to {output_file}")
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
except requests.exceptions.RequestException as e:
print(f"API Request failed: {e}")
if e.response is not None:
print("Error details:", e.response.text)
except Exception as e:
print(f"An unexpected error occurred: {e}")
A successful chunking operation returns a ChunkingResult
object:
{
"chunks": [
{
"text": "This is the first chunk of text...",
"token_count": 480,
"id": 1
},
{
"text": "This is the second chunk...",
"token_count": 505,
"id": 2
},
{
"text": "Additional chunks would appear here...",
"token_count": 490,
"id": 3
}
],
"metadata": {
"n_chunks": 42,
"avg_tokens": 495,
"max_tokens": 510,
"min_tokens": 150,
"percentile": 85,
"embedder_model": "sentence-transformers/all-MiniLM-L6-v2",
"processing_time": 15.78
}
}
The Normalized Semantic Chunker is an open-source project that thrives on community contributions. Your involvement is not just welcome, it's fundamental to the project's growth, innovation, and long-term success.
Whether you're fixing bugs, improving documentation, adding new features, or sharing ideas, every contribution helps build a better tool for everyone. We believe in the power of collaborative development and welcome participants of all skill levels.
If you're interested in contributing:
- Fork the repository
- Create a development environment with all dependencies
- Make your changes
- Add tests if applicable
- Ensure all tests pass
- Submit a pull request
Happy Semantic Chunking!