Normalized Semantic Chunker

The Normalized Semantic Chunker is a cutting-edge tool that unlocks the full potential of semantic chunking in an expanded range of NLP applications processing text documents and splits them into semantically coherent segments while ensuring optimal chunk size for downstream NLP tasks. This innovative solution builds upon concepts from YouTube's Advanced Text Splitting for RAG and implementation patterns from LangChain's semantic chunker documentation. Conventional semantic chunkers prioritize content coherence but often produce chunks with highly variable token counts. This leads to issues like context window overflow and inconsistent retrieval quality, significantly impacting token-sensitive applications such as retrieval-augmented generation (RAG). The Normalized Semantic Chunker overcomes these challenges by combining semantic cohesion with statistical guarantees for token size compliance. It ensures chunks are not only semantically meaningful but also fall within an optimal size range in terms of token count. This enables more precise and efficient text preparation for embeddings, RAG pipelines, and other NLP applications. Whether working with long documents, varied content structures, or token-sensitive NLP architectures, the Normalized Semantic Chunker provides a robust, adaptable solution for optimizing text segmentation.

Key Features

Adaptive Semantic Chunking: Intelligently splits text based on semantic similarity between consecutive sentences.
Precise Chunk Size Control: Advanced algorithm statistically ensures compliance with maximum token limits.
Parallel Multi-Percentile Optimization: Efficiently searches for the optimal similarity percentile using parallel processing.
Intelligent Small Chunk Management: Automatically merges undersized chunks with their most semantically similar neighbors.
Smart Oversized Chunk Handling: Intelligently splits chunks that exceed token threshold limits while preserving semantic integrity.
GPU Acceleration: CUDA-enabled for fast embedding generation using PyTorch.
Comprehensive Processing Pipeline: From raw text to optimized chunks in a single workflow.
Universal REST API with FastAPI: Modern, high-performance API interface with automatic documentation, data validation, and seamless integration capabilities for any system or language.
Docker Integration: Easy deployment with Docker and docker-compose.
Adaptive Processing: Adjusts processing parameters based on document size for optimal resource usage.
Model Caching: Caches embedding models with timeout for improved performance.
Format Support: Handles text (.txt), markdown (.md), and structured JSON (.json) files.
Resource Management: Intelligently manages system resources based on available RAM and CPU cores.

How the Text Chunking Algorithm Works

The Pipeline

The core innovation of Normalized Semantic Chunker lies in its multi-step pipeline that combines NLP techniques with statistical optimization to ensure both semantic coherence and size consistency:

The application exposes a simple REST API endpoint where users can upload a text document and parameters for maximum token limits and embedding model selection.
The text is initially split into sentences using sophisticated regex pattern matching.
Each sentence is transformed into a vector embedding using state-of-the-art transformer models (default: sentence-transformers/all-MiniLM-L6-v2).
The angular similarity between consecutive sentence vectors is calculated.
A parallel search algorithm identifies the optimal percentile of the similarity distribution that respects the specified size constraints.
Chunks are formed by grouping sentences across boundaries identified by the chosen percentile.
A post-processing step identifies and merges chunks too small with their most semantically similar neighbours, ensuring size constraints are met.
A final step splits any remaining chunks that exceed the maximum token limit, prioritizing sentence boundaries.
The application returns a well-structured JSON response containing the chunks, metadata, and performance statistics, ready for immediate integration into production environments.

Statistical Control of Maximum Tokens Chunk Size

Unlike traditional approaches, Normalized Semantic Chunker uses a sophisticated statistical method to ensure that chunks generally stay below a maximum token limit.

During the percentile search, potential chunkings are evaluated based on an estimate of their 95th percentile token count:

# Calculate the estimated 95th percentile using z-score of 1.645
estimated_95th_percentile = average_tokens + (1.645 * std_dev)
if estimated_95th_percentile <= max_tokens:
    # This percentile is considered valid
    return chunks_with_tokens, percentile, average_tokens

This approach ensures that approximately 95% of the generated chunks respect the specified token limit while automatically handling the few edge cases through a subsequent splitting step.

Parallel Multi-Core Percentile Search Optimization

The algorithm leverages parallel processing to simultaneously test multiple percentiles, significantly speeding up the search for the optimal splitting point:

with ProcessPoolExecutor(max_workers=max_workers) as executor:
    futures = [
        executor.submit(_process_percentile_range, args)
        for args in process_args
    ]

This parallel implementation allows for quickly finding the best balance between semantic cohesion and adherence to size constraints.

Comparison with Traditional Chunking

Feature	Traditional Chunking	Normalized Semantic Chunker
Boundary Determination	Fixed rules or token counts	Statistical analysis of semantic similarity distribution
Size Control	Often approximate or not guaranteed	Statistical guarantee (e.g., ~95%) + explicit splitting/merging
Semantic Cohesion	Can split related concepts	Preserves semantic cohesion via similarity analysis
Outlier Handling	Limited or absent	Intelligent merging of small chunks & splitting of large ones
Parallelization	Rarely implemented	Built-in parallel multi-core optimization
Adaptability	Requires manual parameter tuning	Automatically finds optimal parameters for each document type and size

Advantages of the Solution

Optimal Preparation for RAG and Semantic Retrieval

Chunks generated by Normalized Semantic Chunker are ideal for Retrieval-Augmented Generation systems:

Semantic Coherence: Each chunk contains semantically related information.
Balanced Sizes: Chunks adhere to maximum size limits while avoiding excessively small fragments through merging.
Representativeness: Each chunk aims to contain a complete and coherent unit of information.

Superior Performance

The parallel implementation and statistical approach offer:

Processing Speed: Parallel optimization on multi-core systems.
GPU Acceleration: Fast embedding generation using CUDA-enabled PyTorch.
Scalability: Efficient handling of large documents with adaptive processing based on document size.
Consistent Quality: Predictable and reliable results regardless of text type.
Resource Management: Intelligent allocation of CPU cores and memory based on document size and system resources.

Flexibility and Customization

The algorithm adapts automatically to different types of content:

Adaptive Parameters: Automatic identification of the best chunking parameters for each document.
Configurability: Ability to specify custom maximum token limits (max_tokens) and control small chunk merging.
Extensibility: Modular architecture easily extendable with new features.
Embedding Model Selection: Switch between different transformer models based on your needs.

Installation and Deployment

Prerequisites

Docker and Docker Compose (for Docker deployment)
NVIDIA GPU with CUDA support (recommended)
NVIDIA Container Toolkit (for GPU passthrough in Docker)
Python 3.10-3.12 (Python 3.11 specifically recommended, Python 3.13+ not supported due to issues with PyTorch and sentence-transformers dependencies)

Getting the Code

Before proceeding with any installation method, clone the repository:

git clone https://github.com/smart-models/Normalized-Semantic-Chunker.git
cd Normalized-Semantic-Chunker

Local Installation with Uvicorn

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Linux/Mac

For Windows users:

Using Command Prompt:

.venv\Scripts\activate.bat

Using PowerShell:

# If you encounter execution policy restrictions, run this once per session:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process

# Then activate the virtual environment:
.venv\Scripts\Activate.ps1

Note: PowerShell's default security settings may prevent script execution. The above command temporarily allows scripts for the current session only, which is safer than changing system-wide settings.

Install dependencies:

pip install -r requirements.txt

Note: For GPU support, ensure you install the correct PyTorch version:

pip install --extra-index-url https://download.pytorch.org/whl/cu121 torch==2.1.1+cu121

Run the FastAPI server:

uvicorn normalized_semantic_chunker:app --reload

The API will be available at http://localhost:8000.

Access the API documentation and interactive testing interface at http://localhost:8000/docs.

Docker Deployment (Recommended)

Create required directories for persistent storage:

# Linux/macOS
mkdir -p models logs

# Windows CMD
mkdir models
mkdir logs

# Windows PowerShell
New-Item -ItemType Directory -Path models -Force
New-Item -ItemType Directory -Path logs -Force
# Or with PowerShell alias
mkdir -Force models, logs

Deploy with Docker Compose:

CPU-only deployment (default, works on all machines):
```
cd docker
docker compose --profile cpu up -d
```
GPU-accelerated deployment (requires NVIDIA GPU and drivers):
```
cd docker
docker compose --profile gpu up -d
```
Stopping the service:

Important: Always match the --profile flag between your up and down commands:
```
# To stop CPU deployment
docker compose --profile cpu down

# To stop GPU deployment
docker compose --profile gpu down
```
This ensures Docker Compose correctly identifies and manages the specific set of containers you intended to control.

Note: The GPU-accelerated deployment requires an NVIDIA GPU with appropriate drivers installed. If you don't have an NVIDIA GPU, use the CPU-only deployment.
The API will be available at http://localhost:8000.

Access the API documentation and interactive testing interface at http://localhost:8000/docs.

Using the API

API Endpoints

POST /normalized_semantic_chunker/
Chunks a text document into semantically coherent segments while controlling token size.

Parameters:
- file: The text file to be chunked (supports .txt, .md, and .json formats)
- max_tokens: Maximum token count per chunk (integer, required)
- model: Embedding model to use for semantic analysis (string, default: sentence-transformers/all-MiniLM-L6-v2)
- merge_small_chunks: Whether to merge undersized chunks (boolean, default: true)
- verbosity: Show detailed logs (boolean, default: false)
Response: Returns a JSON object containing:
- chunks: Array of text segments with their token counts and IDs
- metadata: Processing statistics including chunk count, token statistics, percentile used, model name, and processing time
JSON Input Format: When using JSON files as input, the expected structure is:
```
{
  "chunks": [
    {
      "text": "First chunk of text content...",
      "metadata_field": "Additional metadata is allowed..."
    },
    {
      "text": "Second chunk of text content...",
      "id": 12345
    },
    ...
  ]
}
```
The service will process each text chunk individually, maintaining the chunk boundaries provided in your JSON file, then apply semantic chunking within those boundaries as needed. Additional metadata fields beyond text are allowed and will be ignored during processing, so you can include any extra information you need while still having the JSON process correctly.
GET /
Health check endpoint that returns service status, GPU availability, and API version.

Example API Call using cURL

# Basic usage with required parameters
curl -X POST "http://localhost:8000/normalized_semantic_chunker/?max_tokens=512" \
  -F "file=@document.txt" 

# With all parameters specified
curl -X POST "http://localhost:8000/normalized_semantic_chunker/?max_tokens=512&model=sentence-transformers/all-MiniLM-L6-v2&merge_small_chunks=true&verbosity=false" \
  -F "file=@document.txt" \
  -H "accept: application/json"

# Health check endpoint
curl http://localhost:8000/

Example API Call using Python

import requests
import json

# Replace with your actual API endpoint if hosted elsewhere
api_url = 'http://localhost:8000/normalized_semantic_chunker/'
file_path = 'document.txt' # Your input text file
max_tokens_per_chunk = 512
# model_name = "sentence-transformers/all-MiniLM-L6-v2" # Optional: specify a different model
merge_small_chunks = True  # Whether to merge undersized chunks with semantically similar neighbors
verbosity = False  # Whether to show detailed logs

try:
    with open(file_path, 'rb') as f:
        files = {'file': (file_path, f, 'text/plain')}
        params = {
            'max_tokens': max_tokens_per_chunk,
            'merge_small_chunks': merge_small_chunks,
            'verbosity': verbosity
        }
        # if model_name: # Uncomment to specify a model
        #     params['model'] = model_name

        response = requests.post(api_url, files=files, params=params)
        response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

        result = response.json()

        print(f"Successfully chunked document into {result['metadata']['n_chunks']} chunks.")
        # Save the response to a file
        output_file = 'response.json'
        # print("Metadata:", result['metadata'])
        # print("First chunk:", result['chunks'][0])
        with open(output_file, 'w', encoding='utf-8') as outfile:
            json.dump(result, outfile, indent=4, ensure_ascii=False)
        print(f"Response saved to {output_file}")

except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except requests.exceptions.RequestException as e:
    print(f"API Request failed: {e}")
    if e.response is not None:
        print("Error details:", e.response.text)
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Response Format

A successful chunking operation returns a ChunkingResult object:

{
  "chunks": [
    {
      "text": "This is the first chunk of text...",
      "token_count": 480,
      "id": 1
    },
    {
      "text": "This is the second chunk...",
      "token_count": 505,
      "id": 2
    },
    {
      "text": "Additional chunks would appear here...",
      "token_count": 490,
      "id": 3
    }
  ],
  "metadata": {
    "n_chunks": 42,
    "avg_tokens": 495,
    "max_tokens": 510,
    "min_tokens": 150,
    "percentile": 85,
    "embedder_model": "sentence-transformers/all-MiniLM-L6-v2",
    "processing_time": 15.78
  }
}

Contributing

The Normalized Semantic Chunker is an open-source project that thrives on community contributions. Your involvement is not just welcome, it's fundamental to the project's growth, innovation, and long-term success.

Whether you're fixing bugs, improving documentation, adding new features, or sharing ideas, every contribution helps build a better tool for everyone. We believe in the power of collaborative development and welcome participants of all skill levels.

If you're interested in contributing:

Fork the repository
Create a development environment with all dependencies
Make your changes
Add tests if applicable
Ensure all tests pass
Submit a pull request

Happy Semantic Chunking!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docker		docker
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WHAT_IS_IT.md		WHAT_IS_IT.md
logo.png		logo.png
normalized_semantic_chunker.py		normalized_semantic_chunker.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
what-is-it.jpg		what-is-it.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Normalized Semantic Chunker

Key Features

Table of Contents

How the Text Chunking Algorithm Works

The Pipeline

Statistical Control of Maximum Tokens Chunk Size

Parallel Multi-Core Percentile Search Optimization

Comparison with Traditional Chunking

Advantages of the Solution

Optimal Preparation for RAG and Semantic Retrieval

Superior Performance

Flexibility and Customization

Installation and Deployment

Prerequisites

Getting the Code

Local Installation with Uvicorn

Docker Deployment (Recommended)

Using the API

API Endpoints

Example API Call using cURL

Example API Call using Python

Response Format

Contributing

About

Uh oh!

Releases 2

Contributors 2

Uh oh!

Languages

License

smart-models/Normalized-Semantic-Chunker

Folders and files

Latest commit

History

Repository files navigation

Normalized Semantic Chunker

Key Features

Table of Contents

How the Text Chunking Algorithm Works

The Pipeline

Statistical Control of Maximum Tokens Chunk Size

Parallel Multi-Core Percentile Search Optimization

Comparison with Traditional Chunking

Advantages of the Solution

Optimal Preparation for RAG and Semantic Retrieval

Superior Performance

Flexibility and Customization

Installation and Deployment

Prerequisites

Getting the Code

Local Installation with Uvicorn

Docker Deployment (Recommended)

Using the API

API Endpoints

Example API Call using cURL

Example API Call using Python

Response Format

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors 2

Uh oh!

Languages