word-tally

Tallies the number of times each word appears in one or more Unicode input sources using ICU4X for word boundary detection. Use word-tally as a command-line tool or WordTally via the Rust library interface.

Unless an I/O mode is specified, a reasonable strategy is automatically selected based on input type:

Files: Parallel memory-mapped I/O for seekable files
Pipes, sockets, stdin, other: Parallel streaming I/O for data streams
Character/block devices: Sequential streaming I/O for device compatibility

All parallel modes use SIMD-accelerated chunk boundary detection. Memory mapping requires seekable file descriptors and won't work with stdin, pipes or devices.

Usage

Usage: word-tally [OPTIONS] [PATHS]...

Arguments:
  [PATHS]...  File paths to use as input (use "-" for stdin) [default: -]

Options:
  -I, --io <STRATEGY>            I/O strategy [default: auto] [possible values: auto, parallel-mmap, parallel-in-memory, parallel-stream, stream]
  -c, --case <FORMAT>            Case normalization [default: original] [possible values: original, upper, lower]
  -s, --sort <ORDER>             Sort order [default: desc] [possible values: desc, asc, unsorted]
  -m, --min-chars <COUNT>        Exclude words containing fewer than min chars
  -n, --min-count <COUNT>        Exclude words appearing fewer than min times
  -w, --exclude-words <WORDS>    Exclude words from a comma-delimited list
  -i, --include <PATTERNS>       Include only words matching a regex pattern
  -x, --exclude <PATTERNS>       Exclude words matching a regex pattern
  -f, --format <FORMAT>          Output format [default: text] [possible values: text, json, csv]
  -d, --field-delimiter <VALUE>  Delimiter between field and value [default: " "] (text format only)
  -D, --entry-delimiter <VALUE>  Delimiter between entries [default: "\n"] (text format only)
  -o, --output <PATH>            Write output to file rather than stdout
  -v, --verbose                  Print verbose details
  -h, --help                     Print help (see more with '--help')
  -V, --version                  Print version

Installation

cargo install word-tally

Examples

I/O strategies

Choose an I/O strategy based on your performance and memory requirements:

# Default: Auto-selection of a reasonable I/O strategy depending on input type
echo "tally me" | word-tally # Parallel streamed I/O for stdin
word-tally file.txt          # Parallel memory-mapped I/O for regular files

# Sequential streamed I/O with minimal memory usage
word-tally --io=stream file.txt

# Parallel streamed I/O
word-tally --io=parallel-stream file.txt

# Parallel memory-mapped I/O
word-tally --io=parallel-mmap file.txt

# Parallel fully loaded into memory
word-tally --io=parallel-in-memory file.txt

Additional features:

# Process multiple files
word-tally file1.txt file2.txt file3.txt

# Mix stdin and files
cat header.txt | word-tally - body.txt footer.txt

Note: Memory mapping (parallel-mmap) requires seekable files and cannot be used with stdin or pipes.

Output formats

Text (default)

# Write to file instead of stdout
word-tally README.md --output=tally.txt

# Custom delimiter between word and count
word-tally README.md --field-delimiter=": " --output=tally.txt

# Custom delimiter between entries (e.g., comma-separated)
word-tally README.md --field-delimiter=": " --entry-delimiter=", " --output=tally.txt

# Pipe to other tools
word-tally README.md | head -n10

Custom delimiters

# Tab-separated values without escaping
word-tally --field-delimiter="\t" README.md > tally.tsv

# Custom delimiters
word-tally --field-delimiter="|" --entry-delimiter=";" README.md

CSV

# CSV with proper escaping and headers
word-tally --format=csv README.md > tally.csv

JSON

word-tally --format=json --output="tally.json" README.md

Visualization

Convert JSON output for visualization with d3-cloud:

word-tally --format=json README.md | jq 'map({text: .[0], value: .[1]})' > d3-cloud.json

Format and pipe the JSON output to the wordcloud_cli to produce an image:

word-tally --format=json README.md | jq -r 'map(.[0] + " ") | join(" ")' | wordcloud_cli --imagefile wordcloud.png

Case normalization

# Convert to lowercase
word-tally --case=lower file.txt

# Preserve original case
word-tally --case=original file.txt

# Convert all to uppercase
word-tally --case=upper file.txt

Sorting options

# Sort by frequency (descending, default)
word-tally --sort=desc file.txt

# Sort alphabetically (ascending)
word-tally --sort=asc file.txt

# No sorting (sorted by order seen)
word-tally --sort=unsorted file.txt

Filtering words

# Only include words that appear at least 10 times
word-tally --min-count=10 file.txt

# Exclude words with fewer than 5 characters
word-tally --min-chars=5 file.txt

# Exclude words by pattern
word-tally --exclude="^a.*" --exclude="^the$" file.txt

# Combining include and exclude patterns
word-tally --include="^w.*" --include=".*o$" --exclude="^who$" file.txt

# Exclude specific words
word-tally --exclude-words="the,a,an,and,or,but" file.txt

Verbose output

echo "fe fi fi fo fo fo" | word-tally --verbose
#>> source -
#>> total-words 6
#>> unique-words 3
#>> delimiter " "
#>> entry-delimiter "\n"
#>> case original
#>> order desc
#>> io parallel-stream
#>> min-chars none
#>> min-count none
#>> exclude-words none
#>> exclude-patterns none
#>> include-patterns none
#>>
#>> fo 3
#>> fi 2
#>> fe 1

Environment variables

The following environment variables configure various aspects of the library:

I/O and processing strategy configuration:

WORD_TALLY_IO - I/O strategy (default: parallel-stream, options: stream, parallel-stream, parallel-in-memory, parallel-mmap)

Memory allocation and performance:

WORD_TALLY_UNIQUENESS_RATIO - Ratio of total words to unique words for capacity estimation. Higher values allocate less initial memory. Books tend to have a 10:1 ratio, and a balanced 32:1 is used as default for better performance (default: 32)
WORD_TALLY_WORDS_PER_KB - Estimated words per KB of text for capacity calculation (default: 128, max: 512)
WORD_TALLY_STDIN_BUFFER_SIZE - Buffer size for stdin when size cannot be determined (default: 262144)

Parallel processing configuration:

WORD_TALLY_THREADS - Number of threads for parallel processing (default: all available cores)
WORD_TALLY_CHUNK_SIZE - Size of chunks for parallel processing in bytes (default: 65536)

Exit codes

word-tally uses standard unix exit codes to indicate success or the types of failure:

0: Success
1: General failure
64: Command line usage error
65: Data format error
66: Input not found
70: Internal software error
73: Output creation failed
74: I/O error
77: Permission denied

Library usage

[dependencies]
word-tally = "0.28.0"

use std::collections::HashMap;
use word_tally::{Case, Filters, Io, Options, Serialization, TallyMap, WordTally};
use anyhow::Result;

fn main() -> Result<()> {
    // Basic usage with file path
    let options = Options::default();
    let tally_map = TallyMap::from_path("document.txt", &options)?;
    let word_tally = WordTally::from_tally_map(tally_map, &options);
    println!("Total words: {}", word_tally.count());

    // Memory-mapped I/O for files
    let options = Options::default()
        .with_case(Case::Lower)
        .with_filters(Filters::default().with_min_chars(3))
        .with_serialization(Serialization::Json)
        .with_io(Io::ParallelMmap);

    let tally_map = TallyMap::from_path("large-file.txt", &options)?;
    let tally = WordTally::from_tally_map(tally_map, &options);

    // Convert to `HashMap` for fast word lookups
    let lookup: HashMap<_, _> = tally.into();
    println!("Count of 'the': {}", *lookup.get("the").unwrap_or(&0));
    println!("Count of 'word': {}", *lookup.get("word").unwrap_or(&0));

    Ok(())
}

The library provides full control over case normalization, sorting, filtering, I/O strategies, and output formats.

Stability notice

Pre-release level stability: This is prerelease software. Expect breaking interface changes at MINOR version (0.x.0) bumps until a stable release.

Tests & benchmarks

Tests

Clone the repository.

git clone https://github.com/havenwood/word-tally
cd word-tally

Run all tests.

cargo test

Run specific test modules.

cargo test --test api_tests
cargo test --test filters_tests
cargo test --test io_tests

Run individual tests

cargo test --test filters_tests -- test_min_chars
cargo test --test io_tests -- test_memory_mapped

Benchmarks

Run all benchmarks.

cargo bench

Run specific benchmark groups

cargo bench --bench core
cargo bench --bench io
cargo bench --bench features

Run specific individual benchmarks

cargo bench --bench features -- case_sensitivity
cargo bench --bench core -- parallel_vs_sequential

Documentation

https://docs.rs/word-tally

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benches		benches
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

word-tally

Usage

Installation

Examples

I/O strategies

Output formats

Text (default)

Custom delimiters

CSV

JSON

Visualization

Case normalization

Sorting options

Filtering words

Verbose output

Environment variables

Exit codes

Library usage

Stability notice

Tests & benchmarks

Tests

Benchmarks

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

havenwood/word-tally

Folders and files

Latest commit

History

Repository files navigation

word-tally

Usage

Installation

Examples

I/O strategies

Output formats

Text (default)

Custom delimiters

CSV

JSON

Visualization

Case normalization

Sorting options

Filtering words

Verbose output

Environment variables

Exit codes

Library usage

Stability notice

Tests & benchmarks

Tests

Benchmarks

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages