Tallies the number of times each word appears in one or more Unicode input sources using ICU4X for word boundary detection. Use word-tally
as a command-line tool or WordTally
via the Rust library interface.
Unless an I/O mode is specified, a reasonable strategy is automatically selected based on input type:
- Files: Parallel memory-mapped I/O for seekable files
- Pipes, sockets, stdin, other: Parallel streaming I/O for data streams
- Character/block devices: Sequential streaming I/O for device compatibility
All parallel modes use SIMD-accelerated chunk boundary detection. Memory mapping requires seekable file descriptors and won't work with stdin, pipes or devices.
Usage: word-tally [OPTIONS] [PATHS]...
Arguments:
[PATHS]... File paths to use as input (use "-" for stdin) [default: -]
Options:
-I, --io <STRATEGY> I/O strategy [default: auto] [possible values: auto, parallel-mmap, parallel-in-memory, parallel-stream, stream]
-c, --case <FORMAT> Case normalization [default: original] [possible values: original, upper, lower]
-s, --sort <ORDER> Sort order [default: desc] [possible values: desc, asc, unsorted]
-m, --min-chars <COUNT> Exclude words containing fewer than min chars
-n, --min-count <COUNT> Exclude words appearing fewer than min times
-w, --exclude-words <WORDS> Exclude words from a comma-delimited list
-i, --include <PATTERNS> Include only words matching a regex pattern
-x, --exclude <PATTERNS> Exclude words matching a regex pattern
-f, --format <FORMAT> Output format [default: text] [possible values: text, json, csv]
-d, --field-delimiter <VALUE> Delimiter between field and value [default: " "] (text format only)
-D, --entry-delimiter <VALUE> Delimiter between entries [default: "\n"] (text format only)
-o, --output <PATH> Write output to file rather than stdout
-v, --verbose Print verbose details
-h, --help Print help (see more with '--help')
-V, --version Print version
cargo install word-tally
Choose an I/O strategy based on your performance and memory requirements:
# Default: Auto-selection of a reasonable I/O strategy depending on input type
echo "tally me" | word-tally # Parallel streamed I/O for stdin
word-tally file.txt # Parallel memory-mapped I/O for regular files
# Sequential streamed I/O with minimal memory usage
word-tally --io=stream file.txt
# Parallel streamed I/O
word-tally --io=parallel-stream file.txt
# Parallel memory-mapped I/O
word-tally --io=parallel-mmap file.txt
# Parallel fully loaded into memory
word-tally --io=parallel-in-memory file.txt
Additional features:
# Process multiple files
word-tally file1.txt file2.txt file3.txt
# Mix stdin and files
cat header.txt | word-tally - body.txt footer.txt
Note: Memory mapping (parallel-mmap
) requires seekable files and cannot be used with stdin or pipes.
# Write to file instead of stdout
word-tally README.md --output=tally.txt
# Custom delimiter between word and count
word-tally README.md --field-delimiter=": " --output=tally.txt
# Custom delimiter between entries (e.g., comma-separated)
word-tally README.md --field-delimiter=": " --entry-delimiter=", " --output=tally.txt
# Pipe to other tools
word-tally README.md | head -n10
# Tab-separated values without escaping
word-tally --field-delimiter="\t" README.md > tally.tsv
# Custom delimiters
word-tally --field-delimiter="|" --entry-delimiter=";" README.md
# CSV with proper escaping and headers
word-tally --format=csv README.md > tally.csv
word-tally --format=json --output="tally.json" README.md
Convert JSON output for visualization with d3-cloud:
word-tally --format=json README.md | jq 'map({text: .[0], value: .[1]})' > d3-cloud.json
Format and pipe the JSON output to the wordcloud_cli to produce an image:
word-tally --format=json README.md | jq -r 'map(.[0] + " ") | join(" ")' | wordcloud_cli --imagefile wordcloud.png
# Convert to lowercase
word-tally --case=lower file.txt
# Preserve original case
word-tally --case=original file.txt
# Convert all to uppercase
word-tally --case=upper file.txt
# Sort by frequency (descending, default)
word-tally --sort=desc file.txt
# Sort alphabetically (ascending)
word-tally --sort=asc file.txt
# No sorting (sorted by order seen)
word-tally --sort=unsorted file.txt
# Only include words that appear at least 10 times
word-tally --min-count=10 file.txt
# Exclude words with fewer than 5 characters
word-tally --min-chars=5 file.txt
# Exclude words by pattern
word-tally --exclude="^a.*" --exclude="^the$" file.txt
# Combining include and exclude patterns
word-tally --include="^w.*" --include=".*o$" --exclude="^who$" file.txt
# Exclude specific words
word-tally --exclude-words="the,a,an,and,or,but" file.txt
echo "fe fi fi fo fo fo" | word-tally --verbose
#>> source -
#>> total-words 6
#>> unique-words 3
#>> delimiter " "
#>> entry-delimiter "\n"
#>> case original
#>> order desc
#>> io parallel-stream
#>> min-chars none
#>> min-count none
#>> exclude-words none
#>> exclude-patterns none
#>> include-patterns none
#>>
#>> fo 3
#>> fi 2
#>> fe 1
The following environment variables configure various aspects of the library:
I/O and processing strategy configuration:
WORD_TALLY_IO
- I/O strategy (default: parallel-stream, options: stream, parallel-stream, parallel-in-memory, parallel-mmap)
Memory allocation and performance:
WORD_TALLY_UNIQUENESS_RATIO
- Ratio of total words to unique words for capacity estimation. Higher values allocate less initial memory. Books tend to have a 10:1 ratio, and a balanced 32:1 is used as default for better performance (default: 32)WORD_TALLY_WORDS_PER_KB
- Estimated words per KB of text for capacity calculation (default: 128, max: 512)WORD_TALLY_STDIN_BUFFER_SIZE
- Buffer size for stdin when size cannot be determined (default: 262144)
Parallel processing configuration:
WORD_TALLY_THREADS
- Number of threads for parallel processing (default: all available cores)WORD_TALLY_CHUNK_SIZE
- Size of chunks for parallel processing in bytes (default: 65536)
word-tally
uses standard unix exit codes to indicate success or the types of failure:
0
: Success1
: General failure64
: Command line usage error65
: Data format error66
: Input not found70
: Internal software error73
: Output creation failed74
: I/O error77
: Permission denied
[dependencies]
word-tally = "0.28.0"
use std::collections::HashMap;
use word_tally::{Case, Filters, Io, Options, Serialization, TallyMap, WordTally};
use anyhow::Result;
fn main() -> Result<()> {
// Basic usage with file path
let options = Options::default();
let tally_map = TallyMap::from_path("document.txt", &options)?;
let word_tally = WordTally::from_tally_map(tally_map, &options);
println!("Total words: {}", word_tally.count());
// Memory-mapped I/O for files
let options = Options::default()
.with_case(Case::Lower)
.with_filters(Filters::default().with_min_chars(3))
.with_serialization(Serialization::Json)
.with_io(Io::ParallelMmap);
let tally_map = TallyMap::from_path("large-file.txt", &options)?;
let tally = WordTally::from_tally_map(tally_map, &options);
// Convert to `HashMap` for fast word lookups
let lookup: HashMap<_, _> = tally.into();
println!("Count of 'the': {}", *lookup.get("the").unwrap_or(&0));
println!("Count of 'word': {}", *lookup.get("word").unwrap_or(&0));
Ok(())
}
The library provides full control over case normalization, sorting, filtering, I/O strategies, and output formats.
Pre-release level stability: This is prerelease software. Expect breaking interface changes at MINOR version (0.x.0) bumps until a stable release.
Clone the repository.
git clone https://github.com/havenwood/word-tally
cd word-tally
Run all tests.
cargo test
Run specific test modules.
cargo test --test api_tests
cargo test --test filters_tests
cargo test --test io_tests
Run individual tests
cargo test --test filters_tests -- test_min_chars
cargo test --test io_tests -- test_memory_mapped
Run all benchmarks.
cargo bench
Run specific benchmark groups
cargo bench --bench core
cargo bench --bench io
cargo bench --bench features
Run specific individual benchmarks
cargo bench --bench features -- case_sensitivity
cargo bench --bench core -- parallel_vs_sequential