TextAssociations.jl is a Julia package for word association analysis and corpus-based research in linguistics, social sciences and the digital humanities. It provides a unified framework to quantify lexical relationships within texts and corpora using 51 association measuresβspanning statistical, information-theoretic, epidemiological and lexical-gravity approachesβfor transparent, data-driven analysis of how words co-occur and connect across discourse.
TextAssociations.jl.
The package is fully functional but still evolving β documentation, tutorials, and examples are actively being expanded.
Even at this stage, it already offers functionality comparable to established corpus analysis tools:
- AntConc (but more programmable)
- SketchEngine (but open source)
- WordSmith Tools (but with more metrics)
With added advantages of:
- Being fully programmable and extensible
- Integration with
Julia's ecosystem - Support for custom metrics
- Ability to process streaming data
- Parallel computing capabilities
This makes TextAssociations.jl a powerful tool for computational linguistics, digital humanities and any field requiring sophisticated text analysis!
Check out our documentation for a detailed overview of all available features and functionalities.
Even in the era of transformer models and word embeddings, association metrics remain valuable because they:
- π Are interpretable: Provide transparent, statistical insights into word relationships
- π Complement neural models: Can be used alongside embeddings to enhance performance and also enhance RAG pipelines.
- π Serve as benchmarks: Provide baselines for evaluating complex models
- πΎ Work with limited data: Perform well even with small corpora
Comprehensive suite including PMI, Log-likelihood, Dice, Jaccard, Lexical Gravity and many more specialized measures from corpus linguistics, information theory and even some association metrics inspired from epidemiology.
Process entire document collections with built-in support for:
- Large-scale corpus processing
- Temporal analysis (track changes over time)
- Subcorpus comparison with statistical tests
- Keyword extraction (TF-IDF and other methods soon to come)
- Lazy evaluation for memory efficiency
- Parallel processing support
- Streaming for massive corpora
- Caching system for repeated analyses
- Multiple input formats (text files, CSV, JSON, DataFrames)
- Easy to add custom metrics
- Comprehensive API for programmatic access
You can install TextAssociations.jl directly from its GitHub repository using Juliaβs package manager. In the Julia REPL, press ] to enter Pkg mode and run:
using Pkg
Pkg.add(url="https://github.com/atantos/TextAssociations.jl")using TextAssociations
# Simple analysis with a single text
text = "The cat sat on the mat. The cat played with the ball."
ct = ContingencyTable(text, "cat", windowsize=3, minfreq=1)
# Calculate PMI scores
pmi_scores = assoc_score(PMI, ct)
# Multiple metrics at once
results = assoc_score([PMI, LogDice, LLR], ct)# Load a corpus from a directory
corpus = read_corpus("path/to/texts/", preprocess=true)
# Analyze word associations across the entire corpus
results = analyze_node(corpus, "innovation", PMI, windowsize=5, minfreq=10)
# Analyze multiple words with multiple metrics
nodes = ["technology", "innovation", "research"]
metrics = [PMI, LogDice, LLR, ChiSquare]
analysis = analyze_nodes(corpus, nodes, metrics, top_n=100)TextAssociations.jl supports 51 metrics organized by category:
-
PMI (Pointwise Mutual Information):
$\log \frac{P(x,y)}{P(x)P(y)}$ - PMIΒ², PMIΒ³: Squared and cubed variants
- PPMI: Positive PMI (negative values set to 0)
- LLR: Log-likelihood ratio
- LexicalGravity: Asymmetric association measure
- ChiSquare: Pearson's ΟΒ² test
- Tscore, Zscore: Statistical significance tests
- PhiCoef: Phi coefficient (Ο)
- CramersV: CramΓ©r's V
- YuleQ, YuleOmega: Yule's measures
-
Dice:
$\frac{2a}{2a + b + c}$ - LogDice: Logarithmic Dice (more stable)
- JaccardIdx: Jaccard similarity
- CosineSim: Cosine similarity
- OverlapCoef: Overlap coefficient
- RelRisk, LogRelRisk: Relative risk measures
- OddsRatio, LogOddsRatio: Odds ratios
- RiskDiff: Risk difference
- AttrRisk: Attributable risk
Click to see all 51 metrics with formulas
| Metric | Type | Formula |
|---|---|---|
| PMI | PMI |
|
| PMIΒ² | PMIΒ² |
|
| PMIΒ³ | PMIΒ³ |
|
| PPMI | PPMI |
|
| LLR | LLR |
|
| LLRΒ² | LLRΒ² |
|
| Dice | Dice |
|
| LogDice | LogDice |
|
| Jaccard | JaccardIdx |
|
| Cosine | CosineSim |
|
| Overlap | OverlapCoef |
|
| Relative Risk | RelRisk |
|
| Odds Ratio | OddsRatio |
|
| Chi-square | ChiSquare |
|
| Phi | PhiCoef |
|
| CramΓ©r's V | CramersV |
|
| ...and 35+ more |
Track how word associations change over time:
temporal_analysis = analyze_temporal(
corpus, ["pandemic", "vaccine"], :year, PMI, time_bins=5
)Compare associations across document groups with statistical tests:
comparison = compare_subcorpora(
corpus, :category, "innovation", PMI
)
# Access statistical tests and effect sizes
tests = comparison.statistical_testsBuild and export word association networks with richer metadata:
network = colloc_graph(
corpus, ["climate", "change"]; # seed terms
metric=PMI,
depth=2,
min_score=2.5,
direction=:undirected,
include_frequency=true,
weight_normalization=:minmax,
compute_centrality=true,
centrality_metrics=[:pagerank, :betweenness]
)
first(network.edges, 5) # includes Frequency / DocFrequency / NormalizedWeight
first(network.node_metrics, 5) # includes degrees, strengths & centrality scores
gephi_graph(network, "nodes.csv", "edges.csv")keywords = keyterms(corpus, method=:tfidf, num_keywords=50)concordance = kwic(corpus, "innovation", context_size=50)
for line in concordance.lines
println("...$(line.LeftContext) [$(line.Node)] $(line.RightContext)...")
end# Use multiple cores
using Distributed
addprocs(4)
analysis = analyze_nodes(
corpus, nodes, metrics, parallel=true
)# Process files without loading everything into memory
results = stream_corpus_analysis(
"texts/*.txt", "word", PMI, chunk_size=1000
)# Process hundreds of node words efficiently
batch_process_corpus(
corpus, "nodelist.txt", "output/",
batch_size=100
)TextAssociations.jl is ideal for:
- Corpus Linguistics: Collocation analysis, lexical patterns, semantic prosody
- Digital Humanities: Literary analysis, historical text mining, stylometry
- NLP Research: Feature extraction, baseline models, evaluation metrics
- Social Media Analysis: Trend detection, sentiment associations, hashtag networks
- Information Retrieval: Query expansion, document similarity, term weighting
# Load abstracts from CSV
corpus = read_corpus("papers.csv",
text_column=:abstract,
metadata_columns=[:year, :journal])
# Extract domain-specific keywords
keywords = keyterms(corpus, method=:tfidf, num_keywords=100)
# Analyze key terms over time
temporal = analyze_temporal(
corpus, keywords[1:10], :year, PMI
)
# Compare across journals
comparison = compare_subcorpora(corpus, :journal, "methodology", LogDice)# Load novels
corpus = read_corpus("novels/", preprocess=true)
# Character co-occurrence network
characters = ["Elizabeth", "Darcy", "Jane", "Bingley"]
network = colloc_graph(
corpus, characters, windowsize=20
)
# Export for visualization
gephi_graph(network, "characters.csv", "relations.csv")We welcome contributions! See our Contributing Guide for details on:
- Adding new metrics
- Improving performance
- Extending functionality
- Reporting issues
# Clone repository
git clone https://github.com/atantos/TextAssociations.jl
cd TextAssociations.jl
# Activate environment
julia --project=.
# Run tests
using Pkg; Pkg.test()TextAssociations.jl builds on established methods from computational linguistics and is inspired by:
- AntConc (Anthony, 2022)
- SketchEngine (Kilgarriff et al., 2014)
- WordSmith Tools (Scott, 2020)
While offering the performance and flexibility of the Julia ecosystem.
MIT License - see LICENSE file for details.
- GPU acceleration for large-scale processing
- Additional keyword extraction methods (TextRank, RAKE)
- Integration with word embeddings
- Indexing & Search Engine (Γ la Corpus Workbench)
- Support for more file formats (XML, CONLL)
How does this compare to other tools?
| Feature | TextAssociations.jl | AntConc | SketchEngine | WordSmith |
|---|---|---|---|---|
| Open Source | β | β | β | β |
| Metrics | 47 | ~10*2 | ~20*2 | ~15*2 |
| Corpus Size | Unlimited*1 | Limited | Large | Medium |
| Parallel Processing | β | β | β | β |
| API Access | β | β | β | β |
| Programmable | β | β | Limited | β |
*1 With streaming and memory-mapped files
*2 This is a rough estimate including both association measures and keyness tests. A more precise count from users of these tools is welcome.
What file formats are supported?
- Plain text files (.txt)
- CSV files with text columns
- JSON files
- Julia DataFrames
- Directory of text files
Can it handle non-English text?
Yes! TextAssociations.jl works with any Unicode text. The preprocessing steps (lowercasing, punctuation removal) are Unicode-aware.
π¬ Contact: For questions and support, please open an issue on GitHub.
π Star us on GitHub: If you find this package useful, please consider giving it a star!

