GitHub

🎯 Introduction

TextAssociations.jl is a Julia package for word association analysis and corpus-based research in linguistics, social sciences and the digital humanities. It provides a unified framework to quantify lexical relationships within texts and corpora using 51 association measures—spanning statistical, information-theoretic, epidemiological and lexical-gravity approaches—for transparent, data-driven analysis of how words co-occur and connect across discourse.

⚠️ Early Release Notice This is an early, pre-registration release of TextAssociations.jl. The package is fully functional but still evolving — documentation, tutorials, and examples are actively being expanded.

Even at this stage, it already offers functionality comparable to established corpus analysis tools:

AntConc (but more programmable)
SketchEngine (but open source)
WordSmith Tools (but with more metrics)

With added advantages of:

Being fully programmable and extensible
Integration with Julia's ecosystem
Support for custom metrics
Ability to process streaming data
Parallel computing capabilities

This makes TextAssociations.jl a powerful tool for computational linguistics, digital humanities and any field requiring sophisticated text analysis!

Check out our documentation for a detailed overview of all available features and functionalities.

Why Word Association Metrics Still Matter

Even in the era of transformer models and word embeddings, association metrics remain valuable because they:

📊 Are interpretable: Provide transparent, statistical insights into word relationships
🔄 Complement neural models: Can be used alongside embeddings to enhance performance and also enhance RAG pipelines.
📏 Serve as benchmarks: Provide baselines for evaluating complex models
💾 Work with limited data: Perform well even with small corpora

✨ Core Features

📈 51 Association Metrics

Comprehensive suite including PMI, Log-likelihood, Dice, Jaccard, Lexical Gravity and many more specialized measures from corpus linguistics, information theory and even some association metrics inspired from epidemiology.

📚 Corpus-Level Analysis

Process entire document collections with built-in support for:

Large-scale corpus processing
Temporal analysis (track changes over time)
Subcorpus comparison with statistical tests
Keyword extraction (TF-IDF and other methods soon to come)

🚀 Performance Optimized

Lazy evaluation for memory efficiency
Parallel processing support
Streaming for massive corpora
Caching system for repeated analyses

🔧 Flexible and Extensible

Multiple input formats (text files, CSV, JSON, DataFrames)
Easy to add custom metrics
Comprehensive API for programmatic access

📦 Installation

You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:

using Pkg
Pkg.add(url="https://github.com/atantos/TextAssociations.jl")

🚀 Quick Start

Basic Usage

using TextAssociations

# Simple analysis with a single text
text = "The cat sat on the mat. The cat played with the ball."
ct = ContingencyTable(text, "cat", windowsize=3, minfreq=1)

# Calculate PMI scores
pmi_scores = assoc_score(PMI, ct)

# Multiple metrics at once
results = assoc_score([PMI, LogDice, LLR], ct)

Corpus Analysis

# Load a corpus from a directory
corpus = read_corpus("path/to/texts/", preprocess=true)

# Analyze word associations across the entire corpus
results = analyze_node(corpus, "innovation", PMI, windowsize=5, minfreq=10)

# Analyze multiple words with multiple metrics
nodes = ["technology", "innovation", "research"]
metrics = [PMI, LogDice, LLR, ChiSquare]
analysis = analyze_nodes(corpus, nodes, metrics, top_n=100)

📊 Supported Metrics

TextAssociations.jl supports 51 metrics organized by category:

Information-Theoretic Metrics

PMI (Pointwise Mutual Information): $\log \frac{P(x,y)}{P(x)P(y)}$
PMI², PMI³: Squared and cubed variants
PPMI: Positive PMI (negative values set to 0)
LLR: Log-likelihood ratio
LexicalGravity: Asymmetric association measure

Statistical Metrics

ChiSquare: Pearson's χ² test
Tscore, Zscore: Statistical significance tests
PhiCoef: Phi coefficient (φ)
CramersV: Cramér's V
YuleQ, YuleOmega: Yule's measures

Similarity Coefficients

Dice: $\frac{2a}{2a + b + c}$
LogDice: Logarithmic Dice (more stable)
JaccardIdx: Jaccard similarity
CosineSim: Cosine similarity
OverlapCoef: Overlap coefficient

Epidemiological Metrics

RelRisk, LogRelRisk: Relative risk measures
OddsRatio, LogOddsRatio: Odds ratios
RiskDiff: Risk difference
AttrRisk: Attributable risk

Complete Metric List

Click to see all 51 metrics with formulas

Metric	Type	Formula
PMI	`PMI`	$\log \frac{P(x,y)}{P(x)P(y)}$
PMI²	`PMI²`	$(\log \frac{P(x,y)}{P(x)P(y)})^2$
PMI³	`PMI³`	$(\log \frac{P(x,y)}{P(x)P(y)})^3$
PPMI	`PPMI`	$\max(0, \log \frac{P(x,y)}{P(x)P(y)})$
LLR	`LLR`	$2 \sum_{i,j} O_{ij} \ln \frac{O_{ij}}{E_{ij}}$
LLR²	`LLR²`	$\sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$
Dice	`Dice`	$\frac{2a}{2a + b + c}$
LogDice	`LogDice`	$14 + \log_2(\frac{2a}{2a + b + c})$
Jaccard	`JaccardIdx`	$\frac{a}{a + b + c}$
Cosine	`CosineSim`	$\frac{a}{\sqrt{(a + b)(a + c)}}$
Overlap	`OverlapCoef`	$\frac{a}{\min(a + b, a + c)}$
Relative Risk	`RelRisk`	$\frac{a/(a+b)}{c/(c+d)}$
Odds Ratio	`OddsRatio`	$\frac{ad}{bc}$
Chi-square	`ChiSquare`	$\sum_{i,j}\frac{(f_{ij}-\hat{f_ij})^2}{\hat{f_ij}}$
Phi	`PhiCoef`	$\frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$
Cramér's V	`CramersV`	$\sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$
...and 35+ more

🎯 Advanced Features

Temporal Analysis

Track how word associations change over time:

temporal_analysis = analyze_temporal(
    corpus, ["pandemic", "vaccine"], :year, PMI, time_bins=5
)

Subcorpus Comparison

Compare associations across document groups with statistical tests:

comparison = compare_subcorpora(
    corpus, :category, "innovation", PMI
)
# Access statistical tests and effect sizes
tests = comparison.statistical_tests

Collocation Networks

Build and export word association networks with richer metadata:

network = colloc_graph(
    corpus, ["climate", "change"];  # seed terms
    metric=PMI,
    depth=2,
    min_score=2.5,
    direction=:undirected,
    include_frequency=true,
    weight_normalization=:minmax,
    compute_centrality=true,
    centrality_metrics=[:pagerank, :betweenness]
)

first(network.edges, 5)          # includes Frequency / DocFrequency / NormalizedWeight
first(network.node_metrics, 5)    # includes degrees, strengths & centrality scores

gephi_graph(network, "nodes.csv", "edges.csv")

Keyword Extraction

keywords = keyterms(corpus, method=:tfidf, num_keywords=50)

Concordance (KWIC)

concordance = kwic(corpus, "innovation", context_size=50)
for line in concordance.lines
    println("...$(line.LeftContext) [$(line.Node)] $(line.RightContext)...")
end

⚡ Performance Features

Parallel Processing

# Use multiple cores
using Distributed
addprocs(4)

analysis = analyze_nodes(
    corpus, nodes, metrics, parallel=true
)

Streaming for Large Corpora

# Process files without loading everything into memory
results = stream_corpus_analysis(
    "texts/*.txt", "word", PMI, chunk_size=1000
)

Batch Processing

# Process hundreds of node words efficiently
batch_process_corpus(
    corpus, "nodelist.txt", "output/",
    batch_size=100
)

🔬 Use Cases

TextAssociations.jl is ideal for:

Corpus Linguistics: Collocation analysis, lexical patterns, semantic prosody
Digital Humanities: Literary analysis, historical text mining, stylometry
NLP Research: Feature extraction, baseline models, evaluation metrics
Social Media Analysis: Trend detection, sentiment associations, hashtag networks
Information Retrieval: Query expansion, document similarity, term weighting

📖 Documentation

Examples

💻 Example Workflows

Research Paper Analysis

# Load abstracts from CSV
corpus = read_corpus("papers.csv",
    text_column=:abstract,
    metadata_columns=[:year, :journal])

# Extract domain-specific keywords
keywords = keyterms(corpus, method=:tfidf, num_keywords=100)

# Analyze key terms over time
temporal = analyze_temporal(
    corpus, keywords[1:10], :year, PMI
)

# Compare across journals
comparison = compare_subcorpora(corpus, :journal, "methodology", LogDice)

Literary Text Analysis

# Load novels
corpus = read_corpus("novels/", preprocess=true)

# Character co-occurrence network
characters = ["Elizabeth", "Darcy", "Jane", "Bingley"]
network = colloc_graph(
    corpus, characters, windowsize=20
)

# Export for visualization
gephi_graph(network, "characters.csv", "relations.csv")

🤝 Contributing

We welcome contributions! See our Contributing Guide for details on:

Adding new metrics
Improving performance
Extending functionality
Reporting issues

Development Setup

# Clone repository
git clone https://github.com/atantos/TextAssociations.jl
cd TextAssociations.jl

# Activate environment
julia --project=.

# Run tests
using Pkg; Pkg.test()

🙏 Acknowledgments

TextAssociations.jl builds on established methods from computational linguistics and is inspired by:

AntConc (Anthony, 2022)
SketchEngine (Kilgarriff et al., 2014)
WordSmith Tools (Scott, 2020)

While offering the performance and flexibility of the Julia ecosystem.

📄 License

MIT License - see LICENSE file for details.

🗺️ Roadmap

GPU acceleration for large-scale processing
Additional keyword extraction methods (TextRank, RAKE)
Integration with word embeddings
Indexing & Search Engine (à la Corpus Workbench)
Support for more file formats (XML, CONLL)

❓ FAQ

How does this compare to other tools?

Feature	TextAssociations.jl	AntConc	SketchEngine	WordSmith
Open Source	✅	✅	❌	❌
Metrics	47	~10*²	~20*²	~15*²
Corpus Size	Unlimited*¹	Limited	Large	Medium
Parallel Processing	✅	❌	✅	❌
API Access	✅	❌	✅	❌
Programmable	✅	❌	Limited	❌

*¹ With streaming and memory-mapped files

*² This is a rough estimate including both association measures and keyness tests. A more precise count from users of these tools is welcome.

What file formats are supported?

Plain text files (.txt)
CSV files with text columns
JSON files
Julia DataFrames
Directory of text files

Can it handle non-English text?

Yes! TextAssociations.jl works with any Unicode text. The preprocessing steps (lowercasing, punctuation removal) are Unicode-aware.

📬 Contact: For questions and support, please open an issue on GitHub.

🌟 Star us on GitHub: If you find this package useful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
assets		assets
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md
benchmark_results_2025-08-22_174103.csv		benchmark_results_2025-08-22_174103.csv
docs_build.log		docs_build.log

License

atantos/TextAssociations.jl

Folders and files

Latest commit

History

Repository files navigation