Skip to content

atantos/TextAssociations.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TextAssociations.jl TextAssociations.jl

Stable Build Status

🎯 Introduction

TextAssociations.jl is a Julia package for word association analysis and corpus-based research in linguistics, social sciences and the digital humanities. It provides a unified framework to quantify lexical relationships within texts and corpora using 51 association measuresβ€”spanning statistical, information-theoretic, epidemiological and lexical-gravity approachesβ€”for transparent, data-driven analysis of how words co-occur and connect across discourse.

⚠️ Early Release Notice This is an early, pre-registration release of TextAssociations.jl. The package is fully functional but still evolving β€” documentation, tutorials, and examples are actively being expanded.

Even at this stage, it already offers functionality comparable to established corpus analysis tools:

  • AntConc (but more programmable)
  • SketchEngine (but open source)
  • WordSmith Tools (but with more metrics)

With added advantages of:

  • Being fully programmable and extensible
  • Integration with Julia's ecosystem
  • Support for custom metrics
  • Ability to process streaming data
  • Parallel computing capabilities

This makes TextAssociations.jl a powerful tool for computational linguistics, digital humanities and any field requiring sophisticated text analysis!

Check out our documentation for a detailed overview of all available features and functionalities.

Why Word Association Metrics Still Matter

Even in the era of transformer models and word embeddings, association metrics remain valuable because they:

  • πŸ“Š Are interpretable: Provide transparent, statistical insights into word relationships
  • πŸ”„ Complement neural models: Can be used alongside embeddings to enhance performance and also enhance RAG pipelines.
  • πŸ“ Serve as benchmarks: Provide baselines for evaluating complex models
  • πŸ’Ύ Work with limited data: Perform well even with small corpora

✨ Core Features

πŸ“ˆ 51 Association Metrics

Comprehensive suite including PMI, Log-likelihood, Dice, Jaccard, Lexical Gravity and many more specialized measures from corpus linguistics, information theory and even some association metrics inspired from epidemiology.

πŸ“š Corpus-Level Analysis

Process entire document collections with built-in support for:

  • Large-scale corpus processing
  • Temporal analysis (track changes over time)
  • Subcorpus comparison with statistical tests
  • Keyword extraction (TF-IDF and other methods soon to come)

πŸš€ Performance Optimized

  • Lazy evaluation for memory efficiency
  • Parallel processing support
  • Streaming for massive corpora
  • Caching system for repeated analyses

πŸ”§ Flexible and Extensible

  • Multiple input formats (text files, CSV, JSON, DataFrames)
  • Easy to add custom metrics
  • Comprehensive API for programmatic access

πŸ“¦ Installation

You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:

using Pkg
Pkg.add(url="https://github.com/atantos/TextAssociations.jl")

πŸš€ Quick Start

Basic Usage

using TextAssociations

# Simple analysis with a single text
text = "The cat sat on the mat. The cat played with the ball."
ct = ContingencyTable(text, "cat", windowsize=3, minfreq=1)

# Calculate PMI scores
pmi_scores = assoc_score(PMI, ct)

# Multiple metrics at once
results = assoc_score([PMI, LogDice, LLR], ct)

Corpus Analysis

# Load a corpus from a directory
corpus = read_corpus("path/to/texts/", preprocess=true)

# Analyze word associations across the entire corpus
results = analyze_node(corpus, "innovation", PMI, windowsize=5, minfreq=10)

# Analyze multiple words with multiple metrics
nodes = ["technology", "innovation", "research"]
metrics = [PMI, LogDice, LLR, ChiSquare]
analysis = analyze_nodes(corpus, nodes, metrics, top_n=100)

πŸ“Š Supported Metrics

TextAssociations.jl supports 51 metrics organized by category:

Information-Theoretic Metrics

  • PMI (Pointwise Mutual Information): $\log \frac{P(x,y)}{P(x)P(y)}$
  • PMIΒ², PMIΒ³: Squared and cubed variants
  • PPMI: Positive PMI (negative values set to 0)
  • LLR: Log-likelihood ratio
  • LexicalGravity: Asymmetric association measure

Statistical Metrics

  • ChiSquare: Pearson's χ² test
  • Tscore, Zscore: Statistical significance tests
  • PhiCoef: Phi coefficient (Ο†)
  • CramersV: CramΓ©r's V
  • YuleQ, YuleOmega: Yule's measures

Similarity Coefficients

  • Dice: $\frac{2a}{2a + b + c}$
  • LogDice: Logarithmic Dice (more stable)
  • JaccardIdx: Jaccard similarity
  • CosineSim: Cosine similarity
  • OverlapCoef: Overlap coefficient

Epidemiological Metrics

  • RelRisk, LogRelRisk: Relative risk measures
  • OddsRatio, LogOddsRatio: Odds ratios
  • RiskDiff: Risk difference
  • AttrRisk: Attributable risk

Complete Metric List

Click to see all 51 metrics with formulas
Metric Type Formula
PMI PMI $\log \frac{P(x,y)}{P(x)P(y)}$
PMIΒ² PMIΒ² $(\log \frac{P(x,y)}{P(x)P(y)})^2$
PMIΒ³ PMIΒ³ $(\log \frac{P(x,y)}{P(x)P(y)})^3$
PPMI PPMI $\max(0, \log \frac{P(x,y)}{P(x)P(y)})$
LLR LLR $2 \sum_{i,j} O_{ij} \ln \frac{O_{ij}}{E_{ij}}$
LLRΒ² LLRΒ² $\sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$
Dice Dice $\frac{2a}{2a + b + c}$
LogDice LogDice $14 + \log_2(\frac{2a}{2a + b + c})$
Jaccard JaccardIdx $\frac{a}{a + b + c}$
Cosine CosineSim $\frac{a}{\sqrt{(a + b)(a + c)}}$
Overlap OverlapCoef $\frac{a}{\min(a + b, a + c)}$
Relative Risk RelRisk $\frac{a/(a+b)}{c/(c+d)}$
Odds Ratio OddsRatio $\frac{ad}{bc}$
Chi-square ChiSquare $\sum_{i,j}\frac{(f_{ij}-\hat{f_ij})^2}{\hat{f_ij}}$
Phi PhiCoef $\frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$
CramΓ©r's V CramersV $\sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$
...and 35+ more

🎯 Advanced Features

Temporal Analysis

Track how word associations change over time:

temporal_analysis = analyze_temporal(
    corpus, ["pandemic", "vaccine"], :year, PMI, time_bins=5
)

Subcorpus Comparison

Compare associations across document groups with statistical tests:

comparison = compare_subcorpora(
    corpus, :category, "innovation", PMI
)
# Access statistical tests and effect sizes
tests = comparison.statistical_tests

Collocation Networks

Build and export word association networks with richer metadata:

network = colloc_graph(
    corpus, ["climate", "change"];  # seed terms
    metric=PMI,
    depth=2,
    min_score=2.5,
    direction=:undirected,
    include_frequency=true,
    weight_normalization=:minmax,
    compute_centrality=true,
    centrality_metrics=[:pagerank, :betweenness]
)

first(network.edges, 5)          # includes Frequency / DocFrequency / NormalizedWeight
first(network.node_metrics, 5)    # includes degrees, strengths & centrality scores

gephi_graph(network, "nodes.csv", "edges.csv")

Keyword Extraction

keywords = keyterms(corpus, method=:tfidf, num_keywords=50)

Concordance (KWIC)

concordance = kwic(corpus, "innovation", context_size=50)
for line in concordance.lines
    println("...$(line.LeftContext) [$(line.Node)] $(line.RightContext)...")
end

⚑ Performance Features

Parallel Processing

# Use multiple cores
using Distributed
addprocs(4)

analysis = analyze_nodes(
    corpus, nodes, metrics, parallel=true
)

Streaming for Large Corpora

# Process files without loading everything into memory
results = stream_corpus_analysis(
    "texts/*.txt", "word", PMI, chunk_size=1000
)

Batch Processing

# Process hundreds of node words efficiently
batch_process_corpus(
    corpus, "nodelist.txt", "output/",
    batch_size=100
)

πŸ”¬ Use Cases

TextAssociations.jl is ideal for:

  • Corpus Linguistics: Collocation analysis, lexical patterns, semantic prosody
  • Digital Humanities: Literary analysis, historical text mining, stylometry
  • NLP Research: Feature extraction, baseline models, evaluation metrics
  • Social Media Analysis: Trend detection, sentiment associations, hashtag networks
  • Information Retrieval: Query expansion, document similarity, term weighting

πŸ“– Documentation

πŸ’» Example Workflows

Research Paper Analysis

# Load abstracts from CSV
corpus = read_corpus("papers.csv",
    text_column=:abstract,
    metadata_columns=[:year, :journal])

# Extract domain-specific keywords
keywords = keyterms(corpus, method=:tfidf, num_keywords=100)

# Analyze key terms over time
temporal = analyze_temporal(
    corpus, keywords[1:10], :year, PMI
)

# Compare across journals
comparison = compare_subcorpora(corpus, :journal, "methodology", LogDice)

Literary Text Analysis

# Load novels
corpus = read_corpus("novels/", preprocess=true)

# Character co-occurrence network
characters = ["Elizabeth", "Darcy", "Jane", "Bingley"]
network = colloc_graph(
    corpus, characters, windowsize=20
)

# Export for visualization
gephi_graph(network, "characters.csv", "relations.csv")

🀝 Contributing

We welcome contributions! See our Contributing Guide for details on:

  • Adding new metrics
  • Improving performance
  • Extending functionality
  • Reporting issues

Development Setup

# Clone repository
git clone https://github.com/atantos/TextAssociations.jl
cd TextAssociations.jl

# Activate environment
julia --project=.

# Run tests
using Pkg; Pkg.test()

πŸ™ Acknowledgments

TextAssociations.jl builds on established methods from computational linguistics and is inspired by:

  • AntConc (Anthony, 2022)
  • SketchEngine (Kilgarriff et al., 2014)
  • WordSmith Tools (Scott, 2020)

While offering the performance and flexibility of the Julia ecosystem.

πŸ“„ License

MIT License - see LICENSE file for details.


πŸ—ΊοΈ Roadmap

  • GPU acceleration for large-scale processing
  • Additional keyword extraction methods (TextRank, RAKE)
  • Integration with word embeddings
  • Indexing & Search Engine (Γ  la Corpus Workbench)
  • Support for more file formats (XML, CONLL)

❓ FAQ

How does this compare to other tools?
Feature TextAssociations.jl AntConc SketchEngine WordSmith
Open Source βœ… βœ… ❌ ❌
Metrics 47 ~10*2 ~20*2 ~15*2
Corpus Size Unlimited*1 Limited Large Medium
Parallel Processing βœ… ❌ βœ… ❌
API Access βœ… ❌ βœ… ❌
Programmable βœ… ❌ Limited ❌

*1 With streaming and memory-mapped files

*2 This is a rough estimate including both association measures and keyness tests. A more precise count from users of these tools is welcome.

What file formats are supported?
  • Plain text files (.txt)
  • CSV files with text columns
  • JSON files
  • Julia DataFrames
  • Directory of text files
Can it handle non-English text?

Yes! TextAssociations.jl works with any Unicode text. The preprocessing steps (lowercasing, punctuation removal) are Unicode-aware.


πŸ“¬ Contact: For questions and support, please open an issue on GitHub.

🌟 Star us on GitHub: If you find this package useful, please consider giving it a star!