Skip to content

davidjurgens/biberplus

Repository files navigation

Biberplus

PyPI version License: MIT

Biberplus is a pure Python implementation of the linguistic tagging system introduced in Biber (1988). Built upon the spaCy library, it delivers fast part-of-speech tagging along with supplemental features such as a function word tagger, PCA, and factor analysis. These features, inspired by the work of Grieve, Clarke, and colleagues, make Biberplus a powerful tool for analyzing large text corpora.


Table of Contents


Features

  • Linguistic Tagging: Implements Biber's tags alongside supplementary features.
  • Function Word Analysis: Option to use built-in or custom lists for function word tagging.
  • Text Embeddings: Flatten tagging frequencies into a vector representation.
  • Dimension Reduction: Perform PCA and factor analysis on the resulting data.
  • Performance: Support for multi-processing and GPU acceleration.

Installation

From PyPI (Stable Release)

Install the latest version (0.3.0) from PyPI:

pip install biberplus

For more details and package history, visit the Biberplus project page on PyPI.

Important:
Biberplus depends on spaCy for text processing. After installing biberplus, you must manually download the spaCy English model by running:

python -m spacy download en_core_web_sm

Quickstart Guide

1. Biber Tagger

Tag a string using the default configuration:

from biberplus.tagger import calculate_tag_frequencies

frequencies_df = calculate_tag_frequencies("Your sample text goes here")
print(frequencies_df)

Tag a large corpus with GPU and multi-processing:

from biberplus.tagger import load_config, load_pipeline, calculate_tag_frequencies

config = load_config()
config.update({'use_gpu': True, 'n_processes': 4, 'function_words': False})
pipeline = load_pipeline(config)
frequencies_df = calculate_tag_frequencies("Your sample text goes here", pipeline, config)
print(frequencies_df)

2. Function Words Tagger

Using the default list:

from biberplus.tagger import load_config, calculate_tag_frequencies

config = load_config()
config.update({'use_gpu': True, 'biber': False, 'function_words': True})
frequencies_df = calculate_tag_frequencies("Your sample text goes here")
print(frequencies_df)

Using a custom list:

from biberplus.tagger import load_config, calculate_tag_frequencies

custom_fw = ["the", "of", "to", "and", "a", "in", "that"]
config = load_config()
config.update({
    'function_words': True,
    'biber': False,
    'grieve_clarke': False,
    'function_words_list': custom_fw
})
frequencies_df = calculate_tag_frequencies("Your sample text goes here", custom_fw)
print(frequencies_df)

3. Word-Level Tagging

See exactly which tags are applied to each word:

import spacy
from biberplus.tagger import tag_text, load_config, load_pipeline

# Load configuration and pipeline
config = load_config()
pipeline = load_pipeline(config)

# Your test sentence
text = "It doesn't seem likely that this will work."

# Get tagged words
tagged_words = tag_text(text, pipeline=pipeline)

# Print each word and its tags
for word in tagged_words:
    print(f"Word: {word['text']:<15} Tags: {', '.join(word['tags'])}")

Example output:

Word: It              Tags: it, PIT, CAP, PRP, SBJP
Word: does            Tags: VPRT, SPAU
Word: n't             Tags: XX0, CONT, RB
Word: seem            Tags: SMP, INF
Word: likely          Tags: JJ

4. Text Embeddings

Generate an embedding vector from the textual data:

from biberplus.tagger import load_config
from biberplus.reducer import encode_text

config = load_config()
embedding = encode_text(config, "Your sample text goes here")
print(embedding)

5. Dimension Reduction

Using PCA:

from biberplus.tagger import load_config, load_pipeline, calculate_tag_frequencies
from biberplus.reducer import tags_pca

config = load_config()
config.update({'use_gpu': True, 'biber': True, 'function_words': True})
pipeline = load_pipeline(config)
frequencies_df = calculate_tag_frequencies("Your sample text goes here", pipeline, config)

pca_df, explained_variance = tags_pca(frequencies_df, components=2)
print(pca_df)
print(explained_variance)

Configuration

The library uses a YAML configuration file located at biberplus/tagger/config.yaml. Common options include:

  • biber: Enable Biber tag analysis.
  • function_words: Enable function word tagging.
  • binary_tags: Use binary features for tag counts.
  • token_normalization: Number of tokens per batch for frequency calculation.
  • use_gpu: Enable GPU acceleration.
  • n_processes: Number of processes for multi-processing.
  • drop_last_batch_pct: Drop the last batch if it is too small. What percentage counts as too small

You can modify these options in the file or update them dynamically in your script after loading the configuration with load_config().


Usage Tips

  • Reuse the Pipeline: For tagging many texts, load the spaCy pipeline once and reuse it across calls.
  • Adjust Batch Settings: For shorter texts (e.g., tweets), consider reducing token_normalization and enable binary tags.
  • Leverage GPU & Multi-processing: Enable use_gpu and adjust n_processes to boost performance on large corpora.

Troubleshooting

  • Performance: If processing is slow, check your GPU and multi-processing settings.
  • spaCy Model: Ensure that en_core_web_sm (or your chosen model) is installed and correctly configured.
  • Configuration Issues: Validate your configuration by printing the config object to ensure it reflects your intended settings.

References

  • Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.
  • Grieve, J. (2023). Register variation explains stylometric authorship analysis.
  • Additional research and references are detailed in the project documentation.

License

MIT License


Funding

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200006. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5