Biberplus is a pure Python implementation of the linguistic tagging system introduced in Biber (1988). Built upon the spaCy library, it delivers fast part-of-speech tagging along with supplemental features such as a function word tagger, PCA, and factor analysis. These features, inspired by the work of Grieve, Clarke, and colleagues, make Biberplus a powerful tool for analyzing large text corpora.
- Linguistic Tagging: Implements Biber's tags alongside supplementary features.
- Function Word Analysis: Option to use built-in or custom lists for function word tagging.
- Text Embeddings: Flatten tagging frequencies into a vector representation.
- Dimension Reduction: Perform PCA and factor analysis on the resulting data.
- Performance: Support for multi-processing and GPU acceleration.
Install the latest version (0.3.0) from PyPI:
pip install biberplusFor more details and package history, visit the Biberplus project page on PyPI.
Important:
Biberplus depends on spaCy for text processing. After installing biberplus, you must manually download the spaCy English model by running:
python -m spacy download en_core_web_smTag a string using the default configuration:
from biberplus.tagger import calculate_tag_frequencies
frequencies_df = calculate_tag_frequencies("Your sample text goes here")
print(frequencies_df)Tag a large corpus with GPU and multi-processing:
from biberplus.tagger import load_config, load_pipeline, calculate_tag_frequencies
config = load_config()
config.update({'use_gpu': True, 'n_processes': 4, 'function_words': False})
pipeline = load_pipeline(config)
frequencies_df = calculate_tag_frequencies("Your sample text goes here", pipeline, config)
print(frequencies_df)Using the default list:
from biberplus.tagger import load_config, calculate_tag_frequencies
config = load_config()
config.update({'use_gpu': True, 'biber': False, 'function_words': True})
frequencies_df = calculate_tag_frequencies("Your sample text goes here")
print(frequencies_df)Using a custom list:
from biberplus.tagger import load_config, calculate_tag_frequencies
custom_fw = ["the", "of", "to", "and", "a", "in", "that"]
config = load_config()
config.update({
    'function_words': True,
    'biber': False,
    'grieve_clarke': False,
    'function_words_list': custom_fw
})
frequencies_df = calculate_tag_frequencies("Your sample text goes here", custom_fw)
print(frequencies_df)See exactly which tags are applied to each word:
import spacy
from biberplus.tagger import tag_text, load_config, load_pipeline
# Load configuration and pipeline
config = load_config()
pipeline = load_pipeline(config)
# Your test sentence
text = "It doesn't seem likely that this will work."
# Get tagged words
tagged_words = tag_text(text, pipeline=pipeline)
# Print each word and its tags
for word in tagged_words:
    print(f"Word: {word['text']:<15} Tags: {', '.join(word['tags'])}")Example output:
Word: It              Tags: it, PIT, CAP, PRP, SBJP
Word: does            Tags: VPRT, SPAU
Word: n't             Tags: XX0, CONT, RB
Word: seem            Tags: SMP, INF
Word: likely          Tags: JJ
Generate an embedding vector from the textual data:
from biberplus.tagger import load_config
from biberplus.reducer import encode_text
config = load_config()
embedding = encode_text(config, "Your sample text goes here")
print(embedding)Using PCA:
from biberplus.tagger import load_config, load_pipeline, calculate_tag_frequencies
from biberplus.reducer import tags_pca
config = load_config()
config.update({'use_gpu': True, 'biber': True, 'function_words': True})
pipeline = load_pipeline(config)
frequencies_df = calculate_tag_frequencies("Your sample text goes here", pipeline, config)
pca_df, explained_variance = tags_pca(frequencies_df, components=2)
print(pca_df)
print(explained_variance)The library uses a YAML configuration file located at biberplus/tagger/config.yaml. Common options include:
- biber: Enable Biber tag analysis.
- function_words: Enable function word tagging.
- binary_tags: Use binary features for tag counts.
- token_normalization: Number of tokens per batch for frequency calculation.
- use_gpu: Enable GPU acceleration.
- n_processes: Number of processes for multi-processing.
- drop_last_batch_pct: Drop the last batch if it is too small. What percentage counts as too small
You can modify these options in the file or update them dynamically in your script after loading the configuration with load_config().
- Reuse the Pipeline: For tagging many texts, load the spaCy pipeline once and reuse it across calls.
- Adjust Batch Settings: For shorter texts (e.g., tweets), consider reducing token_normalizationand enable binary tags.
- Leverage GPU & Multi-processing: Enable use_gpuand adjustn_processesto boost performance on large corpora.
- Performance: If processing is slow, check your GPU and multi-processing settings.
- spaCy Model: Ensure that en_core_web_sm(or your chosen model) is installed and correctly configured.
- Configuration Issues: Validate your configuration by printing the config object to ensure it reflects your intended settings.
- Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.
- Grieve, J. (2023). Register variation explains stylometric authorship analysis.
- Additional research and references are detailed in the project documentation.
MIT License
This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200006. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.