Skip to content

scott062/notorious_b_i_g

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Notorious BIGram Parser

A small tool for counting bigrams from text inputs or file(s). Includes:

  • CLI (Typer) with an ASCII histogram
  • Django UI with a Chart.js bar chart
  • tests for the core logic

Requirements

  • Python 3.11+
  • Virtualenv (uv or pip)

Setup

# from project root
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

Project layout

Important Directories
cli/     # CLI app (Typer)
parsing/ # core logic
tests/   # pytest for parsing
web/     # Django UI

CLI

image

Run

Using Files:

uv run python -m cli.bigram_cli --hist file/path/here

or (interactive):

uv run python -m cli.bigram_cli -i

or (piped):

echo "YOUR TEXT HERE" | PYTHONPATH=. uv run python -m cli.bigram_cli --hist

ex (use any or all 3 possible samples):

uv run python -m cli.bigram_cli --hist ./samples/crime_and_punish.txt ./samples/pride_and_prej.txt ./samples/moby_dick.txt

Inputs

  • FILES = one or more text files; counts are aggregated
  • No files -> reads from stdin
  • No files and no stdin -> exits

Core options

  • --hist -> show a simple bar chart
  • --top INT (default: 50) -> limit to top N
  • -i, --interactive -> prompt to toggle parsing

Parsing flags

  • -l (default: true) keep only A–Z letters
  • -p (default: true) strip all punctuation
  • -a (default: false) keep apostrophes inside words: don't
  • -y (default: false) keep hyphens inside words: mother-in-law
  • -s (default: false) reset bigram sequence at sentence end
  • -s (default: false) reset bigram sequence at each newline
  • -v (default: false) only consider probable valid words in english

What the histogram does

  • Fits to your terminal width.
  • Long labels are condensed.
  • The max count fills the bar; others are scaled proportionally.

Django UI

Run

cd web
python manage.py migrate
python manage.py runserver
# Accss at localhost -> http://127.0.0.1:8000/

Features

  • Paste text, parser options, choose Top N results.
  • Chart.js bar chart + simple table.
  • Valid Words WIP
image image

Parser

parsing/bigram_parser.py

  • count_bigrams(lines, options) Iterates per line (and per sentence, if enabled), forms (prev, word) bigrams.
  • CountBigramOptions fields:
    • ignore_all_punctuation
    • letters_only
    • case_sensitive
    • include_apostrophes
    • include_hyphens
    • sentence_sensitive
    • line_separated
    • valid_words WIP

Return shape is a Counter, the ui calls .most_common() and renders.

Testing

pytest

Limitations

  • the CLI histogram truncates the count column visually if terminal is very narrow.
  • UI is textarea-only (no file upload yet). Use CLI for file parsing.
  • No DB models or caches; results aren’t saved between requests.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published