COSC 524 - Natural Language Processing - Project 1
Authors: Andrei Cozma, Manan Patel, Tulsi Tailor, Zac Perry
Objective: Develop a REGEX-based chatbot for the statistical text analysis of crime novels.
Source of Novels: Project Gutenberg
- Python 3.10.X
usage: main.py [-h] -i INPUT [-v] [-t]
ChatRegex
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
path to input text file
-v, --verbose increase console output verbosity
-t, --test disables the interactive chat mode and runs a series of example prompt test cases
Example Usage:
python3 main.py -i ./dataset/the_sign_of_the_four.txt
Corresponding Output:
================================================================================
ββββββββββ βββ ββββββ βββββββββ βββββββ ββββββββ βββββββ βββββββββββ βββ
βββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ
βββ ββββββββββββββββ βββββββββββββββββββββββ βββ ββββββββββ ββββββ
βββ ββββββββββββββββ βββββββββββββββββββββββ βββ βββββββββ ββββββ
βββββββββββ ββββββ βββ βββ βββ ββββββββββββββββββββββββββββββββ βββ
ββββββββββ ββββββ βββ βββ βββ βββββββββββ βββββββ βββββββββββ βββ
INFO: Reading data from file: ./dataset/the_sign_of_the_four.txt
INFO: Preprocessing data...
INFO: Extracting body of text...
INFO: Normalizing chapter headings...
INFO: Normalizing character set...
INFO: Starting interactive chat session...
================================================================================
AI : Hello! What can I do for you?
--------------------------------------------------------------------------------
You: hi
AI : Hello! How can I help you?
--------------------------------------------------------------------------------
You:
================================================================================
AI : Hello! What can I do for you?
--------------------------------------------------------------------------------
You: help
AI : Special commands you can use:
help, h - Print this help message
example, ex - Print some example prompts (e.g. `example` or `example 5` to print 5 examples)
exit, quit, q - Exit the program.
--------------------------------------------------------------------------------
You: ex
AI : Example queries:
- "Words around perpetrator"
--------------------------------------------------------------------------------
You: ex 3
AI : Example questions you can ask:
- "Identify the chapter and sentence where the detective first appears."
- "When is the killer first mentioned"
- "Tell me when the criminal is first mentioned in the book."
--------------------------------------------------------------------------------
You: quit
AI : Farewell!
- Source Code
- Report (max 2 pages, excluding references)
- Presentation and live analysis
- Choice between pre-recorded and a live, in-class delivery
- The Sign of the Four by Arthur Conan Doyle
- The Murder on the Links by Agatha Christie
- The Man in the Brown Suit by Agatha Christie
- Use regex from Python and the packages available in Python 3.10.
- Prompt parsing should allow flexibility in how the request/question is formulated.
- Aims to analyze the frequency of occurrence of the protagonists and the perpetrator(s) across the novel - per chapter and per sentence in a chapter, the mention of the crime and other circumstances surrounding the antagonists.
- The ultimate objective is to use basic NLP tools to observe any patterns in plot structures across the works of one or all authors.
To analyze and report on:
- When does the investigator (or a pair) occur for the first time.
- When is the crime first mentioned, the type of the crime, and the details.
- When is the perpetrator first mentioned.
- What are the three words that occur around the perpetrator on each mention
- (i.e., the three words preceding and the three words following the mention of a perpetrator)
- When and how the detective/detectives and the perpetrators co-occur
- When are other suspects first introduced
The above should include the chapter # and the sentence(s) # in a chapter.
- Generate responses for the prompted questions.
- Should produce precise results and in natural, well-structured English, as if interacting with a human investigator.