ngram_analysis

Ngram-based analytical idiolectal profiling

Setting

Create a contestant_source/ folder. Put your participants' data here.
Create a contestant_output folder. N-gram extraction for contestant directory.
Create a corpus_source/ folder. Put your preferred corpus for comparison.
Create a corpus_output/ folder. N-gram extraction for corpus directory.

Description

Samples of language from 7 different genres were collected from n participants. The genres include both spoken and written English. The spoken genres were transcribed and saved in text files in written form. The primary purpose of this is to identify the bigrams and trigrams that are shared across all genres. The text file for one genre is called a container. Punctuation will be ignored.
Several corpus of choice will be chosen for comparison between the n-grams' distribution.

First time set up

Create an environment with Python 3.9 (Stable Version)

conda create -n <environment-name> python=3.9
conda activate <environment-name>
pip3 install -r requirements.txt

Process data

The processing pipeline starts with ngram extraction and zscore calculation data from participants' data, then the same procedure is applied to reference corpus'. Finally, zscore comparison between the corpuss and participants' data.

for participants

python3 participant.py

After pasting in the path of corpus_source/ and corpus_output/, the n-gram extraction and z-score calculation will be extracted to this directory.

for corpus

python3 corpus.py

After pasting in the path of input_folder/ and output_processed_folder/, the .txt data extraction will be saved in output_processed_folder. Input in the ngram_output_file and ngram_type, the n-gram extraction and z-score calculation will be extracted to the output_file.

Run

python3 main.py

Paste in your input participants' data directory input_folder. (the data that was previously processed)
Paste in your input file Paste in your output input_folder = input("Enter the path to contestants' data: ") # Update with your folder path output_folder = input("Enter the path to your output folder: ") input_file = input("Enter the path to reference corpus' data: ") # The provided file with ngrams and zscores

Tips

For ease of reading CSV files, consider installing Rainbow CSV.
Ctrl + P in VsCode to launch Quick Open and paste:
ext install mechatroner.rainbow-csv

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.gitignore		.gitignore
README.md		README.md
corpus.py		corpus.py
participant.py		participant.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ngram_analysis

Setting

Description

First time set up

Process data

for participants

for corpus

Run

Tips

About

Uh oh!

Releases

Packages

Uh oh!

Languages

himynameiszim/ngram_analysis

Folders and files

Latest commit

History

Repository files navigation

ngram_analysis

Setting

Description

First time set up

Process data

for participants

for corpus

Run

Tips

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages