Skip to content

MizDaWiz/Text-Extraction-And-Readibility-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-Extraction-And-Readibility-Analysis

The text from the articles in each of the links in the Input sheet was extracted and stored in separate text files. Each file was then filtered for stop words
Following this, each article was tokenized into 3 kinds of sequences using word_tokenize, sent_tokenize and SyllableTokenizer from nltk's nltk.tokenize module[punkt had to be downloaded] Using these tokens and few lists of positive and negative words, the following metrics for each article were calculated and stored in the output excel sheet:

For sentiment analysis

  1. Positive Score
  2. Negative Score
  3. Polarity Score
  4. Subjectivity Score

For readibility analysis

  1. Percentage of Complex words(>2 syllables)
  2. Fog Index
  3. Average Words per sentence
  4. Word Count
  5. Personal Pronouns
  6. Average word length

About

Basic usage of some nltk tokenizers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages