The text from the articles in each of the links in the Input sheet was extracted and stored in separate text files. Each file was then filtered for stop words
Following this, each article was tokenized into 3 kinds of sequences using word_tokenize, sent_tokenize and SyllableTokenizer from nltk's nltk.tokenize module[punkt had to be downloaded]
Using these tokens and few lists of positive and negative words, the following metrics for each article were calculated and stored in the output excel sheet:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Percentage of Complex words(>2 syllables)
- Fog Index
- Average Words per sentence
- Word Count
- Personal Pronouns
- Average word length