GitHub - Abstractmachina/icc_NLP4MS: Software Engineering Group Project at Imperial. Developing NLP tools for the UK Multiple Sklerosis Register.

Project Title

NLP for MS Developers: Matt Barker, Colin Taylor, Taole Chen, Kaixuan Khoo, Ronan Patrick, Gus Levinson, Jack Cheng Supervisors: Chiraag Lala, Rod Middleton, Richard Nicholas

Description

MSc Computing 21/22 Group Project Repository

Installation

The project requires Python version 3.8 or later. The required libraries can be found in the requirements.txt file. For the Bag of Words model, an additional requirement of: cython and scikit-learn==0.22.2.post1 is needed (this is no in the final release but has been left in to showcase work mentioned in our presentation).

Usage

Once downloaded, navigate to the project's directory in a terminal. Then, run AppClass.py to launch the application.

Since the software is designed to analyse CSV files, several example CSV files have been provided in the "test_csv_files" folder, to showcase specific tools of the application. In addition, there are two very large example CSVs, "trump_tweets" and "imdb_reviews". The appropriate license and referencing of these files can be found below in the Licensing section. These two large files are designed to showcase how the search and frequency tools of the application can handle millions of words. Please note that it may take a bit of time to load the first entry into these tools, due to the number and size of the entries in these CSV files. Also note that the IMDB dataset contains start/end of sentence markers "br", which slightly clutter the frequency counts.

About the software

This software contains several tools that are designed to analyse textual data within a CSV file. Whilst the software was originally designed to analyse unprompted free-text entries of medical patients with Multiple Sclerosis (MS), many of the analytical tools will work with any CSV file that contains a column with text entries.

Getting started

From the initial screen, the first step is to click the “load CSV” button, which will prompt you to select the desired CSV file to be analysed. After selecting the file, you will be taken to the “ChooseCSVHeaders” page. On this page, you are required to select the headers of the loaded in CSV file, that contain the relevant, asked for, information. If no such column exists, you simply choose “NONE”. The only required column is a column that contains free text. Once this has been selected, the “Done” button will become unlocked, allowing progression to the main menu of the application. However, most features require more than just a free-text column. The search tool only requires a free text column. The frequency tool requires a free-text column and a user ID column. The user and trend analysis tools requires a free-text column, user ID column, and a completed date column. Note that these requirements are a minimum; additional analysis can be done if more columns are provided.

Search tool

This tool allows you to search for a given phrase in all the free-text entries. Upon the first entry into this tool’s page, some pre-processing is done (such as removing punctuation and converting all words to lower case), which may take a while if the CSV file is sufficiently large. Search results are displayed in the text box at the top of the window. In addition, you can specify how many words either side of the phrase you wish to display in each entry (by entry, we mean a ‘row’ in the CSV file) where the phrase appears. If ‘all’ is selected, the entirety of all matching entries will be displayed. Along with the free text, additional information can be displayed for each entry, such as the date of birth of the user entry (assuming that you have specified a date of birth header in the “ChooseCSVHeaders” page). Each matching entry is separated by dashes, and the display box can be scrolled if the number of matching entries is large. Query results can be downloaded in a txt file by pressing the download button. To search another entry, the clear button must be pressed to reset the output display.

Frequency tool

This tool allows you to query the frequency of a specified n-grams in the free text, as well as plotting a graph of the most frequent n-grams in the text. Upon the first entry into this tool’s page, a lot of pre-processing is done, which may take a while if there are a large number of rows and text in the CSV file. An n-gram is a sequence of consecutive words. For example, “blood pressure” is a bi-gram, so if the frequency of “blood pressure” is 10, that means that there are 10 occurrences of “blood pressure” in all text entries. A maximum of four sequential words (i.e., quad-grams) can be queried. This is to save on pre-processing time and, in our experience, an n larger than four does not lead to many interesting results, as the frequency count gets very small.

For plotting the most frequent n-grams, there are a series of options you can specify, located in the “settings” section. The “remove stopwords” option will consider the frequency of n-grams in the free text with stopwords removed. Stopwords are commonly used words in language such as “the, is, this, and” that do not add much meaning. The list of stopwords we use comes from NLTK. The “medical terms only” option will consider the frequency of n-grams in the free text with all words other than those appearing in a medical lexicon removed. The medical lexicon we use is from Aristotelis P., R. Robinson, and Rajasekharan N., published under the GPL-3.0 license, and can be found on github at: https://github.com/glutanimate/wordlist-medicalterms-en.

The “only count a word once per user entry option” does what is says on the tin. For example, the same user has multiple entries in the free text (identified by their user ID), and if all of these entries contain the bi-gram “blood pressure”, this would only be counted as once on the frequency score. Similarly, the option “only count a word once per user entry” will only score one frequency count if a n-gram is mentioned multiple times in the same entry. If two entries by the same user mention the same n-gram, this would be counted twice under this option. If both of these options are ticked, it defaults to the “only count a word once per user” option.

The final two options require an “MS type” column to be selected, and are only relevant for CSV files directly related to MS. The MS type drop-down affects the frequency that is displayed on the graph; the frequency will simply be the frequency scores for entries where the user has the specified type of MS. If “plot by MS type” is selected, four graphs will be displayed once the “plot most frequent n-grams” is pressed, one for each type of MS.

User Analysis

This tool allows analysis of a specific user, identified by their user ID. The tool can generate the free text entries of that user, the sentiment score for those entries, the users’ disability score at the time of their free text entry/entries (if provided). If the user has more than two entries, graphs for sentiment and disability scores can be displayed to show how they have changed over time. The “combine” option allows sentiment and disability score to be overlayed onto a single graph, to potentially highlight any correlation over time.

Trend Analysis

This tool plots the trend of multiple users’ sentiment and disability scores overtime. A minimum of 20 users is required to use this tool. The trend plots themselves get fairly ‘busy’ with much more than 20 users. What is perhaps more useful is the distribution plots. These plots are bar charts, where the y-axis is the number of entries, and the x-axis is the value of sentiment/disability scores. This is a useful way to gauge the overall sentiment/disability scores of entries in the free text.

License

We aim to publish the project, at a later date, on a public GitHub repo, under a "free for non-commercial use" license.

Our GitLab repo and project contains a few resources that other's have made. As already mentioned and referenced, we have used an open-source medical dictionary to remove medical terms from the text. In addition, we have used a tkinter theme by rdbende, which is avaialble under the MIT license, and can be accessed at: https://github.com/rdbende/Azure-ttk-theme

All images used in our application (the four icons on the main menu) have been sourced from "free for non-commercial use" sites. Some icons have been sourced from: https://iconarchive.com/ , whilst others have been sourced from: https://www.pexels.com/.

In addition, we have made use of the NLTK's package stopwords, n-gram tool, and VADER sentiment model, which can be found under the NLTK data page at: https://www.nltk.org/nltk_data/

Finally, we have included a couple of publically avaialble CSV test files. The Trump Tweets CSV was scraped by Austin Reese, published under the CC0: Public Domain license, and is avaialble at: https://www.kaggle.com/datasets/austinreese/trump-tweets. The IMDB Reviews dataset was created by Andrew L. Maas et al. Their original paper using the data set is Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). And can be accessed at: http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf

Our other test CSVs that we created contain Lorem Ipsum text, text from ShakeSpeare plays, and scraped Wikipedia text.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
Azure-ttk-theme-main		Azure-ttk-theme-main
bag_of_words_model		bag_of_words_model
images		images
nltk_data		nltk_data
test_csv_files		test_csv_files
.gitlab-ci.yml		.gitlab-ci.yml
AppClass.py		AppClass.py
ChooseCsvHeaders.py		ChooseCsvHeaders.py
DataProcessing.py		DataProcessing.py
FrequencyAnalyser.py		FrequencyAnalyser.py
FrequencyPage.py		FrequencyPage.py
HomePage.py		HomePage.py
InstructionPage.py		InstructionPage.py
Interfaces.py		Interfaces.py
MainMenu.py		MainMenu.py
README.md		README.md
SearchPage.py		SearchPage.py
SentimentController.py		SentimentController.py
SentimentGrapher.py		SentimentGrapher.py
SentimentModel.py		SentimentModel.py
SentimentPage.py		SentimentPage.py
TrendPage.py		TrendPage.py
VaderSentimentAdapter.py		VaderSentimentAdapter.py
frequency_test.py		frequency_test.py
medical_terms.txt		medical_terms.txt
requirements.txt		requirements.txt
search.py		search.py
search_test.py		search_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Title

Description

Installation

Usage

About the software

Getting started

Search tool

Frequency tool

User Analysis

Trend Analysis

License

About

Uh oh!

Releases

Packages

Languages

Abstractmachina/icc_NLP4MS

Folders and files

Latest commit

History

Repository files navigation

Project Title

Description

Installation

Usage

About the software

Getting started

Search tool

Frequency tool

User Analysis

Trend Analysis

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages