Natural-Language-Processing

Language Model for Patronising and Condescending language recognition. \ Coursework for NLP course @Imperial College London. Full Research Report can be found here.

Data

All data description can be found in the original paper data description.

Labeled Training Dataset: contains paragraphs with a label from 0 (not containing PCL) to 4 (being highly patronizing or condescending) towards vulnerable communities. It contains one instance per line with the following columns:

"par_id": unique id for each one of the paragraphs in the corpus.
"art_id": the document id in the original NOW (News on Web) corpus.
"keyword": the search term used to retrieve texts about a target community.
"country_code": ISO Alpha-2 country code for the source media outlet.
"text": the paragraph containing the keyword.
"label": an integer between 0 and 4, which reflects the level of PCL as a result of combined annotations of two annotators, with 0 (No PCL), 1 (borderline PCL) and 2 (contains PCL). The combined annotations have been used in the following graded scale:

0 -> Annotator 1 = 0 AND Annotator 2 = 0 1 -> Annotator 1 = 0 AND Annotator 2 = 1 OR Annotator 1 = 1 AND Annotator 2 = 0 2 -> Annotator 1 = 1 AND Annotator 2 = 1 3 -> Annotator 1 = 1 AND Annotator 2 = 2 OR Annotator 1 = 2 AND Annotator 2 = 1 4 -> Annotator 1 = 2 AND Annotator 2 = 2

Following the experiments reported in the paper, we grouped the 5 numeric labels into the binary classification:

{0,1} = No PCL (0)
{2,3,4} = PCL (1)

Requirements

The project is written in python. The following third party packages are required to ensure full project functionality:

numpy: python package for mathematical and scientific computing. Actively maintained (Last Updated: December 2024), open-source and commercially supported.
pandas: python package for data analysis, particularly suitable for handling relational and labelled data. Actively maintained (Last Updated: September 2024) and open-source.
scikit-learn: python package for data modeling and machine learning algorithms. Actively maintained (Last Updated: January 2025), open-source and commercially supported.

All requirements (along with the used versions) can be found in the requirements.txt file.

Models

'Language Model': RoBERTa-based Transformer architecture. It uses 10% Synonym Replacement via WordNet and Backtranslation using MarianMT (20% of paragraphs from English-German-English). The inputs were pre-processed in order to include the community label at the start of each sentence.

'SVM': Support Vector Machine (SVM) with Regularization parameter (C)=10, ’scale’ (gamma) kernel coefficient and ’rbf’ kernel. It uses TF-IDF weighted Bag-of-Words representation of the paragraphs as input.

'BiLSTM': 2-layer bi-direcional Long-Short Term Memory (LSTM) model with 20 units and dropout rate of 0.25% each. It uses 300-dimensional Word2Vec skip-gram word embeddings (trained on the Google News corpus) as input.

Authors

Asia Belfiore (ab6124)
Ginevra Cepparulo (gc1424)
Aslihan Gülseren (ag724)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
plots		plots
predictions		predictions
tunings		tunings
.DS_Store		.DS_Store
70016_spec.pdf		70016_spec.pdf
README.md		README.md
analysis.ipynb		analysis.ipynb
dev.txt		dev.txt
dont_patronize_me.py		dont_patronize_me.py
report.pdf		report.pdf
submission_improved.zip		submission_improved.zip
task1.txt		task1.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural-Language-Processing

Data

Requirements

Models

Authors

About

Uh oh!

Releases

Packages

Languages

belfioreasia/large-language-model-pcl-detection

Folders and files

Latest commit

History

Repository files navigation

Natural-Language-Processing

Data

Requirements

Models

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages