Text article dataset is sorted into 5 categories namely Sport, Tech, Business, Entertainment and Politics using Natural Language Processing Approach.
Provided text documents with 5 categories, can we categorize unseen in articles into 5 categories?
Dataset Credit: Credit dataset
df = pd.read_csv(CSV_URL)
There are 2225 dataset in here,
Create bar plot for category data
sns.countplot(df.category)
Get the unique target variables and inspect the anomalies, duplicates and missing values
df['category'].unique()
#There are 5 category in this :['tech', 'business', 'sport', 'entertainment', 'politics']
#Check for NaN or missing values
df.isna().sum()` # no missing values found
#Check for duplicated values
df.duplicated().sum() #There are 99 duplicated datas
df[df.duplicated()] #Extracting the duplicated data
- Remove all the duplicated data
df= df.drop_duplicates() #Remove 99duplicates, left 2126 datas
df.duplicated().sum() # no more duplicate values
#Extract Features with 5 categories
category = df['category'].values
text=df['text'].values
- Remove numerics inside text file using RegEX
for index, t in enumerate(text):
text[index] = re.sub('.*?',' ',t)
text[index] =re.sub('[^a-zA-Z]',' ',t).lower().split()
#^ means NOT alphabet
#Substituting that is not a-z and A-Z will be replaced with a space
#Hence, all numeric will be removed so now we have changed every word into lower case and splitted them into a list of words
text[10] #Check for all the words has been split into a list of words with lower case
Everything is selected from features since this is NLP data
- Use Tokenization to make each word return with index
vocab_size = 400
OOV_token = 'OOV'
tokenizer= Tokenizer(num_words=vocab_size, oov_token=OOV_token)
tokenizer.fit_on_texts(text) #to learn all the words
word_index = tokenizer.word_index
# Encode all into numbers to fit the text
train_sequences = tokenizer.texts_to_sequences(text) #to convert into numbers
Get the average number of text inside a row for padding
- Padding and Truncating
max_len=380
padded_text = pad_sequences(train_sequences,
maxlen=max_len,
padding='post',
truncating = 'post') # so now all rows are in equal length
- One Hot Encoding for the Target - category
ohe = OneHotEncoder(sparse=False)
category = ohe.fit_transform(np.expand_dims(category,axis=-1))
- Train-test-split because this is a classification problem
X_train,X_test,y_train,y_test = train_test_split(padded_text,
category,
test_size=0.3,
random_state=123)
X_train= np.expand_dims(X_train, axis=-1)
X_test= np.expand_dims(X_test, axis=-1)`
USE LSTM layers, dropout, dense, input, embedding, bidirectional in module_for_article_analysis()
nb_features = 380
output_node=len(y_train[1])
model = Model_Creation().NLP_model(nb_features, output_node,vocab_size,
embedding_dim = 128, drop_rate=0.2,
num_node= 128 )`
Callbacks
tensorboard_callback=TensorBoard(log_dir=LOG_FOLDER_PATH)
Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics ='acc') #since its a classifier problem, categorical crossentropy is chosen
Visualising Natural Language Processing model
plot_model(model,show_shapes=True, show_layer_names=(True))
Total Parameters: 660,229
Model Fitting and Testing
hist= model.fit(X_train, y_train, batch_size=20, epochs=80, validation_data=(X_test, y_test), callbacks= tensorboard_callback)
#0.27% is achieved on 1st training by only using 2 LSTM layer #after bidirectional still not improve at #0.29
Plot fitted model evaluation using history in module_for_article_analysis()
Model_Analysis().plot_analysis(hist)
Plot hist using Tensorboard log directory
#Model is showing overfitting when reaching epochs 50 onwards
Evaluate model by plotting graph using ConfusionMatrixDisplay and generate Classification Report in module_for_article_Analysis()
Model_Analysis().Model_Evaluation(model,X_test,y_test)
Confusion Matrix score
Confusion Matrix plot
Classification Report
Accuracy_score
#Save NLP model
#save tokenizer using JSON
token_json = tokenizer.to_json()
#save One Hot Encoding model
- Model accuracy by only using 2 LSTM layer will only achieve 29% accuracy
- Model accuracy using 1 embedding layer increases performance very minimally
- By Adding masking, epoch increases the accuracy up to 82%
- The model achieved high with average 83% F1-score and accuracy score of 83%
- Model evaluated with test data has 83% accuracy
- when put earlystopping the model reduces accuracy to 29% only
- To further increase the performance of NLP model: #1) Increasing number of epochs, #2) Increase number of samples #3) Change dropout rate value #4) Add word2vec to remove stop words from dataset