Crime Data Topic Modeling using LDA 🚔

This project aims to perform topic modeling on a dataset of crime-related news articles in New York City using Latent Dirichlet Allocation (LDA) and Mallet's LDA Model. The articles were collected over a two-month period through continuous querying using the News API. The goal is to extract key topics related to crime for better insights and analysis.

Table of Contents 📜

Data Preprocessing 🧹
Text-Cleansing Process ✂️
Methodology 🧠
Results 📊
- Coherence Model 📈
- Visualizing Our LDA Model 🖼️
Conclusion 📌

Data Preprocessing 🧹

The dataset consists of articles related to crime in NYC that were obtained from a wide range of USA online news sources. These articles are in raw JSON format, and various properties like content, description, URLs, publishedAt, title, and source were extracted. These attributes are essential for topic modeling, with description providing a key feature for analysis.

Text-Cleansing Process ✂️

In order to prepare the data for machine learning, a text-cleaning process was performed to simplify the language and make it machine-readable. The following steps were applied:

Tokenization 🏷️: The text is split into a list of tokens (words) after removing punctuation.
Removing Stop Words 🚫: Words such as "the", "if", "and", etc., that do not add significant meaning, are removed.
Stemming 🔄: Different forms of a word (e.g., "run", "running") are reduced to a common base form.

These steps ensure the text is processed and ready for topic modeling.

Methodology 🧠

Text Mining 🔍

Text mining is the process of extracting meaningful information from text. In this project, we employed both statistical and machine learning techniques to structure the data, find patterns, and extract insights.

Topic Modeling 🏷️

Topic modeling is a method for discovering hidden thematic structures in a large collection of documents. This project uses LDA (Latent Dirichlet Allocation), a probabilistic model that helps to extract the main topics within a collection of news articles.

LDA assumes that each document is a mixture of several topics, and each topic is represented by a collection of words.

LDA Model on Crime Data ⚖️

We utilized Gensim's LDA library and Mallet’s LDA model to analyze the crime data. Key steps include:

Creating a Dictionary and Corpus 📚: Gensim creates unique IDs for each word and combines them with their frequency.
Training the LDA Model 💻: We trained the LDA model with different numbers of topics, adjusting the alpha and beta parameters.
Extracting Topics 📝: The resulting topics provide insights into key themes in crime data.

Results 📊

Coherence Model 📈

The coherence score measures how well the topics discovered by LDA make sense. A higher coherence score indicates more meaningful topics. Our best coherence score was approximately 53%, which is a good indicator of topic quality.

Visualizing Our LDA Model 🖼️

We used the pyLDAvis package to visualize the topics generated by the LDA model. This interactive visualization helps in understanding the relationships between topics and the most significant words in each topic.

Conclusion 📌

Topic modeling, specifically LDA, plays a crucial role in analyzing crime data. By identifying hidden patterns and themes, LDA helps crime analysts and law enforcement gain better insights into criminal activities, aiding faster investigations and solving of unsolved crimes.

The techniques explored in this paper demonstrate the importance of LDA for understanding large datasets in the crime domain.

🔧 Tools Used:

Python (v3.4)
Gensim Library
Mallet's LDA Model
pyLDAvis for Visualization

📚 References:

Topic modeling and LDA theory references for further reading.

Feel free to clone or fork this repository and start analyzing crime-related data with LDA! 🚀

📸 Some Screenshots of the Project 🖼️✨

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Crime News AI, NLP & Machine Learning.ipynb		Crime News AI, NLP & Machine Learning.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crime Data Topic Modeling using LDA 🚔

Table of Contents 📜

Data Preprocessing 🧹

Text-Cleansing Process ✂️

Methodology 🧠

Text Mining 🔍

Topic Modeling 🏷️

LDA Model on Crime Data ⚖️

Results 📊

Coherence Model 📈

Visualizing Our LDA Model 🖼️

Conclusion 📌

📸 Some Screenshots of the Project 🖼️✨

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Zaibten/Crime-News-AI-NLP-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Crime Data Topic Modeling using LDA 🚔

Table of Contents 📜

Data Preprocessing 🧹

Text-Cleansing Process ✂️

Methodology 🧠

Text Mining 🔍

Topic Modeling 🏷️

LDA Model on Crime Data ⚖️

Results 📊

Coherence Model 📈

Visualizing Our LDA Model 🖼️

Conclusion 📌

📸 Some Screenshots of the Project 🖼️✨

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages