This project aims to perform topic modeling on a dataset of crime-related news articles in New York City using Latent Dirichlet Allocation (LDA) and Mallet's LDA Model. The articles were collected over a two-month period through continuous querying using the News API. The goal is to extract key topics related to crime for better insights and analysis.
The dataset consists of articles related to crime in NYC that were obtained from a wide range of USA online news sources. These articles are in raw JSON format, and various properties like content
, description
, URLs
, publishedAt
, title
, and source
were extracted. These attributes are essential for topic modeling, with description providing a key feature for analysis.
In order to prepare the data for machine learning, a text-cleaning process was performed to simplify the language and make it machine-readable. The following steps were applied:
- Tokenization 🏷️: The text is split into a list of tokens (words) after removing punctuation.
- Removing Stop Words 🚫: Words such as "the", "if", "and", etc., that do not add significant meaning, are removed.
- Stemming 🔄: Different forms of a word (e.g., "run", "running") are reduced to a common base form.
These steps ensure the text is processed and ready for topic modeling.
Text mining is the process of extracting meaningful information from text. In this project, we employed both statistical and machine learning techniques to structure the data, find patterns, and extract insights.
Topic modeling is a method for discovering hidden thematic structures in a large collection of documents. This project uses LDA (Latent Dirichlet Allocation), a probabilistic model that helps to extract the main topics within a collection of news articles.
LDA assumes that each document is a mixture of several topics, and each topic is represented by a collection of words.
We utilized Gensim's LDA library and Mallet’s LDA model to analyze the crime data. Key steps include:
- Creating a Dictionary and Corpus 📚: Gensim creates unique IDs for each word and combines them with their frequency.
- Training the LDA Model 💻: We trained the LDA model with different numbers of topics, adjusting the alpha and beta parameters.
- Extracting Topics 📝: The resulting topics provide insights into key themes in crime data.
The coherence score measures how well the topics discovered by LDA make sense. A higher coherence score indicates more meaningful topics. Our best coherence score was approximately 53%, which is a good indicator of topic quality.
We used the pyLDAvis package to visualize the topics generated by the LDA model. This interactive visualization helps in understanding the relationships between topics and the most significant words in each topic.
Topic modeling, specifically LDA, plays a crucial role in analyzing crime data. By identifying hidden patterns and themes, LDA helps crime analysts and law enforcement gain better insights into criminal activities, aiding faster investigations and solving of unsolved crimes.
The techniques explored in this paper demonstrate the importance of LDA for understanding large datasets in the crime domain.
🔧 Tools Used:
- Python (v3.4)
- Gensim Library
- Mallet's LDA Model
- pyLDAvis for Visualization
📚 References:
- Topic modeling and LDA theory references for further reading.
Feel free to clone or fork this repository and start analyzing crime-related data with LDA! 🚀