This project is about predicting the topic of news articles using the AG News dataset. The prediction model is built using Logistic Regression and is deployed as a Flask web application.
This structure is modular and organized, making it easy to maintain and scale the project. Each folder and file has a clear purpose, ensuring a clean and efficient workflow.
project/
│
├── logs/ # Logs for tracking execution (e.g., training logs)
│
├── model/ # Main folder for model-related files
│ ├── data/ # Data folder
│ │ ├── processed/ # Processed data (e.g., after preprocessing)
│ │ └── raw/ # Raw data (e.g., original datasets)
│ │
│ ├── notebooks/ # Jupyter Notebooks for analysis and experiments
│ │ └── toping_of_news.ipynb
│ │
│ ├── pipelines/ # Folder for data processing and prediction pipelines
│ │ ├── predict_pipeline.py # Pipeline for making predictions
│ │ └── train_pipeline.py # Pipeline for training the model
│ │
│ ├── saved_models/ # Folder for saved model files
│ │
│ ├── scripts/ # Scripts for data processing and model training
│ │ ├── __init__.py # Package initialization file
│ │ ├── data_preprocessing.py # Script for data preprocessing
│ │ ├── model_evaluation.py # Script for model evaluation
│ │ └── model_training.py # Script for model training
│ │
│ ├── utils/ # Utility functions and helpers
│ │ ├── __init__.py # Package initialization file
│ │ ├── logger.py # Utility for logging
│ │ └── pickler.py # Utility for serialization/deserialization (e.g., pickle)
│ │
│ ├── templates/ # HTML templates for the web interface
│ │ └── index.html # Main HTML template for the web interface
│ │
│ ├── .gitignore # File to exclude specific files/folders from Git
│ ├── app.py # Main Flask application file
│ ├── config.py # Project configuration (paths, parameters)
│ ├── README.md # Project documentation
│ └── requirements.txt # Project dependencies
- Clone the repository:
git clone https://github.com/nadirg2/nlp_topings.git
- Navigate to the project directory:
cd nlp_topings
- Install the required packages:
pip install -r requirements.txt
- Run the Flask application:
python run.py
- Open your web browser and go to
http://127.0.0.1:5000
.
The AG News dataset is used for training and testing the model. It contains news articles categorized into four classes: World, Sports, Business, and Sci/Tech.
The Logistic Regression model and TF-IDF vectorizer is used for predicting the topic of news articles. The model is trained on the AG News dataset and saved as model.pkl
. Also implemented text preprocessing with using lemmatization, stop-word removal, and tokenization.
This project is licensed under the MIT License.