Skip to content

Language Detector Loads and cleans text data, trains a language classification model using TF-IDF and Logistic Regression, evaluates it, and enables interactive language prediction with saved model reuse.

License

Notifications You must be signed in to change notification settings

AUX-441/Language-Detector-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌐 Language Detector Model

This project is a Language Detection Model built in Python.
It loads and cleans a multilingual text dataset, trains a machine learning model, and allows users to test language predictions on custom input.

The model uses TF-IDF Vectorization and Logistic Regression to classify text into its respective language.
Additionally, the project includes data cleaning, model evaluation, and visualization of results.


✨ Features

🚀 Fast Language Detection
Detects the language of any given text within milliseconds using a trained ML model.

🧹 Automated Data Cleaning
Removes duplicate or over-represented text entries to improve model accuracy.

📊 Model Evaluation & Reporting
Generates accuracy score, classification report, and detailed confusion matrix.

🎨 Beautiful Visualizations
Creates heatmaps, histograms, and line plots to visualize model performance.

💾 Model Persistence
Saves the trained model as a .pkl file for easy re-use without retraining.

🖥 Interactive Testing
Accepts user input directly from the terminal to test predictions instantly.

🛠 Customizable Training Pipeline
Easily adjust vectorization method, features, and classifier for experimentation.

📁 Multi-Language Dataset Support
Handles large multilingual datasets with millions of entries efficiently.


⚙️ Features

  • Dataset Analysis & Cleaning

    • Loads CSV dataset
    • Removes repeated text entries (>3 times)
    • Filters languages with very low frequency
  • Model Training

    • TF-IDF character-level features
    • Logistic Regression classifier
    • Accuracy score and classification report
    • Saves trained model as .pkl
  • Visualization

    • Confusion matrix heatmap
    • True positives line plot
    • True positives histogram
  • Model Testing

    • Interactive user input
    • Predicts language instantly

🧠 Model Details

Vectorizer: TfidfVectorizer(analyzer='char', max_features=2500)

Classifier: LogisticRegression(max_iter=2500, solver='saga')

Split Ratio: 70% train / 30% test

Evaluation Metrics: Accuracy, classification report, confusion matrix


📦 Installation

Clone the repository

git clone https://github.com/AUX-441/Language-Detector-Model.git
cd Language-Detector-Model

python Test_Model.py

Enter Your Language here (type 'exist' to quit): Bonjour
Predicted Language : French


---

About

Language Detector Loads and cleans text data, trains a language classification model using TF-IDF and Logistic Regression, evaluates it, and enables interactive language prediction with saved model reuse.

Topics

Resources

License

Stars

Watchers

Forks

Languages