This Python code implements a Logistic Regression model for handwritten digit classification using the scikit-learn library. The code performs the following steps:
-
Data Loading:
- Loads the
digits
dataset from scikit-learn, containing images of handwritten digits (0-9) and their corresponding labels. - Prints the shapes of the data and target arrays, indicating the number of images and labels.
- Displays the first five images and their labels using Matplotlib.
- Loads the
-
Data Splitting:
- Splits the data into training and testing sets using
train_test_split
. - The training set is used to train the model, and the testing set is used to evaluate its performance.
- The test size is set to 23% using
test_size=0.23
, and a random state of 2 is used for reproducibility usingrandom_state=2
. - Prints the shapes of the training and testing sets for clarity.
- Splits the data into training and testing sets using
-
Model Training:
- Creates an instance of the
LogisticRegression
classifier from scikit-learn. - Sets the maximum number of iterations (
max_iter
) to 1000 for the optimization algorithm. - Trains the model on the training data using the
fit
method.
- Creates an instance of the
-
Model Evaluation:
- Predicts the class label for the first element of the test set using
predict
. - Compares the predicted label with the actual label to assess initial performance.
- Predicts the class labels for the first 10 elements of the test set.
- Calculates the model's accuracy using the
score
method. - Generates a confusion matrix using
metrics.confusion_matrix
to visualize the model's performance across all classes. - Creates a heatmap of the confusion matrix using Seaborn for better interpretation.
- Predicts the class label for the first element of the test set using
-
Visualization of Misclassified Digits:
- Identifies the indices of misclassified digits in the testing set.
- Plots the first four misclassified digits using Matplotlib, along with their predicted and actual labels.
Logistic regression is a statistical method for classification problems with binary dependent variables (0 or 1). It estimates the probability of an event occurring based on a set of independent variables. In this case, the model predicts the probability that a given image represents a specific handwritten digit (0-9).
Logistic Regression Applications:
- Medical diagnosis (e.g., predicting cancer risk based on medical history)
- Spam email filtering
- Customer churn prediction (identifying customers likely to cancel service)
- Fraud detection in financial transactions
- Sentiment analysis of text data (classifying text as positive, negative, or neutral)
Logistic Regression Advantages:
- Easy to interpret: Coefficients in the model provide insights into the relationship between features and the target variable.
- Computationally efficient: Training and prediction are relatively fast compared to some other models.
- Works well with high-dimensional data: Can handle a large number of features without significant performance degradation.
- Good baseline model: Often used as a baseline for evaluating the performance of more complex models.
Logistic Regression Disadvantages:
- Limited to binary classification problems: Cannot handle multi-class problems directly (requires special techniques like one-vs-rest).
- Assumes linear relationship between features and target: If the relationship is non-linear, logistic regression may not perform well.
- Sensitive to outliers: Can be affected by outliers in the data.