Image Classification: CNN (MobileNetV3), LSTM, and Vision Transformer (ViT) Comparison

This project compares three neural architectures—CNN (MobileNetV3), LSTM, and Vision Transformer (ViT)—using the same preprocessed dataset, normalization, input shape, and output layers, changing only the backbone. Training, evaluation, and metrics (Precision/Recall/Accuracy) are presented for all models.

Dataset

Dataset: Cats vs Dogs
- Source: Kaggle Cats & Dogs Images
Directory structure example:

/kaggle/input/catsdogs/CatsDogs/
├── cats/
│   ├── cat.1.jpg
│   ├── cat.2.jpg
│   └── ...
└── dogs/
    ├── dog.1.jpg ├── dog.2.jpg └── ...

Project Structure

.
├── notebooks/
├── data/
│   └── catsdogs/
├── models/
├── visuals/
└── README.md

Data Preparation

Split: 70% train / 15% val / 15% test (stratified)
Label encoding: cat → 0, dog → 1
Image size: (224, 224, 3)
Normalization:
- CNN/ViT: built-in preprocessing - LSTM: explicit normalization (image / 255.)

Example: Dataset splitting and loading

from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd

DATASET_DIR = Path('/kaggle/input/catsdogs/CatsDogs')
CATS_DIR = DATASET_DIR.joinpath('cats')
DOGS_DIR = DATASET_DIR.joinpath('dogs')

files = [(f, f.parent.name) for f in DATASET_DIR.glob("**/*.jpg")]
df_dataset = pd.DataFrame(files, columns=['File', 'Label'])
df_dataset['Label_i'] = df_dataset['Label'].astype('category').cat.codes

train, test_temp = train_test_split(df_dataset, test_size=0.3, stratify=df_dataset['Label_i'], random_state=89)
val, test = train_test_split(test_temp, test_size=0.5, stratify=test_temp['Label_i'], random_state=89)

Data Pipelines

def load_image(image_path, label):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, size=(224, 224))
    return image, label

BATCH_SIZE = 32

Example for train dataset:
train_ds = tf.data.Dataset.from_tensor_slices((train['File'].values, train['Label_i'].values))
train_ds = train_ds.shuffle(buffer_size=len(train_ds)).map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

Model 1: MobileNetV3 (Convolutional Neural Network)

Backbone: MobileNetV3 (pretrained or trained from scratch)
Loss: Binary crossentropy (sigmoid output)
Metrics: Precision, Recall, Binary accuracy

from tensorflow.keras.models import load_model

mobilenet_model = load_model("/kaggle/input/mobilenetv3.keras/keras/default/1/MobileNetV3.keras")
mobilenet_model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = mobilenet_model.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = mobilenet_model.evaluate(test_ds)

Model 2: Vision Transformer (ViT)

Backbone: Keras Vision Transformer
Reference: Keras ViT Example
Install with:
pip install -U keras keras-hub

import keras_hub
backbone = keras_hub.models.Backbone.from_preset("vit_base_patch16_224_imagenet")
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset("vit_base_patch16_224_imagenet")
vit = keras_hub.models.ViTImageClassifier(
   backbone=backbone, num_classes=2, preprocessor=preprocessor, activation='softmax'
)
vit.compile(
   optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
   loss="sparse_categorical_crossentropy",
   metrics=["accuracy"]
)
history = vit.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)

Model 3: LSTM (Recurrent Neural Network)

Each image reshaped into a sequence: [224, 672] (224 × (224×3))
Input: (224, 672)
Architecture: LSTM(128) → Dense(64, relu) → Dense(1, sigmoid)

def img_to_seq(image, label):
    image = tf.reshape(image / 255., [224, 224 * 3])
    return image, label

lstm_train_ds = train_ds.map(img_to_seq, num_parallel_calls=tf.data.AUTOTUNE)
lstm_train_ds = lstm_train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

model_lstm = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(224, 672)),
    tf.keras.layers.LSTM(128, return_sequences=False),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model_lstm.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = model_lstm.fit(lstm_train_ds, epochs=20, validation_data=lstm_val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = model_lstm.evaluate(lstm_test_ds)

Model Training & Metrics

Callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, CSV logger
Training for all models: Same number of epochs, batch size, steps per epoch, data normalization, and input/output structure.

Example: Measuring Precision and Recall

from sklearn.metrics import precision_score, recall_score

For ViT (as an example)
y_pred_proba = vit.predict(test_ds)
y_pred = np.argmax(y_pred_proba, axis=1)
y_true = []
for x, y in test_ds: y_true.extend(y.numpy())
y_true = np.array(y_true)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
acc = (y_true == y_pred).mean()

Inference Time

def measure_inference_time(model, test_ds):
    for images, _ in test_ds.take(1):
        start_time = time.time()
        model.predict(images)
        end_time = time.time()
        return (end_time - start_time) / len(images)

Resulrs table:

Model	Precision	Recall	Accuracy	Training time	Inference time
ViT	0.996503	0.989583	0.993056	1558.669495	0.114361
MobileNetV3	0.993333	0.995000	0.996656	109.515425	0.004638
LSTM	0.593333	0.598333	0.599327	61.178951	0.007577

Conclusion

Convolutional models (MobileNetV3) and Vision Transformer (ViT) show excellent results on images:
- Accuracy, precision, and recall are extremely high (near 0.99–1.0). They effectively handle spatial patterns in images.
LSTM is significantly worse for images:
- Transforming an image to a sequence loses spatial structure. LSTM is intended for sequences (time), not images.
Training/Inference speed: MobileNetV3 is fast and highly accurate. ViT is slightly slower but competitive in quality. LSTM is fast but not accurate on this kind of data.
Recommendation:
- For image classification tasks, use CNN/ViT architectures; not recurrent networks like LSTM.

Useful Links

How to Run

Clone the repository or download the code.
Prepare the dataset according to the directory structure.
Install requirements: pip install tensorflow keras keras-hub
Run notebooks or scripts for each model.

Contacts

Author: Daria

Email: perinadaria19@gmail.com

If you have questions or suggestions, feel free to contact me!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Comparison-of-the-quality-of-transformers-with-other-models.ipynb		Comparison-of-the-quality-of-transformers-with-other-models.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Classification: CNN (MobileNetV3), LSTM, and Vision Transformer (ViT) Comparison

Dataset

Project Structure

Data Preparation

Data Pipelines

Model 1: MobileNetV3 (Convolutional Neural Network)

Model 2: Vision Transformer (ViT)

Model 3: LSTM (Recurrent Neural Network)

Model Training & Metrics

Example: Measuring Precision and Recall

Inference Time

Resulrs table:

Conclusion

Useful Links

How to Run

Contacts

About

Uh oh!

Releases

Packages

Languages

License

dassakrassia/Comparison-of-the-quality-of-transformers-with-other-models

Folders and files

Latest commit

History

Repository files navigation

Image Classification: CNN (MobileNetV3), LSTM, and Vision Transformer (ViT) Comparison

Dataset

Project Structure

Data Preparation

Data Pipelines

Model 1: MobileNetV3 (Convolutional Neural Network)

Model 2: Vision Transformer (ViT)

Model 3: LSTM (Recurrent Neural Network)

Model Training & Metrics

Example: Measuring Precision and Recall

Inference Time

Resulrs table:

Conclusion

Useful Links

How to Run

Contacts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages