Skip to content

dassakrassia/Comparison-of-the-quality-of-transformers-with-other-models

Repository files navigation

Image Classification: CNN (MobileNetV3), LSTM, and Vision Transformer (ViT) Comparison

This project compares three neural architectures—CNN (MobileNetV3), LSTM, and Vision Transformer (ViT)—using the same preprocessed dataset, normalization, input shape, and output layers, changing only the backbone. Training, evaluation, and metrics (Precision/Recall/Accuracy) are presented for all models.


Dataset

/kaggle/input/catsdogs/CatsDogs/
├── cats/
│   ├── cat.1.jpg
│   ├── cat.2.jpg
│   └── ...
└── dogs/
    ├── dog.1.jpg ├── dog.2.jpg └── ...

Project Structure

.
├── notebooks/
├── data/
│   └── catsdogs/
├── models/
├── visuals/
└── README.md

Data Preparation

  • Split: 70% train / 15% val / 15% test (stratified)
  • Label encoding: cat → 0, dog → 1
  • Image size: (224, 224, 3)
  • Normalization:
    • CNN/ViT: built-in preprocessing - LSTM: explicit normalization (image / 255.)
Example: Dataset splitting and loading
from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd

DATASET_DIR = Path('/kaggle/input/catsdogs/CatsDogs')
CATS_DIR = DATASET_DIR.joinpath('cats')
DOGS_DIR = DATASET_DIR.joinpath('dogs')

files = [(f, f.parent.name) for f in DATASET_DIR.glob("**/*.jpg")]
df_dataset = pd.DataFrame(files, columns=['File', 'Label'])
df_dataset['Label_i'] = df_dataset['Label'].astype('category').cat.codes

train, test_temp = train_test_split(df_dataset, test_size=0.3, stratify=df_dataset['Label_i'], random_state=89)
val, test = train_test_split(test_temp, test_size=0.5, stratify=test_temp['Label_i'], random_state=89)

Data Pipelines

def load_image(image_path, label):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, size=(224, 224))
    return image, label

BATCH_SIZE = 32

Example for train dataset:
train_ds = tf.data.Dataset.from_tensor_slices((train['File'].values, train['Label_i'].values))
train_ds = train_ds.shuffle(buffer_size=len(train_ds)).map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

Model 1: MobileNetV3 (Convolutional Neural Network)

  • Backbone: MobileNetV3 (pretrained or trained from scratch)
  • Loss: Binary crossentropy (sigmoid output)
  • Metrics: Precision, Recall, Binary accuracy
from tensorflow.keras.models import load_model

mobilenet_model = load_model("/kaggle/input/mobilenetv3.keras/keras/default/1/MobileNetV3.keras")
mobilenet_model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = mobilenet_model.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = mobilenet_model.evaluate(test_ds)

Model 2: Vision Transformer (ViT)

  • Backbone: Keras Vision Transformer
  • Reference: Keras ViT Example
  • Install with:
    pip install -U keras keras-hub
import keras_hub
backbone = keras_hub.models.Backbone.from_preset("vit_base_patch16_224_imagenet")
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset("vit_base_patch16_224_imagenet")
vit = keras_hub.models.ViTImageClassifier(
   backbone=backbone, num_classes=2, preprocessor=preprocessor, activation='softmax'
)
vit.compile(
   optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
   loss="sparse_categorical_crossentropy",
   metrics=["accuracy"]
)
history = vit.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)

Model 3: LSTM (Recurrent Neural Network)

  • Each image reshaped into a sequence: [224, 672] (224 × (224×3))
  • Input: (224, 672)
  • Architecture: LSTM(128) → Dense(64, relu) → Dense(1, sigmoid)
def img_to_seq(image, label):
    image = tf.reshape(image / 255., [224, 224 * 3])
    return image, label

lstm_train_ds = train_ds.map(img_to_seq, num_parallel_calls=tf.data.AUTOTUNE)
lstm_train_ds = lstm_train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

model_lstm = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(224, 672)),
    tf.keras.layers.LSTM(128, return_sequences=False),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model_lstm.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = model_lstm.fit(lstm_train_ds, epochs=20, validation_data=lstm_val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = model_lstm.evaluate(lstm_test_ds)

Model Training & Metrics

  • Callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, CSV logger
  • Training for all models: Same number of epochs, batch size, steps per epoch, data normalization, and input/output structure.

Example: Measuring Precision and Recall

from sklearn.metrics import precision_score, recall_score

For ViT (as an example)
y_pred_proba = vit.predict(test_ds)
y_pred = np.argmax(y_pred_proba, axis=1)
y_true = []
for x, y in test_ds: y_true.extend(y.numpy())
y_true = np.array(y_true)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
acc = (y_true == y_pred).mean()

Inference Time

def measure_inference_time(model, test_ds):
    for images, _ in test_ds.take(1):
        start_time = time.time()
        model.predict(images)
        end_time = time.time()
        return (end_time - start_time) / len(images)

Resulrs table:

Model Precision Recall Accuracy Training time Inference time
ViT 0.996503 0.989583 0.993056 1558.669495 0.114361
MobileNetV3 0.993333 0.995000 0.996656 109.515425 0.004638
LSTM 0.593333 0.598333 0.599327 61.178951 0.007577

Conclusion

  • Convolutional models (MobileNetV3) and Vision Transformer (ViT) show excellent results on images:
    • Accuracy, precision, and recall are extremely high (near 0.99–1.0). They effectively handle spatial patterns in images.
  • LSTM is significantly worse for images:
    • Transforming an image to a sequence loses spatial structure. LSTM is intended for sequences (time), not images.
  • Training/Inference speed: MobileNetV3 is fast and highly accurate. ViT is slightly slower but competitive in quality. LSTM is fast but not accurate on this kind of data.
  • Recommendation:
    • For image classification tasks, use CNN/ViT architectures; not recurrent networks like LSTM.

Useful Links


How to Run

  1. Clone the repository or download the code.
  2. Prepare the dataset according to the directory structure.
  3. Install requirements: pip install tensorflow keras keras-hub
  4. Run notebooks or scripts for each model.

Contacts

Author: Daria

If you have questions or suggestions, feel free to contact me!