This project compares three neural architectures—CNN (MobileNetV3), LSTM, and Vision Transformer (ViT)—using the same preprocessed dataset, normalization, input shape, and output layers, changing only the backbone. Training, evaluation, and metrics (Precision/Recall/Accuracy) are presented for all models.
- Dataset: Cats vs Dogs
- Source: Kaggle Cats & Dogs Images
- Directory structure example:
/kaggle/input/catsdogs/CatsDogs/
├── cats/
│ ├── cat.1.jpg
│ ├── cat.2.jpg
│ └── ...
└── dogs/
├── dog.1.jpg ├── dog.2.jpg └── ...
.
├── notebooks/
├── data/
│ └── catsdogs/
├── models/
├── visuals/
└── README.md
- Split: 70% train / 15% val / 15% test (stratified)
- Label encoding:
cat
→ 0,dog
→ 1 - Image size: (224, 224, 3)
- Normalization:
- CNN/ViT: built-in preprocessing - LSTM: explicit normalization (
image / 255.
)
- CNN/ViT: built-in preprocessing - LSTM: explicit normalization (
Example: Dataset splitting and loading
from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd
DATASET_DIR = Path('/kaggle/input/catsdogs/CatsDogs')
CATS_DIR = DATASET_DIR.joinpath('cats')
DOGS_DIR = DATASET_DIR.joinpath('dogs')
files = [(f, f.parent.name) for f in DATASET_DIR.glob("**/*.jpg")]
df_dataset = pd.DataFrame(files, columns=['File', 'Label'])
df_dataset['Label_i'] = df_dataset['Label'].astype('category').cat.codes
train, test_temp = train_test_split(df_dataset, test_size=0.3, stratify=df_dataset['Label_i'], random_state=89)
val, test = train_test_split(test_temp, test_size=0.5, stratify=test_temp['Label_i'], random_state=89)
def load_image(image_path, label):
image = tf.io.read_file(image_path)
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, size=(224, 224))
return image, label
BATCH_SIZE = 32
Example for train dataset:
train_ds = tf.data.Dataset.from_tensor_slices((train['File'].values, train['Label_i'].values))
train_ds = train_ds.shuffle(buffer_size=len(train_ds)).map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
- Backbone: MobileNetV3 (pretrained or trained from scratch)
- Loss: Binary crossentropy (sigmoid output)
- Metrics: Precision, Recall, Binary accuracy
from tensorflow.keras.models import load_model
mobilenet_model = load_model("/kaggle/input/mobilenetv3.keras/keras/default/1/MobileNetV3.keras")
mobilenet_model.compile(
optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = mobilenet_model.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = mobilenet_model.evaluate(test_ds)
- Backbone: Keras Vision Transformer
- Reference: Keras ViT Example
- Install with:
pip install -U keras keras-hub
import keras_hub
backbone = keras_hub.models.Backbone.from_preset("vit_base_patch16_224_imagenet")
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset("vit_base_patch16_224_imagenet")
vit = keras_hub.models.ViTImageClassifier(
backbone=backbone, num_classes=2, preprocessor=preprocessor, activation='softmax'
)
vit.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
history = vit.fit(train_ds, epochs=20, validation_data=val_ds, callbacks=callbacks)
- Each image reshaped into a sequence:
[224, 672]
(224 × (224×3)
) - Input: (224, 672)
- Architecture: LSTM(128) → Dense(64, relu) → Dense(1, sigmoid)
def img_to_seq(image, label):
image = tf.reshape(image / 255., [224, 224 * 3])
return image, label
lstm_train_ds = train_ds.map(img_to_seq, num_parallel_calls=tf.data.AUTOTUNE)
lstm_train_ds = lstm_train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
model_lstm = tf.keras.Sequential([
tf.keras.layers.Input(shape=(224, 672)),
tf.keras.layers.LSTM(128, return_sequences=False),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model_lstm.compile(
optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.BinaryAccuracy()]
)
history = model_lstm.fit(lstm_train_ds, epochs=20, validation_data=lstm_val_ds, callbacks=callbacks)
test_loss, test_acc, test_precision, test_recall = model_lstm.evaluate(lstm_test_ds)
- Callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, CSV logger
- Training for all models: Same number of epochs, batch size, steps per epoch, data normalization, and input/output structure.
from sklearn.metrics import precision_score, recall_score
For ViT (as an example)
y_pred_proba = vit.predict(test_ds)
y_pred = np.argmax(y_pred_proba, axis=1)
y_true = []
for x, y in test_ds: y_true.extend(y.numpy())
y_true = np.array(y_true)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
acc = (y_true == y_pred).mean()
def measure_inference_time(model, test_ds):
for images, _ in test_ds.take(1):
start_time = time.time()
model.predict(images)
end_time = time.time()
return (end_time - start_time) / len(images)
Model | Precision | Recall | Accuracy | Training time | Inference time |
---|---|---|---|---|---|
ViT | 0.996503 | 0.989583 | 0.993056 | 1558.669495 | 0.114361 |
MobileNetV3 | 0.993333 | 0.995000 | 0.996656 | 109.515425 | 0.004638 |
LSTM | 0.593333 | 0.598333 | 0.599327 | 61.178951 | 0.007577 |
- Convolutional models (MobileNetV3) and Vision Transformer (ViT) show excellent results on images:
- Accuracy, precision, and recall are extremely high (near 0.99–1.0). They effectively handle spatial patterns in images.
- LSTM is significantly worse for images:
- Transforming an image to a sequence loses spatial structure. LSTM is intended for sequences (time), not images.
- Training/Inference speed: MobileNetV3 is fast and highly accurate. ViT is slightly slower but competitive in quality. LSTM is fast but not accurate on this kind of data.
- Recommendation:
- For image classification tasks, use CNN/ViT architectures; not recurrent networks like LSTM.
- Kaggle Dataset: Dogs vs Cats
- Keras Applications: MobileNetV3
- Keras Vision Transformer (ViT)
- Kaggle Pretrained ViT Model
- TensorFlow Keras Callbacks
- tf.data API
- Clone the repository or download the code.
- Prepare the dataset according to the directory structure.
- Install requirements:
pip install tensorflow keras keras-hub
- Run notebooks or scripts for each model.
Author: Daria
- Email: perinadaria19@gmail.com
If you have questions or suggestions, feel free to contact me!