Recognizing "Bouba" and "Kiki" Using Neural Machine Learning Methods

Introduction

When prompted to determine which is bouba and which is kiki, participants typically answer that the left image is kiki and the right is bouba.

Background

The “Bouba/Kiki effect” describes an interesting cross-modal association where people tend to match certain visual shapes with certain speech sounds. Specifically, given the nonsense words “Bouba” and “Kiki,” most people associate “Bouba” with a smooth, rounded shape, and “Kiki” with a sharp, angular shape. While these words have no inherent meaning, they are associated consistently across participants of various cultures, ages, and languages (with some exceptions).

Machine Learning

The Bouba/Kiki effect is an interesting case of sensory modalities naturally interacting and informing each other within human perception. Could this phenomenon be replicated via machine learning? Exploring this phenomenon through ML provides an opportunity to determine whether basic models can capture and mimic these subjective perceptual experiences. This is a multimodal task: it involves both image and audio classification, requiring different processing strategies for each modality.

Task

In this project, I attempt to create models which can differentiate Bouba and Kiki stimuli. This entails 1) a model that can distinguish Bouba and Kiki images, 2) one that can distinguish Bouba and Kiki audios, and lastly, 3) one that can correctly match Bouba/Kiki images to audios, without the intermediate step of recognizing the images and audios individually.

Siamese Neural Network

In the classic Bouba/Kiki test, participants are shown an image of Bouba, and an image of Kiki, then verbally asked, “Which one is Bouba, and which one is Kiki?” With the two unimodal models (a model to identify images, and a model to identify audios), this task is easy: simply identify the images, identify the audios, and match them. However, what makes the Bouba/Kiki effect interesting is the fact that participants have no knowledge of what Bouba and Kiki could refer to– they are somehow able to match images to audios without any semantic information. In a Siamese neural network, the model will not know the labels for the images and audios, and it will not be trained on what is “Bouba” and “Kiki.” During its training, it will only learn what is a “correct match” and an “incorrect match.” As such, the Siamese model will most closely resemble the conditions of the classic Bouba/Kiki experiment, and (in my opinion) will provide the best insight into how this multimodal effect can be replicated algorithmically.

Project Overview

The approach taken in this project consists of several key steps:

Data preparation: Algorithmically generating distinct "Bouba" and "Kiki" images and collecting/augmenting audio samples of the words "Bouba" and "Kiki."
Image Classification: Developing and training a Convolutional Neural Network (CNN) to distinguish Bouba and Kiki images.
Audio Classification: Developing another CNN for classifying audio samples.
Siamese NN: Using a Siamese Neural Network to test the model's ability to match images and audio samples without explicit labels, resembling the conditions of a human psychological experiment.

1. Data Preparation

Bouba and Kiki images generated from random polar plots (cartesian plots for reference). Bouba plots come from a sum of sinusoids, and Kiki plots are simply 30 uniformly random points connected together.

Both bouba and kiki shapes were generated from random polar plots with the center filled, effectively drawing a closed-loop, irregular shape. Bouba images were based on a sinusoidal equation:

$r = \sin(A * \theta) + \cos(B * \theta) + C$

Each curve was defined by 100 points distributed between 0 and 2π. The last 10 points were linearly “tapered” so that the final point equaled the first point– this kept continuity in the shape and avoided sharp edges (not characteristic of Bouba!)

In contrast, Kiki images were much simpler: 30 random radius values between 0.5 and 1.5 were assigned to equally spaced angles between 0 and 2π, resulting in an irregular, pointy, jagged shape.

1000 images of each type were generated. Additionally, each image was rotated at 90-degree intervals and included as a new image, effectively quadrupling the dataset size. Audio

The audio data consisted of 60 original recordings of both "Bouba" and "Kiki," where I varied my intonation, voice quality, and speaking speed. These recordings were augmented from 120 samples to a total of 3000 samples using techniques such as time shifting, pitch shifting, time stretching, volume adjustments, and adding white noise. This made the audio dataset more robust and (literally) added some noise to an otherwise clean dataset.

2. Image Classification

I used a Convolutional Neural Network (CNN) model to classify images into Bouba and Kiki categories. The model architecture included three convolutional layers with 32, 64, and 128 filters respectively, each followed by max-pooling layers. After flattening the convolutional features, I included a dense layer with 128 neurons and dropout regularization to prevent overfitting. The output layer was a single neuron with sigmoid activation for binary classification.

The model achieved a high accuracy very quickly (99.69%), demonstrating that distinguishing between Bouba and Kiki images was a straightforward task for the CNN.

Demonstration of the CNN model correctly recognizing Bouba and Kiki images.

3. Audio Classification

Each audio sample was preprocessed into 17 distinct features, including 13 Mel-frequency cepstral coefficients (MFCCs), spectral centroid, bandwidth, rolloff, and zero-crossing rate. Because of its 1-dimensional shape, the audio classification CNN had a simpler architecture consisting of two convolutional layers (64 and 32 filters) with batch normalization and max-pooling layers. It included the same dense layer, dropout regularization, and binary classification output layer.

This CNN actually achieved 100% accuracy relatively quickly. This is not super surprising, given that none of the Bouba and Kiki samples sounded similar or hard to discern.

4. Siamese Neural Network

While the above two steps would be sufficient to pass the classic Bouba/Kiki test, they would not resemble the multimodal way that humans respond to the stimuli. To investigate the multimodal associations, I implemented a Siamese Neural Network to test if the model could correctly match audio to images without explicit labeling. I extracted the audio embeddings (64-dimensional) and image embeddings (128-dimensional) from intermediate layers of their respective CNNs.

Visualizing separation of embeddings in 2D space through PCA

The Siamese network projected both embeddings into a shared embedding space, calculated the L1 distance between paired embeddings, and predicted if the audio-image pair matched. Surprisingly, the Siamese network achieved 100% accuracy, meaning it was able to match audios and images correctly every time. It’s an interesting result, given that the image classification test wasn’t able to achieve 100% on its own. My theory is that the classic Bouba/Kiki test setup (“Which one is Bouba and which one is Kiki?”) is a relative comparison, which is an easier task to solve than deciding whether or not something is objectively Bouba or Kiki, without any comparison point.

Demonstration

[see figures/demo_vid.mov]

A demonstration simulated realistic experimental conditions: presenting an audio sample alongside two images (one Bouba and one Kiki).

Conclusion

In all three parts of this project, the models were able to achieve perfect or near-perfect accuracy in recognizing Bouba and Kiki stimuli. This suggests that, while it is certainly an interesting human phenomenon, it is not difficult for artificial intelligence to achieve, even when the conditions are similar.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
figures		figures
models		models
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Recognizing "Bouba" and "Kiki" Using Neural Machine Learning Methods

Introduction

Background

Machine Learning

Task

Siamese Neural Network

Project Overview

1. Data Preparation

2. Image Classification

3. Audio Classification

4. Siamese Neural Network

Demonstration

Conclusion

About

Uh oh!

Releases

Packages

Languages

sidneytma/bouba_kiki_neural

Folders and files

Latest commit

History

Repository files navigation

Recognizing "Bouba" and "Kiki" Using Neural Machine Learning Methods

Introduction

Background

Machine Learning

Task

Siamese Neural Network

Project Overview

1. Data Preparation

2. Image Classification

3. Audio Classification

4. Siamese Neural Network

Demonstration

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages