A Research on Evaluating the Role of Multimodal Clinical Data in Breast Cancer Diagnostic Classifier

The sequential workflow of the methodology, breaking down each stage, from data preparation to experimental setup and analysis.

Diagram of fusion architectures that simultaneously incorporate breast imaging and non-image features. Feature Fusion (A) integrates acquired image features with non-image features, while Learned Feature Fusion (B) combines learned image features with learned non-image features. The baseline models, Image-Only (a) and Non-Image-Only (b), are also represented. Feature vectors are depicted without borders, with the number inside denoting the vector's size. "FC-n" designates a fully-connected layer with "n" hidden units, and the symbol "ŷ" signifies the predicted probability.

Project Abstract

The process of diagnosing breast cancer is fundamentally multimodal in nature. In real-life clinical scenarios, physicians utilise mammography scan images and clinical information to assess cancer. With advances in artificial intelligence, technology is being increasingly integrated into healthcare to support human decision-making. Despite this, several seminal deep learning approaches for breast cancer classification are primarily based on using image or non-image textual clinical data only, without effectively integrating both modalities. Using an established study as a reference, this project compared the diagnostic classification performance of unimodal and multimodal approaches, with the aim of gaining insights into how the performance of a classifier changes when image-level data is combined suitably with non-image clinical data.

This project made use of the CBIS-DDSM dataset, which comprises 6,671 breast images from 1,566 participants. Key sampling and preprocessing steps, such as binary encoding and image preprocessing, including pixel value extraction and normalisation, were implemented. Subsequently, the breast images were linked with relevant non-image features, including breast density, mass shape, and calcification type. Unimodal baseline models based on image and non-image data were trained first, following which, the two models with distinct modality fusion strategies were trained. The assessment was conducted using the area under the receiver operating characteristic curve (AUC) and specificity at 95% sensitivity. In the most optimal iteration of the image-only model, an AUC of 0.53 (95% CI: 0.47, 0.6) and a specificity at 95% sensitivity of 10% were achieved. In contrast, the top-performing non-image-only model iteration achieved an AUC of 0.74 (95% CI: 0.67, 0.79) and a specificity at 95% sensitivity of 29%. The finest feature fusion model attained an AUC of 0.67 (95% CI: 0.6, 0.73) with a specificity at 95% sensitivity of 17%. Notably, the learned feature fusion model surpassed the feature fusion model, achieving an AUC of 0.74 (95% CI: 0.68, 0.8) and a specificity at 95% sensitivity of 26%.

Usage

1. Data Preparation

data_preparation.ipynb was used to prepare non-image features including feature selection and one-hot encoding, merge non-image and image data sources, perform stratified sampling and train test splits, and finally export data including ID references and meta feature names to the \dataset directory in the following file structure:

├── dataset
│   ├── train
│   │   ├── 1_x_cropped.npy
│   │   ├── 1_y.npy
│   │   ├── 1_meta.npy
│   │   ├── ...
│   ├── val
│   │   ├── 1_x_cropped.npy
│   │   ├── 1_y.npy
│   │   ├── 1_meta.npy
│   │   ├── ...
│   ├── test
│   │   ├── 1_x_cropped.npy
│   │   ├── 1_y.npy
│   │   ├── 1_meta.npy
│   │   ├── ...
│   ├── id_reference_test.txt
│   ├── id_reference_train.txt
│   ├── id_reference_val.txt
│   ├── meta_header.txt

where *_x_cropped.npy contains a raw cropped image of shape (h, w), *_y.npy contains the associated target of shape (1,), *_meta.npy contains the associated non-image data of shape (n_features,), id_reference_*.txt contains the ID references per associated dataset and meta_header.txt contains the column names the final one-hot encoded non-image features.

2. Executing Pipelines

Assuming you have Anaconda installed, run conda env create -f mri_fusion.yaml to install all pre-requisites. The pipeline used for this work was simply to normalize and resize the image data (pre_process.py) and train models (train.py); you can find the specific commands to conduct all experiments in run_experiments.bat containing five experiments for each model at different number-generating seeds. After training, we conducted a feature importance analysis (feature_imp.py) to understand the contribution of each non-image features to the breast cancer prediction. Lastly, plotROC.ipynb was used to produce the combined ROC curves for the best run and five-run ensemble. All the experimental results for the two unimodal baselines and Learned Feature Fusion and Feature Fusion can be found in results/ directory.

To execute the experiments, simply run train.py with the arguments of your choice. Below is an example of training the Learned Feature Fusion model on the dataset with the settings used in the project using default values for most arguments (refer to train.py for more default values):

python train.py --data_dir <path_to_your_dataset> \
                --out_dir <path_to_results> \
                --model learned-feature-fusion \
                --seed 0

Example to run the learned feature fusion on the above trained model:

python feature_imp.py   --model learned-feature-fusion
                        --model_name learned-feature-fusion-concat_no-CW_aug_seed0

Contributors: Si Man Kou, Paul Yomer Ruiz Pinto, Pranshu Datta
Last Modified: 5/11/2023
Code adapted from https://github.com/gholste/breast_mri_fusion/tree/main - licensed under MIT License permitting modification and distribution.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
img		img
.gitignore		.gitignore
README.md		README.md
data_preparation.ipynb		data_preparation.ipynb
dataset.py		dataset.py
feature_imp.py		feature_imp.py
models.py		models.py
mri_fusion.yml		mri_fusion.yml
plotROC.ipynb		plotROC.ipynb
pre_process.py		pre_process.py
research_paper.pdf		research_paper.pdf
run_experiments.bat		run_experiments.bat
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Research on Evaluating the Role of Multimodal Clinical Data in Breast Cancer Diagnostic Classifier

Project Abstract

Usage

1. Data Preparation

2. Executing Pipelines

About

Uh oh!

Releases

Packages

Languages

TiffanyKousiman/breast-cancer-diagnostic-classifier

Folders and files

Latest commit

History

Repository files navigation

A Research on Evaluating the Role of Multimodal Clinical Data in Breast Cancer Diagnostic Classifier

Project Abstract

Usage

1. Data Preparation

2. Executing Pipelines

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages