An Evaluation Framework for Medical Image Distribution Similarity Metrics

By Nicholas Konz, Richard Osuala, (* = equal contribution), Preeti Verma, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Andrew Marshall, Lidia Garrucho, Kaisar Kushibar, Daniel M. Lang, Gene S. Kim, Lars J. Grimm, John M. Lewin, James S. Duncan, Julia A. Schnabel, Oliver Diaz, Karim Lekadir and Maciej A. Mazurowski.

arXiv paper link:

We provide an easy-to-use framework for evaluating distance/similarity metrics between unpaired sets of medical images with a variety of metrics, accompanying our paper. For example, this can be used to evaluate the performance of image generative models in the medical imaging domain. The codebase includes implementations of several distance metrics that can be used to compare images, as well as tools for evaluating the performance of generative models on various downstream tasks.

Included metrics:

FRD (Fréchet Radiomic Distance)
FID (Fréchet Inception Distance)
Radiology FID/RadFID
KID (Kernel Inception Distance)
CMMD (CLIP Maximum Mean Discrepancy)

Credits

Thanks to the following repositories which this framework utilizes and builds upon:

frd-score
pyradiomics
gan-metrics-pytorch, which we modified to allow for computing RadFID.
cmmd-pytorch

Citation

Please cite our paper if you use this framework in your work:

@article{konzosuala_frd2025,
      title={Fr\'echet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets}, 
      author = {Konz, Nicholas and Osuala, Richard and Verma, Preeti and Chen, Yuwen and Gu, Hanxue and Dong, Haoyu and Chen, Yaqian and Marshall, Andrew and Garrucho, Lidia and Kushibar, Kaisar and Lang, Daniel M. and Kim, Gene S. and Grimm, Lars J. and Lewin, John M. and Duncan, James S. and Schnabel, Julia A. and Diaz, Oliver and Lekadir, Karim and Mazurowski, Maciej A.},
      year={2025},
      eprint={2412.01496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.01496}, 
}

0. Installation/Setup

First, please run pip3 install -r requirements.txt to install the required packages.
Finally, RadFID requires the RadImageNet weights for the InceptionV3 model, which can be downloaded from RadImageNet's official source here. Once downloaded, please place the InceptionV3.pt checkpoint file into src/gan-metrics-pytorch/models and rename it to RadImageNet_InceptionV3.pt. Our code will take care of the rest.

1. Basic Metric Computation

You can compute various distance metrics between two sets of images using the following command:

bash compute_allmetrics.sh $IMAGE_FOLDER1 $IMAGE_FOLDER2 $METRICS

where $IMAGE_FOLDER1 and $IMAGE_FOLDER2 are the paths to the two folders containing the images you want to compare, and $METRICS is the list of metrics you want to compute out of FRD, FID, RadFID, KID and CMMD, as a single comma-separated string. E.g., to compute only FRD and CMMD, you would run:

bash compute_allmetrics.sh $IMAGE_FOLDER1 $IMAGE_FOLDER2 FRD,CMMD

This will print out the computed distances to the terminal. For example, this can be used to evaluate the performance of a generative model by comparing the generated images to a set of real reference images.

2. Further Evaluations: Intrinsic

2.1 Sample Efficiency and Computation Speed Analysis

As in our paper (Secs. 5.2 and 5.3), you can also evaluate how distance estimations and computation times change with the sample size of images used to compute the distance metrics. This can be done by running the run_sample_efficiency.sh script, with the same arguments as compute_allmetrics.sh (see Basic Metric Computation), except now, you'll need to specify the sample sizes you want to use, provided as a single string with spaces separating each size. For example, to compute the distances for sample sizes of 10, 100, 500 and 1000 images, you can run:

bash run_sample_efficiency.sh $IMAGE_FOLDER1 $IMAGE_FOLDER2 "10 100 500 1000"

The distance values and computation times will be printed to the terminal.

2.2 Sensitivity to Image Transformations

To evaluate the sensitivity of the distance metrics to image transformations (as in Sec. 5.4 of our paper), you can use the transform_images.py script. This script applies a set of transformations to a folder of images $IMAGE_FOLDER and saves the transformed images in separate folders. The transformations include Gaussian blur and sharpness adjustment with different parameters (kernel sizes of 5 and 9, and sharpness factors of 0, 0.5 and 2, respectively). The script can be run with the following command:

python3 transform_images.py $IMAGE_FOLDER

where $IMAGE_FOLDER is the path to the folder containing the images you want to transform. The script will create a new folder called transformed_images in the same directory as the input folder, and save the transformed images in subfolders named after the transformation type (e.g., gaussian_blur).

Transformed images for the input folder will be saved in additional folders within the same directory, one for each type of transformation. From here, the sensitivity of the distance metrics to a type of transformation can be evaluated simply by computing the distance metrics between the non-transformed image folder and the transformed image folder (see Basic Metric Computation). For example, for a transformation of Gaussian blur with kernel size 5, you can run:

bash compute_allmetrics.sh $IMAGE_FOLDER $IMAGE_FOLDER_TRANSFORMED

where $IMAGE_FOLDER_TRANSFORMED is the path to the folder containing the transformed images: {$IMAGE_FOLDER}_gaussian_blur5 in this case.

3. Further Evaluations: Extrinsic

3.1 Correlation with Downstream Task Performance

As in Sec. 4.2 of our paper, you can evaluate the correlation between the distance metrics and the performance of a downstream task (e.g., classification, segmentation, etc.) using the correlate_downstream_task.py script. For example, as in our paper, this can be used to evaluate image-to-image translation models; given a test set $D_{s\rightarrow t}$ of source domain images which were translated to the target domain as well as an additional set of reference target domain images $D_t$, a distance/similarity metric $d$ (e.g., FRD) can be evaluated by seeing if $d(D_t, D_{s\rightarrow t})$ can serve as a proxy of (i.e, correlates with) the performance of some downstream task model on $D_{s\rightarrow t}$ (for example, Dice coefficient if the task is segmentation). Note: for this to be valid, the reference set $D_t$ must be fixed for all evaluations of $d$.

To use this script, create a simple CSV file with the following columns:

distance: the distance metric value (e.g., FRD) between the test images (e.g., generated/translated images) and the reference images
task_performance: the performance of the downstream task model on the test images (e.g., Dice coefficient)

From here, you can run the script with the following command:

python3 correlate_downstream_task.py $CSV_FILE

where $CSV_FILE is the path to the CSV file you created. The script will compute the correlation between the distance metric and the downstream task performance, and print the results to the terminal. The correlation will be computed using the Pearson linear correlation coefficient, and the Spearman and Kendall nonlinear/rank correlation coefficients; the results will be printed to the terminal. The script will also plot a scatter plot of the distance metric values against the downstream task performance, and save the plot as a PNG file in the same directory as the input CSV file. The plot will be saved as correlation_plot.png, and the correlation coefficient will be printed to the terminal. The script will also print the p-value of the correlation tests, which indicates the statistical significance of the correlations.

3.2 Out-of-Domain/Distribution Detection

The script ood_detection.py allows you to evaluate the ability of different feature representations to detect out-of-distribution (OOD) images in both threshold-free and threshold-based settings, as shown in Section 4.1 of our paper. This is computed given:

A reference in-distribution image set (used to compute a reference feature distribution), for example, a model's training set.
A test set of both in-distribution (ID) and out-of-distribution (OOD) images.

This script extracts feature embeddings (e.g., standardized radiomic features as is used in FRD, or InceptionV3 features with ImageNet or RadImageNet weights as is evaluated in our paper), and evaluates:

Threshold-independent performance: using AUC based on distance from the ID mean.
Threshold-based detection: using a 95th percentile threshold on ID validation distances, to compute accuracy, TPR, TNR, and AUC.

To run this file, you can use the following command:

python3 ood_detection.py \
  --img_folder_ref_id ${IMAGE_FOLDER_REF_ID} \
  --img_folder_test_id ${IMAGE_FOLDER_TEST_ID} \
  --img_folder_test_ood ${IMAGE_FOLDER_TEST_OOD}

where:

${IMAGE_FOLDER_REF_ID} is the path to the folder containing the reference in-distribution images.
${IMAGE_FOLDER_TEST_ID} is the path to the folder containing the test in-distribution images.
${IMAGE_FOLDER_TEST_OOD} is the path to the folder containing the test out-of-distribution images.

The various results will be printed to the terminal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An Evaluation Framework for Medical Image Distribution Similarity Metrics

Credits

Citation

0. Installation/Setup

1. Basic Metric Computation

2. Further Evaluations: Intrinsic

2.1 Sample Efficiency and Computation Speed Analysis

2.2 Sensitivity to Image Transformations

3. Further Evaluations: Extrinsic

3.1 Correlation with Downstream Task Performance

3.2 Out-of-Domain/Distribution Detection

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
figs		figs
src		src
README.md		README.md
analyze_radiomics.py		analyze_radiomics.py
compute_allmetrics.sh		compute_allmetrics.sh
correlate_downstream_task.py		correlate_downstream_task.py
domainshift_main.py		domainshift_main.py
ood_detection.py		ood_detection.py
requirements.txt		requirements.txt
run_sample_efficiency.sh		run_sample_efficiency.sh
transform_images.py		transform_images.py

mazurowski-lab/medical-image-similarity-metrics

Folders and files

Latest commit

History

Repository files navigation

An Evaluation Framework for Medical Image Distribution Similarity Metrics

Credits

Citation

0. Installation/Setup

1. Basic Metric Computation

2. Further Evaluations: Intrinsic

2.1 Sample Efficiency and Computation Speed Analysis

2.2 Sensitivity to Image Transformations

3. Further Evaluations: Extrinsic

3.1 Correlation with Downstream Task Performance

3.2 Out-of-Domain/Distribution Detection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages