Skip to content

Reproducibility

Vanessasaurus edited this page Jun 1, 2019 · 6 revisions

Reproducibility

Singularity Python has commands that will generate content hashes for all files in a container, and use am algorithm to derive similarity scores of containers on different levels of reproducibility.

Levels

You can see the levels by loading them programatically:

from singularity.reproduce import get_levels
levels = get_levels()

and print their description:

for name,level in levels.items():
    print("%s: %s" %(name,level['description']))

I've put them into a list for easier reading here:

  • RECIPE: recipe looks at everything on the level of the Singularity image, meaning the runscript, and environment for version 2.2
  • LABELS: only look at the container labels, if they exist (singularity version 2.3)
  • RUNSCRIPT: runscript is a level that assesses only the executable runscript in the image. This is a fast approach to sniff if the container is broadly doing the same thing
  • BASE: base ignores the core Singularity files, and focuses on the base operating system, and omits files in variable locations (eg, /tmp and /var)
  • IDENTICAL: The image is exactly the same, meaning the file itself. This is what should be achieved if you download the same image multiple times. The entire contents of the image are used to generate the hash.
  • ENVIRONMENT: only look at the container's environment. This level will only look at the environment files when assessing similarity.
  • REPLICATE: replicate assumes equivalence in the core Singularity files, plus the base operating system, but not including files in variable locations (eg, /tmp and /var)

Compare Containers

We can use these levels to assess two containers. Let's first write a function to generate images. Note that we are using the Singularity Client from spython:

from spython.main import Client

image1 = Client.pull('docker://ubuntu')
image2 = Client.pull('docker://busybox')

Note that you don't have to create images in this fashion, you can use images that you already have on your machine. Now let's run a function to compare the images on all of our reproducibility levels.

from singularity.analysis.reproduce import assess_differences, get_level

levels = {'RECIPE': get_level("RECIPE")}
diffs = assess_differences(image1,image2,levels=levels)

Diffs will give you, for each level (this one is RECIPE), a score:

{'RECIPE': {'difference': [],
  'intersect_different': [],
  'same': 7,
  'union': 14},
 'scores': {'RECIPE': 1.0}}

Since we have all equivalent files in the Singularity root folder (/.singularity.d) the score here is 1.0. Deviations in run scripts, environments, or other files would change this score.

Hashes

You may want to generate image hashes for your own purposes. Here we are still using image1 and image2 from above, and you can use your own image path generated otherwise.

from singularity.analysis.reproduce import (
    get_content_hashes,
    get_image_hashes,
    get_image_hash
)

hashes = get_content_hashes(image1)

This first command will give you dictionary of hashes, sizes, and root owned, each a dictionary with the file name as key, and the hash (md5 sum), size (MB), and root_owned (True/False) as values. The last function generates static values to summarize all files.

hashes = get_image_hashes(image1,levels)
hashes
{'BASE': 'd41d8cd98f00b204e9800998ecf8427e',
 'ENVIRONMENT': '3ec5cadab8ed209ae0c5b87ab6b52352',
 'IDENTICAL': 'aaae49889e12564e4c00cd2a821b3446',
 'LABELS': 'd41d8cd98f00b204e9800998ecf8427e',
 'RECIPE': '3ec5cadab8ed209ae0c5b87ab6b52352',
 'REPLICATE': 'd41d8cd98f00b204e9800998ecf8427e',
 'RUNSCRIPT': 'd41d8cd98f00b204e9800998ecf8427e'}

We advise you to only save the level IDENTICAL, ENVIRONMENT, RUNSCRIPT, and RECIPE for later comparison with other images. The other levels, when doing a comparison, are reliant on reading the bytes content and comparing to the other image. To make the algorithm faster, this is only done when the original files aren't in agreement.

Visualization

note this function is better done by container-tree

You can generate a data frame with difference scores between two containers:

from singularity.analysis.compare import compare_singularity_images
image_files = [image1,image2]
diffs = compare_singularity_images(image_paths1=image_files)
             ubuntu.sif  busybox.sif
ubuntu.sif            1          0.0
busybox.sif           0          1.0

If you have two dataframes like this, you can do RSA (representational similarity analysis) to compare them:

from singularity.analysis.compare import RSA

pearsonr_sim = RSA(diffs1,diffs2)
# 0.74136458648818637
Clone this wiki locally