-
Notifications
You must be signed in to change notification settings - Fork 11
Reproducibility
Singularity Python has commands that will generate content hashes for all files in a container, and use am algorithm to derive similarity scores of containers on different levels of reproducibility.
You can see the levels by loading them programatically:
from singularity.reproduce import get_levels
levels = get_levels()
and print their description:
for name,level in levels.items():
print("%s: %s" %(name,level['description']))
I've put them into a list for easier reading here:
-
RECIPE
: recipe looks at everything on the level of the Singularity image, meaning the runscript, and environment for version 2.2 -
LABELS
: only look at the container labels, if they exist (singularity version 2.3) -
RUNSCRIPT
: runscript is a level that assesses only the executable runscript in the image. This is a fast approach to sniff if the container is broadly doing the same thing -
BASE
: base ignores the core Singularity files, and focuses on the base operating system, and omits files in variable locations (eg, /tmp and /var) -
IDENTICAL
: The image is exactly the same, meaning the file itself. This is what should be achieved if you download the same image multiple times. The entire contents of the image are used to generate the hash. -
ENVIRONMENT
: only look at the container's environment. This level will only look at the environment files when assessing similarity. -
REPLICATE
: replicate assumes equivalence in the core Singularity files, plus the base operating system, but not including files in variable locations (eg, /tmp and /var)
We can use these levels to assess two containers. Let's first write a function to generate images. Note that we are using the Singularity Client from spython:
from spython.main import Client
image1 = Client.pull('docker://ubuntu')
image2 = Client.pull('docker://busybox')
Note that you don't have to create images in this fashion, you can use images that you already have on your machine. Now let's run a function to compare the images on all of our reproducibility levels.
from singularity.analysis.reproduce import assess_differences, get_level
levels = {'RECIPE': get_level("RECIPE")}
diffs = assess_differences(image1,image2,levels=levels)
Diffs will give you, for each level (this one is RECIPE), a score:
{'RECIPE': {'difference': [],
'intersect_different': [],
'same': 7,
'union': 14},
'scores': {'RECIPE': 1.0}}
Since we have all equivalent files in the Singularity root folder (/.singularity.d
) the score here is 1.0. Deviations in run scripts, environments, or other files would change this score.
You may want to generate image hashes for your own purposes. Here we are still using image1
and image2
from above, and you can use your own image path generated otherwise.
from singularity.analysis.reproduce import (
get_content_hashes,
get_image_hashes,
get_image_hash
)
hashes = get_content_hashes(image1)
This first command will give you dictionary of hashes, sizes, and root owned, each a dictionary with the file name as key, and the hash (md5 sum), size (MB), and root_owned (True/False) as values. The last function generates static values to summarize all files.
hashes = get_image_hashes(image1,levels)
hashes
{'BASE': 'd41d8cd98f00b204e9800998ecf8427e',
'ENVIRONMENT': '3ec5cadab8ed209ae0c5b87ab6b52352',
'IDENTICAL': 'aaae49889e12564e4c00cd2a821b3446',
'LABELS': 'd41d8cd98f00b204e9800998ecf8427e',
'RECIPE': '3ec5cadab8ed209ae0c5b87ab6b52352',
'REPLICATE': 'd41d8cd98f00b204e9800998ecf8427e',
'RUNSCRIPT': 'd41d8cd98f00b204e9800998ecf8427e'}
We advise you to only save the level IDENTICAL
, ENVIRONMENT
, RUNSCRIPT
, and RECIPE
for later comparison with other images. The other levels, when doing a comparison, are reliant on reading the bytes content and comparing to the other image. To make the algorithm faster, this is only done when the original files aren't in agreement.
note this function is better done by container-tree
You can generate a data frame with difference scores between two containers:
from singularity.analysis.compare import compare_singularity_images
image_files = [image1,image2]
diffs = compare_singularity_images(image_paths1=image_files)
ubuntu.sif busybox.sif
ubuntu.sif 1 0.0
busybox.sif 0 1.0
If you have two dataframes like this, you can do RSA (representational similarity analysis) to compare them:
from singularity.analysis.compare import RSA
pearsonr_sim = RSA(diffs1,diffs2)
# 0.74136458648818637
Need help? submit an issue and let us know!
- Home
- Getting Started
- Singularity Hub (deprecated)
- Container Analysis
- Visualization