-
Notifications
You must be signed in to change notification settings - Fork 21
Description
We have noticed that files created with {rhdf5} can have inconsistent hashes even though the content is identical with writing a data.frame
. Here is a minimal example:
library(rhdf5)
h5_path <- tempfile(fileext = ".h5")
hashes <- purrr::map_chr(1:5, \(i) {
h5createFile(h5_path)
h5write(data.frame(a = 1L:5L), h5_path, "df")
hash <- rlang::hash_file(h5_path)
h5closeAll()
fs::file_delete(h5_path)
Sys.sleep(1)
hash
})
unique(hashes)
#> [1] "fe872321c9d8b174799c3d77d6ded667" "fcfb171578516d910ecd32051427b636"
#> [3] "8bf7d5edcffb0ca1e7bd89d1152241d5" "78069aa4a27f1322ae239d336fec60e5"
#> [5] "9a22f19c41919eeb43880d7f58139b6e"
Created on 2025-05-28 with reprex v2.1.1
The Sys.sleep()
isn't necessary for bigger examples but it makes this one more consistent. It suggests that maybe a timestamp is being saved in the file somewhere but I've also considered that it might be something to do with the order things are written in the file. I have tried the example in #51 and that is consistent so it seems to be something to do with data.frame
s but possibly it's more general than that.
Do you suggestions what it might be and/or how to reliably write a file with a consistent hash?
Also, if you have hints for how to check for timestamps in any HDF5 file that would be super helpful.
Thanks!