Skip to content

Inconsistent file hashes when writing data.frames #160

@lazappi

Description

@lazappi

We have noticed that files created with {rhdf5} can have inconsistent hashes even though the content is identical with writing a data.frame. Here is a minimal example:

library(rhdf5)

h5_path <- tempfile(fileext = ".h5")

hashes <- purrr::map_chr(1:5, \(i) {
  h5createFile(h5_path)
  h5write(data.frame(a = 1L:5L), h5_path, "df")
  hash <- rlang::hash_file(h5_path)
  h5closeAll()
  fs::file_delete(h5_path)
  
  Sys.sleep(1)
  
  hash
})

unique(hashes)
#> [1] "fe872321c9d8b174799c3d77d6ded667" "fcfb171578516d910ecd32051427b636"
#> [3] "8bf7d5edcffb0ca1e7bd89d1152241d5" "78069aa4a27f1322ae239d336fec60e5"
#> [5] "9a22f19c41919eeb43880d7f58139b6e"

Created on 2025-05-28 with reprex v2.1.1

The Sys.sleep() isn't necessary for bigger examples but it makes this one more consistent. It suggests that maybe a timestamp is being saved in the file somewhere but I've also considered that it might be something to do with the order things are written in the file. I have tried the example in #51 and that is consistent so it seems to be something to do with data.frames but possibly it's more general than that.

Do you suggestions what it might be and/or how to reliably write a file with a consistent hash?

Also, if you have hints for how to check for timestamps in any HDF5 file that would be super helpful.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions