Skip to content

Should there be a chunk iterator for writing datasets using 'create_dataset'? #88

@jbhatch

Description

@jbhatch

When writing an HDF5 file to the HSDS with H5PYD, it appears that although chunks are being created in the final output file, the initial writing of the data seems to operate in a contiguous manner. This would sometimes produce interrupts (http request errors) when writing large, ~GB-size HDF5 files (~GB-size) with H5PYD to the HSDS despite having more than enough memory in each of the HSDS data nodes. Writing smaller, ~MB-sized files was hit and miss, and KB-sized files had no issues. The 3D datasets in the HDF5 files of varied sizes (~GB, ~MB, and ~KB-size) used in these tests were filled with 3D random numpy arrays.

In order to use the H5PYD Chunk Interator in create_dataset, the following fix is suggested:

The line below is added to the import statements in the group.py file in h5pyd/_hl:

from h5pyd._apps.chunkiter import ChunkIterator

In the group.py file under h5pyd/_hl, change lines 334-336 from this:

if data is not None:
self.log.info("initialize data")
dset[...] = data

to this:

    if data is not None:
        self.log.info("initialize data")
        # dset[...] = data
        it = ChunkIterator(dset)
        for chunk in it:
           dset[chunk] = data[chunk]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions