Skip to content

[BUG] Load latent_gt_path in __init__ method of FFHQBlindDataset may cause memory allocate error. #435

@Golden-Pigeon

Description

@Golden-Pigeon

When using FFHQBlindDataset with multi-processing DataLoader and set num_workers > 0, a RuntimeError related to mmap occurs. The error message is:

RuntimeError: unable to mmap XXX bytes from file </torch_XXXX>: Cannot allocate memory (12)

This issue was traced to the following line in FFHQBlindDataset.__init__:

self.latent_gt_dict = torch.load(self.latent_gt_path)

In our case, latent_gt_path is a .pth file that is approximately 1.6GB in size.

When DataLoader uses multiple workers (num_workers > 0), PyTorch tries to share the Dataset instance with child processes. Since self.latent_gt_dict is a large object, PyTorch attempts to use shared memory (via mmap) to avoid copying the data between processes. If the object is too large or this happens repeatedly, it can easily lead to Shared memory exhaustion, File descriptor exhaustion or mmap-related ENOMEM errors, even when system RAM and /dev/shm space are sufficient.

Solution: move the torch.load() call from the __init__ method to the __getitem__ method, and load the relevant latent entry on demand per sample. For example:

def __getitem__(self, index):
    if self.latent_gt_path is not None:
        self.load_latent_gt = True        
        if self.latent_gt_dict is None:    
            self.latent_gt_dict = torch.load(self.latent_gt_path)
    else:
        self.load_latent_gt = False 
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions