-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
When using FFHQBlindDataset
with multi-processing DataLoader and set num_workers > 0, a RuntimeError related to mmap occurs. The error message is:
RuntimeError: unable to mmap XXX bytes from file </torch_XXXX>: Cannot allocate memory (12)
This issue was traced to the following line in FFHQBlindDataset.__init__:
self.latent_gt_dict = torch.load(self.latent_gt_path)
In our case, latent_gt_path
is a .pth file that is approximately 1.6GB in size.
When DataLoader uses multiple workers (num_workers > 0), PyTorch tries to share the Dataset instance with child processes. Since self.latent_gt_dict
is a large object, PyTorch attempts to use shared memory (via mmap) to avoid copying the data between processes. If the object is too large or this happens repeatedly, it can easily lead to Shared memory exhaustion, File descriptor exhaustion or mmap-related ENOMEM errors, even when system RAM and /dev/shm space are sufficient.
Solution: move the torch.load()
call from the __init__
method to the __getitem__
method, and load the relevant latent entry on demand per sample. For example:
def __getitem__(self, index):
if self.latent_gt_path is not None:
self.load_latent_gt = True
if self.latent_gt_dict is None:
self.latent_gt_dict = torch.load(self.latent_gt_path)
else:
self.load_latent_gt = False
...