Skip to content

Deeprec hangs in distributed mode. #125

@silingtong123

Description

@silingtong123

Current behavior

In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log:
6ca9fe77321c27383b3b3de9bb8fc5d5
Nvidia-smi:
a3ee237e24abfd35d1c087126b6331f8
cpu:
071c9938c994a484295fdc3ef25b483d

Expected behavior

Deeprec works fine in distributed mode. Log:
315532d0f8197d279e990d49332c85b3

System information

  • GPU model and memory: Two GPU devices: Tesla T4 . Memory: 15109MiB
  • OS Platform: x86_64 x86_64 x86_64 GNU/Linux
  • Docker version: Docker version 20.10.8, build 3967b7d
  • GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
  • Python/conda version: python3.6
  • TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e

Code to reproduce

    sess_config = tf.ConfigProto(
        # If the device you specify doesn't exist, allow TF to assign the device automatically
        allow_soft_placement=True,
        log_device_placement=False,  # Whether to print the device assignment log
    )
    sess_config.gpu_options.force_gpu_compatible = True
    sess_config.gpu_options.allow_growth = True

    with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):

Willing to contribute

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions