-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Current behavior
In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log:
Nvidia-smi:
cpu:
Expected behavior
Deeprec works fine in distributed mode. Log:
System information
- GPU model and memory: Two GPU devices: Tesla T4 . Memory: 15109MiB
- OS Platform: x86_64 x86_64 x86_64 GNU/Linux
- Docker version: Docker version 20.10.8, build 3967b7d
- GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
- Python/conda version: python3.6
- TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e
Code to reproduce
sess_config = tf.ConfigProto(
# If the device you specify doesn't exist, allow TF to assign the device automatically
allow_soft_placement=True,
log_device_placement=False, # Whether to print the device assignment log
)
sess_config.gpu_options.force_gpu_compatible = True
sess_config.gpu_options.allow_growth = True
with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):
Willing to contribute
Yes
Metadata
Metadata
Assignees
Labels
No labels