xla_ops failed when use multi gpu

log：
```
2023-04-26 17:41:51.342225: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-0-at-0x267a43f0 located
in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-04-26 17:41:51.342242: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-0-at-0x267a43f0 located
in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-04-26 17:41:51.342632: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7feef4013580 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices
:
2023-04-26 17:41:51.386968: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2023-04-26 17:41:51.386982: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2023-04-26 17:41:51.386987: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-0-at-0x267a43f0 located
in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-04-26 17:41:51.387011: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-0-at-0x267a43f0 located
in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
```
command：
```
CUDA_VISIBLE_DEVICES=1,2  onmt-main --model ../config/models/tiny_multi_source_transformer.py --config data_tiny_0425.yml --auto_config train --with_eval --num_gpus 2
```
model:
```
class TinyDualSourceTransformer(onmt.models.Transformer):

    def __init__(self):
        super(TinyDualSourceTransformer, self).__init__(
            source_inputter=onmt.inputters.ParallelInputter([
                onmt.inputters.WordEmbedder(embedding_size=256),
                onmt.inputters.WordEmbedder(embedding_size=256)]),
            target_inputter=onmt.inputters.WordEmbedder(embedding_size=256),
            num_layers=4,
            num_units=128,
            num_heads=4,
            ffn_inner_dim=512,
            dropout=0.1,
            attention_dropout=0.1,
            ffn_dropout=0.1,
            share_encoders=True)

    def auto_config(self, num_replicas=1):
        config = super(TinyDualSourceTransformer, self).auto_config(num_replicas=num_replicas)
        max_length = config["train"]["maximum_features_length"]
        return misc.merge_dict(config, {
            "train": {
                "maximum_features_length": [max_length, max_length]
            }
        })
```
yaml:
```
model_dir: run_/


data:
  train_features_file:
    - input.subword.train
    - label.subword.train
  train_labels_file: output.subword.train
  eval_features_file:
    - input.subword.val
    - label.subword.val
  eval_labels_file: output.subword.val
  source_1_vocabulary: input.vocab.txt
  source_2_vocabulary: label.vocab.txt
  target_vocabulary: output.vocab.txt

train:
  batch_size: 256
  batch_type: examples
  save_checkpoints_steps: 1000
  max_step: 30000
  maximum_features_length: [30, 30]
  maximum_labels_length: 30
  sample_buffer_size: 0


eval:
  steps: 1000
```
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xla_ops failed when use multi gpu #1000

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

xla_ops failed when use multi gpu #1000

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions