Skip to content

XLA bug #292

Open
Open
@itsmekhoathekid

Description

@itsmekhoathekid

i got these errros while running with config :

python /data/npl/Speech2Text/TensorFlowASR-main/examples/train.py --mxp=auto --jit-compile --config-path=/data/npl/Speech2Text/TensorFlowASR-main/examples/models/transducer/rnnt/small.yml.j2 --dataset-type=tfrecord --modeldir=/data/npl/Speech2Text/TensorFlowASR-main/tensorflow_asr/checkpoint --datadir=/data/npl/Speech2Text/TensorFlowASR-main/scripts/data


Epoch 1/300
INFO:tensorflow:Collective all_reduce tensors: 39 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Error reported to Coordinator: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
yield
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 946, in _call
raise errors.UnimplementedError(
tensorflow.python.framework.errors_impl.UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 110, in
cli_util.run(main)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/utils/cli_util.py", line 19, in run
fire.Fire(component, command=command, name=name)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 98, in main
model.fit(
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 544, in fit
tmp_logs, caching = self.train_function(iterator, caching=caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.UnimplementedError: in user code:

File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 317, in train_function  *
    return step_function(self, iterator, caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 304, in step_function  *
    outputs, caching = model.distribute_strategy.run(run_step, args=(data, caching))

UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.

i tried to use one gpu to train (A100) but its extremely slow. Can you please help .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions