Skip to content

XLA bug #292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
itsmekhoathekid opened this issue Mar 1, 2025 · 3 comments
Open

XLA bug #292

itsmekhoathekid opened this issue Mar 1, 2025 · 3 comments

Comments

@itsmekhoathekid
Copy link

i got these errros while running with config :

python /data/npl/Speech2Text/TensorFlowASR-main/examples/train.py --mxp=auto --jit-compile --config-path=/data/npl/Speech2Text/TensorFlowASR-main/examples/models/transducer/rnnt/small.yml.j2 --dataset-type=tfrecord --modeldir=/data/npl/Speech2Text/TensorFlowASR-main/tensorflow_asr/checkpoint --datadir=/data/npl/Speech2Text/TensorFlowASR-main/scripts/data


Epoch 1/300
INFO:tensorflow:Collective all_reduce tensors: 39 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Error reported to Coordinator: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
yield
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 946, in _call
raise errors.UnimplementedError(
tensorflow.python.framework.errors_impl.UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 110, in
cli_util.run(main)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/utils/cli_util.py", line 19, in run
fire.Fire(component, command=command, name=name)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 98, in main
model.fit(
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 544, in fit
tmp_logs, caching = self.train_function(iterator, caching=caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.UnimplementedError: in user code:

File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 317, in train_function  *
    return step_function(self, iterator, caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 304, in step_function  *
    outputs, caching = model.distribute_strategy.run(run_step, args=(data, caching))

UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.

i tried to use one gpu to train (A100) but its extremely slow. Can you please help .

@Aegon007
Copy link

Aegon007 commented Mar 1, 2025 via email

@itsmekhoathekid
Copy link
Author

my cuda and tensorflow version :

(/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
(/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ pip show tensorflow
Name: tensorflow
Version: 2.15.0.post1
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: tensorflow-text, tf_kera

@nglehuy
Copy link
Collaborator

nglehuy commented Mar 13, 2025

@itsmekhoathekid there's a newer version with tf v2.18 and keras v3 on branch feat-streaming.
Can you consider testing on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants