Skip to content

Fix edge case for tokenize (#36277) #36555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wangzhen0518
Copy link

What does this PR do?

Fixes #36277

The dtype of the tensor returned by tokenize is inconsistent depending on whether the input text is empty or not, e.g.,

def show_return_dtype(x:str):
    return_type = ["tf", "pt", "jax", "mlx", "np"]
    for t in return_type:
        output = tokenizer(x, return_tensors=t)
        print(f"{t}: {output['input_ids'].dtype}")

x = ""
show_return_dtype(x)
# tf: <dtype: 'float32'>
# pt: torch.float32
# jax: float32
# mlx: mlx.core.float32
# np: float64

x = "abc"
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

After this PR is fixed:

def show_return_dtype(x:str):
    return_type = ["tf", "pt", "jax", "mlx", "np"]
    for t in return_type:
        output = tokenizer(x, return_tensors=t)
        print(f"{t}: {output['input_ids'].dtype}")

x = ""
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

x = "abc"
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 and @ArthurZucker

@github-actions github-actions bot marked this pull request as draft March 5, 2025 08:08
Copy link
Contributor

github-actions bot commented Mar 5, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@wangzhen0518 wangzhen0518 marked this pull request as ready for review March 5, 2025 08:12
@Rocketknight1
Copy link
Member

Rocketknight1 commented Mar 6, 2025

cc @ArthurZucker @itazap for review. The full explanation is at #36277, but the tl;dr is that WITHOUT this PR:

  • Tokenizer tensor outputs are generated with torch.tensor()
  • torch.tensor() infers the dtype according to the input data and the following rule:
    • If every element is an int, dtype=torch.int64
    • If at least one element is a float, dtype=torch.float32
    • If the input list is empty, dtype=torch.float32

This creates an edge cases when our tokenizers return empty arrays - the dtype switches from torch.int64 to torch.float32, and this breaks ops like torch.concat. This PR always casts the dtype to torch.int64, but is there any case where tokenizers return float32 arrays? I can't think of one, but this will definitely break those cases if they exist!

@Rocketknight1
Copy link
Member

Rebasing to run tests, gentle ping @ArthurZucker @itazap

@wangzhen0518
Copy link
Author

It seems that some testcases in speech models have float tokenizing inputs. I modified the code to handle this case and fixed a bug in the previous commit. Now, the dtype is set to int only when the value is empty and the user has not set the dtype. In all other cases, it remains unchanged from before this PR.

@wangzhen0518
Copy link
Author

I don't understand the two failed tests. They appear to be an issue with the model files on Hugging Face, unrelated to tokenization.

FAILED tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_pipeline_audio_classification - FileNotFoundError: Couldn't find any data file at /root/project/hf-internal-testing/librispeech_asr_dummy. Couldn't find 'hf-internal-testing/librispeech_asr_dummy' on the Hugging Face Hub either: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

FAILED tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_pipeline_audio_classification_fp16 - FileNotFoundError: Couldn't find any data file at /root/project/hf-internal-testing/librispeech_asr_dummy. Couldn't find 'hf-internal-testing/librispeech_asr_dummy' on the Hugging Face Hub either: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

@Rocketknight1
Copy link
Member

Yes, these are CI issues. You can ignore them, we'll try to get them fixed!

@itazap
Copy link
Collaborator

itazap commented Apr 2, 2025

LGTM (although I'm not familiar with where this typing can have a deeper effect )

@wangzhen0518
Copy link
Author

Hi, I just merged the latest main branch into this PR and all tests have passed. Are there any other issues need to be addressed?

@itazap
Copy link
Collaborator

itazap commented Apr 14, 2025

@wangzhen0518 Thanks for addressing this! 😊 Do you mind please sharing a code snippet / description of the edge case where the inconsistent typing was an issue, so that we can add a test for it? That way your change won't be undone in the future! 😉

@wangzhen0518
Copy link
Author

Inconsistent cases

When the input text is empty, the output tensor is an empty tensor of type float, such as torch.float32. When the input text is not empty, the output tensor is of type int, such as torch.int64. The types of the output tensors are inconsistent in these two cases.

t = tokenizer('', return_tensors='pt')
print(t['input_ids'].dtype) # torch.float32

t = tokenizer('abc', return_tensors='pt')
print(t['input_ids'].dtype) # torch.int64

The case of an empty input string is used when...

I use a LLM as an agent to interact with an environment, and the feedback from the environment sometimes is empty.
Here is an example. The next_app_actor_prompt sometimes is empty.

actor_prompt = self.env.reset(query_config, args.seed, args.max_rollout_step, args.prompt_max_len)
inputs = self.tokenizer(
    [actor_prompt],
    return_tensors="pt",
    add_special_tokens=False,
    max_length=max_length,
    padding=True,
    truncation=True,
)
while not (failed or truncated or done):
    input_len: int = inputs["input_ids"].size(1)
    sequences, attention_mask, action_mask = self.actor.generate(**inputs, **generate_kwargs)
    action = self.get_action(sequences, input_len)
    # `next_app_actor_prompt` sometimes is an empty string.
    next_app_actor_prompt, reward, failed, truncated, done, _ = self.env.step(action)
    next_app_input_token = self.tokenize_fn(next_app_actor_prompt, self.prompt_max_len, device="cuda")
    next_inputs = {
        "input_ids": torch.hstack((sequences, next_app_input_token["input_ids"])),
        "attention_mask": torch.hstack((attention_mask, next_app_input_token["attention_mask"])),
    }
    inputs = next_inputs
    ...

What I do in this PR

I changed the output type to int when the user does not specify the output type and the input is empty.

Testcase

Here is a testcase demo.

def test_empty_string():
    tokenizer = ...
    x = ""
    tokenizer_return_type = ["tf", "pt", "jax", "mlx", "np"]
    output_tensor_type = [tf.int32, torch.int64, jnp.int32, mx.int32, np.int64]
    for t1, t2 in zip(tokenizer_return_type, output_tensor_type):
        output = tokenizer(x, return_tensors=t1)
        assert output["input_ids"].dtype is t2

Potential Issues

I noticed that the tokenizer processes audio data in some test cases. I'm not sure whether there could be cases where the input audio data is empty, and if so, whether the return type should be set to int when the audio input is empty (by default, it returns float when the audio is non-empty).

@Rocketknight1
Copy link
Member

Hi @wangzhen0518 @itazap, this issue has cropped up again in #38417. Can we merge this fix, or is there anything else delaying it?

@wangzhen0518
Copy link
Author

The current changes do not affect the existing text tokenization logic. The only modification is setting the return type to int when the input text is empty—which, as I understand it, is the expected behavior.​​

However, as previously discussed, since audio tokenization outputs a float type, any case where the audio input is empty should also return a float. The current changes would break this consistency. Thus, the key question is: ​​Does audio tokenization ever encounter empty input, i.e., len(value) == 0? I have no expertise in audio tokenization, but I suspect this might occur when an audio file has a duration of 0 seconds.

Is this scenario significant enough to warrant a compatibility solution?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM let's just add a test please!

@wangzhen0518 wangzhen0518 force-pushed the tokenize-dtype branch 8 times, most recently from 3d111b9 to 4b48bd1 Compare May 29, 2025 09:15
@wangzhen0518
Copy link
Author

I added a test case, but I found that some tokenizers do not support tokenizing str. They require other input data structures. The unsupported tokenizers are as follows:

  • LayoutLMv2Tokenizer, LayoutLMv3Tokenizer, LayoutXLMTokenizer, and UdopTokenizer require input as list[str] along with boxes.
  • MarkupLMTokenizer requires input as list[str] along with xpaths.
  • TapasTokenizer requires input as pd.DataFrame.

I'm not familiar with these types of tokenizers. Currently, I added the test case in tests/test_tokenization_common.py. Do I need to add separate test cases for these tokenizers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The output tensor's data type is not torch.long when the input text is empty.
4 participants