Fix edge case for tokenize (#36277) #36555

wangzhen0518 · 2025-03-05T08:08:12Z

What does this PR do?

The dtype of the tensor returned by tokenize is inconsistent depending on whether the input text is empty or not, e.g.,

def show_return_dtype(x:str):
    return_type = ["tf", "pt", "jax", "mlx", "np"]
    for t in return_type:
        output = tokenizer(x, return_tensors=t)
        print(f"{t}: {output['input_ids'].dtype}")

x = ""
show_return_dtype(x)
# tf: <dtype: 'float32'>
# pt: torch.float32
# jax: float32
# mlx: mlx.core.float32
# np: float64

x = "abc"
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

After this PR is fixed:

def show_return_dtype(x:str):
    return_type = ["tf", "pt", "jax", "mlx", "np"]
    for t in return_type:
        output = tokenizer(x, return_tensors=t)
        print(f"{t}: {output['input_ids'].dtype}")

x = ""
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

x = "abc"
show_return_dtype(x)
# tf: <dtype: 'int32'>
# pt: torch.int64
# jax: int32
# mlx: mlx.core.int32
# np: int64

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
The output tensor's data type is not torch.long when the input text is empty. #36277
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 and @ArthurZucker

github-actions · 2025-03-05T08:08:27Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

Rocketknight1 · 2025-03-06T11:31:53Z

cc @ArthurZucker @itazap for review. The full explanation is at #36277, but the tl;dr is that WITHOUT this PR:

Tokenizer tensor outputs are generated with torch.tensor()
torch.tensor() infers the dtype according to the input data and the following rule:
- If every element is an int, dtype=torch.int64
- If at least one element is a float, dtype=torch.float32
- If the input list is empty, dtype=torch.float32

This creates an edge cases when our tokenizers return empty arrays - the dtype switches from torch.int64 to torch.float32, and this breaks ops like torch.concat. This PR always casts the dtype to torch.int64, but is there any case where tokenizers return float32 arrays? I can't think of one, but this will definitely break those cases if they exist!

Rocketknight1 · 2025-03-31T14:34:50Z

Rebasing to run tests, gentle ping @ArthurZucker @itazap

wangzhen0518 · 2025-03-31T17:01:05Z

It seems that some testcases in speech models have float tokenizing inputs. I modified the code to handle this case and fixed a bug in the previous commit. Now, the dtype is set to int only when the value is empty and the user has not set the dtype. In all other cases, it remains unchanged from before this PR.

wangzhen0518 · 2025-03-31T17:12:40Z

I don't understand the two failed tests. They appear to be an issue with the model files on Hugging Face, unrelated to tokenization.

FAILED tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_pipeline_audio_classification - FileNotFoundError: Couldn't find any data file at /root/project/hf-internal-testing/librispeech_asr_dummy. Couldn't find 'hf-internal-testing/librispeech_asr_dummy' on the Hugging Face Hub either: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

FAILED tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_pipeline_audio_classification_fp16 - FileNotFoundError: Couldn't find any data file at /root/project/hf-internal-testing/librispeech_asr_dummy. Couldn't find 'hf-internal-testing/librispeech_asr_dummy' on the Hugging Face Hub either: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

Rocketknight1 · 2025-04-01T11:19:00Z

Yes, these are CI issues. You can ignore them, we'll try to get them fixed!

itazap · 2025-04-02T13:24:57Z

LGTM (although I'm not familiar with where this typing can have a deeper effect )

wangzhen0518 · 2025-04-14T02:43:06Z

Hi, I just merged the latest main branch into this PR and all tests have passed. Are there any other issues need to be addressed?

itazap · 2025-04-14T09:36:50Z

@wangzhen0518 Thanks for addressing this! 😊 Do you mind please sharing a code snippet / description of the edge case where the inconsistent typing was an issue, so that we can add a test for it? That way your change won't be undone in the future! 😉

wangzhen0518 · 2025-04-15T03:59:18Z

Inconsistent cases

When the input text is empty, the output tensor is an empty tensor of type float, such as torch.float32. When the input text is not empty, the output tensor is of type int, such as torch.int64. The types of the output tensors are inconsistent in these two cases.

t = tokenizer('', return_tensors='pt')
print(t['input_ids'].dtype) # torch.float32

t = tokenizer('abc', return_tensors='pt')
print(t['input_ids'].dtype) # torch.int64

The case of an empty input string is used when...

I use a LLM as an agent to interact with an environment, and the feedback from the environment sometimes is empty.
Here is an example. The next_app_actor_prompt sometimes is empty.

actor_prompt = self.env.reset(query_config, args.seed, args.max_rollout_step, args.prompt_max_len)
inputs = self.tokenizer(
    [actor_prompt],
    return_tensors="pt",
    add_special_tokens=False,
    max_length=max_length,
    padding=True,
    truncation=True,
)
while not (failed or truncated or done):
    input_len: int = inputs["input_ids"].size(1)
    sequences, attention_mask, action_mask = self.actor.generate(**inputs, **generate_kwargs)
    action = self.get_action(sequences, input_len)
    # `next_app_actor_prompt` sometimes is an empty string.
    next_app_actor_prompt, reward, failed, truncated, done, _ = self.env.step(action)
    next_app_input_token = self.tokenize_fn(next_app_actor_prompt, self.prompt_max_len, device="cuda")
    next_inputs = {
        "input_ids": torch.hstack((sequences, next_app_input_token["input_ids"])),
        "attention_mask": torch.hstack((attention_mask, next_app_input_token["attention_mask"])),
    }
    inputs = next_inputs
    ...

What I do in this PR

I changed the output type to int when the user does not specify the output type and the input is empty.

Testcase

Here is a testcase demo.

def test_empty_string():
    tokenizer = ...
    x = ""
    tokenizer_return_type = ["tf", "pt", "jax", "mlx", "np"]
    output_tensor_type = [tf.int32, torch.int64, jnp.int32, mx.int32, np.int64]
    for t1, t2 in zip(tokenizer_return_type, output_tensor_type):
        output = tokenizer(x, return_tensors=t1)
        assert output["input_ids"].dtype is t2

Potential Issues

I noticed that the tokenizer processes audio data in some test cases. I'm not sure whether there could be cases where the input audio data is empty, and if so, whether the return type should be set to int when the audio input is empty (by default, it returns float when the audio is non-empty).

Rocketknight1 · 2025-05-28T14:44:16Z

Hi @wangzhen0518 @itazap, this issue has cropped up again in #38417. Can we merge this fix, or is there anything else delaying it?

wangzhen0518 · 2025-05-29T02:38:19Z

The current changes do not affect the existing text tokenization logic. The only modification is setting the return type to int when the input text is empty—which, as I understand it, is the expected behavior.

However, as previously discussed, since audio tokenization outputs a float type, any case where the audio input is empty should also return a float. The current changes would break this consistency. Thus, the key question is: Does audio tokenization ever encounter empty input, i.e., len(value) == 0? I have no expertise in audio tokenization, but I suspect this might occur when an audio file has a duration of 0 seconds.

Is this scenario significant enough to warrant a compatibility solution?

ArthurZucker

LGTM let's just add a test please!

wangzhen0518 · 2025-05-29T12:48:22Z

I added a test case, but I found that some tokenizers do not support tokenizing str. They require other input data structures. The unsupported tokenizers are as follows:

LayoutLMv2Tokenizer, LayoutLMv3Tokenizer, LayoutXLMTokenizer, and UdopTokenizer require input as list[str] along with boxes.
MarkupLMTokenizer requires input as list[str] along with xpaths.
TapasTokenizer requires input as pd.DataFrame.

I'm not familiar with these types of tokenizers. Currently, I added the test case in tests/test_tokenization_common.py. Do I need to add separate test cases for these tokenizers?

github-actions bot marked this pull request as draft March 5, 2025 08:08

wangzhen0518 marked this pull request as ready for review March 5, 2025 08:12

Rocketknight1 force-pushed the tokenize-dtype branch from 3a1a52e to f61631a Compare March 31, 2025 14:34

This was referenced May 28, 2025

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Open

[Qwen2.5-VL] Fix empty string input crash in processor #38421

Open

ArthurZucker approved these changes May 29, 2025

View reviewed changes

wangzhen0518 added 2 commits May 29, 2025 15:40

Fix edge case for tokenize (huggingface#36277)

74db2a0

Fix tokenizing dtype for float input cases

861c1b7

wangzhen0518 force-pushed the tokenize-dtype branch 8 times, most recently from 3d111b9 to 4b48bd1 Compare May 29, 2025 09:15

add test for empty input string

d1f6110

wangzhen0518 force-pushed the tokenize-dtype branch from 4b48bd1 to d1f6110 Compare May 29, 2025 09:56

deal empty list of list like [[]]

31cc583

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix edge case for tokenize (#36277) #36555

Fix edge case for tokenize (#36277) #36555

wangzhen0518 commented Mar 5, 2025

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

Rocketknight1 commented Mar 6, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Mar 31, 2025

Uh oh!

wangzhen0518 commented Mar 31, 2025

Uh oh!

wangzhen0518 commented Mar 31, 2025

Uh oh!

Rocketknight1 commented Apr 1, 2025

Uh oh!

itazap commented Apr 2, 2025

Uh oh!

wangzhen0518 commented Apr 14, 2025

Uh oh!

itazap commented Apr 14, 2025

Uh oh!

wangzhen0518 commented Apr 15, 2025

Uh oh!

Rocketknight1 commented May 28, 2025

Uh oh!

wangzhen0518 commented May 29, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

wangzhen0518 commented May 29, 2025

Uh oh!

Uh oh!

Fix edge case for tokenize (#36277) #36555

Are you sure you want to change the base?

Fix edge case for tokenize (#36277) #36555

Conversation

wangzhen0518 commented Mar 5, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

Rocketknight1 commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Mar 31, 2025

Uh oh!

wangzhen0518 commented Mar 31, 2025

Uh oh!

wangzhen0518 commented Mar 31, 2025

Uh oh!

Rocketknight1 commented Apr 1, 2025

Uh oh!

itazap commented Apr 2, 2025

Uh oh!

wangzhen0518 commented Apr 14, 2025

Uh oh!

itazap commented Apr 14, 2025

Uh oh!

wangzhen0518 commented Apr 15, 2025

Inconsistent cases

The case of an empty input string is used when...

What I do in this PR

Testcase

Potential Issues

Uh oh!

Rocketknight1 commented May 28, 2025

Uh oh!

wangzhen0518 commented May 29, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

wangzhen0518 commented May 29, 2025

Uh oh!

Uh oh!

Rocketknight1 commented Mar 6, 2025 •

edited

Loading