-
Notifications
You must be signed in to change notification settings - Fork 29.2k
Fix edge case for tokenize (#36277) #36555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
cc @ArthurZucker @itazap for review. The full explanation is at #36277, but the tl;dr is that WITHOUT this PR:
This creates an edge cases when our tokenizers return empty arrays - the dtype switches from |
3a1a52e
to
f61631a
Compare
Rebasing to run tests, gentle ping @ArthurZucker @itazap |
It seems that some testcases in speech models have float tokenizing inputs. I modified the code to handle this case and fixed a bug in the previous commit. Now, the dtype is set to int only when the value is empty and the user has not set the dtype. In all other cases, it remains unchanged from before this PR. |
I don't understand the two failed tests. They appear to be an issue with the model files on Hugging Face, unrelated to tokenization.
|
Yes, these are CI issues. You can ignore them, we'll try to get them fixed! |
LGTM (although I'm not familiar with where this typing can have a deeper effect ) |
Hi, I just merged the latest main branch into this PR and all tests have passed. Are there any other issues need to be addressed? |
@wangzhen0518 Thanks for addressing this! 😊 Do you mind please sharing a code snippet / description of the edge case where the inconsistent typing was an issue, so that we can add a test for it? That way your change won't be undone in the future! 😉 |
Inconsistent casesWhen the input text is empty, the output tensor is an empty tensor of type float, such as t = tokenizer('', return_tensors='pt')
print(t['input_ids'].dtype) # torch.float32
t = tokenizer('abc', return_tensors='pt')
print(t['input_ids'].dtype) # torch.int64 The case of an empty input string is used when...I use a LLM as an agent to interact with an environment, and the feedback from the environment sometimes is empty. actor_prompt = self.env.reset(query_config, args.seed, args.max_rollout_step, args.prompt_max_len)
inputs = self.tokenizer(
[actor_prompt],
return_tensors="pt",
add_special_tokens=False,
max_length=max_length,
padding=True,
truncation=True,
)
while not (failed or truncated or done):
input_len: int = inputs["input_ids"].size(1)
sequences, attention_mask, action_mask = self.actor.generate(**inputs, **generate_kwargs)
action = self.get_action(sequences, input_len)
# `next_app_actor_prompt` sometimes is an empty string.
next_app_actor_prompt, reward, failed, truncated, done, _ = self.env.step(action)
next_app_input_token = self.tokenize_fn(next_app_actor_prompt, self.prompt_max_len, device="cuda")
next_inputs = {
"input_ids": torch.hstack((sequences, next_app_input_token["input_ids"])),
"attention_mask": torch.hstack((attention_mask, next_app_input_token["attention_mask"])),
}
inputs = next_inputs
... What I do in this PRI changed the output type to int when the user does not specify the output type and the input is empty. TestcaseHere is a testcase demo. def test_empty_string():
tokenizer = ...
x = ""
tokenizer_return_type = ["tf", "pt", "jax", "mlx", "np"]
output_tensor_type = [tf.int32, torch.int64, jnp.int32, mx.int32, np.int64]
for t1, t2 in zip(tokenizer_return_type, output_tensor_type):
output = tokenizer(x, return_tensors=t1)
assert output["input_ids"].dtype is t2 Potential IssuesI noticed that the tokenizer processes audio data in some test cases. I'm not sure whether there could be cases where the input audio data is empty, and if so, whether the return type should be set to |
Hi @wangzhen0518 @itazap, this issue has cropped up again in #38417. Can we merge this fix, or is there anything else delaying it? |
The current changes do not affect the existing text tokenization logic. The only modification is setting the return type to int when the input text is empty—which, as I understand it, is the expected behavior. However, as previously discussed, since audio tokenization outputs a float type, any case where the audio input is empty should also return a float. The current changes would break this consistency. Thus, the key question is: Does audio tokenization ever encounter empty input, i.e., Is this scenario significant enough to warrant a compatibility solution? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM let's just add a test please!
3d111b9
to
4b48bd1
Compare
4b48bd1
to
d1f6110
Compare
I added a test case, but I found that some tokenizers do not support tokenizing
I'm not familiar with these types of tokenizers. Currently, I added the test case in |
What does this PR do?
Fixes #36277
The dtype of the tensor returned by tokenize is inconsistent depending on whether the input text is empty or not, e.g.,
After this PR is fixed:
Before submitting
Pull Request section?
to it if that's the case.
The output tensor's data type is not torch.long when the input text is empty. #36277
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Rocketknight1 and @ArthurZucker