Skip to content

Conversation

@shavit
Copy link
Contributor

@shavit shavit commented Feb 25, 2025

  • Add match cases for token value
  • Add BartTokenizer to known tokenizers

/Close #182

  * Add match cases for token value
  * Add BartTokenizer to known tokenizers
@greenrazer
Copy link

Thank you for the contribution @shavit! I tested it and facebook/bart-large-mnli (#182) is now working with the AutoTokenizer.

However this is a curious issue, because when pulling from a URL AutoTokenzier.from calls:

  • LanguageModelConfigurationFromHub.init
  • Then LanguageModelConfigurationFromHub.loadConfig
  • Then calls this code from HubAPI.swift:
func configuration(fileURL: URL) throws -> Config {
    let data = try Data(contentsOf: fileURL)
    let parsed = try JSONSerialization.jsonObject(with: data, options: [])
    guard let dictionary = parsed as? [NSString: Any] else { throw Hub.HubClientError.parse }
    return Config(dictionary)
}

I could be missing something, but since JSON doesn't support tuples, wouldn't we expect the value entries in the config to be arrays instead of tuples? I'm a bit confused why this worked previously. @pcuenca @FL33TW00D

@shavit shavit marked this pull request as draft February 26, 2025 16:00
@shavit
Copy link
Contributor Author

shavit commented Feb 26, 2025

Right, I agree. The config likely decodes to a dictionary with any byte values, and the token value casts in a dynamic lookup.

I'll do a few more checks.

@pcuenca
Copy link
Member

pcuenca commented Feb 26, 2025

I think you are right @greenrazer! I don't think this could ever work with actual tokenizer files. sep and cls would be decoded as arrays, and I don't think we can cast them to tuples. The correct way to do it is the current one proposed in this PR. Actually, I don't think this would work either.

@shavit
Copy link
Contributor Author

shavit commented Feb 27, 2025

True, but the JSON array was decoded into Any?, not a tuple. The value is then read from a new Config and matched with the correct type.
https://github.com/huggingface/swift-transformers/blob/main/Sources/Hub/Hub.swift#L83

The problem comes from the token order, which is not index-first.

{
  "type": "RobertaProcessing",
  "sep": [
    "</s>",
    2
  ],
  "cls": [
    "<s>",
    0
  ],
}

Comment on lines +105 to +106
case let (i, t) as (UInt, String): return (i, t)
case let (t, i) as (String, UInt): return (i, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if both versions exist in the wild?

XCTAssertNotEqual(vocab_dict.count, 2)
}

func testConfigTokenValueDifferentOrder() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test! Could we please add another one that loads a tokenizer such as facebook/bart-large-mnli? We don't need anything fancy: just show that the tokenizer loads and that it can encode a string.

@jkrukowski
Copy link
Contributor

jkrukowski commented Mar 6, 2025

Added the discussed parsing in this PR #188
If the parsing to tuple is not needed I could remove it, lmk what you think @pcuenca and @greenrazer

@shavit shavit mentioned this pull request Mar 22, 2025
@FL33TW00D
Copy link
Contributor

@jkrukowski added this to his PR #188 so I will close this @shavit but thank you for your contribution!

@FL33TW00D FL33TW00D closed this Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fatalError("Missing sep processor configuration" when using "facebook/bart-large-mnli"

5 participants