Skip to content

Match token value with array and tuple #183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

shavit
Copy link
Contributor

@shavit shavit commented Feb 25, 2025

  • Add match cases for token value
  • Add BartTokenizer to known tokenizers

/Close #182

  * Add match cases for token value
  * Add BartTokenizer to known tokenizers
@greenrazer
Copy link
Collaborator

Thank you for the contribution @shavit! I tested it and facebook/bart-large-mnli (#182) is now working with the AutoTokenizer.

However this is a curious issue, because when pulling from a URL AutoTokenzier.from calls:

  • LanguageModelConfigurationFromHub.init
  • Then LanguageModelConfigurationFromHub.loadConfig
  • Then calls this code from HubAPI.swift:
func configuration(fileURL: URL) throws -> Config {
    let data = try Data(contentsOf: fileURL)
    let parsed = try JSONSerialization.jsonObject(with: data, options: [])
    guard let dictionary = parsed as? [NSString: Any] else { throw Hub.HubClientError.parse }
    return Config(dictionary)
}

I could be missing something, but since JSON doesn't support tuples, wouldn't we expect the value entries in the config to be arrays instead of tuples? I'm a bit confused why this worked previously. @pcuenca @FL33TW00D

@shavit shavit marked this pull request as draft February 26, 2025 16:00
@shavit
Copy link
Contributor Author

shavit commented Feb 26, 2025

Right, I agree. The config likely decodes to a dictionary with any byte values, and the token value casts in a dynamic lookup.

I'll do a few more checks.

@pcuenca
Copy link
Member

pcuenca commented Feb 26, 2025

I think you are right @greenrazer! I don't think this could ever work with actual tokenizer files. sep and cls would be decoded as arrays, and I don't think we can cast them to tuples. The correct way to do it is the current one proposed in this PR. Actually, I don't think this would work either.

@shavit
Copy link
Contributor Author

shavit commented Feb 27, 2025

True, but the JSON array was decoded into Any?, not a tuple. The value is then read from a new Config and matched with the correct type.
https://github.com/huggingface/swift-transformers/blob/main/Sources/Hub/Hub.swift#L83

The problem comes from the token order, which is not index-first.

{
  "type": "RobertaProcessing",
  "sep": [
    "</s>",
    2
  ],
  "cls": [
    "<s>",
    0
  ],
}

Comment on lines +105 to +106
case let (i, t) as (UInt, String): return (i, t)
case let (t, i) as (String, UInt): return (i, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if both versions exist in the wild?

@@ -118,4 +118,14 @@ class HubTests: XCTestCase {
let vocab_dict = config.dictionary["vocab"] as! [String: Int]
XCTAssertNotEqual(vocab_dict.count, 2)
}

func testConfigTokenValueDifferentOrder() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test! Could we please add another one that loads a tokenizer such as facebook/bart-large-mnli? We don't need anything fancy: just show that the tokenizer loads and that it can encode a string.

@jkrukowski
Copy link
Contributor

jkrukowski commented Mar 6, 2025

Added the discussed parsing in this PR #188
If the parsing to tuple is not needed I could remove it, lmk what you think @pcuenca and @greenrazer

@shavit shavit mentioned this pull request Mar 22, 2025
@FL33TW00D
Copy link
Collaborator

@jkrukowski added this to his PR #188 so I will close this @shavit but thank you for your contribution!

@FL33TW00D FL33TW00D closed this Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fatalError("Missing sep processor configuration" when using "facebook/bart-large-mnli"
5 participants