Use fallback config if class not defined (#53)

pcuenca · web-flow · commit 03d86acf3481 · 2024-02-29T17:34:38.000+01:00
Fixes distilgpt2 tokenization. Previously, we only used the fallback configuration if there was no `tokenizer_config.json` in the model repo. These files are now being added to some repos in the context of removing dependencies with transformers' internals, like this PR: huggingface/transformers#29112. But only keys removed from the hardcoded rules are being added to minimize potential breaking changes. We now use the fallback config if tokenizer_config.json exists, no tokenizer class is specified, and we do have a fallback config for this architecture.
diff --git a/Sources/Hub/Hub.swift b/Sources/Hub/Hub.swift
@@ -130,6 +130,14 @@ public class LanguageModelConfigurationFromHub {
                 // Try to guess the class if it's not present and the modelType is
                 if let _ = hubConfig.tokenizerClass?.stringValue { return hubConfig }
                 guard let modelType = try await modelType else { return hubConfig }
+
+                // If the config exists but doesn't contain a tokenizerClass, use a fallback config if we have it
+                if let fallbackConfig = Self.fallbackTokenizerConfig(for: modelType) {
+                    let configuration = fallbackConfig.dictionary.merging(hubConfig.dictionary, uniquingKeysWith: { current, _ in current })
+                    return Config(configuration)
+                }
+
+                // Guess by capitalizing
                 var configuration = hubConfig.dictionary
                 configuration["tokenizer_class"] = "\(modelType.capitalized)Tokenizer"
                 return Config(configuration)