Skip to content

Commit 30cdeb2

Browse files
TensorFlow Datasets Teamcopybara-github
authored andcommitted
Fix bug involving temporary filepaths when loading a Tokenizer of a TokenTextEncoder.
Because the data directory is marked as "incomplete" while the dataset is being built, and later renamed, absolute filepaths written while the dataset is being built are wrong. This change eliminates the absolute filepaths and instead assumes a constant relative structure between the metadata files. PiperOrigin-RevId: 258279826
1 parent a5d0580 commit 30cdeb2

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

tensorflow_datasets/core/features/text/text_encoder.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -330,16 +330,16 @@ def save_to_file(self, filename_prefix):
330330
}
331331
if self._user_defined_tokenizer is not None:
332332
self._tokenizer.save_to_file(filename)
333-
kwargs["tokenizer_file_prefix"] = filename
333+
kwargs["has_tokenizer"] = True
334334
self._write_lines_to_file(filename, self._vocab_list, kwargs)
335335

336336
@classmethod
337337
def load_from_file(cls, filename_prefix):
338338
filename = cls._filename(filename_prefix)
339339
vocab_lines, kwargs = cls._read_lines_from_file(filename)
340-
tokenizer_file = kwargs.pop("tokenizer_file_prefix", None)
341-
if tokenizer_file:
342-
kwargs["tokenizer"] = Tokenizer.load_from_file(tokenizer_file)
340+
has_tokenizer = kwargs.pop("has_tokenizer", False)
341+
if has_tokenizer:
342+
kwargs["tokenizer"] = Tokenizer.load_from_file(filename)
343343
return cls(vocab_list=vocab_lines, **kwargs)
344344

345345

0 commit comments

Comments
 (0)