From 7e950174e24fe8968e6a2ba3bde7de8e591663c6 Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Fri, 6 Jun 2025 14:53:45 -0700
Subject: [PATCH] sentencepiece
---
chapters/en/chapter6/7.mdx | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/chapters/en/chapter6/7.mdx b/chapters/en/chapter6/7.mdx
index b0067ecc5..b0f0320f4 100644
--- a/chapters/en/chapter6/7.mdx
+++ b/chapters/en/chapter6/7.mdx
@@ -7,7 +7,9 @@
{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb"},
]} />
-The Unigram algorithm is often used in SentencePiece, which is the tokenization algorithm used by models like AlBERT, T5, mBART, Big Bird, and XLNet.
+The Unigram algorithm is used in combination with [SentencePiece](https://huggingface.co/papers/1808.06226), which is the tokenization algorithm used by models like AlBERT, T5, mBART, Big Bird, and XLNet.
+
+SentencePiece addresses the fact that not all languages use spaces to separate words. Instead, SentencePiece treats the input as a raw input stream which includes the space in the set of characters to use. Then it can use the Unigram algorithm to construct the appropriate vocabulary.
@@ -378,4 +380,10 @@ tokenize("This is the Hugging Face course.", model)
['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']
```
+
+
+The XLNetTokenizer uses SentencePiece which is why the `"_"` character is included. To decode with SentencePiece, concatenate all the tokens and replace `"_"` with a space.
+
+
+
That's it for Unigram! Hopefully by now you're feeling like an expert in all things tokenizer. In the next section, we will delve into the building blocks of the 🤗 Tokenizers library, and show you how you can use them to build your own tokenizer.