Skip to content

Commit 0ff89a4

Browse files
authored
Merge pull request #439 from huggingface/v1.0.0-pre
Prepare v1.0.0 release - `Trainer`, `TrainingArguments`, SetFitABSA, logging, evaluation during training, callbacks, docs
2 parents 061f456 + 3152e49 commit 0ff89a4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+9265
-1711
lines changed

.github/workflows/build_documentation.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ jobs:
1313
with:
1414
commit_sha: ${{ github.sha }}
1515
package: setfit
16+
notebook_folder: setfit_doc
1617
languages: en
1718
secrets:
1819
token: ${{ secrets.HUGGINGFACE_PUSH }}

.github/workflows/quality.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,12 @@ on:
55
branches:
66
- main
77
- v*-release
8+
- v*-pre
89
pull_request:
910
branches:
1011
- main
12+
- v*-pre
13+
workflow_dispatch:
1114

1215
jobs:
1316

.github/workflows/tests.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,12 @@ on:
55
branches:
66
- main
77
- v*-release
8+
- v*-pre
89
pull_request:
910
branches:
1011
- main
12+
- v*-pre
13+
workflow_dispatch:
1114

1215
jobs:
1316

@@ -40,6 +43,9 @@ jobs:
4043
run: |
4144
python -m pip install --no-cache-dir --upgrade pip
4245
python -m pip install --no-cache-dir ${{ matrix.requirements }}
46+
python -m pip install '.[codecarbon]'
47+
python -m spacy download en_core_web_lg
48+
python -m spacy download en_core_web_sm
4349
if: steps.restore-cache.outputs.cache-hit != 'true'
4450

4551
- name: Install the checked-out setfit

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,7 @@ scripts/tfew/run_tmux.sh
149149
# macOS
150150
.DS_Store
151151
.vscode/settings.json
152+
153+
# Common SetFit Trainer logging folders
154+
wandb
155+
runs/

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
include src/setfit/model_card_template.md

README.md

Lines changed: 43 additions & 294 deletions
Large diffs are not rendered by default.

docs/README.md

Lines changed: 5 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Licensed under the Apache License, Version 2.0 (the "License");
55
you may not use this file except in compliance with the License.
66
You may obtain a copy of the License at
77
8-
http://www.apache.org/licenses/LICENSE-2.0
8+
https://www.apache.org/licenses/LICENSE-2.0
99
1010
Unless required by applicable law or agreed to in writing, software
1111
distributed under the License is distributed on an "AS IS" BASIS,
@@ -78,7 +78,7 @@ The `preview` command only works with existing doc files. When you add a complet
7878
Accepted files are Markdown (.md or .mdx).
7979

8080
Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
81-
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/_toctree.yml) file.
81+
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/en/_toctree.yml) file.
8282

8383
## Renaming section headers and moving sections
8484

@@ -103,7 +103,7 @@ Sections that were moved:
103103

104104
Use the relative style to link to the new file so that the versioned docs continue to work.
105105

106-
For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
106+
For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
107107

108108

109109
## Writing Documentation - Specification
@@ -123,34 +123,10 @@ Make sure to put your new file under the proper section. It's unlikely to go in
123123
depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or
124124
four.
125125

126-
### Translating
127126

128-
When translating, refer to the guide at [./TRANSLATING.md](https://github.com/huggingface/setfit/blob/main/docs/TRANSLATING.md).
127+
### Autodoc
129128

130-
131-
### Adding a new model
132-
133-
When adding a new model:
134-
135-
- Create a file `xxx.mdx` or under `./source/model_doc` (don't hesitate to copy an existing file as template).
136-
- Link that file in `./source/_toctree.yml`.
137-
- Write a short overview of the model:
138-
- Overview with paper & authors
139-
- Paper abstract
140-
- Tips and tricks and how to use it best
141-
- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
142-
every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
143-
The order is generally:
144-
- Configuration,
145-
- Tokenizer
146-
- PyTorch base model
147-
- PyTorch head models
148-
- TensorFlow base model
149-
- TensorFlow head models
150-
- Flax base model
151-
- Flax head models
152-
153-
These classes should be added using our Markdown syntax. Usually as follows:
129+
The following are some examples of `[[autodoc]]` for documentation building.
154130

155131
```
156132
## XXXConfig

docs/source/_config.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# docstyle-ignore
2+
INSTALL_CONTENT = """
3+
# SetFit installation
4+
! pip install setfit
5+
# To install from source instead of the last release, comment the command above and uncomment the following one.
6+
# ! pip install git+https://github.com/huggingface/setfit.git
7+
"""
8+
9+
notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]

docs/source/en/_toctree.yml

Lines changed: 41 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,53 @@
66
- local: installation
77
title: Installation
88
title: Get started
9+
910
- sections:
10-
- local: tutorials/placeholder
11-
title: Placeholder
11+
- local: tutorials/overview
12+
title: Overview
13+
- local: tutorials/zero_shot
14+
title: Zero-shot Text Classification
15+
- local: tutorials/onnx
16+
title: Efficiently run SetFit with ONNX
1217
title: Tutorials
18+
1319
- sections:
14-
- local: how_to/placeholder
15-
title: Placeholder
20+
- local: how_to/overview
21+
title: Overview
22+
- local: how_to/callbacks
23+
title: Callbacks
24+
- local: how_to/model_cards
25+
title: Model Cards
26+
- local: how_to/classification_heads
27+
title: Classification Heads
28+
- local: how_to/multilabel
29+
title: Multilabel Text Classification
30+
- local: how_to/zero_shot
31+
title: Zero-shot Text Classification
32+
- local: how_to/hyperparameter_optimization
33+
title: Hyperparameter Optimization
34+
- local: how_to/knowledge_distillation
35+
title: Knowledge Distillation
36+
- local: how_to/batch_sizes
37+
title: Batch Sizes for Inference
38+
- local: how_to/absa
39+
title: Aspect Based Sentiment Analysis
40+
- local: how_to/v1.0.0_migration_guide
41+
title: v1.0.0 Migration Guide
1642
title: How-to Guides
43+
1744
- sections:
18-
- local: conceptual_guides/placeholder
19-
title: Placeholder
45+
- local: conceptual_guides/setfit
46+
title: SetFit
47+
- local: conceptual_guides/sampling_strategies
48+
title: Sampling Strategies
2049
title: Conceptual Guides
50+
2151
- sections:
22-
- local: api/main
52+
- local: reference/main
2353
title: Main classes
24-
- local: api/trainer
54+
- local: reference/trainer
2555
title: Trainer classes
26-
title: API
56+
- local: reference/utility
57+
title: Utility
58+
title: Reference

docs/source/en/api/main.mdx

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/source/en/api/trainer.mdx

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/source/en/conceptual_guides/placeholder.mdx

Lines changed: 0 additions & 3 deletions
This file was deleted.
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
2+
# SetFit Sampling Strategies
3+
4+
SetFit supports various contrastive pair sampling strategies in [`TrainingArguments`]. In this conceptual guide, we will learn about the following four sampling strategies:
5+
6+
1. `"oversampling"` (the default)
7+
2. `"undersampling"`
8+
3. `"unique"`
9+
4. `"num_iterations"`
10+
11+
Consider first reading the [SetFit conceptual guide](../setfit) for a background on contrastive learning and positive & negative pairs.
12+
13+
## Running example
14+
15+
Throughout this conceptual guide, we will use to the following example scenario:
16+
17+
* 3 classes: "happy", "content", and "sad".
18+
* 20 total samples: 8 "happy", 4 "content", and 8 "sad" samples.
19+
20+
Considering that a sentence pair of `(X, Y)` and `(Y, X)` result in the same embedding distance/loss, we only want to consider one of those two cases. Furthermore, we don't want pairs where both sentences are the same, e.g. no `(X, X)`.
21+
22+
The resulting positive and negative pairs can be visualized in a table like below. The `+` and `-` represent positive and negative pairs, respectively. Furthermore, `h-n` represents the n-th "happy" sentence, `c-n` the n-th "content" sentence, and `s-n` the n-th "sad" sentence. Note that the area below the diagonal is not used as `(X, Y)` and `(Y, X)` result in the same embedding distances, and that the diagonal is not used as we are not interested in pairs where both sentences are identical.
23+
24+
| |h-1|h-2|h-3|h-4|h-5|h-6|h-7|h-8|c-1|c-2|c-3|c-4|s-1|s-2|s-3|s-4|s-5|s-6|s-7|s-8|
25+
|-------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26+
|**h-1**| | + | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
27+
|**h-2**| | | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
28+
|**h-3**| | | | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
29+
|**h-4**| | | | | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
30+
|**h-5**| | | | | | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
31+
|**h-6**| | | | | | | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
32+
|**h-7**| | | | | | | | + | - | - | - | - | - | - | - | - | - | - | - | - |
33+
|**h-8**| | | | | | | | | - | - | - | - | - | - | - | - | - | - | - | - |
34+
|**c-1**| | | | | | | | | | + | + | + | - | - | - | - | - | - | - | - |
35+
|**c-2**| | | | | | | | | | | + | + | - | - | - | - | - | - | - | - |
36+
|**c-3**| | | | | | | | | | | | + | - | - | - | - | - | - | - | - |
37+
|**c-4**| | | | | | | | | | | | | - | - | - | - | - | - | - | - |
38+
|**s-1**| | | | | | | | | | | | | | + | + | + | + | + | + | + |
39+
|**s-2**| | | | | | | | | | | | | | | + | + | + | + | + | + |
40+
|**s-3**| | | | | | | | | | | | | | | | + | + | + | + | + |
41+
|**s-4**| | | | | | | | | | | | | | | | | + | + | + | + |
42+
|**s-5**| | | | | | | | | | | | | | | | | | + | + | + |
43+
|**s-6**| | | | | | | | | | | | | | | | | | | + | + |
44+
|**s-7**| | | | | | | | | | | | | | | | | | | | + |
45+
|**s-8**| | | | | | | | | | | | | | | | | | | | |
46+
47+
As shown in the prior table, we have 28 positive pairs for "happy", 6 positive pairs for "content", and another 28 positive pairs for "sad". In total, this is 62 positive pairs. Also, we have 32 negative pairs between "happy" and "content", 64 negative pairs between "happy" and "sad", and 32 negative pairs between "content" and "sad". In total, this is 128 negative pairs.
48+
49+
## Oversampling
50+
51+
By default, SetFit applies the oversampling strategy for its contrastive pairs. This strategy samples an equal amount of positive and negative training pairs, oversampling the minority pair type to match that of the majority pair type. As the number of negative pairs is generally larger than the number of positive pairs, this usually involves oversampling the positive pairs.
52+
53+
In our running example, this would involve oversampling the 62 positive pairs up to 128, resulting in one epoch of 128 + 128 = 256 pairs. In summary:
54+
55+
* ✅ An equal amount of positive and negative pairs are sampled.
56+
* ✅ Every possible pair is used.
57+
* ❌ There is some data duplication.
58+
59+
## Undersampling
60+
61+
Like oversampling, this strategy samples an equal amount of positive and negative training pairs. However, it undersamples the majority pair type to match that of the minority pair type. This usually involves undersampling the negative pairs to match the positive pairs.
62+
63+
In our running example, this would involve undersampling the 128 negative pairs down to 62, resulting in one epoch of 62 + 62 = 124 pairs. In summary:
64+
65+
* ✅ An equal amount of positive and negative pairs are sampled.
66+
***Not** every possible pair is used.
67+
* ✅ There is **no** data duplication.
68+
69+
## Unique
70+
71+
Thirdly, the unique strategy does not sample an equal amount of positive and negative training pairs. Instead, it simply samples all possible pairs exactly once. No form of oversampling or undersampling is used here.
72+
73+
In our running example, this would involve sampling all negative and positive pairs, resulting in one epoch of 62 + 128 = 190 pairs. In summary:
74+
75+
***Not** an equal amount of positive and negative pairs are sampled.
76+
* ✅ Every possible pair is used.
77+
* ✅ There is **no** data duplication.
78+
79+
## `num_iterations`
80+
81+
Lastly, SetFit can still be used with a deprecated sampling strategy involving the `num_iterations` training argument. Unlike the other sampling strategies, this strategy does not involve the number of possible pairs. Instead, it samples `num_iterations` positive pairs and `num_iterations` negative pairs for each training sample.
82+
83+
In our running example, if we assume `num_iterations=20`, then we would sample 20 positive pairs and 20 negative pairs per training sample. Because there's 20 samples, this involves (20 + 20) * 20 = 800 pairs. Because there are only 190 unique pairs, this certainly involves some data duplication. However, it does not guarantee that every possible pair is used. In summary:
84+
85+
***Not** an equal amount of positive and negative pairs are sampled.
86+
* ❌ Not necessarily every possible pair is used.
87+
* ❌ There is some data duplication.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
2+
# Sentence Transformers Finetuning (SetFit)
3+
4+
SetFit is a model framework to efficiently train text classification models with surprisingly little training data. For example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. Furthermore, SetFit is fast to train and run inference with, and can easily support multilingual tasks.
5+
6+
Every SetFit model consists of two parts: a **sentence transformer** embedding model (the body) and a **classifier** (the head). These two parts are trained in two separate phases: the **embedding finetuning phase** and the **classifier training phase**. This conceptual guide will elaborate on the intuition between these phases, and why SetFit works so well.
7+
8+
## Embedding finetuning phase
9+
10+
The first phase has one primary goal: finetune a sentence transformer embedding model to produce useful embeddings for *our* classification task. The [Hugging Face Hub](https://huggingface.co/models?library=sentence-transformers) already has thousands of sentence transformer available, many of which have been trained to very accurately group the embeddings of texts with similar semantic meaning.
11+
12+
However, models that are good at Semantic Textual Similarity (STS) are not necessarily immediately good at *our* classification task. For example, according to an embedding model, the sentence of 1) `"He biked to work."` will be much more similar to 2) `"He drove his car to work."` than to 3) `"Peter decided to take the bicycle to the beach party!"`. But if our classification task involves classifying texts into transportation modes, then we want our embedding model to place sentences 1 and 3 closely together, and 2 further away.
13+
14+
To do so, we can finetune the chosen sentence transformer embedding model. The goal here is to nudge the model to use its pretrained knowledge in a different way that better aligns with our classification task, rather than making the completely forget what it has learned.
15+
16+
For finetuning, SetFit uses **contrastive learning**. This training approach involves creating **positive and negative pairs** of sentences. A sentence pair will be positive if both of the sentences are of the same class, and negative otherwise. For example, in the case of binary "positive"-"negative" sentiment analysis, `("The movie was awesome", "I loved it")` is a positive pair, and `("The movie was awesome", "It was quite disappointing")` is a negative pair.
17+
18+
During training, the embedding model receives these pairs, and will convert the sentences to embeddings. If the pair is positive, then it will pull on the model weights such that the text embeddings will be more similar, and vice versa for a negative pair. Through this approach, sentences with the same label will be embedded more similarly, and sentences with different labels less similarly.
19+
20+
Conveniently, this contrastive learning works with pairs rather than individual samples, and we can create plenty of unique pairs from just a few samples. For example, given 8 positive sentences and 8 negative sentences, we can create 28 positive pairs and 64 negative pairs for 92 unique training pairs. This grows exponentially to the number of sentences and classes, and that is why SetFit can train with just a few examples and still correctly finetune the sentence transformer embedding model. However, we should still be wary of overfitting.
21+
22+
## Classifier training phase
23+
24+
Once the sentence transformer embedding model has been finetuned for our task at hand, we can start training the classifier. This phase has one primary goal: create a good mapping from the sentence transformer embeddings to the classes.
25+
26+
Unlike with the first phase, training the classifier is done from scratch and using the labeled samples directly, rather than using pairs. By default, the classifier is a simple **logistic regression** classifier from scikit-learn. First, all training sentences are fed through the now-finetuned sentence transformer embedding model, and then the sentence embeddings and labels are used to fit the logistic regression classifier. The result is a strong and efficient classifier.
27+
28+
Using these two parts, SetFit models are efficient, performant and easy to train, even on CPU-only devices.

0 commit comments

Comments
 (0)