You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sparsity.md
+25-12Lines changed: 25 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,16 @@ Sparsity is one of promising model compression techniques that can be used to ac
4
4
5
5
The document describes the sparsity definition, sparsity training flow, validated models, and performance benefit using software sparsity. Note that the document discusses the sparse weight (with dense activation) for inference acceleration. Sparse activation or sparse embedding for inference acceleration or training acceleration is out of the scope.
6
6
7
-
> **Note**: training for sparsity with 2:4 or similar structured pattern is under development
7
+
> **Note**: training for sparsity with 2:4 or similar structured pattern is supported, please refer it at our new [API](../neural_compressor/experimental/pytorch_pruner/), [question-answering examples](../examples/pytorch/nlp/huggingface_models/question-answering/pruning/pytorch_pruner/eager) and [text-classification examples](../examples/pytorch/nlp/huggingface_models/text-classification/pruning/pytorch_pruner/eager)
8
8
9
9
## Sparsity Definition
10
-
Different from structured sparsity pattern [2:4](https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) what NVidia proposed in Ampere architecture, we propose the block-wise structured sparsity patterns that we are able to demonstrate the performance benefits on existing Intel hardwares even without the support of hardware sparsity. A block-wise sparsity pattern with block size ```S``` means the contiguous ```S``` elements in this block are all zero values.
10
+
NVidia proposed [2:4 sparsity](https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) (or known as "2in4 sparsity") in Ampere architecture, for every 4 continuous elements in a matrix, two of them are zero and others are non-zero.
Different from 2:4 sparsity above, we propose the block-wise structured sparsity patterns that we are able to demonstrate the performance benefits on existing Intel hardwares even without the support of hardware sparsity. A block-wise sparsity pattern with block size ```S``` means the contiguous ```S``` elements in this block are all zero values.
11
17
12
18
For a typical GEMM, the weight dimension is ```IC``` x ```OC```, where ```IC``` is the number of input channels and ```OC``` is the number of output channels. Note that sometimes ```IC``` is also called dimension ```K```, and ```OC``` is called dimension ```N```. The sparsity dimension is on ```OC``` (or ```N```).
13
19
@@ -45,16 +51,23 @@ def train():
45
51
46
52
We validate the sparsity on typical models across different domains (including CV, NLP, and Recommendation System). The below table shows the sparsity pattern, sparsity ratio, and accuracy of sparse and dense (Reference) model for each model. We also provide a simplified [BERT example](../examples/pytorch/nlp/huggingface_models/question-answering/pruning/group_lasso/eager) with only one sparse layer.
47
53
48
-
| Model | Sparsity Pattern | Sparsity Ratio | Accuracy (Sparse Model) | Accuracy (Dense Model) |
****bold*** means the sparsity dimension (```OC```).
70
+
* Bert-Mini related examples are developed based on our [Pytorch Pruner API](../neural_compressor/experimental/pytorch_pruner/). Examples of [question answering](../examples/pytorch/nlp/huggingface_models/question-answering/pruning/pytorch_pruner/eager) and [text classification](../examples/pytorch/nlp/huggingface_models/text-classification/pruning/pytorch_pruner/eager) are developed.
0 commit comments