@@ -44,7 +44,7 @@ that have to deal with any possible code being compiled for them. For more
44
44
information on this, the reader is referred to [Computer
45
45
architecture](https://books.google.it/books/about/Computer_Architecture.html?id=v3-1hVwHnHwC&redir_esc=y),
46
46
by Hennessy, Patterson and Asanovic. If the code being run does not contain
47
- any of these conditional statements, there is no need for supporting them in
47
+ any of these conditional statements, there is no need for supporting them in
48
48
hardware, which gives a lot of space to optimizations.
49
49
50
50
Moreover, NorthPole is an inference-only accelerator, _i.e._, you cannot train a
@@ -81,8 +81,8 @@ def dummy_sparse_dot_prod(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
81
81
It seems easy, isn't it? Well, it is not. The `if` in `dummy_sparse_dot_prod` is
82
82
a mess. Why so? Well, the problem is that when this code is running in
83
83
hardware, the line `a_i, b_i = a[i], b[i]` is much more costly (_i.e._, it takes
84
- more _energy_ to execute it) than `a_i * b_i`
85
- [[Horowitz](https://ieeexplore.ieee.org/document/6757323)]! This is due to how
84
+ more _energy_ to execute it) than `a_i * b_i`
85
+ [[Horowitz](https://ieeexplore.ieee.org/document/6757323)]! This is due to how
86
86
we design our digital circuits to perform these operations. Hence, what you
87
87
would like to avoid is to _read_ the inputs, more than to multiply them! And if
88
88
to check that these are not zero you need to read them, well, you have lost the
@@ -97,16 +97,16 @@ processing units (GPUs) support 2:4 sparsity, which means that every 4 elements
97
97
in a matrix, 2 are zeros (more or less, I am not being extremely precise on
98
98
this).
99
99
100
- ### Axiom 2 - Getting inspired by biological neurons
100
+ ### Axiom 2 - Getting inspired by biological neurons
101
101
102
102
> Inspired by biological precision, NorthPole is optimized for 8, 4, and 2-bit
103
103
low-precision. This is sufficient to achieve state-of-the-art inference accuracy
104
104
on many neural networks while dispensing with the high-precision required for
105
105
training.
106
106
107
107
Neurons in biology communicate by means of voltage spikes. This can interpreted
108
- as binary signals: if I have a spike, I have a logic one; if there is no spike,
109
- I have a logic zero. This information encoding, if implemented in hardware,
108
+ as binary signals: if I have a spike, I have a logic one; if there is no spike,
109
+ I have a logic zero. This information encoding, if implemented in hardware,
110
110
requires a single bit. This is the analogy the authors are referring to.
111
111
112
112
Why should I care about the precision of the data in my neural
@@ -116,36 +116,36 @@ By _precision_ it is meant the number of bits to which your data is encoded. The
116
116
larger this number is, the larger numbers you can describe with your bit word,
117
117
but also smaller since some of those bits are used to encode the decimals.
118
118
However, you cannot use a large number of bits for each datum: first, because
119
- it would require much more memory to host these data; second, it requires much
120
- more energy to process them!
119
+ it would require much more memory to host these data; second, it requires much
120
+ more energy to process them!
121
121
122
122
| Operation | Floating point energy [pJ] | Integer energy [pJ] | Energy ratio FP/INT |
123
123
|:--------------:|:--------------------------:|:-----------------------:|:----------------------------------------------:|
124
124
| Addition | 0.4 (16 b), 0.9 (32 b) | 0.03 (8 b), 0.1 (32 b) | **~13.3x** (16 b / 8 b), **9x** (32 b / 32 b) |
125
125
| Multiplication | 1.1 (16 b), 3.7 (32 b) | 0.2 (8 b), 3.1 (32 b) | **5.5x** (16 b / 8 b), **~1.2x** (32 b / 32 b) |
126
126
127
- In the table above [[Horowitz](https://ieeexplore.ieee.org/document/6757323)],
127
+ In the table above [[Horowitz](https://ieeexplore.ieee.org/document/6757323)],
128
128
the energy required to perform addition on `int`s and `float`s is provided. You
129
129
could notice that it is much more convenient to work with `int`s! This is due to
130
- the fact that the physical hardware required to perform floating point
131
- arithmetic is _much_ more complex that the corresponding integer one. That is
130
+ the fact that the physical hardware required to perform floating point
131
+ arithmetic is _much_ more complex that the corresponding integer one. That is
132
132
why we want to represent our DNNs weights and activations with integers!
133
133
134
- However, you cannot simply convert an integer to a floating point value. For
134
+ However, you cannot simply convert an integer to a floating point value. For
135
135
instance, the number 1.2345 would become 1 as an integer. This means that you
136
- would degrade the information encoded in the data, and the network would not
136
+ would degrade the information encoded in the data, and the network would not
137
137
perform as expected, showing a large loss in accuracy. For this reason,
138
138
researchers have come up with _quantization_ algorithms: these allow to convert
139
- floating point values to integer ones while loosing as few information as
140
- possible in the process. Since some information is lost in any case, the DNN
141
- usually needs to be retrained a bit using its integer version in order to
139
+ floating point values to integer ones while losing as few information as
140
+ possible in the process. Since some information is lost in any case, the DNN
141
+ usually needs to be retrained a bit using its integer version in order to
142
142
recover the loss in performance.
143
143
144
144
We have been running DNNs using INT8 since ~2017 without claiming
145
145
biological inspiration. Recently, however, progress has been made and we can use
146
146
INT4 quantization (only 4 bits to represent a number) with marginal loss in
147
147
performance compared to the 32-bit
148
- floating point (FP32) baseline that you trained on your GPU
148
+ floating point (FP32) baseline that you trained on your GPU
149
149
[[Keller et
150
150
al.](https://ieeexplore.ieee.org/abstract/document/10019275?casa_token=fmLtbZfys2cAAAAA:UQvvJ3LWrATwWYtBQZ7HSAZigZdRe-k06Z9rOcKVc4c1LrrqXCe49E5IFgKRyC952n0Fmp_9UQ)].
151
151
@@ -158,12 +158,12 @@ GPU Architecture](https://resources.nvidia.com/en-us-tensor-core)].
158
158
> NorthPole has a distributed, modular core array (16-by-16), with each core
159
159
capable of massive parallelism (8192 2-bit operations per cycle) (Fig. 2F).
160
160
161
- When claiming 8192 operations per clock cycle, using INT2 operands, it
161
+ When claiming 8192 operations per clock cycle, using INT2 operands, it
162
162
means that NorthPole has 2048 multiply-and-accumulate (MAC) units that work on
163
163
INT8 precision operands. Why do we care about MACs?
164
164
165
- DNNs are basically matrix multipliers: a neural network receives a vector in
166
- input and multiplies it by a matrix of weights, producing in output another
165
+ DNNs are basically matrix multipliers: a neural network receives a vector in
166
+ input and multiplies it by a matrix of weights, producing in output another
167
167
vector. Let us consider a _naive_ matrix-vector multiplication.
168
168
169
169
```python
@@ -172,8 +172,8 @@ def naive_mat_vec_mul(v: torch.Tensor, m: torch.Tensor) -> torch.Tensor:
172
172
assert m_cols == len(v)
173
173
res = torch.zeros((m_rows,))
174
174
for r in range(m_rows):
175
- for c in range(m_cols):
176
- res[r] += v[c] * m[r, c] # Hey, this is multiplication and
175
+ for c in range(m_cols):
176
+ res[r] += v[c] * m[r, c] # Hey, this is multiplication and
177
177
# accumulation! Here's our MAC!
178
178
return r
179
179
```
@@ -183,7 +183,7 @@ product. That's why we care about it.
183
183
184
184
These MACs can be configured in single-instruction-multiple-data (SIMD)
185
185
mode, _i.e._, you can "glue" together 4 INT2 operands to form an INT8 word and
186
- work on these in parallel.
186
+ work on these in parallel.
187
187
188
188
{{<
189
189
figure
@@ -248,21 +248,21 @@ In layer-fuse architectures (also called _dataflow_, which also means another
248
248
thing in hardware accelerators just to mess with you), instead, the PEs work on
249
249
all the operands at once. The secret is that, when the operands are too big to
250
250
fit on the available PEs, you perform part of the computations in an iteration,
251
- and the remaning part in another, as it is shown in the figure above for the
252
- matrix multiplication example. It has to be remarked that the intermediate
251
+ and the remaining part in another, as it is shown in the figure above for the
252
+ matrix multiplication example. It has to be remarked that the intermediate
253
253
results are kept among the PEs, that exchange them to finish the computation.
254
- Ideally, there are no off-chip memory accesses, or it is strongly reduced when
254
+ Ideally, there are no off-chip memory accesses, or it is strongly reduced when
255
255
compared to an overlay architecture
256
256
257
- {{<
257
+ {{<
258
258
figure src="memory-energy.png" caption="Stolen from my thesis." width=600px
259
259
>}}
260
260
261
- The graph above explains why it is a good idea to keep data among the PEs: to
261
+ The graph above explains why it is a good idea to keep data among the PEs: to
262
262
retrieve activations from off-chip memory, one has to employ **200x** the energy
263
263
needed to execute a MAC! Instead, if the MAC unit accesses the data in the PE
264
264
itself (the PE register file bar) or from another PE (the NoC bar), the energy
265
- drawback is bearable.
265
+ drawback is bearable.
266
266
267
267
### Axiom 4 - Efficiency in distribution
268
268
@@ -361,8 +361,8 @@ each layer enables optimal use of on-chip resources without compromising
361
361
inference accuracy (supplementary texts S9 and S10).
362
362
363
363
In short: IBM will provide a quantization aware training (QAT) toolchain with
364
- the NorthPole system. QAT starts, usually, from a full precision FP32
365
- model and converts all the weights and activations to integers, in order to
364
+ the NorthPole system. QAT starts, usually, from a full precision FP32
365
+ model and converts all the weights and activations to integers, in order to
366
366
reduce their precision. This leads to information loss that worsens the accuracy
367
367
of the network: to recover this, the DNN is trained for few more epochs to use
368
368
backprop to tune the network taking into account the approximations brought by
@@ -491,10 +491,10 @@ substantially higher instantaneous parallelism (through high utilization of many
491
491
highly parallel compute units specialized for neural inference) and
492
492
substantially lower transistor count owing to low precision (Fig. 4B).
493
493
494
- NorthPole runs at 400 MHz while an A100 GPU can run up to 1.4 GHz. Of
495
- course, GPUs have much more "redundant" hardware to be programmable for more
494
+ NorthPole runs at 400 MHz while an A100 GPU can run up to 1.4 GHz. Of
495
+ course, GPUs have much more "redundant" hardware to be programmable for more
496
496
tasks. Hence, is it fair this comparison with general purpose hardware? Sure.
497
- However, shall we call in a
497
+ However, shall we call in a
498
498
[fairer competitor](https://ieeexplore.ieee.org/abstract/document/10019275)? :)
499
499
500
500
| Accelerator | Power [W] | Throughput (FPS) | Efficiency [inferences / J] | Data format | Only DNNs? | Training? |
@@ -513,7 +513,7 @@ accelerator like NorthPole. This explains why the throughput is so low and the
513
513
efficiency so high.
514
514
515
515
The Nvidia accelerator is meant for inference only, just like NorthPole, and it
516
- uses a very fancy quantization technique
516
+ uses a very fancy quantization technique
517
517
[[Dai et al.](https://proceedings.mlsys.org/paper_files/paper/2021/file/48a6431f04545e11919887748ec5cb52-Paper.pdf)]
518
518
to use INT4 precision without compromising accuracy. Moreover, the accelerator
519
519
is designed to run large Transformers on it, but I have used their ResNet50 data
@@ -530,7 +530,7 @@ In conclusion, the following statement
530
530
> Inspired by the brain, which has no off-chip memory, NorthPole is optimized
531
531
for on-chip networks [...]
532
532
533
- will make me sleep better tonight, knowing that I do not have a DDR4 stick
533
+ will make me sleep better tonight, knowing that I do not have a DDR4 stick
534
534
on top of my head. Sigh.
535
535
536
536
NorthPole is an interesting experiment: it is an extremely large accelerator,
@@ -541,14 +541,14 @@ biological inspiration. In my opinion, NorthPole is "just" an excellent
541
541
engineering work, that takes into account key factors:
542
542
* reduced precision operations are much more efficient that high-precision ones.
543
543
An FP32 multiplication costs _much_ more than an INT8 one.
544
- * DNNs are extremely robust to quantization, and with INT8 precision there is
544
+ * DNNs are extremely robust to quantization, and with INT8 precision there is
545
545
basically no accuracy degradation.
546
- * memory accesses are much more costly than computations when going up the
546
+ * memory accesses are much more costly than computations when going up the
547
547
memory hierarchy (_i.e._, external DRAMs and so on).
548
548
549
549
However, this brain-inspired approach seems to prove more useful than other ones
550
- at the moment, such as spiking network: the distributed memory hierarchy leads to
551
- great improvement in processing efficiency, without compromising network
550
+ at the moment, such as spiking network: the distributed memory hierarchy leads to
551
+ great improvement in processing efficiency, without compromising network
552
552
performance.
553
553
554
554
I can see clusters of NorthPole being stacked in servers to improve inference
@@ -559,11 +559,11 @@ chose an IEEE journal instead of Science, where hardware is not really common.
559
559
560
560
## Acknowledgements
561
561
562
- I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
563
- reviewing this blog post and the super-useful discussion about the
564
- brain-inspired traits of NorthPole: he convinced me that the way in which the
565
- authors claim biology inspiration actually proves useful (_e.g._, distributed
566
- memory hierarchy), differently from other approaches that severly compromise
562
+ I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
563
+ reviewing this blog post and the super-useful discussion about the
564
+ brain-inspired traits of NorthPole: he convinced me that the way in which the
565
+ authors claim biology inspiration actually proves useful (_e.g._, distributed
566
+ memory hierarchy), differently from other approaches that severely compromise
567
567
performance (_e.g._, accuracy), with negligible efficiency improvements.
568
568
569
569
## Bibliography
0 commit comments