Skip to content

Commit 824c5a2

Browse files
authored
1 parent 25db01e commit 824c5a2

File tree

14 files changed

+63
-63
lines changed
  • content/english
    • blog
      • digital-neuromorphic-hardware-read-list
      • northpole-ibm-neuromorphic-ai-hardware
      • spiking-neural-network-framework-benchmarking
      • spiking-neurons-digital-hardware-implementation
      • truenorth-deep-dive-ibm-neuromorphic-chip-design
    • neuromorphic-computing
    • workshops/whats-catching-your-eye-visual-attention-mechanism-giulia-dangelo

14 files changed

+63
-63
lines changed

content/english/blog/digital-neuromorphic-hardware-read-list/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ The Loihi chip employs **128 neuromorphic cores**, each of which consisting of *
4545

4646
In this paper, a digital neuromorphic processor is presented. The Verilog is also [open source](https://github.com/ChFrenkel/ODIN)!
4747

48-
The neurons states and the synapses weights are stored in two foundry SRAMs on chip. In order to emulate a crossbar, **time-multiplexing** is adopted: the synapses weights and neurons states are updated in a sequential manner instead of in parallel. On the core, **256 neurons (4kB SRAM)** and **256x256 synapses (64kB SRAM)** are embedded. This allows to get a very high synapses and neuron densities: **741k synapses per squared millimiters** and **3k neurons per squared millimeters**, using a **28nm CMOS FDSOI** process.
48+
The neurons states and the synapses weights are stored in two foundry SRAMs on chip. In order to emulate a crossbar, **time-multiplexing** is adopted: the synapses weights and neurons states are updated in a sequential manner instead of in parallel. On the core, **256 neurons (4kB SRAM)** and **256x256 synapses (64kB SRAM)** are embedded. This allows to get a very high synapses and neuron densities: **741k synapses per squared millimeters** and **3k neurons per squared millimeters**, using a **28nm CMOS FDSOI** process.
4949

5050
The neuron model is programmable through an SPI interface: the user can choose among a **LIF** model (**8 bits** for the state of each neuron) and **Izhikevic** one (**55 bits** for the state of each neuron). Online-learning capabilities are allowed with an hardware-efficient implementation of the **Spike-Driven Synaptic Plasticity (SDSP)** rule.
5151

content/english/blog/northpole-ibm-neuromorphic-ai-hardware/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ instance, the number 1.2345 would become 1 as an integer. This means that you
144144
would degrade the information encoded in the data, and the network would not
145145
perform as expected, showing a large loss in accuracy. For this reason,
146146
researchers have come up with _quantization_ algorithms: these allow to convert
147-
floating point values to integer ones while loosing as few information as
147+
floating point values to integer ones while losing as few information as
148148
possible in the process. Since some information is lost in any case, the DNN
149149
usually needs to be retrained a bit using its integer version in order to
150150
recover the loss in performance.
@@ -285,7 +285,7 @@ organic biochemical substrate is suitable for supporting many slow analog
285285
neurons, where each neuron is hardwired to a fixed set of synaptic weights.
286286
Directly following this architectural construct leads to an inefficient use of
287287
inorganic silicon, which is suitable for fewer and faster digital neurons.
288-
Reconfigurability resolves this key dilem- ma by storing weights and programs
288+
Reconfigurability resolves this key dilemma by storing weights and programs
289289
just once in the distributed memory and reconfiguring the weights during the
290290
execution of each layer using one NoC and reconfiguring the programs before the
291291
start of the layer using another NoC. Stated differently, these two NoCs serve
@@ -318,7 +318,7 @@ NoCs are highlighted: the partial sums NoC distributes the partial results among
318318
cores (refer to the layer-fuse architecture); the activation NoC carries inputs
319319
and layers outputs; the instruction NoC is used to tell the sequence of
320320
instructions to be carried out by the core; the model NoC is the one that
321-
trasnfers the layers weights to the computational cores.
321+
transfers the layers weights to the computational cores.
322322

323323
In my opinion, the instruction core plays an important role. Having a
324324
specialized instruction set architecture (ISA) has a large impact on
@@ -582,7 +582,7 @@ I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
582582
reviewing this blog post and the super-useful discussion about the
583583
brain-inspired traits of NorthPole: he convinced me that the way in which the
584584
authors claim biology inspiration actually proves useful (_e.g._, distributed
585-
memory hierarchy), differently from other approaches that severly compromise
585+
memory hierarchy), differently from other approaches that severely compromise
586586
performance (_e.g._, accuracy), with negligible efficiency improvements.
587587

588588
I would also like to thank [Siddharth Joshi](https://siddharth-joshi.com) for pointing out that Keller et al.

content/english/blog/northpole-ibm-neuromorphic-ai-hardware/index.md.bak

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ that have to deal with any possible code being compiled for them. For more
4444
information on this, the reader is referred to [Computer
4545
architecture](https://books.google.it/books/about/Computer_Architecture.html?id=v3-1hVwHnHwC&redir_esc=y),
4646
by Hennessy, Patterson and Asanovic. If the code being run does not contain
47-
any of these conditional statements, there is no need for supporting them in
47+
any of these conditional statements, there is no need for supporting them in
4848
hardware, which gives a lot of space to optimizations.
4949

5050
Moreover, NorthPole is an inference-only accelerator, _i.e._, you cannot train a
@@ -81,8 +81,8 @@ def dummy_sparse_dot_prod(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
8181
It seems easy, isn't it? Well, it is not. The `if` in `dummy_sparse_dot_prod` is
8282
a mess. Why so? Well, the problem is that when this code is running in
8383
hardware, the line `a_i, b_i = a[i], b[i]` is much more costly (_i.e._, it takes
84-
more _energy_ to execute it) than `a_i * b_i`
85-
[[Horowitz](https://ieeexplore.ieee.org/document/6757323)]! This is due to how
84+
more _energy_ to execute it) than `a_i * b_i`
85+
[[Horowitz](https://ieeexplore.ieee.org/document/6757323)]! This is due to how
8686
we design our digital circuits to perform these operations. Hence, what you
8787
would like to avoid is to _read_ the inputs, more than to multiply them! And if
8888
to check that these are not zero you need to read them, well, you have lost the
@@ -97,16 +97,16 @@ processing units (GPUs) support 2:4 sparsity, which means that every 4 elements
9797
in a matrix, 2 are zeros (more or less, I am not being extremely precise on
9898
this).
9999

100-
### Axiom 2 - Getting inspired by biological neurons
100+
### Axiom 2 - Getting inspired by biological neurons
101101

102102
> Inspired by biological precision, NorthPole is optimized for 8, 4, and 2-bit
103103
low-precision. This is sufficient to achieve state-of-the-art inference accuracy
104104
on many neural networks while dispensing with the high-precision required for
105105
training.
106106

107107
Neurons in biology communicate by means of voltage spikes. This can interpreted
108-
as binary signals: if I have a spike, I have a logic one; if there is no spike,
109-
I have a logic zero. This information encoding, if implemented in hardware,
108+
as binary signals: if I have a spike, I have a logic one; if there is no spike,
109+
I have a logic zero. This information encoding, if implemented in hardware,
110110
requires a single bit. This is the analogy the authors are referring to.
111111

112112
Why should I care about the precision of the data in my neural
@@ -116,36 +116,36 @@ By _precision_ it is meant the number of bits to which your data is encoded. The
116116
larger this number is, the larger numbers you can describe with your bit word,
117117
but also smaller since some of those bits are used to encode the decimals.
118118
However, you cannot use a large number of bits for each datum: first, because
119-
it would require much more memory to host these data; second, it requires much
120-
more energy to process them!
119+
it would require much more memory to host these data; second, it requires much
120+
more energy to process them!
121121

122122
| Operation | Floating point energy [pJ] | Integer energy [pJ] | Energy ratio FP/INT |
123123
|:--------------:|:--------------------------:|:-----------------------:|:----------------------------------------------:|
124124
| Addition | 0.4 (16 b), 0.9 (32 b) | 0.03 (8 b), 0.1 (32 b) | **~13.3x** (16 b / 8 b), **9x** (32 b / 32 b) |
125125
| Multiplication | 1.1 (16 b), 3.7 (32 b) | 0.2 (8 b), 3.1 (32 b) | **5.5x** (16 b / 8 b), **~1.2x** (32 b / 32 b) |
126126

127-
In the table above [[Horowitz](https://ieeexplore.ieee.org/document/6757323)],
127+
In the table above [[Horowitz](https://ieeexplore.ieee.org/document/6757323)],
128128
the energy required to perform addition on `int`s and `float`s is provided. You
129129
could notice that it is much more convenient to work with `int`s! This is due to
130-
the fact that the physical hardware required to perform floating point
131-
arithmetic is _much_ more complex that the corresponding integer one. That is
130+
the fact that the physical hardware required to perform floating point
131+
arithmetic is _much_ more complex that the corresponding integer one. That is
132132
why we want to represent our DNNs weights and activations with integers!
133133

134-
However, you cannot simply convert an integer to a floating point value. For
134+
However, you cannot simply convert an integer to a floating point value. For
135135
instance, the number 1.2345 would become 1 as an integer. This means that you
136-
would degrade the information encoded in the data, and the network would not
136+
would degrade the information encoded in the data, and the network would not
137137
perform as expected, showing a large loss in accuracy. For this reason,
138138
researchers have come up with _quantization_ algorithms: these allow to convert
139-
floating point values to integer ones while loosing as few information as
140-
possible in the process. Since some information is lost in any case, the DNN
141-
usually needs to be retrained a bit using its integer version in order to
139+
floating point values to integer ones while losing as few information as
140+
possible in the process. Since some information is lost in any case, the DNN
141+
usually needs to be retrained a bit using its integer version in order to
142142
recover the loss in performance.
143143

144144
We have been running DNNs using INT8 since ~2017 without claiming
145145
biological inspiration. Recently, however, progress has been made and we can use
146146
INT4 quantization (only 4 bits to represent a number) with marginal loss in
147147
performance compared to the 32-bit
148-
floating point (FP32) baseline that you trained on your GPU
148+
floating point (FP32) baseline that you trained on your GPU
149149
[[Keller et
150150
al.](https://ieeexplore.ieee.org/abstract/document/10019275?casa_token=fmLtbZfys2cAAAAA:UQvvJ3LWrATwWYtBQZ7HSAZigZdRe-k06Z9rOcKVc4c1LrrqXCe49E5IFgKRyC952n0Fmp_9UQ)].
151151

@@ -158,12 +158,12 @@ GPU Architecture](https://resources.nvidia.com/en-us-tensor-core)].
158158
> NorthPole has a distributed, modular core array (16-by-16), with each core
159159
capable of massive parallelism (8192 2-bit operations per cycle) (Fig. 2F).
160160

161-
When claiming 8192 operations per clock cycle, using INT2 operands, it
161+
When claiming 8192 operations per clock cycle, using INT2 operands, it
162162
means that NorthPole has 2048 multiply-and-accumulate (MAC) units that work on
163163
INT8 precision operands. Why do we care about MACs?
164164

165-
DNNs are basically matrix multipliers: a neural network receives a vector in
166-
input and multiplies it by a matrix of weights, producing in output another
165+
DNNs are basically matrix multipliers: a neural network receives a vector in
166+
input and multiplies it by a matrix of weights, producing in output another
167167
vector. Let us consider a _naive_ matrix-vector multiplication.
168168

169169
```python
@@ -172,8 +172,8 @@ def naive_mat_vec_mul(v: torch.Tensor, m: torch.Tensor) -> torch.Tensor:
172172
assert m_cols == len(v)
173173
res = torch.zeros((m_rows,))
174174
for r in range(m_rows):
175-
for c in range(m_cols):
176-
res[r] += v[c] * m[r, c] # Hey, this is multiplication and
175+
for c in range(m_cols):
176+
res[r] += v[c] * m[r, c] # Hey, this is multiplication and
177177
# accumulation! Here's our MAC!
178178
return r
179179
```
@@ -183,7 +183,7 @@ product. That's why we care about it.
183183

184184
These MACs can be configured in single-instruction-multiple-data (SIMD)
185185
mode, _i.e._, you can "glue" together 4 INT2 operands to form an INT8 word and
186-
work on these in parallel.
186+
work on these in parallel.
187187

188188
{{<
189189
figure
@@ -248,21 +248,21 @@ In layer-fuse architectures (also called _dataflow_, which also means another
248248
thing in hardware accelerators just to mess with you), instead, the PEs work on
249249
all the operands at once. The secret is that, when the operands are too big to
250250
fit on the available PEs, you perform part of the computations in an iteration,
251-
and the remaning part in another, as it is shown in the figure above for the
252-
matrix multiplication example. It has to be remarked that the intermediate
251+
and the remaining part in another, as it is shown in the figure above for the
252+
matrix multiplication example. It has to be remarked that the intermediate
253253
results are kept among the PEs, that exchange them to finish the computation.
254-
Ideally, there are no off-chip memory accesses, or it is strongly reduced when
254+
Ideally, there are no off-chip memory accesses, or it is strongly reduced when
255255
compared to an overlay architecture
256256

257-
{{<
257+
{{<
258258
figure src="memory-energy.png" caption="Stolen from my thesis." width=600px
259259
>}}
260260

261-
The graph above explains why it is a good idea to keep data among the PEs: to
261+
The graph above explains why it is a good idea to keep data among the PEs: to
262262
retrieve activations from off-chip memory, one has to employ **200x** the energy
263263
needed to execute a MAC! Instead, if the MAC unit accesses the data in the PE
264264
itself (the PE register file bar) or from another PE (the NoC bar), the energy
265-
drawback is bearable.
265+
drawback is bearable.
266266

267267
### Axiom 4 - Efficiency in distribution
268268

@@ -361,8 +361,8 @@ each layer enables optimal use of on-chip resources without compromising
361361
inference accuracy (supplementary texts S9 and S10).
362362

363363
In short: IBM will provide a quantization aware training (QAT) toolchain with
364-
the NorthPole system. QAT starts, usually, from a full precision FP32
365-
model and converts all the weights and activations to integers, in order to
364+
the NorthPole system. QAT starts, usually, from a full precision FP32
365+
model and converts all the weights and activations to integers, in order to
366366
reduce their precision. This leads to information loss that worsens the accuracy
367367
of the network: to recover this, the DNN is trained for few more epochs to use
368368
backprop to tune the network taking into account the approximations brought by
@@ -491,10 +491,10 @@ substantially higher instantaneous parallelism (through high utilization of many
491491
highly parallel compute units specialized for neural inference) and
492492
substantially lower transistor count owing to low precision (Fig. 4B).
493493

494-
NorthPole runs at 400 MHz while an A100 GPU can run up to 1.4 GHz. Of
495-
course, GPUs have much more "redundant" hardware to be programmable for more
494+
NorthPole runs at 400 MHz while an A100 GPU can run up to 1.4 GHz. Of
495+
course, GPUs have much more "redundant" hardware to be programmable for more
496496
tasks. Hence, is it fair this comparison with general purpose hardware? Sure.
497-
However, shall we call in a
497+
However, shall we call in a
498498
[fairer competitor](https://ieeexplore.ieee.org/abstract/document/10019275)? :)
499499

500500
| Accelerator | Power [W] | Throughput (FPS) | Efficiency [inferences / J] | Data format | Only DNNs? | Training? |
@@ -513,7 +513,7 @@ accelerator like NorthPole. This explains why the throughput is so low and the
513513
efficiency so high.
514514

515515
The Nvidia accelerator is meant for inference only, just like NorthPole, and it
516-
uses a very fancy quantization technique
516+
uses a very fancy quantization technique
517517
[[Dai et al.](https://proceedings.mlsys.org/paper_files/paper/2021/file/48a6431f04545e11919887748ec5cb52-Paper.pdf)]
518518
to use INT4 precision without compromising accuracy. Moreover, the accelerator
519519
is designed to run large Transformers on it, but I have used their ResNet50 data
@@ -530,7 +530,7 @@ In conclusion, the following statement
530530
> Inspired by the brain, which has no off-chip memory, NorthPole is optimized
531531
for on-chip networks [...]
532532

533-
will make me sleep better tonight, knowing that I do not have a DDR4 stick
533+
will make me sleep better tonight, knowing that I do not have a DDR4 stick
534534
on top of my head. Sigh.
535535

536536
NorthPole is an interesting experiment: it is an extremely large accelerator,
@@ -541,14 +541,14 @@ biological inspiration. In my opinion, NorthPole is "just" an excellent
541541
engineering work, that takes into account key factors:
542542
* reduced precision operations are much more efficient that high-precision ones.
543543
An FP32 multiplication costs _much_ more than an INT8 one.
544-
* DNNs are extremely robust to quantization, and with INT8 precision there is
544+
* DNNs are extremely robust to quantization, and with INT8 precision there is
545545
basically no accuracy degradation.
546-
* memory accesses are much more costly than computations when going up the
546+
* memory accesses are much more costly than computations when going up the
547547
memory hierarchy (_i.e._, external DRAMs and so on).
548548

549549
However, this brain-inspired approach seems to prove more useful than other ones
550-
at the moment, such as spiking network: the distributed memory hierarchy leads to
551-
great improvement in processing efficiency, without compromising network
550+
at the moment, such as spiking network: the distributed memory hierarchy leads to
551+
great improvement in processing efficiency, without compromising network
552552
performance.
553553

554554
I can see clusters of NorthPole being stacked in servers to improve inference
@@ -559,11 +559,11 @@ chose an IEEE journal instead of Science, where hardware is not really common.
559559

560560
## Acknowledgements
561561

562-
I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
563-
reviewing this blog post and the super-useful discussion about the
564-
brain-inspired traits of NorthPole: he convinced me that the way in which the
565-
authors claim biology inspiration actually proves useful (_e.g._, distributed
566-
memory hierarchy), differently from other approaches that severly compromise
562+
I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
563+
reviewing this blog post and the super-useful discussion about the
564+
brain-inspired traits of NorthPole: he convinced me that the way in which the
565+
authors claim biology inspiration actually proves useful (_e.g._, distributed
566+
memory hierarchy), differently from other approaches that severely compromise
567567
performance (_e.g._, accuracy), with negligible efficiency improvements.
568568

569569
## Bibliography

0 commit comments

Comments
 (0)