Upgrade to PyTorch 1.8.1; fix streaming buffer overwrite detection (#120)

Robert Muchsel · web-flow · commit 754b438d6b60 · 2021-04-12T08:58:52.000-05:00
* Upgrade to PyTorch 1.8.1
* README updates
* Make 'arch' / 'extras' optional in checkpoint file
* Fix streaming buffer overlap detection
* Handle checkpoint files with all-zero weights and print warning
diff --git a/.python-version b/.python-version
@@ -1 +1 @@
-3.8.6
+3.8.9
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # MAX78000 Model Training and Synthesis
 
-_March 31, 2021_
+_April 8, 2021_
 
 The Maxim Integrated AI project is comprised of four repositories:
 
@@ -90,9 +90,9 @@ The following software is optional, and can be replaced with other similar softw
 
 ### Project Installation
 
-*The software in this project uses Python 3.8.6 or a later 3.8.x version.*
+*The software in this project uses Python 3.8.9 or a later 3.8.x version.*
 
-It is not necessary to install Python 3.8.6 system-wide, or to rely on the system-provided Python. To manage Python versions, use `pyenv` (https://github.com/pyenv/pyenv).
+It is not necessary to install Python 3.8.9 system-wide, or to rely on the system-provided Python. To manage Python versions, use `pyenv` (https://github.com/pyenv/pyenv).
 
 On macOS (no CUDA support available):
 
@@ -107,7 +107,7 @@ $ sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \
   libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
   libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev \
   libsndfile-dev portaudio19-dev
-$ curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash
+$ curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash  # NOTE: Verify contents of the script before running it!!
 ```
 
 Then, add to either `~/.bash_profile`, `~/.bashrc`, or `~/.profile` (as shown by the terminal output of the previous step):
@@ -119,7 +119,7 @@ eval "$(pyenv virtualenv-init -)"
 
 If you use zsh as the shell (default on macOS), add these same commands to `~/.zprofile` or `~/.zshrc` in addition to adding them to the bash startup scripts.
 
-Next, close the Terminal, open a new Terminal and install Python 3.8.6.
+Next, close the Terminal, open a new Terminal and install Python 3.8.9.
 
 On macOS:
 
@@ -131,13 +131,13 @@ $ env \
   PKG_CONFIG_PATH="$(brew --prefix tcl-tk)/lib/pkgconfig" \
   CFLAGS="-I$(brew --prefix tcl-tk)/include" \
   PYTHON_CONFIGURE_OPTS="--with-tcltk-includes='-I$(brew --prefix tcl-tk)/include' --with-tcltk-libs='-L$(brew --prefix tcl-tk)/lib -ltcl8.6 -ltk8.6'" \
-  pyenv install 3.8.6
+  pyenv install 3.8.9
 ```
 
 On Linux:
 
 ```shell
-$ pyenv install 3.8.6
+$ pyenv install 3.8.9
 ```
 
 #### git Environment
@@ -229,7 +229,7 @@ Then continue with the following:
 
 ```shell
 $ git submodule update --init
-$ pyenv local 3.8.6
+$ pyenv local 3.8.9
 $ python3 -m venv .
 $ source bin/activate
 (ai8x-training) $ pip3 install -U pip wheel setuptools
@@ -240,7 +240,7 @@ The next step differs depending on whether the system uses Linux with CUDA 11.x,
 For CUDA 11.x on Linux:
 
 ```shell
-(ai8x-training) $ pip3 install -r requirements-cu111.txt
+(ai8x-training) $ pip3 install -r requirements-cu11.txt
 ```
 
 For all other systems, including CUDA 10.2 on Linux:
@@ -275,7 +275,7 @@ For minor updates, pull the latest code and install the updated wheels:
 (ai8x-training) $ git pull
 (ai8x-training) $ git submodule update --init
 (ai8x-training) $ pip3 install -U pip setuptools
-(ai8x-training) $ pip3 install -U -r requirements.txt # or requirements-cu111.txt with CUDA 11.x
+(ai8x-training) $ pip3 install -U -r requirements.txt # or requirements-cu11.txt with CUDA 11.x
 ```
 
 Updating Python frequently requires updating `pyenv` first. Should `pyenv install x.y.z`
@@ -307,7 +307,7 @@ Then continue:
 
 ```shell
 $ git submodule update --init
-$ pyenv local 3.8.6
+$ pyenv local 3.8.9
 $ python3 -m venv .
 $ source bin/activate
 (ai8x-synthesis) $ pip3 install -U pip setuptools
@@ -646,9 +646,10 @@ Because of the fact that a processor has its own dedicated weight memory, this w
 
 For each layer, a set of active processors must be specified. The number input channels for the layer must be equal to or a multiple of the active processors, and the input data for that layer must be located in data memory instances accessible to the selected processors.
 
-It is possible to specify a relative offset into the data memory instance that applies to all processors. _Example:_ Assuming HWC data format, specifying the offset as 8192 bytes will cause processors 0-3 to read their input from the second half of data memory 0, processors 4-7 will read from the second half of data memory instance 1, etc.
+It is possible to specify a relative offset into the data memory instance that applies to all processors.
+_Example:_ Assuming HWC data format, specifying the offset as 16384 bytes (or 0x4000) will cause processors 0-3 to read their input from the second half of data memory 0, processors 4-7 will read from the second half of data memory instance 1, etc.
 
-For most simple networks with limited data sizes, it is easiest to ping-pong between the first and second halves of the data memories - specify the data offset as 0 for the first layer, 0x2000 for the second layer, 0 for the third layer, etc. This strategy avoids overlapping inputs and outputs when a given processor is used in two consecutive layers.
+For most simple networks with limited data sizes, it is easiest to ping-pong between the first and second halves of the data memories – specify the data offset as 0 for the first layer, 0x4000 for the second layer, 0 for the third layer, etc. This strategy avoids overlapping inputs and outputs when a given processor is used in two consecutive layers.
 
 Even though it is supported by the accelerator, the Network Generator will not be able to check for inadvertent overwriting of unprocessed input data by newly generated output data when overlapping data or streaming data. Use the `--overlap-data` command line switch to disable these checks, and to allow overlapped data.
 
@@ -823,11 +824,15 @@ The following table describes the most important command line arguments for `tra
 | `--8-bit-mode`, `-8`       | Simluate quantized operation for hardware device (8-bit data) |                                 |
 | `--exp-load-weights-from`  | Load weights from file                                       |                                 |
 | *Export*                   |                                                              |                                 |
-| `--summary onnx`           | Export trained model to ONNX (default name: to model.onnx)   |                                 |
+| `--summary onnx`           | Export trained model to ONNX (default name: to model.onnx) — *see description below* |                                 |
 | `--summary onnx_simplified` | Export trained model to simplified ONNX file (default name: model.onnx) |                                 |
 | `--summary-filename`       | Change the file name for the exported model                  | `--summary-filename mnist.onnx` |
 | `--save-sample`            | Save data[index] from the test set to a NumPy pickle for use as sample data | `--save-sample 10`              |
 
+#### ONNX Model Export
+
+The ONNX model export (via `--summary onnx` or `--summary onnx_simplified`) is primarily intended for visualization of the model. ONNX does not support all of the operators that `ai8x.py` uses, and these operators are therefore removed from the export (see function `onnx_export_prep()` in `ai8x.py`). The ONNX file does contain the trained weights and *may* therefore be usable for inference under certain circumstances. However, it is important to note that the ONNX file **will not** be usable for training (for example, the ONNX `floor` operator has a  gradient of zero which is incompatible with quantization-aware training as implemented in `ai8x.py`).
+
 ### Observing GPU Resources
 
 `nvidia-smi` can be used in a different terminal during training to examine the GPU resource usage of the training process. In the following example, the GPU is using 100% of its compute capabilities, but not all of the available memory. In this particular case, the batch size could be increased to use more memory.
@@ -1910,7 +1915,7 @@ Perform minimum accelerator initialization so it can be configured or restarted.
 Configure the accelerator for the given network.
 
 `int cnn_load_weights(void);`
-Load the accelerator weights.
+Load the accelerator weights. Note that `cnn_init()` must be called before loading weights after reset or wake from sleep.
 
 `int cnn_verify_weights(void);`
 Verify the accelerator weights (used for debug only).
@@ -2172,4 +2177,4 @@ https://github.com/MaximIntegratedAI/MaximAI_Documentation/blob/master/CONTRIBUT
 
 ---
 
-o
+o
diff --git a/README.pdf b/README.pdf
diff --git a/distiller b/distiller
@@ -1 +1 @@
-Subproject commit 4d61fb866989f52cec728da78422471d4f7e99bf
+Subproject commit 26b8d727083cd821812dd74502d34479596a598d
diff --git a/izer/checkpoint.py b/izer/checkpoint.py
@@ -14,7 +14,7 @@
 
 from . import op as opn
 from . import tornadocnn as tc
-from .eprint import eprint
+from .eprint import eprint, wprint
 from .utils import fls
 
 
@@ -58,12 +58,16 @@ def load(
     checkpoint = torch.load(checkpoint_file, map_location='cpu')
     print(f'Reading {checkpoint_file} to configure network weights...')
 
-    if 'state_dict' not in checkpoint or 'arch' not in checkpoint:
-        raise RuntimeError("\nNo `state_dict` or `arch` in checkpoint file.")
-
-    if arch and checkpoint['arch'].lower() != arch.lower():
-        eprint(f"Network architecture of configuration file ({arch}) does not match "
-               f"network architecture of checkpoint file ({checkpoint['arch']}).")
+    if 'state_dict' not in checkpoint:
+        eprint("No `state_dict` in checkpoint file.")
+    if 'arch' not in checkpoint:
+        wprint("No `arch` in checkpoint file.")
+        checkpoint_arch = ''
+    else:
+        checkpoint_arch = checkpoint['arch']
+        if arch and checkpoint_arch.lower() != arch.lower():
+            eprint(f"Network architecture of configuration file ({arch}) does not match "
+                   f"network architecture of checkpoint file ({checkpoint_arch}).")
 
     checkpoint_state = checkpoint['state_dict']
     layers = 0
@@ -90,6 +94,9 @@ def load(
             w = checkpoint_state[k].numpy().astype(np.int64)
             w_min, w_max, w_abs = w.min(), w.max(), np.abs(w)
 
+            if np.all(w == 0):
+                wprint(f'All weights for `{k}` are zero.')
+
             # Determine quantization or make sure that what was given fits
             if quantization[seq] is not None:
                 if quantization[seq] == -1:
@@ -98,7 +105,8 @@ def load(
                     assert w_min >= -(2**(quantization[seq]-1))
                     assert w_max < 2**(quantization[seq]-1)
             else:
-                if tc.dev.SUPPORT_BINARY_WEIGHTS and w_abs.min() == w_abs.max() == 1:
+                if tc.dev.SUPPORT_BINARY_WEIGHTS and w_abs.min() == w_abs.max() == 1 \
+                   and not np.any(w_abs == 0):
                     quantization[seq] = -1
                 else:
                     if w_max > 0:
@@ -109,7 +117,10 @@ def load(
                         w_min_m = int(w_min)
                     else:
                         w_min_m = int(abs(w_min)) - 1
-                    quantization[seq] = 1 << (fls(max(fls(w_max_m), fls(w_min_m)) + 1) + 1)
+                    if w_max_m > 0 or w_min_m > 0:
+                        quantization[seq] = 1 << (fls(max(fls(w_max_m), fls(w_min_m)) + 1) + 1)
+                    else:
+                        quantization[seq] = 1  # all weights zero
                 assert quantization[seq] <= 8
             quant.append(quantization[seq])
 
@@ -166,6 +177,9 @@ def load(
                 w = checkpoint_state[bias_name].numpy(). \
                     astype(np.int64) // tc.dev.BIAS_DIV
 
+                if np.all(w == 0):
+                    wprint(f'All bias values for `{bias_name}` are zero.')
+
                 w_min, w_max = w.min(), w.max()
                 assert w_min >= -(2**(bias_quantization[seq]-1))
                 assert w_max < 2**(bias_quantization[seq]-1)
@@ -210,7 +224,7 @@ def load(
             seq += 1
 
     if verbose:
-        print(f'Checkpoint for epoch {checkpoint["epoch"]}, model {checkpoint["arch"]} - '
+        print(f'Checkpoint for epoch {checkpoint["epoch"]}, model {checkpoint_arch} - '
               'weight and bias data:')
         print(' InCh OutCh  Weights         Quant Shift  Min  Max   Size '
               'Key                                 Bias       Quant  Min  Max Size Key')
diff --git a/izer/max7800x.py b/izer/max7800x.py
@@ -2741,7 +2741,7 @@ def run_eltwise(
                 memfile.close()
 
         data_buf.append(out_buf.reshape(out_size))
-        if streaming[ll]:
+        if next_sequence[ll] != -1 and streaming[next_sequence[ll]]:
             # When streaming, the output should not overwrite the input of prior layers since
             # these layers are still needed.
             in_map = [a if a is not None else b for a, b, in zip(in_map, out_map)]
diff --git a/izer/quantize.py b/izer/quantize.py
@@ -17,7 +17,7 @@
 from . import tornadocnn as tc
 from . import yamlcfg
 from .devices import device
-from .eprint import wprint
+from .eprint import eprint, wprint
 
 CONV_SCALE_BITS = 8
 CONV_DEFAULT_WEIGHT_BITS = 8
@@ -50,7 +50,7 @@ def convert_checkpoint(input_file, output_file, arguments):
         print(get_contents_table(checkpoint))
 
     if 'state_dict' not in checkpoint:
-        raise RuntimeError("\nNo state_dict in checkpoint file.")
+        eprint("No `state_dict` in checkpoint file.")
 
     checkpoint_state = checkpoint['state_dict']
     compression_sched = checkpoint['compression_sched'] \
@@ -96,6 +96,9 @@ def get_max_bit_shift(t, return_bit_shift=False):
     # If not using quantization-aware training (QAT),
     # scale to our fixed point representation using any of four methods
     # The 'magic constant' seems to work best for SCALE
+    if 'extras' not in checkpoint:
+        wprint("No `extras` in checkpoint file.")
+        checkpoint['extras'] = {}
     if arguments.clip_mode is not None:
         if arguments.clip_mode == 'STDDEV':
             sat_fn = partial(mean_n_stds_max_abs, n_stds=arguments.stddev)
diff --git a/izer/rtlsim.py b/izer/rtlsim.py
@@ -91,8 +91,13 @@ def create_runtest_sv(
                 )
             if tc.dev.MODERN_SIM:
                 runfile.write(
-                    '\n`define CNN_ENA  `DIGITAL_TOP.xuut1.x16proc[0].xproc.xuut.cnnena\n'
-                    '`define CNN_CLK  `DIGITAL_TOP.xuut1.x16proc[0].xproc.xuut.clk\n\n'
+                    '\n`ifdef gate_sims\n'
+                    '  `define CNN_ENA  `DIGITAL_TOP.xuut1.x16proc_0__xproc_xuut.xcnn_fsm2.cnnena'
+                    '\n  `define CNN_CLK  `DIGITAL_TOP.xuut1.x16proc_0__xproc_xuut.clk\n'
+                    '`else\n'
+                    '  `define CNN_ENA  `DIGITAL_TOP.xuut1.x16proc[0].xproc.xuut.cnnena\n'
+                    '  `define CNN_CLK  `DIGITAL_TOP.xuut1.x16proc[0].xproc.xuut.clk\n'
+                    '`endif\n\n'
                 )
             else:
                 runfile.write(
@@ -110,6 +115,7 @@ def create_runtest_sv(
             if result_output:
                 runfile.write('int   chk_stat;\n')
             runfile.write(
+                'logic chk_clk;\n'
                 '\ninitial begin\n'
             )
             if result_output:
@@ -142,7 +148,8 @@ def create_runtest_sv(
                 '    $display("CNN enabled");\n'
                 '  end\n'
                 'end\n\n'
-                'always @(negedge `CNN_ENA) begin\n'
+                'assign #10 chk_clk = `CNN_ENA;\n\n'
+                'always @(negedge chk_clk) begin\n'
                 '  if (start_ena) begin\n'
                 '    end_time  = $realtime;\n'
                 '    clkena1   = 1;\n'
diff --git a/requirements.txt b/requirements.txt
@@ -1,12 +1,12 @@
-numpy>=1.19,<1.20
+numpy>=1.20.2,<1.21
 PyYAML>=5.1.1
 tabulate==0.8.3
 future>=0.17.1
 six>=1.12.0
 scipy>=1.3.0
-torch==1.7.1
+torch==1.8.1
 pytest~=4.6.4
 onnx>=1.7.0
-tensorboard==2.4.0
+tensorboard==2.4.1
 colorama>=0.4.4
 -e file:distiller