README: Debugging techniques, preventing Flash overflow; fix --mlator; default optimization -O2; checkpoint reader bug fixes (#131)

Robert Muchsel · web-flow · commit 5bfc060efd62 · 2021-05-10T11:12:15.000-05:00
* Improve --synthesize-input, add --synthesize-words
* Fix --mlator code generation; ensure verify_output does not use mlator for 32-bit output
* Change default optimization level to -O2
* README: Debugging techniques; handling memory overflows
* Adjust stream_start when using pooling in the first layer for MAX78000
diff --git a/.github/workflows/linter.yml b/.github/workflows/linter.yml
@@ -28,4 +28,5 @@ jobs:
           VALIDATE_MARKDOWN: false
           VALIDATE_PYTHON_BLACK: false
           VALIDATE_JSCPD: false
+          VALIDATE_CPP: false
           FILTER_REGEX_EXCLUDE: attic/.*
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # MAX78000 Model Training and Synthesis
 
-_May 4, 2021_
+_May 7, 2021_
 
 The Maxim Integrated AI project is comprised of four repositories:
 
@@ -818,6 +818,7 @@ The MAX78000 hardware does not support arbitrary network parameters. Specificall
   * When using data greater than 90×91, `streaming` mode must be used.
   * When using `streaming` mode, the product of any layer’s input width, input height, and input channels divided by 64 rounded up must not exceed 2^21: $width * height * ⌈\frac{channels}{64}⌉ < 2^{21}$. _width_ and _height_ must not exceed 1023.
   * Streaming is limited to 8 layers or less, and is limited to four FIFOs (up to 4 input channels in CHW and up to 16 channels in HWC format), see [FIFOs](#FIFOs).
+  * For streaming layers, bias values may not be added correctly in all cases.
   
 * The weight memory supports up to 768 * 64 3×3 Q7 kernels (see [Number Format](#Number-Format)).
   When using 1-, 2- or 4 bit weights, the capacity increases accordingly.
@@ -1325,6 +1326,10 @@ The following table describes the most important command line arguments for `ai8
 | `--ready-sel`            | Specify memory waitstates                                    |                                 |
 | `--ready-sel-fifo`       | Specify FIFO waitstates                                      |                                 |
 | `--ready-sel-aon`        | Specify AON waitstates                                       |                                 |
+| Various                  |                                                              |                                 |
+| `--synthesize-input`     | Instead of using large sample input data, use only the first `--synthesize-words` words of the sample input, and add N to each subsequent set of `--synthesize-words` 32-bit words | `--synthesize-input 0x112233` |
+| `--synthesize-words` | When using `—synthesize-input`, specifies how many words to use from the input. The default is 8. This number must be a divisor of the total number of pixels per channel. | `--synthesize-words 64` |
+| `--max-checklines`       | Instead of checking all of the expected output data, verify only the first N words | `--max-checklines 1024`         |
 
 ### YAML Network Description
 
@@ -2042,7 +2047,57 @@ The generator also adds all files from the `assets/eclipse`, `assets/device-all`
 * For MAX78000/MAX78002, the software Softmax is implemented in `softmax.c`.
 * A template for the `cnn.h` header file in `templatecnn.h`. The template is customized during code generation using model statistics and timer, but uses common function signatures for all projects.
 
+#### Determining the Compiled Flash Image Size
 
+The generated `.elf` file (either `max78000.elf` or `max78000-combined.elf`) contains debug and other meta information. To determine the true Flash image size, either examine the `.map` file, or convert the `.elf` to a binary image and examine the resulting image.
+
+```shell
+% arm-none-eabi-objcopy -I elf32-littlearm build/max78000.elf -O binary temp.bin                     
+% ls -la temp.bin
+-rwxr-xr-x  1 user  staff  321968 Jan  1 11:11 temp.bin
+```
+
+#### Handling Linker Flash Section Overflows
+
+When linking the generated C code, the code space might overflow:
+
+```shell
+$ make
+  CC    main.c
+  CC    cnn.c
+  ...
+  LD    build/max78000.elf 
+arm-none-eabi/bin/ld: build/max78000.elf section `.text' will not fit in region `FLASH'
+arm-none-eabi/bin/ld: region `FLASH' overflowed by 600176 bytes
+collect2: error: ld returned 1 exit status
+```
+
+The most likely reason is that the input is too large (from `sampledata.h`), or that the expected output is too large. It is important to note that this only affects the generated code with the built-in known-answer test (KAT) that will not be part of the user application since normal input and output data are not predefined in Flash memory.
+
+To deal with this issue, there are several options:
+
+* The sample input data can be stored in external memory. This requires modifications to the generated code. Please see the SDK examples to learn how to access external memory.
+* The sample input data can be programmatically generated. Typically, this requires manual modification of the generated code, and a corresponding modification of the sample input file.
+  The generator also contains a built-in generator (supported *only* when using `—fifo`, and only for HWC inputs); the command line option `--synthesize-input` uses only the first few words of the sample input data, and then adds the specified value N (for example, 0x112233 if three input channels are used) to each subsequent set of M 32-bit words. M can be specified using `--synthesize-words` and defaults to 8. Note that M must be a divisor of the number of pixels per channel.
+* The output check can be truncated. The command line option `--max-checklines` checks only the first N words of output data (for example, 1024).
+* For 8-bit output values, `--mlator` typically generates more compact code.
+* Change the compiler optimization level in `Makefile`. To change the default optimization levels, modify `MXC_OPTIMIZE_CFLAGS` in `assets/embedded-ai85/templateMakefile` for Arm code and  `assets/embedded-riscv-ai85/templateMakefile.RISCV` for RISC-V code. Both `-O1` and `-Os` may result in smaller code compared to `-O2`.
+* If the last layer has large-dimension, large-channel output, the `cnn_unload()` code in `cnn.c` may cause memory segment overflows not only in Flash, but also in the target buffer in SRAM (`ml_data32[]` or `ml_data[]` in `main.c`). In this case, manual code edits are required to perform multiple partial unloads in sequence.
+
+#### Debugging Techniques
+
+There can be many reasons why the known-answer test (KAT) fails for a given network. The following techniques may help in narrowing down where in the network or the YAML description of the network the error occurs:
+
+* The default compiler optimization level is `-O2`, and incorrect code may be generated under rare circumstances. Lower the optimization level in the generated `Makefile` to `-O1`, clean (`make distclean && make clean`) and rebuild the project (`make`). If this solves the problem, one of the possible reasons is that code is missing the `volatile`  keyword for certain variables.
+  To permanently adjust the default compiler optimization level, modify `MXC_OPTIMIZE_CFLAGS` in  `assets/embedded-ai85/templateMakefile` for Arm code and  `assets/embedded-riscv-ai85/templateMakefile.RISCV` for RISC-V code.
+
+* `--stop-after N` where `N` is a layer number may help finding the problematic layer by terminating the network early without having to retrain and without having to change the weight input file. Note that this may also require `--max-checklines` as [described above](#Handling Linker Flash Section Overflows) since intermediate outputs tend to be large.
+
+* `--no-bias LIST` where `LIST` is a comma-separated list of layers (e.g., `0,1,2,3`) can rule out problems due to the bias. This option zeros out the bias for the given layers without having to remove bias values from the weight input file. 
+
+* `--ignore-streaming` ignores all `streaming` statements in the YAML file. Note that this typically only works when the sample input is replaced with a different, lower-dimension sample input (for example, use 3×32×32 instead of 3×128×128).
+
+  
 
 #### Energy Measurement
 
diff --git a/README.pdf b/README.pdf
diff --git a/assets/embedded-ai85/templateMakefile b/assets/embedded-ai85/templateMakefile
@@ -100,7 +100,7 @@ PROJ_CFLAGS+=-Wall -Wcast-align
 #STARTUPFILE=start.S
 
 # Override the default optimization level using this variable
-MXC_OPTIMIZE_CFLAGS=-O1
+MXC_OPTIMIZE_CFLAGS=-O2
 
 ################################################################################
 # Include external library makefiles here
diff --git a/assets/embedded-ai87/templateMakefile b/assets/embedded-ai87/templateMakefile
@@ -100,7 +100,7 @@ PROJ_CFLAGS+=-Wall -Wcast-align
 #STARTUPFILE=start.S
 
 # Override the default optimization level using this variable
-MXC_OPTIMIZE_CFLAGS=-O1
+MXC_OPTIMIZE_CFLAGS=-O2
 
 ################################################################################
 # Include external library makefiles here
diff --git a/assets/embedded-riscv-ai85/templateMakefile.ARM b/assets/embedded-riscv-ai85/templateMakefile.ARM
@@ -94,7 +94,7 @@ PROJ_CFLAGS+=-Wall -Wcast-align
 #STARTUPFILE=startup_max78000.S
 
 # Override the default optimization level using this variable
-MXC_OPTIMIZE_CFLAGS=-O1
+MXC_OPTIMIZE_CFLAGS=-O2
 
 # Point this variable to a linker file to override the default file
 LINKERFILE=$(CMSIS_ROOT)/Device/Maxim/$(TARGET_UC)/Source/GCC/$(TARGET_LC)_arm.ld
diff --git a/assets/embedded-riscv-ai85/templateMakefile.RISCV b/assets/embedded-riscv-ai85/templateMakefile.RISCV
@@ -107,7 +107,7 @@ PROJ_CFLAGS+=-Wall -Wcast-align
 #STARTUPFILE=startup_riscv_max78000.S
 
 # Override the default optimization level using this variable
-MXC_OPTIMIZE_CFLAGS=-O0
+MXC_OPTIMIZE_CFLAGS=-O2
 
 # Point this variable to a linker file to override the default file
 LINKERFILE=$(CMSIS_ROOT)/Device/Maxim/$(TARGET_UC)/Source/GCC/$(TARGET_LC)_riscv.ld
diff --git a/assets/embedded-riscv-ai87/templateMakefile.ARM b/assets/embedded-riscv-ai87/templateMakefile.ARM
@@ -50,9 +50,7 @@ COMPILER=GCC
 
 # Specify the board used
 ifeq "$(BOARD)" ""
-#BOARD=BCB
-BOARD=EvKit_V1
-#BOARD=Emulator
+BOARD=##__BOARD__##
 endif
 
 # This is the path to the CMSIS root directory
@@ -96,7 +94,7 @@ PROJ_CFLAGS+=-Wall -Wcast-align
 #STARTUPFILE=startup_max78002.S
 
 # Override the default optimization level using this variable
-MXC_OPTIMIZE_CFLAGS=-O1
+MXC_OPTIMIZE_CFLAGS=-O2
 
 # Point this variable to a linker file to override the default file
 LINKERFILE=$(CMSIS_ROOT)/Device/Maxim/$(TARGET_UC)/Source/GCC/$(TARGET_LC)_arm.ld
diff --git a/assets/embedded-riscv-ai87/templateMakefile.RISCV b/assets/embedded-riscv-ai87/templateMakefile.RISCV
@@ -50,9 +50,7 @@ COMPILER=GCC
 
 # Specify the board used
 ifeq "$(BOARD)" ""
-#BOARD=BCB
-BOARD=EvKit_V1
-#BOARD=Emulator
+BOARD=##__BOARD__##
 endif
 
 RISCV_CORE=RV32
diff --git a/izer/apbaccess.py b/izer/apbaccess.py
@@ -728,6 +728,7 @@ def verify_unload(
             max_count=max_count,
             write_gap=write_gap,
             final_layer=final_layer,
+            embedded=self.embedded_code,
         )
 
     def output_define(  # pylint: disable=no-self-use
@@ -1245,9 +1246,19 @@ def unload(
         Write the unload function. The layer to unload has the shape `input_shape`,
         and the optional `output_offset` argument can shift the output.
         """
-        unload.unload(self.apifile or self.memfile, self.apb_base, processor_map, input_shape,
-                      output_offset, out_expand, out_expand_thresh, output_width,
-                      mlator=mlator, blocklevel=self.blocklevel)
+        unload.unload(
+            self.apifile or self.memfile,
+            self.apb_base,
+            processor_map,
+            input_shape,
+            output_offset,
+            out_expand,
+            out_expand_thresh,
+            output_width,
+            mlator=mlator,
+            blocklevel=self.blocklevel,
+            embedded=self.embedded_code,
+        )
 
     def output_define(
             self,
diff --git a/izer/checkpoint.py b/izer/checkpoint.py
@@ -12,7 +12,7 @@
 import numpy as np
 import torch
 
-from . import op as opn
+from . import op
 from . import tornadocnn as tc
 from .eprint import eprint, wprint
 from .utils import fls
@@ -81,14 +81,14 @@ def load(
 
     for _, k in enumerate(checkpoint_state.keys()):
         # Skip over non-weight layers
-        while seq < len(operator) and (operator[seq] == opn.NONE or bypass[seq]):
+        while seq < len(operator) and (operator[seq] == op.NONE or bypass[seq]):
             seq += 1
 
         param_levels = k.rsplit(sep='.', maxsplit=2)
         if len(param_levels) == 3:
-            layer, op, parameter = param_levels[0], param_levels[1], param_levels[2]
+            layer, this_op, parameter = param_levels[0], param_levels[1], param_levels[2]
         elif len(param_levels) == 2:
-            layer, op, parameter = param_levels[0], None, param_levels[1]
+            layer, this_op, parameter = param_levels[0], None, param_levels[1]
         else:
             continue
 
@@ -132,13 +132,13 @@ def load(
             weight_min.append(w_min)
             weight_max.append(w_max)
 
-            if operator[seq] == opn.CONVTRANSPOSE2D:
+            if operator[seq] == op.CONVTRANSPOSE2D:
                 # For ConvTranspose2d, flip the weights as follows:
                 w = np.flip(w, axis=(2, 3)).swapaxes(0, 1)
 
-            mult = conv_groups[seq] if operator[seq] != opn.CONVTRANSPOSE2D else 1
+            mult = conv_groups[seq] if operator[seq] != op.CONVTRANSPOSE2D else 1
             input_channels.append(w.shape[1] * mult)  # Input channels
-            mult = conv_groups[seq] if operator[seq] == opn.CONVTRANSPOSE2D else 1
+            mult = conv_groups[seq] if operator[seq] == op.CONVTRANSPOSE2D else 1
             output_channels.append(w.shape[0] * mult)  # Output channels
 
             if len(w.shape) == 2:  # MLP
@@ -176,12 +176,15 @@ def load(
             weight_keys.append(k)
 
             # Is there a bias for this layer?
-            bias_name = '.'.join([layer, op, 'bias'])
+            bias_name = '.'.join([layer, 'bias']) if this_op is None \
+                else '.'.join([layer, this_op, 'bias'])
             wb_name = '.'.join([layer, 'weight_bits'])
 
             if bias_name in checkpoint_state and seq not in no_bias:
+                wb = checkpoint_state[wb_name].numpy().astype(np.int64) \
+                     if wb_name in checkpoint_state else 8
                 w = checkpoint_state[bias_name].numpy(). \
-                    astype(np.int64) // 2**(checkpoint_state[wb_name].numpy().astype(np.int64) - 1)
+                    astype(np.int64) // 2**(wb - 1)
 
                 if np.all(w == 0):
                     wprint(f'All bias values for `{bias_name}` are zero.')
@@ -253,7 +256,8 @@ def load(
                       f'{bias_shape:10} '
                       f'{bias_quant[ll]:5} {bias_min[ll]:4} {bias_max[ll]:4} {bias_size[ll]:4} '
                       f'{bias_keys[ll]:25}')
-        print(f'TOTAL: {layers} layers, {param_count:,} parameters, {param_size:,} bytes')
+        print(f'TOTAL: {layers} parameter layers, {param_count:,} parameters, '
+              f'{param_size:,} bytes')
 
     if error_exit:
         sys.exit(1)
diff --git a/izer/commandline.py b/izer/commandline.py
@@ -187,8 +187,6 @@ def get_parser():
                        help="use fixed 0xaa/0x55 alternating input (default: false)")
     group.add_argument('--reshape-inputs', action='store_true', default=False,
                        help="drop data channel dimensions to match weights (default: false)")
-    group.add_argument('--max-checklines', type=int, metavar='N', default=None, dest='max_count',
-                       help="output only N output check lines (default: all)")
     group.add_argument('--forever', action='store_true', default=False,
                        help="after initial run, repeat CNN forever (default: false)")
     group.add_argument('--link-layer', action='store_true', default=False,
@@ -260,17 +258,17 @@ def get_parser():
                        help="allow output to overwrite input (default: warn/stop)")
     group.add_argument('--override-start', type=lambda x: int(x, 0), metavar='N',
                        help="override auto-computed streaming start value (x8 hex)")
-    group.add_argument('--increase-start', type=int, default=2, metavar='N',
+    group.add_argument('--increase-start', type=lambda x: int(x, 0), default=2, metavar='N',
                        help="add integer to streaming start value (default: 2)")
     group.add_argument('--override-rollover', type=lambda x: int(x, 0), metavar='N',
                        help="override auto-computed streaming rollover value (x8 hex)")
     group.add_argument('--override-delta1', type=lambda x: int(x, 0), metavar='N',
                        help="override auto-computed streaming delta1 value (x8 hex)")
-    group.add_argument('--increase-delta1', type=int, default=0, metavar='N',
+    group.add_argument('--increase-delta1', type=lambda x: int(x, 0), default=0, metavar='N',
                        help="add integer to streaming delta1 value (default: 0)")
     group.add_argument('--override-delta2', type=lambda x: int(x, 0), metavar='N',
                        help="override auto-computed streaming delta2 value (x8 hex)")
-    group.add_argument('--increase-delta2', type=int, default=0, metavar='N',
+    group.add_argument('--increase-delta2', type=lambda x: int(x, 0), default=0, metavar='N',
                        help="add integer to streaming delta2 value (default: 0)")
     group.add_argument('--ignore-streaming', action='store_true', default=False,
                        help="ignore all 'streaming' layer directives (default: false)")
@@ -325,8 +323,14 @@ def get_parser():
     group = parser.add_argument_group('Various')
     group.add_argument('--input-split', type=int, default=1, metavar='N', choices=range(1, 1025),
                        help="split input into N portions (default: don't split)")
-    group.add_argument('--synthesize-input', type=int, metavar='N',
-                       help="synthesize input data from first 8 lines (default: false)")
+    group.add_argument('--synthesize-input', type=lambda x: int(x, 0), metavar='N',
+                       help="synthesize input data from first `--synthesize-words` words and add "
+                            "N to each subsequent set of `--synthesize-words` 32-bit words "
+                            "(default: false)")
+    group.add_argument('--synthesize-words', type=int, metavar='N', default=8,
+                       help="number of input words to use (default all or 8)")
+    group.add_argument('--max-checklines', type=int, metavar='N', default=None, dest='max_count',
+                       help="output only N output check lines (default: all)")
 
     args = parser.parse_args()
 
diff --git a/izer/izer.py b/izer/izer.py
@@ -136,7 +136,11 @@ def main():
         sampledata_file = os.path.join('tests', f'sample_{cfg["dataset"].lower()}.npy')
     else:
         sampledata_file = args.sample_input
-    data = sampledata.get(sampledata_file)
+    data = sampledata.get(
+        sampledata_file,
+        synthesize_input=args.synthesize_input,
+        synthesize_words=args.synthesize_words,
+    )
     if np.max(data) > 127 or np.min(data) < -128:
         eprint(f'Input data {sampledata_file} contains values that are outside the limits of '
                f'signed 8-bit (data min={np.min(data)}, max={np.max(data)})!')
@@ -567,6 +571,7 @@ def main():
             increase_delta2=args.increase_delta2,
             slow_load=args.slow_load,
             synthesize_input=args.synthesize_input,
+            synthesize_words=args.synthesize_words,
             mlator_noverify=args.mlator_noverify,
             input_csv=args.input_csv,
             input_csv_period=args.input_csv_period,
diff --git a/izer/kbias.py b/izer/kbias.py
@@ -214,7 +214,7 @@ def bias_sort(e):
 
         if streaming[ll] and not tc.dev.SUPPORT_STREAM_BIAS:
             wprint(f'Layer {ll} uses streaming and a bias. '
-                   'THIS COMBINATION MIGHT NOT BE FUNCTIONING CORRECTLY!!!')
+                   'THIS COMBINATION MIGHT NOT FUNCTION CORRECTLY!!!')
 
         group = bias_group[ll]
 
diff --git a/izer/load.py b/izer/load.py
diff --git a/izer/max7800x.py b/izer/max7800x.py
diff --git a/izer/onnxcp.py b/izer/onnxcp.py
diff --git a/izer/sampledata.py b/izer/sampledata.py
diff --git a/izer/unload.py b/izer/unload.py