-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hi,
I tried to train the network by just changing the batchsize and gpus in the default setting. And I get the following error, which occurs after the finishing of the first batch.
[09:20:05] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... [09:20:05] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded! [09:20:05] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... [09:20:05] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded! INFO:root:start with arguments Namespace(batch_size=2, benchmark=0, data_nthreads=128, disp_batches=20, dtype='float32', gpus='0', image_shape='3,512,512', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='100,200', max_random_aspect_ratio=0, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix='./model/tasn', mom=0, monitor=0, network=None, num_classes=200, num_epochs=300, num_examples=5994, num_layers=None, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=5, wd=0) [09:20:05] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/cub/train.rec, use 4 threads for decoding.. [09:20:08] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/cub/val.rec, use 4 threads for decoding.. learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer. INFO:root:Epoch[0] Batch [0-20] Speed: 33.71 samples/sec att_net_accuracy=0.000000 part_net_accuracy=0.023810 master_net_accuracy=0.023810 part_net_aux_accuracy=0.023810 master_net_aux_accuracy=0.023810 distillation_loss=5.296982 Traceback (most recent call last): File "train.py", line 57, in <module> eval_metric = evaluate.Multi_Accuracy(num=6)) File "/home/ysu/project/attention_net/tasn/tasn-mxnet/example/tasn/common/fit.py", line 195, in fit monitor = monitor File "/home/ysu/mxnet_attention/lib/python3.5/site-packages/mxnet/module/base_module.py", line 533, in fit self.update_metric(eval_metric, data_batch.label) File "/home/ysu/mxnet_attention/lib/python3.5/site-packages/mxnet/module/module.py", line 775, in update_metric self._exec_group.update_metric(eval_metric, labels, pre_sliced) File "/home/ysu/mxnet_attention/lib/python3.5/site-packages/mxnet/module/executor_group.py", line 640, in update_metric eval_metric.update_dict(labels_, preds) File "/home/ysu/mxnet_attention/lib/python3.5/site-packages/mxnet/metric.py", line 133, in update_dict self.update(label, pred) File "/home/ysu/project/attention_net/tasn/tasn-mxnet/example/tasn/common/evaluate.py", line 32, in update self.sum_metric[i] += (pred_label.flat == label.flat).sum() TypeError: 'float' object is not subscriptable
The reason is that, in mxnet1.6.0, the EvalMetric
class has not only num_inst
, sum_metric
, but also global_num_inst
, global_sum_metric
.
And in the batch_end_callback function (here is Speedometer
), it will execute reset_local()
function to reset num_inst
, sum_metric
, rather than reset()
function as in the old version of mxnet.
However, you don't have the implementation of reset_local()
in your Multi_Accuracy
class. So the sum_metric
will be reset as 0.0 using the reset_local()
function in the EvalMetric
class.
A quick solution could be, set the auto_reset
argument in Speedometer
as False.