CPU performance regression in 1.4 #19734

slowjazz · 2021-01-08T03:02:08Z

slowjazz
Jan 8, 2021

Hi all,

Training is much slower after upgrading an old container from MXNet 1.1 to 1.4.1. Due to how my org is handling our builds, I can't update MXNet past 1.4 so I'd just like to ask if there are better approaches to debug this issue.

I have two containers:

One running Python 2 and MXNet 1.1
An updated container running Python 3 and MXNet 1.4

I have observed some significant performance regressions in the py3-MXNet 1.4.1 container, which is built with MKLDNN enabled.

I am using code at this repo as a ‘minimal reproducible example’: https://github.com/opringle/multivariate_time_series_forecasting

I used to profiler in each version to capture the second training batch at the second epoch for both containers, in a manner like this:

        for i, batch in enumerate(train_iter):
            start_time = time.time()

            if i==1:
                profiler.set_state('run')

            module.forward(batch, is_train=True) 
            module.backward()  
            mx.nd.waitall()
            
            if i==1:
                profiler.set_state('stop')
                profiler.dump()

I capture the program at this time since the docs recommend avoiding profiling at the first batch.

This is the profiler output when sorted by total op time for the py2-1.1:

This is the output for the py3-1.4 container:

Some ops like backward_Convolution are significantly slower. My machine CPU is a 6-core Intel i7.

Does anyone know how to determine the root cause for this? Is this issue related to MKL-DNN somehow?

Also, when I run the same example with the same containers on a machine with an Intel Xeon CPU (c5 instance on AWS), the opposite occurs: the py3-1.4.1 container is much faster per batch (1s difference) than the py2-1.1 container.

leezu · 2021-01-08T13:07:30Z

leezu
Jan 8, 2021
Collaborator

Even if you can't use a newer version in production, can you try locally if the regression is also present in newer versions? Specifically in the v1.8.x and v1.x branches? For the purpose of debugging, you can try the nightly build at https://dist.mxnet.io/python/cpu

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU performance regression in 1.4 #19734

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CPU performance regression in 1.4 #19734

Uh oh!

slowjazz Jan 8, 2021

Replies: 1 comment

Uh oh!

leezu Jan 8, 2021 Collaborator

slowjazz
Jan 8, 2021

leezu
Jan 8, 2021
Collaborator