This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Replies: 1 comment
-
Even if you can't use a newer version in production, can you try locally if the regression is also present in newer versions? Specifically in the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
Training is much slower after upgrading an old container from MXNet 1.1 to 1.4.1. Due to how my org is handling our builds, I can't update MXNet past 1.4 so I'd just like to ask if there are better approaches to debug this issue.
I have two containers:
I have observed some significant performance regressions in the py3-MXNet 1.4.1 container, which is built with MKLDNN enabled.
I am using code at this repo as a ‘minimal reproducible example’: https://github.com/opringle/multivariate_time_series_forecasting
I used to profiler in each version to capture the second training batch at the second epoch for both containers, in a manner like this:
I capture the program at this time since the docs recommend avoiding profiling at the first batch.
This is the profiler output when sorted by total op time for the py2-1.1:
This is the output for the py3-1.4 container:
Some ops like backward_Convolution are significantly slower. My machine CPU is a 6-core Intel i7.
Does anyone know how to determine the root cause for this? Is this issue related to MKL-DNN somehow?
Also, when I run the same example with the same containers on a machine with an Intel Xeon CPU (c5 instance on AWS), the opposite occurs: the py3-1.4.1 container is much faster per batch (1s difference) than the py2-1.1 container.
Beta Was this translation helpful? Give feedback.
All reactions