Force device synchronization before CUDA module loading

nicolasvasilache · nicolasvasilache · commit c0fb4a18cf3f · 2018-06-29T09:49:14.000-07:00
This commit forces device synchronization before loading the cuda module.
Synchronization at this place helps prevent a subtle issue that
can occur with multi-threading and autotuning.
The issue appeared in PyTorch integration during the backward phase only
but may appear with other frameworks too.
The issue came from the way PyTorch switches CPU threads for computing
backward without immediately initializing the CUDA context.
In such situations the tuner may kick in and cuModuleLoadDataEx would get
called on a CPU thread on which the CUDA context was not previously
initialized resulting in a hard unrecoverable error.

Forcing synchronization calls a CUDA runtime API function
(cudaDeviceSynchronize()) which has the side effect of initializing the
CUDA context. Granted the implicit nature of the is not ideal this is a
CUDA-ism.
In the same way the PyTorch-ism of switching thread without initializing
the CUDA context requires lazy on-demand initialization.

Putting this initialization inside cuda_rtc.cc is future proof and will
not require us to screw around when the problem appears in the future
with other frameworks.
diff --git a/tc/core/cuda/cuda_rtc.cc b/tc/core/cuda/cuda_rtc.cc
@@ -48,6 +48,18 @@ void CudaRTCFunction::clear() {
   }
 }
 
+void checkOrCreateContext() {
+  static thread_local bool created = false;
+  if (!created) {
+    created = true;
+    CUcontext ctx;
+    TC_CUDA_DRIVERAPI_ENFORCE(cuCtxGetCurrent(&ctx));
+    if (!ctx) {
+      TC_CUDA_RUNTIMEAPI_ENFORCE(cudaDeviceSynchronize());
+    }
+  }
+}
+
 std::unique_ptr<CudaRTCFunction> CudaRTCFunction::Compile(
     const std::string& name,
     const std::string& source) {
@@ -143,6 +155,13 @@ Duration CudaRTCFunction::Launch(
   if (perGpuModule_.count(dev) == 0) {
     CUmodule module;
     CUfunction function;
+    // Checking that a CUDA context exists for the current thread is necessary
+    // when benchmarking the backward of a PyTorch gradient operator:
+    // the backward is called on a different thread whose context may not have
+    // been initialized explicitly.
+    // This call to cudaDeviceSynchronize implicitly creates a new context if
+    // one is not bound to the current CPU.
+    checkOrCreateContext();
     TC_CUDA_DRIVERAPI_ENFORCE(
         cuModuleLoadDataEx(&module, nvrtc_ptx.data(), 0, 0, 0));
     perGpuModule_.emplace(dev, module);