-
Couldn't load subscription status.
- Fork 30
Description
I'm working on serac and noticing that a lot of our simple tests take a surprisingly long time to compile.
We noticed a similar problem a while back, that Inlet in particular had some really bizarre combinatorial explosion of template instantiations that took a long time to compile, but this issue relates to axom/core.hpp, not Inlet.
When I started to write a new small cuda test, I noticed that the trivial executable,
int main() { return 0; }takes 0.5s to compile with nvcc on my machine, while
#include <axom/core.hpp>
int main() { return 0; }surprisingly takes 5.5s.
So, just the act of #includeing a header from axom added a whopping 5 seconds (?!) to my compile time.
I profiled the trivial example above w/ #include <axom/core.hpp> via clang's -ftime-trace flag (only works for C++, not CUDA)
and the data is here (can be opened in chrome://tracing). It revealed a number of things:
-
~25% of that time is spent on the
#include <immintrin.h>fromBitUtilities.hpp. The declarations inBitUtilities.hppare not function templates, and do not depend on the intrinsics defined inimmintrin.h, so the implementation could be moved into a separate file that #includesimmintrin.hand is only compiled once. (Since these particular functions have__host__ __device__annotations, then this requires separable compilation, but I believe axom is already using this feature). -
~30% of that time is spent on
Determinants.hppandLU.hppwhich seem like an unusual "core" features, most of this time is spent #including umpire stuff for memory allocation. Like above, it seems these allocation/deallocation calls can be abstracted in a way that moves the implementation (and heavy includes) out of the header file. e.g.
instead of
//--------------------------------
// src/axom/core/memory_management.hpp
template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
const std::size_t numbytes = n * sizeof(T);
#ifdef AXOM_USE_UMPIRE
umpire::ResourceManager& rm = umpire::ResourceManager::getInstance(); // now I have to #include all the umpire stuff in this header
umpire::Allocator allocator = rm.getAllocator(allocID);
return static_cast<T*>(allocator.allocate(numbytes));
#else
AXOM_UNUSED_VAR(allocID);
return static_cast<T*>(std::malloc(numbytes));
#endif
}
//--------------------------------do
//--------------------------------
// src/axom/core/memory_management.hpp
void * allocate(std::size_t num_bytes, int allocID);
template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
return static_cast<T *>(allocate(n * sizeof(T), allocID));
}
//--------------------------------
// src/axom/core/memory_management.cpp
#include <big/umpire/headers.hpp> // only included in the .cpp file, compiled once
void * allocate(std::size_t num_bytes, int allocID) {
#ifdef AXOM_USE_UMPIRE
umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
return rm.getAllocator(allocID).allocate(numbytes));
#else
return std::malloc(numbytes);
#endif
}
//--------------------------------- 15% of that time is spent on
ArrayBase.hpp, 97% of which goes towardfor_all.hpp. I only see two uses offor_allin that header, and they are for filling an array with a single value. I understand that it's convenient to reusefor_allhere, but a simple kernel definition like
template < typename T >
__global__ void fill(T * ptr, T value, size_t n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) { ptr[tid] = value; }
}accomplishes the same outcome, doesn't impact the compilation time at all (still 0.5s after adding this to the trivial example), and is only a few lines of code.
-
~8% of that time is spent on
Utilities.hpp, which includes heavy headers likerandom, but the related functions don't actually need to be in the header (e.g.random_real). -
~3% of that time is spent on
Timer.hpp, and includingchrono. I don't see any part ofTimer's interface that needs to know aboutchrono, so a PImpl version of this class can move the#include <chrono>and implementation out of the header.
The common theme is to avoid putting big #includes and implementations in headers, unless necessary.