Skip to content

long #include times for axom headers #872

@samuelpmishLLNL

Description

@samuelpmishLLNL

I'm working on serac and noticing that a lot of our simple tests take a surprisingly long time to compile.

We noticed a similar problem a while back, that Inlet in particular had some really bizarre combinatorial explosion of template instantiations that took a long time to compile, but this issue relates to axom/core.hpp, not Inlet.

When I started to write a new small cuda test, I noticed that the trivial executable,

int main() { return 0; }

takes 0.5s to compile with nvcc on my machine, while

#include <axom/core.hpp>
int main() { return 0; }

surprisingly takes 5.5s.

So, just the act of #includeing a header from axom added a whopping 5 seconds (?!) to my compile time.


I profiled the trivial example above w/ #include <axom/core.hpp> via clang's -ftime-trace flag (only works for C++, not CUDA)
and the data is here (can be opened in chrome://tracing). It revealed a number of things:

  • ~25% of that time is spent on the #include <immintrin.h> from BitUtilities.hpp. The declarations in BitUtilities.hpp are not function templates, and do not depend on the intrinsics defined in immintrin.h, so the implementation could be moved into a separate file that #includes immintrin.h and is only compiled once. (Since these particular functions have __host__ __device__ annotations, then this requires separable compilation, but I believe axom is already using this feature).

  • ~30% of that time is spent on Determinants.hpp and LU.hpp which seem like an unusual "core" features, most of this time is spent #including umpire stuff for memory allocation. Like above, it seems these allocation/deallocation calls can be abstracted in a way that moves the implementation (and heavy includes) out of the header file. e.g.

instead of

//--------------------------------
// src/axom/core/memory_management.hpp
template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  const std::size_t numbytes = n * sizeof(T);
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance(); // now I have to #include all the umpire stuff in this header
  umpire::Allocator allocator = rm.getAllocator(allocID);
  return static_cast<T*>(allocator.allocate(numbytes));
#else
  AXOM_UNUSED_VAR(allocID);
  return static_cast<T*>(std::malloc(numbytes));
#endif
}
//--------------------------------

do

//--------------------------------
// src/axom/core/memory_management.hpp 

void * allocate(std::size_t num_bytes, int allocID);

template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  return static_cast<T *>(allocate(n * sizeof(T), allocID));
}

//--------------------------------
// src/axom/core/memory_management.cpp

#include <big/umpire/headers.hpp> // only included in the .cpp file, compiled once
void * allocate(std::size_t num_bytes, int allocID) {
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  return rm.getAllocator(allocID).allocate(numbytes));
#else
  return std::malloc(numbytes);
#endif
}
//--------------------------------
  • 15% of that time is spent on ArrayBase.hpp, 97% of which goes toward for_all.hpp. I only see two uses of for_all in that header, and they are for filling an array with a single value. I understand that it's convenient to reuse for_all here, but a simple kernel definition like
template < typename T >                                                             
__global__ void fill(T * ptr, T value, size_t n) {                                  
  int tid = threadIdx.x + blockIdx.x * blockDim.x;                                  
  if (tid < n) { ptr[tid] = value; }                                                
}

accomplishes the same outcome, doesn't impact the compilation time at all (still 0.5s after adding this to the trivial example), and is only a few lines of code.

  • ~8% of that time is spent on Utilities.hpp, which includes heavy headers like random, but the related functions don't actually need to be in the header (e.g. random_real).

  • ~3% of that time is spent on Timer.hpp, and including chrono. I don't see any part of Timer's interface that needs to know about chrono, so a PImpl version of this class can move the #include <chrono> and implementation out of the header.


The common theme is to avoid putting big #includes and implementations in headers, unless necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions