long #include times for axom headers

I'm working on serac and noticing that a lot of our simple tests take a surprisingly long time to compile.

We noticed a similar problem a while back, that Inlet in particular had some really bizarre combinatorial explosion of template instantiations that took a long time to compile, but this issue relates to `axom/core.hpp`, not Inlet.

When I started to write a new small cuda test, I noticed that the trivial executable,
```cpp
int main() { return 0; }
```
takes 0.5s to compile with `nvcc` on my machine, while 
```cpp
#include <axom/core.hpp>
int main() { return 0; }
```
surprisingly takes 5.5s.

So, just the act of `#include`ing a header from axom added a whopping 5 seconds (?!) to my compile time. 

---------

I profiled the trivial example above w/ `#include <axom/core.hpp>` via clang's `-ftime-trace` flag (only works for C++, not CUDA) 
and the data is [here](https://github.com/LLNL/axom/files/9143954/axom_include_time.json.zip) (can be opened in `chrome://tracing`). It revealed a number of things:

- ~25% of that time is spent on the `#include <immintrin.h>` from `BitUtilities.hpp`. The declarations in `BitUtilities.hpp` are not function templates, and do not depend on the intrinsics defined in `immintrin.h`, so the implementation could be moved into a separate file that #includes `immintrin.h` and is only compiled once. (Since these particular functions have `__host__ __device__` annotations, then this requires separable compilation, but I believe axom is already using this feature).

- ~30% of that time is spent on `Determinants.hpp` and `LU.hpp` which seem like an unusual "core" features, most of this time is spent #including umpire stuff for memory allocation. Like above, it seems these allocation/deallocation calls can be abstracted in a way that moves the implementation (and heavy includes) out of the header file. e.g.

instead of
```cpp
//--------------------------------
// src/axom/core/memory_management.hpp
template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  const std::size_t numbytes = n * sizeof(T);
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance(); // now I have to #include all the umpire stuff in this header
  umpire::Allocator allocator = rm.getAllocator(allocID);
  return static_cast<T*>(allocator.allocate(numbytes));
#else
  AXOM_UNUSED_VAR(allocID);
  return static_cast<T*>(std::malloc(numbytes));
#endif
}
//--------------------------------
```
do
```cpp
//--------------------------------
// src/axom/core/memory_management.hpp 

void * allocate(std::size_t num_bytes, int allocID);

template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  return static_cast<T *>(allocate(n * sizeof(T), allocID));
}

//--------------------------------
// src/axom/core/memory_management.cpp

#include <big/umpire/headers.hpp> // only included in the .cpp file, compiled once
void * allocate(std::size_t num_bytes, int allocID) {
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  return rm.getAllocator(allocID).allocate(numbytes));
#else
  return std::malloc(numbytes);
#endif
}
//--------------------------------
```

- 15% of that time is spent on `ArrayBase.hpp`, 97% of which goes toward `for_all.hpp`. I only see two uses of `for_all` in that header, and they are for filling an array with a single value. I understand that it's convenient to reuse `for_all` here, but a simple kernel definition like
```cu
template < typename T >                                                             
__global__ void fill(T * ptr, T value, size_t n) {                                  
  int tid = threadIdx.x + blockIdx.x * blockDim.x;                                  
  if (tid < n) { ptr[tid] = value; }                                                
}
```
accomplishes the same outcome, doesn't impact the compilation time at all (still 0.5s after adding this to the trivial example), and is only a few lines of code.

- ~8% of that time is spent on `Utilities.hpp`, which includes heavy headers like `random`, but the related functions don't actually need to be in the header (e.g. `random_real`).

- ~3% of that time is spent on `Timer.hpp`, and including `chrono`. I don't see any part of `Timer`'s interface that needs to know about `chrono`, so a PImpl version of this class can move the `#include <chrono>` and implementation out of the header.

------------

The common theme is to avoid putting big #includes and implementations in headers, unless necessary.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

long #include times for axom headers #872

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

long #include times for axom headers #872

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions