-
Notifications
You must be signed in to change notification settings - Fork 192
Fix segfault race condition at process exit #5477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I discovered a couple race conditions leading to segfaults in a separate repository. After debugging I managed to narrow this down to issues around the lifetimes of our static global state instances. As these instances are all destructed as the main thread exits, anything in separate threads can end up attempting to reuse these instances after they have been destructed. The test included in this PR doesn't actually use threads due to flakiness concerns. Instead we can observe the same phenomenon by attempting to store a `static std::optional<Context>` instance which ends up having its destructor run after the destructors for GlobalState and the root Logger. This change updates the GlobalState to be a `static std::shared_ptr<GlobalState>` and the `Logger` to be a `static Logger*`. In the case of `GlobalState` we then just rely on shared_ptr reference counting to ensure that the instance is alive for as long as a `Context` needs it. The `Logger*` on the other hand is just never delete'd so that any log messages generated are included rather than silencing any thread after the main thread exits.
// Note: We are *not* deallocating this root `Logger` instance so that | ||
// during process exit, threads other than main will not trigger segfault's | ||
// if they attempt to log after the main thread has exited. | ||
static Logger* l = tiledb_new<Logger>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the heap profiler ever enabled by the time you get here? It looks like it is enabled only via C API, not even via environmental option on startup
I suppose if that changes then you will pick it up here which will be nice - unless it expects memory usage to be zero by the time it exits for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge the heap profiling is a compile time option so “yes” I guess? Assuming that code still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected it to be something that has to last for the life of the process but there is a C API tiledb_heap_profiler_enable
which looks like it connects to the same thing used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick skim suggests that the reporter logic is always available, but if its not enabled at compile time it’ll just not have anything to report.
See:
#if defined(TILEDB_MEMTRACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm looking at is:
In tiledb/common/heap_memory.h
:
template <typename T, typename... Args>
T* tiledb_new(const std::string& label, Args&&... args) {
if (!heap_profiler.enabled()) {
return new T(std::forward<Args>(args)...);
}
...
}
In tiledb/common/heap_profiler.h
:
extern HeapProfiler heap_profiler;
class HeapProfiler {
inline bool enabled() const {
// We know that this instance has been initalized
// if `reserved_memory_` has been set.
return reserved_memory_ != nullptr;
}
};
And in tiledb/common/heap_profiler.cc
we have HeapProfiler::HeapProfiler
initializing its reserved_memory_
to null. The enable function I mentioned above initializes this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Looks like that’s a missed optimization issue. The make_shared thing was updated specifically to be compile time switch. You can see the constexpr part here:
if constexpr (detail::global_tracing<void>::enabled::value) { |
Haven't given the code a full look-over yet, but on first pass I'm hesitant on the implementation. I'd still like to see |
@bekadavis9 I’m right there with you. This was the least worst thing I could think of on first pass while also being much more concerned about just debugging the CI issues discussed in the story. The GlobalState stuff for instance is mostly for cloud to ask query processing to stop which I’m not entirely sure even works, but if the new cloud stuff doesn’t need it any more (double crossed fingers on that one) maybe the fix there is to just delete the global state. For the logger side, I still think that should be application provided. It’s still weird to me that we even attempt to write logs to disk instead of just exposing an API but it is what it is for the moment. However, using the library as is, within our guarantees can segfault and I put “don’t segfault” at a higher priority than “eww this touches code I want to delete”. |
We should revisit this. We're seeing it reasonably often in |
I discovered a couple race conditions leading to segfaults in a separate repository. After debugging I managed to narrow this down to issues around the lifetimes of our static global state instances. As these instances are all destructed as the main thread exits, anything in separate threads can end up attempting to reuse these instances after they have been destructed.
The test included in this PR doesn't actually use threads due to flakiness concerns. Instead we can observe the same phenomenon by attempting to store a
static std::optional<Context>
instance which ends up having its destructor run after the destructors for GlobalState and the root Logger.This change updates the GlobalState to be a
static std::shared_ptr<GlobalState>
and theLogger
to be astatic Logger*
. In the case ofGlobalState
we then just rely on shared_ptr reference counting to ensure that the instance is alive for as long as aContext
needs it. TheLogger*
on the other hand is just never delete'd so that any log messages generated are included rather than silencing any thread after the main thread exits.TYPE: BUG
DESC: Fix segfault race condition at process exit