-
Notifications
You must be signed in to change notification settings - Fork 340
Description
Description:
We’ve written a test to verify the behavior of Gloo's UnboundBuffer when it is destroyed shortly after a send() operation. The test is based on TCP_LAZY transport and context size of 2. Despite the simplicity of the setup, the test randomly fails due to a connection timeout (gloo::IoException). This suggests a possible timing or synchronization issue during peer connection setup in the TCP backend.
Test code:
TEST_F(BaseTest, DestroyingBuffer) {
const auto transport = TCP_LAZY;
const auto contextSize = 2;
spawn(transport, contextSize, [&](const std::shared_ptr<Context> &context) {
const auto rank = static_cast<size_t>(context->rank);
if (rank == 0LU)
return;
using BufferPtr = std::unique_ptr<::gloo::transport::UnboundBuffer>;
std::vector<int> storage = { context->rank };
BufferPtr buffer =
context->createUnboundBuffer(storage.data(), sizeof(int));
ASSERT_NO_THROW({
for (auto i = 0LU; i < contextSize; i++) {
if (i == rank)
continue;
buffer->send(static_cast<int>(i), rank);
}
buffer.reset();
});
});
}
Behavior:
When it passes, the test completes in ~20ms, with both ranks attempting to connect and send as expected.
When it fails, it hangs for a long time (150s) before throwing:
gloo::IoException: [/path/to/gloo/transport/tcp/pair.h:303] Connect timeout [none]
This happens despite running the test multiple times consecutively with no code changes.
Environment:
Gloo commit: fe67c4b
Transport: TCP_LAZY
Context size: 2
Platform: Linux Ubuntu 24.04