Skip to content

Connection timeout in custom test using UnboundBuffer with TCP_LAZY transport #444

@piotrchmiel

Description

@piotrchmiel

Description:

We’ve written a test to verify the behavior of Gloo's UnboundBuffer when it is destroyed shortly after a send() operation. The test is based on TCP_LAZY transport and context size of 2. Despite the simplicity of the setup, the test randomly fails due to a connection timeout (gloo::IoException). This suggests a possible timing or synchronization issue during peer connection setup in the TCP backend.

Test code:

TEST_F(BaseTest, DestroyingBuffer) {
  const auto transport = TCP_LAZY;
  const auto contextSize = 2;

  spawn(transport, contextSize, [&](const std::shared_ptr<Context> &context) {
    const auto rank = static_cast<size_t>(context->rank);
    if (rank == 0LU)
      return;

    using BufferPtr = std::unique_ptr<::gloo::transport::UnboundBuffer>;
    std::vector<int> storage = { context->rank };
    BufferPtr buffer =
        context->createUnboundBuffer(storage.data(), sizeof(int));

    ASSERT_NO_THROW({
      for (auto i = 0LU; i < contextSize; i++) {
        if (i == rank)
          continue;
        buffer->send(static_cast<int>(i), rank);
      }
      buffer.reset();
    });
  });
}

Behavior:
When it passes, the test completes in ~20ms, with both ranks attempting to connect and send as expected.

When it fails, it hangs for a long time (150s) before throwing:
gloo::IoException: [/path/to/gloo/transport/tcp/pair.h:303] Connect timeout [none]
This happens despite running the test multiple times consecutively with no code changes.

Environment:
Gloo commit: fe67c4b

Transport: TCP_LAZY

Context size: 2

Platform: Linux Ubuntu 24.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions