Skip to content

Connection state mismatch edge-case #128

@mzealey

Description

@mzealey

I've not been able to figure out the root cause of this edge case (stompman 2.1.0/python 3.13) but it seems to occasionally be able to get into a broken connection state where most if not all attempts at starting a txn end up with the following traceback:

      async with self.client.begin() as txn:
                 ~~~~~~~~~~~~~~~~~^^
    File "/usr/local/lib/python3.13/contextlib.py", line 214, in __aenter__
      return await anext(self.gen)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/client.py", line 139, in begin
      async with Transaction(
                 ~~~~~~~~~~~^
          _connection_manager=self._connection_manager, _active_transactions=self._active_transactions
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      ) as transaction:
      ^
    File "/usr/local/lib/python3.13/site-packages/stompman/transaction.py", line 21, in __aenter__
      await self._connection_manager.write_frame_reconnecting(BeginFrame(headers={"transaction": self.id}))
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_manager.py", line 192, in write_frame_reconnecting
      connection_state = await self._get_active_connection_state()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_manager.py", line 166, in _get_active_connection_state
      connection_result = await self._connect_to_any_server()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_manager.py", line 143, in _connect_to_any_server
      connection_result = await lifespan.enter()
                          ^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_lifespan.py", line 86, in enter
      connection_result = await self._establish_connection()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_lifespan.py", line 82, in _establish_connection
      self.set_heartbeat_interval(server_heartbeat)
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.13/site-packages/stompman/connection_manager.py", line 86, in _restart_heartbeat_tasks
      self._send_heartbeat_task = self._task_group.create_task(
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
          self._send_heartbeats_forever(server_heartbeat.want_to_receive_interval_ms)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      )
      ^
    File "/usr/local/lib/python3.13/asyncio/taskgroups.py", line 191, in create_task
      raise RuntimeError(f"TaskGroup {self!r} is finished")
  RuntimeError: TaskGroup <TaskGroup entered> is finished

The activemq/artemis service reports that there are connections that just keep timing out along the lines of:

2025-07-24 11:52:37,818 WARN  [org.apache.activemq.artemis.core.server] AMQ222067: Connection failure has been detected:  [code=REMOTE_DISCONNECT]
2025-07-24 11:52:38,439 WARN  [org.apache.activemq.artemis.core.protocol.stomp] AMQ332069: Sent ERROR frame to STOMP client 10.141.93.122:42308:
2025-07-24 11:52:38,439 WARN  [org.apache.activemq.artemis.core.server] AMQ222067: Connection failure has been detected:  [code=REMOTE_DISCONNECT]
2025-07-24 11:52:42,949 WARN  [org.apache.activemq.artemis.core.protocol.stomp] AMQ332069: Sent ERROR frame to STOMP client 10.141.93.122:42332: AMQ229014: Did not receive data from 10.141.93.122:42332 within the 2000ms connection TTL. The connection will now be closed.
2025-07-24 11:52:42,949 WARN  [org.apache.activemq.artemis.core.server] AMQ222067: Connection failure has been detected: AMQ229014: Did not receive data from 10.141.93.122:42332 within the 2000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]

This was on a service with 2-3 identical containers running. Only 1 got into this state and only restarting it fixed the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions