Replies: 10 comments 5 replies
-
Sounds like a good idea, thanks @seetadev. |
Beta Was this translation helpful? Give feedback.
-
Hi @acul71, Thank you for the super thorough review of the Yamux implementation. Your spec analysis was a huge help in spotting the gaps. I’ve made the changes needed to get everything up to snuff. Here’s what I did:
I’ve committed the changes with a message linking back to your review. Could you take a look and let me know if I missed anything or if there’s more to polish? |
Beta Was this translation helpful? Give feedback.
-
Hi @paschal533 👋 I made a small update: With this the identify protocol test succeed. Let me know if you need anything from my side! Thanks 🙏 Yamux None to Int FixIssue DescriptionThe Yamux stream multiplexer implementation in py-libp2p had an issue with handling This issue affected the identify protocol which sometimes passes Root CauseIn Python, passing The issue specifically occurred in two key methods:
Fix ImplementationThe fix was implemented in both methods by adding explicit handling for Changes in YamuxStream.read:async def read(self, n: int = -1) -> bytes:
if self.recv_closed and not self.conn.stream_buffers.get(self.stream_id):
return b""
# Handle None value for n by converting it to -1
if n is None:
n = -1
return await self.conn.read_stream(self.stream_id, n) Changes in Yamux.read_stream:async def read_stream(self, stream_id: int, n: int = -1) -> bytes:
logging.debug(f"Reading from stream {stream_id}, n={n}")
# Handle None value for n by converting it to -1
if n is None:
n = -1
# ... rest of the method implementation ... Testing and VerificationThe fix allows the identify protocol tests to pass successfully. The identify protocol is a critical part of libp2p that allows peers to exchange metadata about each other, including:
Before the fix, the identify protocol tests would fail when attempting to read responses from streams where ImpactThis fix ensures compatibility with other libp2p components that may pass
By handling the References
|
Beta Was this translation helpful? Give feedback.
-
@paschal533 FAILED tests/core/host/test_connected_peers.py::test_connected_peers[connect_and_disconnect] - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/host/test_live_peers.py::test_live_peers_disconnect - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/host/test_live_peers.py::test_live_peers_unexpected_drop - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_read_until_eof - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_read_after_remote_closed - BaseExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_read_after_local_reset - BaseExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_read_after_remote_reset - BaseExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_write_after_local_closed - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_write_after_local_reset - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_net_stream.py::test_net_stream_write_after_remote_reset - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_notify.py::test_notify - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_swarm.py::test_swarm_close_peer - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
FAILED tests/core/network/test_swarm_conn.py::test_swarm_conn_close - ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
|
Beta Was this translation helpful? Give feedback.
-
Yamux Stream Test FailuresTest Failures OverviewAfter running the tests in
Failure Details and Suggested Fixes1.
|
Beta Was this translation helpful? Give feedback.
-
Test Failure Analysis:
|
Beta Was this translation helpful? Give feedback.
-
Swarm Connection Test Failure AnalysisTest Name
Issue DescriptionThe test is failing at the assertion
This occurs in the following test scenario:
Root CauseThe root cause appears to be that the connection closure is not properly propagated across the connection. When
This issue is related to the same fundamental problem observed in the Problem in the ImplementationLooking at the test and the error, there are several potential issues in the implementation:
Suggested FixThe fix should ensure proper bidirectional closure notification for connections. Here are specific changes that could address the issue:
Implementation ApproachThe most direct approach would be to modify the async def close(self):
if not self.is_closed:
# Send explicit closure notification to remote peer
try:
# Send a GO_AWAY message through the Yamux connection
await self.muxed_conn.close()
# Small delay to allow the message to be sent
await trio.sleep(0.05)
except Exception as e:
# Log the error but continue with closure
logging.error(f"Error sending close notification: {str(e)}")
# Set local state to closed
self.is_closed = True
# Remove from swarm
if self.peer_id in self.swarm.connections:
self.swarm.remove_conn(self) Additionally, the Yamux implementation should be updated to properly handle connection closure events:
Related IssuesThis issue is directly related to the other connection management problems observed in:
All these issues point to fundamental weaknesses in how the libp2p implementation handles connection lifecycle management, particularly around closure operations. The fix should be comprehensive and address all aspects of connection establishment, maintenance, and termination. |
Beta Was this translation helpful? Give feedback.
-
Swarm Test Failure AnalysisTest Name
Issue DescriptionThe test is failing at the assertion
This occurs in the following test scenario:
Root CauseThe root cause appears to be that closing a peer connection is not properly propagated to both sides of the connection. When
This one-sided connection cleanup is problematic because:
Suggested FixThe fix should ensure that when a swarm closes a connection to a peer, both sides of the connection are properly notified and cleaned up. Here are specific changes that could address the issue:
Implementation DetailsThe most direct approach would be to modify the
The key challenge is ensuring that the remote peer receives and processes the close notification before the connection is fully terminated, which may require careful sequencing of the shutdown operations. Related IssuesThis failure is related to the other connection management issues observed in Specifically, the issues observed in the Yamux implementation regarding stream closure and reset are likely contributing to the connection issues as well, since connections rely on the underlying stream multiplexer for communication. |
Beta Was this translation helpful? Give feedback.
-
Notify Test Failure AnalysisTest Name
Issue DescriptionThe test is failing with an assertion error in the The failure occurs at line 106 in # Connected again, but different direction.
await connect_swarm(swarms[1], swarms[0]) Root CauseThe test is failing because the connection between the swarms is not being properly established after the previous disconnect operation. This could be due to several factors:
Suggested FixTo address this issue, the following fixes should be considered:
Related IssuesThis failure is likely related to the other test failures in
Fixing the underlying Yamux implementation issues would likely resolve this test failure as well, as they appear to be manifestations of the same core problems with connection and stream state management. |
Beta Was this translation helpful? Give feedback.
-
Fix Proposal: Connection Closure in YamuxIssue Summary
import trio
import logging
from libp2p.peer.peerinfo import info_from_p2p_addr
from tests.utils.factories import HostFactory
logging.basicConfig(level=logging.DEBUG)
async def main():
print('Creating hosts')
async with HostFactory.create_batch_and_listen(2) as hosts:
host_a, host_b = hosts
# Get peer info
addr = host_b.get_addrs()[0]
info = info_from_p2p_addr(addr)
# Connect them
print('Connecting hosts')
await host_a.connect(info)
# Check connection
print(f'Host A connected peers: {host_a.get_connected_peers()}')
print(f'Host B connected peers: {host_b.get_connected_peers()}')
# Disconnect
print('Disconnecting hosts')
await host_b.disconnect(host_a.get_id())
await trio.sleep(0.5) # Give time for disconnect to propagate
# Check connection again
print(f'Host A connected peers after disconnect: {host_a.get_connected_peers()}')
print(f'Host B connected peers after disconnect: {host_b.get_connected_peers()}')
if __name__ == "__main__": python test_disconnect.py
Creating hosts
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba700> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b06c4440>>
DEBUG:factory.generate:LazyFunction: Evaluating <function default_key_pair_factory at 0x7ef6b0c85620> on <factory.builder.BuildStep object at 0x7ef6b06c4440>
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba7a0> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b06c4440>>
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba840> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b06c4440>>
DEBUG:factory.generate:LazyFunction: Evaluating <function default_muxer_transport_factory at 0x7ef6b06ba340> on <factory.builder.BuildStep object at 0x7ef6b06c4440>
DEBUG:factory.generate:LazyFunction: Evaluating <class 'libp2p.transport.tcp.tcp.TCP'> on <factory.builder.BuildStep object at 0x7ef6b06c4440>
DEBUG:async_service.Manager:<Manager[Swarm] flags=SRcfe>: _handle_cancelled waiting for cancellation
DEBUG:libp2p.transport.tcp:serve_tcp 127.0.0.1 0
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba700> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b0686ad0>>
DEBUG:factory.generate:LazyFunction: Evaluating <function default_key_pair_factory at 0x7ef6b0c85620> on <factory.builder.BuildStep object at 0x7ef6b0686ad0>
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba7a0> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b0686ad0>>
DEBUG:factory.generate:LazyAttribute: Evaluating <function SwarmFactory.<lambda> at 0x7ef6b06ba840> on <Resolver for <factory.builder.BuildStep object at 0x7ef6b0686ad0>>
DEBUG:factory.generate:LazyFunction: Evaluating <function default_muxer_transport_factory at 0x7ef6b06ba340> on <factory.builder.BuildStep object at 0x7ef6b0686ad0>
DEBUG:factory.generate:LazyFunction: Evaluating <class 'libp2p.transport.tcp.tcp.TCP'> on <factory.builder.BuildStep object at 0x7ef6b0686ad0>
DEBUG:async_service.Manager:<Manager[Swarm] flags=SRcfe>: _handle_cancelled waiting for cancellation
DEBUG:libp2p.transport.tcp:serve_tcp 127.0.0.1 0
Connecting hosts
DEBUG:libp2p.network.swarm:attempting to dial peer QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq
DEBUG:libp2p.network.swarm:dialed peer QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq over base transport
DEBUG:libp2p.network.swarm:upgraded security for peer QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq
DEBUG:libp2p.network.swarm:upgraded mux for peer QmZyjU7JBbjao8Jb3Y6gkrSXrtxydbWMQqLW26jzfCbbu2
DEBUG:root:Starting Yamux for QmZyjU7JBbjao8Jb3Y6gkrSXrtxydbWMQqLW26jzfCbbu2
DEBUG:libp2p.network.swarm:upgraded mux for peer QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq
DEBUG:root:Waiting for new stream
DEBUG:root:Starting Yamux for QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq
DEBUG:root:Waiting for new stream
DEBUG:libp2p.network.swarm:successfully opened connection to peer QmZyjU7JBbjao8Jb3Y6gkrSXrtxydbWMQqLW26jzfCbbu2
DEBUG:libp2p.network.swarm:successfully dialed peer QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq
Host A connected peers: [<libp2p.peer.id.ID (QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq)>]
Host B connected peers: [<libp2p.peer.id.ID (QmZyjU7JBbjao8Jb3Y6gkrSXrtxydbWMQqLW26jzfCbbu2)>]
Disconnecting hosts
DEBUG:root:Closing Yamux connection with code 0
DEBUG:root:Received header: type=3, flags=0, stream_id=0, length=0
DEBUG:root:Received GO_AWAY: Normal termination
ERROR:root:Error in handle_incoming: RawConnError:
DEBUG:libp2p.network.swarm:successfully close the connection to peer QmZyjU7JBbjao8Jb3Y6gkrSXrtxydbWMQqLW26jzfCbbu2
Host A connected peers after disconnect: [<libp2p.peer.id.ID (QmV4csbu4dmiwGQQysaDUyW2mcAd1a1kq8Ynw9xVdiGvTq)>]
Host B connected peers after disconnect: []
DEBUG:async_service.Manager:<Manager[Swarm] flags=SRCfe>: _handle_cancelled triggering task cancellation
DEBUG:async_service.Manager:<Manager[Swarm] flags=SrCFe>: finished
DEBUG:async_service.Manager:<Manager[Swarm] flags=SRCfe>: _handle_cancelled triggering task cancellation
DEBUG:async_service.Manager:<Manager[Swarm] flags=SrCFe>: finished
Issue SummaryThe test Our test demonstrated this issue clearly:
The problem occurs in the Yamux stream multiplexer, where the connection closure event triggered by one peer is not properly propagated to the other peer's connection management system. Root CauseWhen a peer calls
We can see this in the logs:
The key issue is that when the Yamux connection detects closure, it doesn't properly propagate this information to the SwarmConn layer, which is responsible for maintaining the connection list in the Swarm. Proposed FixThe fix should ensure that when a Yamux connection detects a closure (either via GO_AWAY or a RawConnError), it:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@acul71, @paschal533 — let’s move our coordination to this discussion thread. As you work through the PR, please post your test results, compliance findings, and any blockers or observations here. This will give us a single place to track progress and context.
@pacrob, @dhuseby — we’re transitioning to using the Discussions page to manage in-depth PR conversations. The goal is to keep the PR page itself clean and focused on the code diffs, review comments, and final approvals. This improves long-term readability and makes it easier to understand the history of changes during future audits or debugging. It also helps avoid losing important context in long PR comment threads. Please let me know your feedback and thoughts.
Beta Was this translation helpful? Give feedback.
All reactions