Skip to content

Testnet Validator Intermittent WebSocket Handshake Timeouts on Kubernetes #209

@Luka-Loncar

Description

@Luka-Loncar

Description

The testnet validator experiences intermittent timeout errors when attempting to connect to the testnet WebSocket endpoint (wss://test.finney.opentensor.ai:443) when running on Kubernetes. The connection works fine locally and within the pod when testing individual commands, but fails during validator initialization with a handshake timeout.

Error Details

TimeoutError: timed out while waiting for handshake response

Full stack trace:

pythonTraceback (most recent call last):
  File "/app/scripts/run_validator.py", line 25, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/app/scripts/run_validator.py", line 13, in main
    validator = Validator()
  File "/app/neurons/validator.py", line 78, in __init__
    self.metagraph.sync_nodes()
  File "/usr/local/lib/python3.10/site-packages/fiber/chain/metagraph.py", line 66, in sync_nodes
    nodes = fetch_nodes.get_nodes_for_netuid(self.substrate, self.netuid)
  File "/usr/local/lib/python3.10/site-packages/fiber/chain/fetch_nodes.py", line 66, in get_nodes_for_netuid
    substrate = get_substrate(subtensor_address=substrate.url)
  File "/usr/local/lib/python3.10/site-packages/fiber/chain/interface.py", line 31, in get_substrate
    substrate = SubstrateInterface(
  File "/usr/local/lib/python3.10/site-packages/async_substrate_interface/sync_substrate.py", line 538, in __init__
    self.ws = self.connect(init=True)
  File "/usr/local/lib/python3.10/site-packages/async_substrate_interface/sync_substrate.py", line 624, in connect
    return connect(self.chain_endpoint, max_size=self.ws_max_size)
  File "/usr/local/lib/python3.10/site-packages/websockets/sync/client.py", line 378, in connect
    connection.handshake(
  File "/usr/local/lib/python3.10/site-packages/websockets/sync/client.py", line 94, in handshake
    raise TimeoutError("timed out while waiting for handshake response")
TimeoutError: timed out while waiting for handshake response

Current Behavior

  • Connection fails intermittently (succeeded on 5th retry in the reported case)
  • Works fine when running locally
  • Works fine when executing Python scripts directly in the pod
  • Only affects testnet validator on Kubernetes
  • Mainnet validator appears unaffected

Expected Behavior

The validator should establish a stable WebSocket connection to the testnet on first attempt without requiring multiple retries.

Environment

  • Platform: Kubernetes
  • Python: 3.10
  • Endpoint: wss://test.finney.opentensor.ai:443
  • Component: Testnet Validator

Reproduction Steps

  • Deploy testnet validator to Kubernetes
  • Observe logs during startup
  • Connection may fail with timeout error (intermittent)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions