Skip to content

Node version v0.8.13 as leader gets stuck every 2-3 hours after restart  #1887

@ghost

Description

Describe the bug
My node doesn't run longer than 2-3 hours stable. I have tried tweaking my node-config to get an optimal result, I tried different configurations in order to solve this, but I don't think it's the node-config.

Mandatory Information

  1. jcli --full-version jcli 0.8.13 (HEAD-241b3a59, release, linux [x86_64]) - [rustc 1.41.0 (5e1a79984 2020-01-27)];
  2. jormungandr --full-version jormungandr 0.8.13 (HEAD-241b3a59, release, linux [x86_64]) - [rustc 1.41.0 (5e1a79984 2020-01-27)];

To Reproduce
Steps to reproduce the behavior:

  1. When the process of jormungandr has stopped, I'll run it with start_leader with the following configuration.
log:
- output: stderr
  format: plain
  level: info
http_fetch_block0_service:
  - "https://github.com/input-output-hk/jormungandr-block0/raw/master/data/"
skip_bootstrap: false
bootstrap_from_trusted_peers: false
p2p:
  listen_address: "/ip4/0.0.0.0/tcp/17900"
  public_address: "/ip4/x.x.x.x/tcp/17900"
  topics_of_interest:
    blocks: high
    messages: high
  max_connections: 600
  max_bootstrap_attempts: 3
  max_unreachable_nodes_to_connect_per_event: 18
  gossip_interval: 5s
  policy:
    quarantine_duration: 10m
  trusted_peers:
    - address: "/ip4/3.231.168.222/tcp/3000"
      id: ff2aaaac6cab77d3fb72bf3cb9079246eca323c60b2fd68a
    - address: "/ip4/13.56.0.226/tcp/3000"
      id: 7ddf203c86a012e8863ef19d96aabba23d2445c492d86267
    - address: "/ip4/52.28.91.178/tcp/3000"
      id: 23b3ca09c644fe8098f64c24d75d9f79c8e058642e63a28c
    - address: "/ip4/3.125.75.156/tcp/3000"
      id: 22fb117f9f72f38b21bca5c0f069766c0d4327925d967791
    - address: "/ip4/13.112.181.42/tcp/3000"
      id: 52762c49a84699d43c96fdfe6de18079fb2512077d6aa5bc
    - address: "/ip4/13.114.196.228/tcp/3000"
      id: 7e1020c2e2107a849a8353876d047085f475c9bc646e42e9
    - address: "/ip4/52.8.15.52/tcp/3000"
      id: 18bf81a75e5b15a49b843a66f61602e14d4261fb5595b5f5
    - address: "/ip4/52.9.132.248/tcp/3000"
      id: 671a9e7a5c739532668511bea823f0f5c5557c99b813456c
    - address: "/ip4/3.125.183.71/tcp/3000"
      id: 9d15a9e2f1336c7acda8ced34e929f697dc24ea0910c3e67
    - address: "/ip4/18.184.35.137/tcp/3000"
      id: 06aa98b0ab6589f464d08911717115ef354161f0dc727858
    - address: "/ip4/18.182.115.51/tcp/3000"
      id: 8529e334a39a5b6033b698be2040b1089d8f67e0102e2575
    - address: "/ip4/3.115.154.161/tcp/3000"
      id: 35bead7d45b3b8bda5e74aa12126d871069e7617b7f4fe62
    - address: "/ip4/18.177.78.96/tcp/3000"
      id: fc89bff08ec4e054b4f03106f5312834abdf2fcb444610e9
    - address: "/ip4/52.9.77.197/tcp/3000"
      id: fcdf302895236d012635052725a0cdfc2e8ee394a1935b63
    - address: "/ip4/54.183.149.167/tcp/3000"
      id: df02383863ae5e14fea5d51a092585da34e689a73f704613
    - address: "/ip4/3.124.116.145/tcp/3000"
      id: 99cb10f53185fbef110472d45a36082905ee12df8a049b74
rest:
  listen: "127.0.0.1:3100"
storage: /home/ada/storage
explorer:
  enabled: false
mempool:
    pool_max_entries: 10000
    log_max_entries: 100000
leadership:
    logs_capacity: 4096
  1. The node runs for some time and get's stuck very quickly and shoots up the amount of max_connections, that is usally a good indication that it's not registering any blocks.
  2. In the logs I see the following stuck-notifier messages:
Mar 05 16:02:43.091 WARN blockchain is not moving up, system-date=82.37473, the last tip c19b40e7-000426db-82.24939 was 25068 seconds ago, task: stuck_notifier
Mar 05 16:03:43.093 WARN blockchain is not moving up, system-date=82.37503, the last tip c19b40e7-000426db-82.24939 was 25128 seconds ago, task: stuck_notifier
Mar 05 16:04:43.093 WARN blockchain is not moving up, system-date=82.37533, the last tip c19b40e7-000426db-82.24939 was 25188 seconds ago, task: stuck_notifier
Mar 05 16:05:43.091 WARN blockchain is not moving up, system-date=82.37563, the last tip c19b40e7-000426db-82.24939 was 25248 seconds ago, task: stuck_notifier
Mar 05 16:06:43.092 WARN blockchain is not moving up, system-date=82.37593, the last tip c19b40e7-000426db-82.24939 was 25308 seconds ago, task: stuck_notifier
Mar 05 16:07:43.092 WARN blockchain is not moving up, system-date=82.37623, the last tip c19b40e7-000426db-82.24939 was 25368 seconds ago, task: stuck_notifier
Mar 05 16:08:43.093 WARN blockchain is not moving up, system-date=82.37653, the last tip c19b40e7-000426db-82.24939 was 25428 seconds ago, task: stuck_notifier
  1. With a restart-script that checks if the node is stuck, that process also get clogged as sometimes it stops at bootstrapping process.
  2. As a result:
Has this node been scheduled to be leader?
---
- created_at_time: "2020-03-05T07:33:47.290214282+00:00"
  enclave_leader_id: 1
  finished_at_time: "2020-03-05T12:50:41.000850225+00:00"
  scheduled_at_date: "82.31711"
  scheduled_at_time: "2020-03-05T12:50:39+00:00"
  status:
    Rejected:
      reason: Failed to compute the schedule within time boundaries
  wake_at_time: "2020-03-05T12:50:39.001462736+00:00"

Expected behavior
A stable node as a leader, running on 4cpu, 16gb machine with ubuntu server and doesn't do anything else.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions