-
Hello, I have encountered the following common errors in etcd, primarily in the leader node:
The problem seems to be very broad, so I'll describe my environment using the troubleshooting article as a guideline:
Let me know if I need to provide anything else regarding troubleshooting. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Seems like slow disks?
…On Thu, Apr 24, 2025 at 7:29 AM Alexander Manov ***@***.***> wrote:
Hello,
I have encountered the following common errors in etcd, primarily in the
leader node:
- "waiting for ReadIndex response took too long, retrying"
- "apply request took too long"
- "Failed to check current member's leadership"
I still cannot figure out why etcd is crashing.
The problem seems to be very broad, so I'll describe my environment using
the troubleshooting article as a guideline:
- The nodes (currently 1 control plane, with a setup of 3 control
planes the leader yields the same error; workers are reset and currently
not used for troubleshooting purposes) are deployed on virtual machines
(using Proxmox) via the metal amd64 ISO on a common subnet, so the machines
grab their network configuration via DHCP and the provided IP addresses are
static.
- All firewall rules between the nodes inside the cluster subnet have
been removed, so 50000 and 50001 are accessible.
- Multiple fresh configurations were deployed with each deploy the
machine is reset beforehand (EPHEMERAL, META, STATE partitions are wiped
and machine is rebooted). With each redeploy the same errors resurface (the
few mentioned above).
- The cluster's common subnet is 10.240.6.0/24, which isn't
conflicting with the default 10.244.0.0/16 and 10.96.0.0/12.
- Health check reports that etcd is not healthy
- Here are the full logs
<https://github.com/user-attachments/files/19893563/etcd.log> of the
node's etcd service. Most of it is a repeating pattern of the errors
mentioned above.
- The worker and control plane configurations point to an on-prem
discovery service which is reset each time the nodes are also reset.
Let me know if I need to provide anything else regarding troubleshooting.
Thanks in advance! :)
—
Reply to this email directly, view it on GitHub
<#10791>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQGWG5P7KXIJN7EW7FSJ3OT23DYMZAVCNFSM6AAAAAB3ZGJYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZYGIZTSMRSHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
This is either slow disks (like really slow if it happens without too much load), or some networking issue, including e.g. MTU mismatch. It's not a Talos issue on its own, so you would need to troubleshoot your setup further to understand what might be the root cause. E.g. you can try a single controlplane node cluster, and see if the problem persists. If it does, it's a disk issue. |
Beta Was this translation helpful? Give feedback.
This is either slow disks (like really slow if it happens without too much load), or some networking issue, including e.g. MTU mismatch.
It's not a Talos issue on its own, so you would need to troubleshoot your setup further to understand what might be the root cause.
E.g. you can try a single controlplane node cluster, and see if the problem persists. If it does, it's a disk issue.