Etcd is crashing by timing out #10791

AEManov20 · 2025-04-24T14:29:10Z

AEManov20
Apr 24, 2025

Hello,

I have encountered the following common errors in etcd, primarily in the leader node:

"waiting for ReadIndex response took too long, retrying"
"apply request took too long"
"Failed to check current member's leadership"
I still cannot figure out why etcd is crashing.

The problem seems to be very broad, so I'll describe my environment using the troubleshooting article as a guideline:

The nodes (currently 1 control plane, with a setup of 3 control planes the leader yields the same error; workers are reset and currently not used for troubleshooting purposes) are deployed on virtual machines (using Proxmox) via the metal amd64 ISO on a common subnet, so the machines grab their network configuration via DHCP and the provided IP addresses are static.
All firewall rules between the nodes inside the cluster subnet have been removed, so 50000 and 50001 are accessible.
Multiple fresh configurations were deployed with each deploy the machine is reset beforehand (EPHEMERAL, META, STATE partitions are wiped and machine is rebooted). With each redeploy the same errors resurface (the few mentioned above).
The cluster's common subnet is 10.240.6.0/24, which isn't conflicting with the default 10.244.0.0/16 and 10.96.0.0/12.
Health check reports that etcd is not healthy
Here are the full logs of the node's etcd service. Most of it is a repeating pattern of the errors mentioned above.
The worker and control plane configurations point to an on-prem discovery service which is reset each time the nodes are also reset.

Let me know if I need to provide anything else regarding troubleshooting.
Thanks in advance! :)

Answered by smira

Apr 24, 2025

This is either slow disks (like really slow if it happens without too much load), or some networking issue, including e.g. MTU mismatch.

It's not a Talos issue on its own, so you would need to troubleshoot your setup further to understand what might be the root cause.

E.g. you can try a single controlplane node cluster, and see if the problem persists. If it does, it's a disk issue.

View full answer

steverfrancis · 2025-04-24T14:31:37Z

steverfrancis
Apr 24, 2025
Maintainer

Seems like slow disks?

…

On Thu, Apr 24, 2025 at 7:29 AM Alexander Manov ***@***.***> wrote: Hello, I have encountered the following common errors in etcd, primarily in the leader node: - "waiting for ReadIndex response took too long, retrying" - "apply request took too long" - "Failed to check current member's leadership" I still cannot figure out why etcd is crashing. The problem seems to be very broad, so I'll describe my environment using the troubleshooting article as a guideline: - The nodes (currently 1 control plane, with a setup of 3 control planes the leader yields the same error; workers are reset and currently not used for troubleshooting purposes) are deployed on virtual machines (using Proxmox) via the metal amd64 ISO on a common subnet, so the machines grab their network configuration via DHCP and the provided IP addresses are static. - All firewall rules between the nodes inside the cluster subnet have been removed, so 50000 and 50001 are accessible. - Multiple fresh configurations were deployed with each deploy the machine is reset beforehand (EPHEMERAL, META, STATE partitions are wiped and machine is rebooted). With each redeploy the same errors resurface (the few mentioned above). - The cluster's common subnet is 10.240.6.0/24, which isn't conflicting with the default 10.244.0.0/16 and 10.96.0.0/12. - Health check reports that etcd is not healthy - Here are the full logs <https://github.com/user-attachments/files/19893563/etcd.log> of the node's etcd service. Most of it is a repeating pattern of the errors mentioned above. - The worker and control plane configurations point to an on-prem discovery service which is reset each time the nodes are also reset. Let me know if I need to provide anything else regarding troubleshooting. Thanks in advance! :) — Reply to this email directly, view it on GitHub <#10791>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQGWG5P7KXIJN7EW7FSJ3OT23DYMZAVCNFSM6AAAAAB3ZGJYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZYGIZTSMRSHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

smira · 2025-04-24T14:39:29Z

smira
Apr 24, 2025
Maintainer

This is either slow disks (like really slow if it happens without too much load), or some networking issue, including e.g. MTU mismatch.

It's not a Talos issue on its own, so you would need to troubleshoot your setup further to understand what might be the root cause.

E.g. you can try a single controlplane node cluster, and see if the problem persists. If it does, it's a disk issue.

1 reply

AEManov20 Apr 29, 2025
Author

Sorry for the slow response. It turned out it really was very slow IO disk operations. One of the VMs (not related to the Talos cluster) was starving the others of resource. Still trying to figure out what's going on in that VM. Thanks for the swift response! I'll now mark this as closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Etcd is crashing by timing out #10791

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Etcd is crashing by timing out #10791

Uh oh!

AEManov20 Apr 24, 2025

Replies: 2 comments · 1 reply

Uh oh!

steverfrancis Apr 24, 2025 Maintainer

Uh oh!

smira Apr 24, 2025 Maintainer

Uh oh!

AEManov20 Apr 29, 2025 Author

AEManov20
Apr 24, 2025

Replies: 2 comments 1 reply

steverfrancis
Apr 24, 2025
Maintainer

smira
Apr 24, 2025
Maintainer

AEManov20 Apr 29, 2025
Author