-
Notifications
You must be signed in to change notification settings - Fork 31
Description
We are experiencing reproducible application crashes in DragonflyDB, a Redis-compatible data store that heavily utilizes io_uring. The crash is a SIGABRT that originates deep within DragonflyDB's io_uring event loop.
This issue appears to be specific to the environment and has only been reproduced on Bottlerocket OS v1.41.0 running the Linux kernel 6.1.140 on the arm64 architecture. We have not been able to reproduce this on other kernel versions or architectures, which leads us to suspect a potential issue in the kernel's io_uring implementation or its integration within this specific Bottlerocket version.
The crash can be triggered reliably under at least two different high-I/O scenarios:
Sending a high frequency of CL.THROTTLE commands.
During replication and sync events between DragonflyDB nodes.
Both scenarios produce an identical stack trace, pointing to a systemic issue rather than a bug in a single application code path.
Environment
Bottlerocket Version: 1.41.0 (from aws-k8s-1.32 variant)
Kernel Version: 6.1.140
Architecture: arm64
Cloud Provider: AWS
Orchestrator: EKS (Kubernetes v1.32)
Node Provisioner: Karpenter
Symptoms
The DragonflyDB pod terminates unexpectedly. The container logs show a SIGABRT signal followed by a SIGTRAP. The key function call in the stack trace is fu2::...::empty_invoker<>::invoke() occurring within the util::fb2::UringProactor::ProcessCqeBatch() function. This suggests the io_uring event loop is attempting to execute an invalid or deallocated callback function.
Crash Log & Stack Trace
*** SIGABRT received at time=1751305699 on cpu 1 ***
PC: @ 0xffff9e6ff1f0 (unknown) (unknown)
@ 0xaaaae65f3214 480 absl::lts_20240722::AbslFailureSignalHandler()
@ 0xffff9e979820 4960 (unknown)
@ 0xffff9e6ba67c 208 gsignal
@ 0xffff9e6a7130 32 abort
@ 0xaaaae5b5c3fc 336 fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
@ 0xaaaae63e8b40 16 util::fb2::UringProactor::ProcessCqeBatch()
@ 0xaaaae63ea448 304 util::fb2::UringProactor::ReapCompletions()
@ 0xaaaae63eabd8 80 util::fb2::UringProactor::MainLoop()
@ 0xaaaae63aa398 1568 boost::context::detail::fiber_entry<>()
[failure_signal_handler.cc : 345] RAW: Signal 5 raised at PC=0xffff9e6a71ec while already in AbslFailureSignalHandler()
*** SIGTRAP received at time=1751305699 on cpu 1 ***
PC: @ 0xffff9e6a71ec (unknown) abort
@ 0xaaaae65f3214 480 absl::lts_20240722::AbslFailureSignalHandler()
@ 0xffff9e979820 4960 (unknown)
@ 0xaaaae5b5c3fc 336 fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
@ 0xaaaae63e8b40 16 util::fb2::UringProactor::ProcessCqeBatch()
@ 0xaaaae63ea448 304 util::fb2::UringProactor::ReapCompletions()
@ 0xaaaae63eabd8 80 util::fb2::UringProactor::MainLoop()
@ 0xaaaae63aa398 1568 boost::context::detail::fiber_entry<>()
Steps to Reproduce
Note: this reproduction includes the karpenter manifests we use for our nodes, but a standard managed node group may also be used.
- Set up the EKS Cluster and Node Pool
Provision an EKS cluster (v1.32) with Karpenter. Use the following NodePool and EC2NodeClass to provision arm64 worker nodes running Bottlerocket v1.41.0.
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: test-dragonfly-instance-store
spec:
disruption:
consolidateAfter: 1m0s
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "5%"
- nodes: "3"
limits:
cpu: 1k
memory: 1Ti
template:
metadata:
labels:
role: test-dragonfly
spec:
expireAfter: 336h0m0s
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: test-dragonfly-bottlerocket-instance-store
taints:
- key: test-dragonfly
effect: NoSchedule
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- reserved
- key: kubernetes.io/arch
operator: In
values:
- arm64
# - amd64
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-1a
- key: karpenter.k8s.aws/instance-local-nvme
operator: Gt
values: ["100"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["3"]
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values:
- nano
- micro
- small
- medium
- large
- xlarge
- 2xlarge
- 12xlarge
- 16xlarge
- 18xlarge
- 24xlarge
- 32xlarge
- 48xlarge
- metal
- metal-16xl
- metal-24xl
- metal-48xl
- key: kubernetes.io/os
operator: In
values:
- linux
weight: 10
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: test-dragonfly-bottlerocket-instance-store
spec:
amiSelectorTerms:
- alias: bottlerocket@1.41.0
instanceStorePolicy: RAID0
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: ${aws_ebs_kms_key_arn}
throughput: 125
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: ${aws_ebs_kms_key_arn}
throughput: 125
volumeSize: 15Gi
volumeType: gp3
kubelet:
systemReserved:
cpu: 100m
memory: 100Mi
metadataOptions:
httpEndpoint: enabled
httpPutResponseHopLimit: 2
httpTokens: required
role: ${karpenter_node_iam_role}
securityGroupSelectorTerms:
- id: ${cluster_security_group_id}
- id: ${worker_security_group_id}
subnetSelectorTerms:
- tags:
Tier: private
subnet_role: eks
tags:
karpenter.sh/discovery: ${cluster_name}
userData: |
[settings.metrics]
send-metrics = false
[settings.kubernetes]
event-qps = 50
event-burst = 100
registry-qps = 100
registry-burst = 200
kube-api-qps = 50
kube-api-burst = 100
allowed-unsafe-sysctls = ["net.core.somaxconn", "net.ipv4.tcp_*", "net.ipv4.ip_local_reserved_ports"]
shutdown-grace-period = "30s"
shutdown-grace-period-for-critical-pods = "10s"
[settings.kernel.sysctl]
"kernel.yama.ptrace_scope" = "0"
- Install the Dragonflydb Operator
kubectl apply -f https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/main/manifests/dragonfly-operator.yaml
- Create a dragonflydb cluster. Ensure it runs on arm64 Bottlerocket OS 1.41.0 (aws-k8s-1.32) with kernel version 6.1.140
---
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
labels:
app.kubernetes.io/name: test-dragonfly-ratelimit
app.kubernetes.io/instance: test-dragonfly-ratelimit
app.kubernetes.io/part-of: dragonfly-operator
name: test-dragonfly-ratelimit
namespace: test-dragonfly
spec:
# https://www.dragonflydb.io/docs/managing-dragonfly/operator/dragonfly-configuration
image: ghcr.io/dragonflydb/dragonfly:v1.31.0
replicas: 2
# https://www.dragonflydb.io/docs/managing-dragonfly/flags
args:
- "--dbnum=1"
- "--hz=50"
- "--cache_mode=true"
- "--version_check=false"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2 # this is required as the cpu limit == io thread count and sets QoS to "Guaranteed"
memory: 4Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: test-dragonfly-ratelimit
app.kubernetes.io/name: dragonfly
- Port-forward the dragonfly service
kubectl port-forward svc/test-dragonfly-ratelimit 6379:6379
- Trigger the bug by opening multiple terminals and running the command
while true; do redis-cli CL.THROTTLE user123 20 120 60 1; done
You will see the crash in the dragonflydb logs
I20250630 17:46:46.903510 11 replica.cc:713] Transitioned into stable sync
*** SIGABRT received at time=1751305699 on cpu 1 ***
PC: @ 0xffff9e6ff1f0 (unknown) (unknown)
@ 0xaaaae65f3214 480 absl::lts_20240722::AbslFailureSignalHandler()
@ 0xffff9e979820 4960 (unknown)
@ 0xffff9e6ba67c 208 gsignal
@ 0xffff9e6a7130 32 abort
@ 0xaaaae5b5c3fc 336 fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
@ 0xaaaae63e8b40 16 util::fb2::UringProactor::ProcessCqeBatch()
@ 0xaaaae63ea448 304 util::fb2::UringProactor::ReapCompletions()
@ 0xaaaae63eabd8 80 util::fb2::UringProactor::MainLoop()
@ 0xaaaae63aa398 1568 boost::context::detail::fiber_entry<>()
[failure_signal_handler.cc : 345] RAW: Signal 5 raised at PC=0xffff9e6a71ec while already in AbslFailureSignalHandler()
*** SIGTRAP received at time=1751305699 on cpu 1 ***
PC: @ 0xffff9e6a71ec (unknown) abort
@ 0xaaaae65f3214 480 absl::lts_20240722::AbslFailureSignalHandler()
@ 0xffff9e979820 4960 (unknown)
@ 0xaaaae5b5c3fc 336 fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
@ 0xaaaae63e8b40 16 util::fb2::UringProactor::ProcessCqeBatch()
@ 0xaaaae63ea448 304 util::fb2::UringProactor::ReapCompletions()
@ 0xaaaae63eabd8 80 util::fb2::UringProactor::MainLoop()
@ 0xaaaae63aa398 1568 boost::context::detail::fiber_entry<>()
W20250630 17:48:19.531806 11 common.cc:356] ReportError: Software caused connection abort