Skip to content

DragonflyDB (io_uring app) crashes with SIGABRT on arm64 running kernel 6.1.140 #206

@dschaaff

Description

@dschaaff

We are experiencing reproducible application crashes in DragonflyDB, a Redis-compatible data store that heavily utilizes io_uring. The crash is a SIGABRT that originates deep within DragonflyDB's io_uring event loop.

This issue appears to be specific to the environment and has only been reproduced on Bottlerocket OS v1.41.0 running the Linux kernel 6.1.140 on the arm64 architecture. We have not been able to reproduce this on other kernel versions or architectures, which leads us to suspect a potential issue in the kernel's io_uring implementation or its integration within this specific Bottlerocket version.

The crash can be triggered reliably under at least two different high-I/O scenarios:

Sending a high frequency of CL.THROTTLE commands.

During replication and sync events between DragonflyDB nodes.

Both scenarios produce an identical stack trace, pointing to a systemic issue rather than a bug in a single application code path.

Environment

Bottlerocket Version: 1.41.0 (from aws-k8s-1.32 variant)

Kernel Version: 6.1.140

Architecture: arm64

Cloud Provider: AWS

Orchestrator: EKS (Kubernetes v1.32)

Node Provisioner: Karpenter

Symptoms

The DragonflyDB pod terminates unexpectedly. The container logs show a SIGABRT signal followed by a SIGTRAP. The key function call in the stack trace is fu2::...::empty_invoker<>::invoke() occurring within the util::fb2::UringProactor::ProcessCqeBatch() function. This suggests the io_uring event loop is attempting to execute an invalid or deallocated callback function.

Crash Log & Stack Trace

*** SIGABRT received at time=1751305699 on cpu 1 ***
PC: @      0xffff9e6ff1f0  (unknown)  (unknown)
    @      0xaaaae65f3214         480  absl::lts_20240722::AbslFailureSignalHandler()
    @      0xffff9e979820        4960  (unknown)
    @      0xffff9e6ba67c         208  gsignal
    @      0xffff9e6a7130          32  abort
    @      0xaaaae5b5c3fc         336  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
    @      0xaaaae63e8b40          16  util::fb2::UringProactor::ProcessCqeBatch()
    @      0xaaaae63ea448         304  util::fb2::UringProactor::ReapCompletions()
    @      0xaaaae63eabd8          80  util::fb2::UringProactor::MainLoop()
    @      0xaaaae63aa398        1568  boost::context::detail::fiber_entry<>()
[failure_signal_handler.cc : 345] RAW: Signal 5 raised at PC=0xffff9e6a71ec while already in AbslFailureSignalHandler()
*** SIGTRAP received at time=1751305699 on cpu 1 ***
PC: @      0xffff9e6a71ec  (unknown)  abort
    @      0xaaaae65f3214         480  absl::lts_20240722::AbslFailureSignalHandler()
    @      0xffff9e979820        4960  (unknown)
    @      0xaaaae5b5c3fc         336  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
    @      0xaaaae63e8b40          16  util::fb2::UringProactor::ProcessCqeBatch()
    @      0xaaaae63ea448         304  util::fb2::UringProactor::ReapCompletions()
    @      0xaaaae63eabd8          80  util::fb2::UringProactor::MainLoop()
    @      0xaaaae63aa398        1568  boost::context::detail::fiber_entry<>()

Steps to Reproduce

Note: this reproduction includes the karpenter manifests we use for our nodes, but a standard managed node group may also be used.

  1. Set up the EKS Cluster and Node Pool

Provision an EKS cluster (v1.32) with Karpenter. Use the following NodePool and EC2NodeClass to provision arm64 worker nodes running Bottlerocket v1.41.0.

---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: test-dragonfly-instance-store
spec:
  disruption:
    consolidateAfter: 1m0s
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
      - nodes: "5%"
      - nodes: "3"
  limits:
    cpu: 1k
    memory: 1Ti
  template:
    metadata:
      labels:
        role: test-dragonfly
    spec:
      expireAfter: 336h0m0s
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: test-dragonfly-bottlerocket-instance-store
      taints:
        - key: test-dragonfly
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
            - on-demand
            - reserved
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64
            # - amd64
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - us-west-1a
        - key: karpenter.k8s.aws/instance-local-nvme
          operator: Gt
          values: ["100"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values:
            - nano
            - micro
            - small
            - medium
            - large
            - xlarge
            - 2xlarge
            - 12xlarge
            - 16xlarge
            - 18xlarge
            - 24xlarge
            - 32xlarge
            - 48xlarge
            - metal
            - metal-16xl
            - metal-24xl
            - metal-48xl
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
  weight: 10
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: test-dragonfly-bottlerocket-instance-store
spec:
  amiSelectorTerms:
    - alias: bottlerocket@1.41.0
  instanceStorePolicy: RAID0
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: ${aws_ebs_kms_key_arn}
        throughput: 125
        volumeSize: 4Gi
        volumeType: gp3
    - deviceName: /dev/xvdb
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: ${aws_ebs_kms_key_arn}
        throughput: 125
        volumeSize: 15Gi
        volumeType: gp3
  kubelet:
    systemReserved:
      cpu: 100m
      memory: 100Mi
  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: ${karpenter_node_iam_role}
  securityGroupSelectorTerms:
    - id: ${cluster_security_group_id}
    - id: ${worker_security_group_id}
  subnetSelectorTerms:
    - tags:
        Tier: private
        subnet_role: eks
  tags:
    karpenter.sh/discovery: ${cluster_name}
  userData: |
    [settings.metrics]
    send-metrics = false
    [settings.kubernetes]
    event-qps = 50
    event-burst = 100
    registry-qps = 100
    registry-burst = 200
    kube-api-qps = 50
    kube-api-burst = 100
    allowed-unsafe-sysctls = ["net.core.somaxconn", "net.ipv4.tcp_*", "net.ipv4.ip_local_reserved_ports"]
    shutdown-grace-period = "30s"
    shutdown-grace-period-for-critical-pods = "10s"
    [settings.kernel.sysctl]
    "kernel.yama.ptrace_scope" = "0"
  1. Install the Dragonflydb Operator

kubectl apply -f https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/main/manifests/dragonfly-operator.yaml

  1. Create a dragonflydb cluster. Ensure it runs on arm64 Bottlerocket OS 1.41.0 (aws-k8s-1.32) with kernel version 6.1.140
---
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: test-dragonfly-ratelimit
    app.kubernetes.io/instance: test-dragonfly-ratelimit
    app.kubernetes.io/part-of: dragonfly-operator
  name: test-dragonfly-ratelimit
  namespace: test-dragonfly
spec:
  # https://www.dragonflydb.io/docs/managing-dragonfly/operator/dragonfly-configuration
  image: ghcr.io/dragonflydb/dragonfly:v1.31.0
  replicas: 2
  # https://www.dragonflydb.io/docs/managing-dragonfly/flags
  args:
    - "--dbnum=1"
    - "--hz=50"
    - "--cache_mode=true"
    - "--version_check=false"
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 2 # this is required as the cpu limit == io thread count and sets QoS to "Guaranteed"
      memory: 4Gi
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: test-dragonfly-ratelimit
          app.kubernetes.io/name: dragonfly
  1. Port-forward the dragonfly service kubectl port-forward svc/test-dragonfly-ratelimit 6379:6379
  2. Trigger the bug by opening multiple terminals and running the command while true; do redis-cli CL.THROTTLE user123 20 120 60 1; done

You will see the crash in the dragonflydb logs

I20250630 17:46:46.903510    11 replica.cc:713] Transitioned into stable sync
*** SIGABRT received at time=1751305699 on cpu 1 ***
PC: @     0xffff9e6ff1f0  (unknown)  (unknown)
    @     0xaaaae65f3214        480  absl::lts_20240722::AbslFailureSignalHandler()
    @     0xffff9e979820       4960  (unknown)
    @     0xffff9e6ba67c        208  gsignal
    @     0xffff9e6a7130         32  abort
    @     0xaaaae5b5c3fc        336  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
    @     0xaaaae63e8b40         16  util::fb2::UringProactor::ProcessCqeBatch()
    @     0xaaaae63ea448        304  util::fb2::UringProactor::ReapCompletions()
    @     0xaaaae63eabd8         80  util::fb2::UringProactor::MainLoop()
    @     0xaaaae63aa398       1568  boost::context::detail::fiber_entry<>()
[failure_signal_handler.cc : 345] RAW: Signal 5 raised at PC=0xffff9e6a71ec while already in AbslFailureSignalHandler()
*** SIGTRAP received at time=1751305699 on cpu 1 ***
PC: @     0xffff9e6a71ec  (unknown)  abort
    @     0xaaaae65f3214        480  absl::lts_20240722::AbslFailureSignalHandler()
    @     0xffff9e979820       4960  (unknown)
    @     0xaaaae5b5c3fc        336  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::empty_invoker<>::invoke()
    @     0xaaaae63e8b40         16  util::fb2::UringProactor::ProcessCqeBatch()
    @     0xaaaae63ea448        304  util::fb2::UringProactor::ReapCompletions()
    @     0xaaaae63eabd8         80  util::fb2::UringProactor::MainLoop()
    @     0xaaaae63aa398       1568  boost::context::detail::fiber_entry<>()
W20250630 17:48:19.531806    11 common.cc:356] ReportError: Software caused connection abort

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions