Skip to content

节点内存泄漏 #130

@masterOcean

Description

@masterOcean

节点在 30 分钟内内存泄露直到宕机,主要是 PubMessage对象

我们生产环境 4个节点(16c 32g)组成的集群,其中一个节点在 30分钟内内存一直上升,gc 不下来,最终不可用,我们也是第一次出现。当时集群总连接数大概12W,单个节点 3W,每个连接大概 5S或10S发一次消息,通过共享连接转出到kafka。
gc log:

[2025-05-20T09:40:43.490+0800][30530][gc] GC(168430) Garbage Collection (Allocation Rate) 6626M(54%)->1966M(16%)
[2025-05-20T09:40:58.462+0800][30530][gc] GC(168431) Garbage Collection (Allocation Rate) 7112M(58%)->2040M(17%)
.....
[2025-05-20T09:44:37.140+0800][30530][gc] GC(168446) Garbage Collection (Allocation Rate) 5252M(43%)->2862M(23%)
[2025-05-20T09:44:45.639+0800][30530][gc] GC(168447) Garbage Collection (Allocation Rate) 5222M(42%)->2422M(20%)
[2025-05-20T09:44:55.678+0800][30530][gc] GC(168448) Garbage Collection (Allocation Rate) 4706M(38%)->3060M(25%)
.....
[2025-05-20T09:48:01.016+0800][30530][gc] GC(168460) Garbage Collection (Allocation Rate) 7482M(61%)->3376M(27%)
[2025-05-20T09:48:15.690+0800][30530][gc] GC(168461) Garbage Collection (Allocation Rate) 7518M(61%)->3448M(28%)
[2025-05-20T09:48:30.689+0800][30530][gc] GC(168462) Garbage Collection (Allocation Rate) 7720M(63%)->3502M(28%)
.....
[2025-05-20T09:52:06.422+0800][30530][gc] GC(168483) Garbage Collection (Allocation Rate) 6834M(56%)->4232M(34%)
[2025-05-20T09:52:16.279+0800][30530][gc] GC(168484) Garbage Collection (Allocation Rate) 6762M(55%)->4284M(35%)
[2025-05-20T09:52:26.332+0800][30530][gc] GC(168485) Garbage Collection (Allocation Rate) 6868M(56%)->4376M(36%)
.....
[2025-05-20T10:03:21.311+0800][30530][gc] GC(168626) Garbage Collection (Allocation Rate) 7308M(59%)->7280M(59%)
[2025-05-20T10:03:25.278+0800][30530][gc] GC(168627) Garbage Collection (Allocation Rate) 7302M(59%)->7328M(60%)
.....
[2025-05-20T10:10:03.710+0800][30530][gc] GC(168717) Garbage Collection (Allocation Rate) 9672M(79%)->9866M(80%)
[2025-05-20T10:10:09.355+0800][30843][gc] Allocation Stall (crdt-service-scheduler) 176.979ms
[2025-05-20T10:10:09.355+0800][30755][gc] Allocation Stall (io-rpc-worker-elg-31) 107.565ms
[2025-05-20T10:10:09.355+0800][32197][gc] Allocation Stall (basekv-range-mutator) 560.162ms

坏节点的 heap 直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       4387991      789,541360  [Ljava.lang.Object; (java.base@17.0.10)
   2:       6318061      692352160  [B (java.base@17.0.10)
   3:       2070386      560754472  [J (java.base@17.0.10)
   4:      14831152      474596864  java.util.concurrent.CompletableFuture (java.base@17.0.10)
   5:       2057200      298501136  [I (java.base@17.0.10)
   6:       4197419      268634816  java.util.concurrent.CompletableFuture$UniWhenComplete (java.base@17.0.10)
   7:       2076515      215957560  com.baidu.bifromq.type.Message
   8:       3136953      175669368  java.util.concurrent.CompletableFuture$UniRelay (java.base@17.0.10)
   9:       2048731      163898352  [S (java.base@17.0.10)
  10:       2120880      135736320  java.util.concurrent.CompletableFuture$UniApply (java.base@17.0.10)
  11:       2076511      132896704  java.util.concurrent.CompletableFuture$UniExceptionally (java.base@17.0.10)
  12:       4110609      131539488  java.lang.String (java.base@17.0.10)
  13:       1078152      120753024  io.netty.buffer.PooledUnsafeDirectByteBuf
  14:       2076517       99672816  com.baidu.bifromq.basescheduler.CallTask
  15:       2076510       83060400  com.baidu.bifromq.dist.client.scheduler.DistServerCall
  16:       2161485       69167520  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@17.0.10)
  17:       1061848       67958272  io.netty.buffer.PooledSlicedByteBuf
  18:       1060441       67868224  java.util.concurrent.CompletableFuture$UniAccept (java.base@17.0.10)
  19:       2076518       66448576  com.baidu.bifromq.dist.client.scheduler.BatcherKey
  20:       1060435       59384360  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1774/0x00007fdddea97150
  21:       1060405       59382680  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1777/0x00007fdddea97808
  22:       1038502       58156112  java.util.LinkedHashMap$Entry (java.base@17.0.10)
  23:       1016076       56900256  java.util.concurrent.CancellationException (java.base@17.0.10)
  24:        279987       51446320  [Ljava.util.HashMap$Node; (java.base@17.0.10)
  25:       1060436       50900928  com.baidu.bifromq.plugin.authprovider.type.CheckResult
  26:       1060435       50900880  io.netty.handler.codec.mqtt.MqttPublishMessage
  27:       2088592       50126208  com.google.protobuf.ByteString$LiteralByteString
  28:       2076517       49836408  com.baidu.bifromq.basescheduler.BatchCallScheduler$$Lambda$1429/0x00007fddde92f5f0
  29:       1127869       45114760  java.util.HashMap$Node (java.base@17.0.10)
  30:       1060439       42417560  io.netty.handler.codec.mqtt.MqttFixedHeader
  31:        212802       37065952  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
  32:        410806       36150928  io.netty.channel.DefaultChannelHandlerContext
  33:       1060437       33933984  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1693/0x00007fdddea5e250
  34:       1060435       33933920  io.netty.handler.codec.mqtt.MqttPublishVariableHeader
  35:        268707       25795872  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
  36:        321774       25741920  java.util.LinkedHashMap (java.base@17.0.10)
  37:       1060435       25450440  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1778/0x00007fdddea97c78
  38:        776047       24833504  io.netty.util.Recycler$DefaultHandle
  39:       1016077       24385848  java.util.concurrent.CompletableFuture$AltResult (java.base@17.0.10)
  40:        526340       21053600  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  41:         45411       13804944  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  42:        137358       10988640  java.util.TreeMap (java.base@17.0.10)
  43:         45864       10273536  io.netty.channel.epoll.EpollSocketChannel
  44:        393943        9454632  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  45:        194612        9341376  com.google.protobuf.MapField
  46:         45609        8756928  io.netty.handler.traffic.TrafficCounter
  47:         91987        8094856  io.netty.util.concurrent.ScheduledFutureTask
  48:        139228        7796768  com.baidu.bifromq.type.ClientInfo
  49:        182432        7297280  io.netty.util.DefaultAttributeMap$DefaultAttribute
  50:        220074        7042368  java.util.concurrent.ConcurrentHashMap$KeySetView (java.base@17.0.10)
  51:        138238        6635424  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  52:        194612        6227584  com.google.protobuf.MapField$MutabilityAwareMap
  53:         96349        6166336  java.util.HashMap (java.base@17.0.10)
  54:         44444        6044384  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  55:          7600        5168000  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  56:         45605        4742920  com.baidu.bifromq.mqtt.handler.TenantSettings
  57:        195219        4685256  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  58:        194612        4670688  com.google.protobuf.MapField$ImmutableMessageConverter
  59:         45871        4403616  io.netty.channel.DefaultChannelPipeline$HeadContext

正常节点的堆内存直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        311863      113943464  [Ljava.lang.Object; (java.base@17.0.10)
   2:       2077519      107863216  [B (java.base@17.0.10)
   3:       1940738       62103616  java.lang.String (java.base@17.0.10)
   4:        906314       50753584  java.util.LinkedHashMap$Entry (java.base@17.0.10)
   5:        216048       37429824  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
   6:        420222       36979536  io.netty.channel.DefaultChannelHandlerContext
   7:        263687       34769200  [Ljava.util.HashMap$Node; (java.base@17.0.10)
   8:        273474       26253504  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
   9:        306730       24538400  java.util.LinkedHashMap (java.base@17.0.10)
  10:        525528       21021120  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  11:         46471       14127184  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  12:        140444       11235520  java.util.TreeMap (java.base@17.0.10)
  13:         46915       10508960  io.netty.channel.epoll.EpollSocketChannel
  14:        421213       10109112  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  15:         46653        8957376  io.netty.handler.traffic.TrafficCounter
  16:        176317        8463216  com.google.protobuf.MapField
  17:         94029        8274552  io.netty.util.concurrent.ScheduledFutureTask
  18:        186600        7464000  io.netty.util.DefaultAttributeMap$DefaultAttribute
  19:        223556        7153792  java.util.concurrent.ConcurrentHashMap$KeySetView (java.base@17.0.10)
  20:        141252        6780096  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  21:        120211        6731816  com.baidu.bifromq.type.ClientInfo
  22:         98398        6297472  java.util.HashMap (java.base@17.0.10)
  23:         45324        6164064  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  24:        176317        5642144  com.google.protobuf.MapField$MutabilityAwareMap
  25:        230407        5529768  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  26:        164961        5278752  java.util.concurrent.CompletableFuture (java.base@17.0.10)
  27:          7440        5059200  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  28:         46640        4850560  com.baidu.bifromq.mqtt.handler.TenantSettings
  29:         46922        4504512  io.netty.channel.DefaultChannelPipeline$HeadContext
  30:         46653        4478688  io.netty.handler.codec.mqtt.MqttDecoder
  31:         46653        4478688  io.netty.handler.traffic.ChannelTrafficShapingHandler
  32:        176317        4231608  com.google.protobuf.MapField$ImmutableMessageConverter
  33:         46922        4129136  io.netty.channel.DefaultChannelPipeline$TailContext
  34:         46891        4126408  io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl
  35:         46640        4104320  com.baidu.bifromq.mqtt.handler.v3.MQTT3ConnectHandler
  36:        118016        3776512  com.baidu.bifromq.mqtt.service.LocalDistService$TopicFilter
  37:        118014        3776448  com.baidu.bifromq.mqtt.service.LocalDistService$LocalRoutes
  38:         93994        3759760  java.net.InetAddress$InetAddressHolder (java.base@17.0.10)
  39:         46922        3753760  io.netty.channel.DefaultChannelPipeline
  40:         46915        3753200  io.netty.channel.epoll.EpollSocketChannel$EpollSocketChannelUnsafe
  41:         46915        3753200  io.netty.channel.epoll.EpollSocketChannelConfig
  42:         93280        3731200  com.baidu.bifromq.mqtt.session.MQTTSessionAuthProvider
  43:        152437        3658488  java.util.LinkedHashMap$LinkedEntrySet (java.base@17.0.10)
  44:         62679        3510024  java.util.TreeMap$Entry (java.base@17.0.10)
  45:          8682        3504408  [I (java.base@17.0.10)
  46:         18591        3456320  java.lang.Class (java.base@17.0.10)
  47:         93293        3358496  [Lcom.baidu.bifromq.mqtt.handler.condition.Condition;
  48:         46640        3358080  com.baidu.bifromq.mqtt.handler.ConditionalSlowDownHandler
  2774:             2             96  io.netty.handler.codec.mqtt.MqttPublishMessage

我们怀疑是 DistServerCallScheduler 中,在 batcher 里grpc 超时阻塞,MqttPublishMessage全都添加到 Batcher.callTaskBuffers 中,这是个 ConcurrentLinkedQueue,是无界的。

Environment

  • Version: [3.2.1]
  • JVM Version: [OpenJDK17,启动参数 -Xms12g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=12g]
  • Hardware Spec: [15c32g, 4个节点]
  • OS: [腾讯云OS]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions