Skip to content

[BUG]: Updating to 2.14.0 lead to high memory and termination of karafka worker #4626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wahlg opened this issue May 6, 2025 · 14 comments
Labels
bug Involves a bug community Was opened by a community member

Comments

@wahlg
Copy link

wahlg commented May 6, 2025

Tracer Version(s)

2.14.0

Ruby Version(s)

ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]

Relevent Library and Version(s)

No response

Bug Report

We have a Rails app that uses the karafka gem for kafka message consumption.

We have been using the built-in karafka tracing described here

# Initialize the listener 
dd_logger_listener = Karafka::Instrumentation::Vendors::Datadog::LoggerListener.new do |config| 
  config.client = Datadog::Tracing 
end
Karafka.monitor.subscribe(dd_logger_listener)

We recently upgraded our datadog tracing version to 2.14.0. We did NOT enable the karafka tracing that this provides (i.e. we did NOT add c.tracing.instrument :karafka to our datadog initializer).

However, after deploying the app we started to see very high memory usage of the container running karafka, which ultimately led to the container continuously getting killed due to out-of-memory issues. We also detected traces recorded by the worker of upwards of 90 minutes.

Once we disabled the karafka instrumentation for traces above, memory usage returned to normal. However, we now have no traces, and were not expecting this datadog upgrade to be a breaking change for our application.

Is there a way we can disable any instrumentation that the datadog gem is providing, and just use the tracing packaged with the karafka gem?

Thanks

Reproduction Code

Add this to karafka.rb, without adding any karafka tracing in Datadog.configure

# Initialize the listener 
dd_logger_listener = Karafka::Instrumentation::Vendors::Datadog::LoggerListener.new do |config| 
  config.client = Datadog::Tracing 
end
Karafka.monitor.subscribe(dd_logger_listener)

Configuration Block

Datadog.configure do |c|
  c.profiling.enabled = true
  c.service = 'service'
  c.version = ENV.fetch('GIT_SHA', 'none')
  c.env = Rails.env
  c.diagnostics.debug = false
  c.diagnostics.startup_logs.enabled = false
  c.telemetry.enabled = true
  c.tracing.log_injection = true
  c.tracing.enabled = true
  c.tracing.instrument :rails
  c.tracing.instrument :concurrent_ruby
  c.tracing.instrument :faraday, split_by_domain: true
  c.tracing.instrument :graphql, service_name: 'ruby-graphql'
  c.tracing.instrument :grpc, { distributed_tracing: true }
  c.tracing.instrument :pg, {
    comment_propagation: 'full',
  }
  c.tracing.instrument :rack
  c.tracing.instrument :ethon, split_by_domain: true
  c.logger.instance = Rails.logger
end

Error Logs

No response

Operating System

No response

How does Datadog help you?

No response

@wahlg wahlg added bug Involves a bug community Was opened by a community member labels May 6, 2025
@wahlg wahlg changed the title [BUG]: [BUG]: Updating to 2.14.0 lead to high memory and termination of karafka worker May 6, 2025
@mensfeld
Copy link

mensfeld commented May 6, 2025

Putting aside the bloat issue itself, it seems based on what @wahlg told me, that the patches for tracing are applied even when datadog tracing is not in use (and only the internal karafka dd tracing is):

def inspect_class(klass)
  puts "Class: #{klass}"
  
  # Get included modules
  puts "\nIncluded modules:"
  klass.included_modules.each { |m| puts "  #{m}" }
  
  # Get ancestors (includes superclasses and included modules)
  puts "\nAncestors (superclasses and included modules):"
  klass.ancestors.each { |a| puts "  #{a}" }
  
  # Get the singleton class (eigenclass)
  singleton = klass.singleton_class
  
  # Get modules extended to the class (included in singleton class)
  puts "\nExtended modules (included in singleton class):"
  singleton.included_modules.each { |m| puts "  #{m}" unless m == Class }
  
  # For prepended modules, we need to check the ancestors
  puts "\nPrepended modules (appear before the class in ancestors):"
  prepended = []
  klass.ancestors.each_with_index do |ancestor, i|
    if ancestor == klass && i > 0
      # All ancestors before this index are prepended
      prepended = klass.ancestors[0...i]
      break
    end
  end
  prepended.each { |m| puts "  #{m}" if m.is_a?(Module) && !m.is_a?(Class) }
end

inspect_class(Karafka::Messages::Messages)
inspect_class(Karafka::Instrumentation::Monitor)

results in:

Class: Karafka::Messages::Messages

Included modules:
  Karafka::Pro::Cleaner::Messages::Messages
  Datadog::Tracing::Contrib::Karafka::MessagesPatch
  Enumerable
  ActiveSupport::Dependencies::RequireDependency
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel

Ancestors (superclasses and included modules):
  Karafka::Pro::Cleaner::Messages::Messages
  Datadog::Tracing::Contrib::Karafka::MessagesPatch
  Karafka::Messages::Messages
  Enumerable
  ActiveSupport::Dependencies::RequireDependency
  Object
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel
  BasicObject

Extended modules (included in singleton class):
  ActiveSupport::DescendantsTracker::ReloadedClassesFiltering
  Zeitwerk::ConstAdded
  Module::Concerning
  ActiveSupport::Dependencies::RequireDependency
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel

Prepended modules (appear before the class in ancestors):
  Karafka::Pro::Cleaner::Messages::Messages
  Datadog::Tracing::Contrib::Karafka::MessagesPatch
Class: Karafka::Instrumentation::Monitor

Included modules:
  Datadog::Tracing::Contrib::Karafka::Monitor
  ActiveSupport::Dependencies::RequireDependency
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel

Ancestors (superclasses and included modules):
  Datadog::Tracing::Contrib::Karafka::Monitor
  Karafka::Instrumentation::Monitor
  Karafka::Core::Monitoring::Monitor
  ActiveSupport::Dependencies::RequireDependency
  Object
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel
  BasicObject

Extended modules (included in singleton class):
  ActiveSupport::DescendantsTracker::ReloadedClassesFiltering
  Zeitwerk::ConstAdded
  Module::Concerning
  ActiveSupport::Dependencies::RequireDependency
  PP::ObjectMixin
  ActiveSupport::ToJsonWithActiveSupportEncoder
  ActiveSupport::Tryable
  JSON::Ext::Generator::GeneratorMethods::Object
  Kernel

Prepended modules (appear before the class in ancestors):
  Datadog::Tracing::Contrib::Karafka::Monitor

with:

Datadog::Tracing::Contrib::Karafka::MessagesPatch
Datadog::Tracing::Contrib::Karafka::Monito

being present.

@Strech
Copy link
Member

Strech commented May 6, 2025

Hey @wahlg thanks for reporting this.

I would like to ask you about the auto instrumentation, do use have something like this in your Gemfile

gem 'datadog', require: 'datadog/auto_instrument'

or just in the code

require 'datadog/auto_instrument'

@wahlg
Copy link
Author

wahlg commented May 6, 2025

Hi @Strech this is the current entry in the Gemfile:

gem 'datadog', '~> 2.0', require: 'datadog/auto_instrument'

@p-datadog
Copy link
Member

I added the configuration described in this issue to the test app in https://github.com/p-datadog/karafka-test and I do see what appears to be constant growth in memory consumed by the server process, however:

Datadog.configure do |c|
  c.profiling.enabled = true
  c.service = 'service'
  c.version = ENV.fetch('GIT_SHA', 'none')
  c.env = Rails.env
  c.diagnostics.debug = false
  c.diagnostics.startup_logs.enabled = false
  c.telemetry.enabled = true
  c.tracing.log_injection = true
  c.tracing.enabled = true
  c.tracing.instrument :rails
  c.tracing.instrument :concurrent_ruby
  c.tracing.instrument :faraday, split_by_domain: true
  c.tracing.instrument :graphql, service_name: 'ruby-graphql'
  c.tracing.instrument :grpc, { distributed_tracing: true }
  c.tracing.instrument :pg, {
    comment_propagation: 'full',
  }
  c.tracing.instrument :rack
  c.tracing.instrument :ethon, split_by_domain: true
  c.logger.instance = Rails.logger
end

That turns on a lot of functionality and if I, for example, remove profiling from this list the growth in process memory size appears to slow down.

With the upstream instrumentation removed I think there is still continuous growth in process memory size but it is very slow, possibly just due to heap fragmentation?

@mensfeld
Copy link

mensfeld commented May 7, 2025

With the upstream instrumentation removed I think there is still continuous growth in process memory size but it is very slow, possibly just due to heap fragmentation?

Karafka has no known memory leaks and I've been running 20-30k msg/s processing to debug other stuff lately. The heap fragmentation may occur but it is very slow and stabilizes after a while (depending on processing nature and payloads, etc).

Though as far as I know, the thing is, with updated (patched?) dd things are much worse.

@mensfeld
Copy link

mensfeld commented May 7, 2025

My recent run of 10-20k msg/s increases memory usage by 9.5MB over period of 5 hours.

P.S. By no means I say it is not Karafka related ofc and if this is the case I will fix it asap.

@p-datadog
Copy link
Member

I do not work on tracing part of the library and could be wrong, but perhaps the issue is that https://github.com/DataDog/dd-trace-rb/blob/master/lib/datadog/tracing/contrib/karafka/monitor.rb#L21 uses the block form of trace which I assume is guaranteed to clean up (finish) every span that is started, but https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L51 uses non-block form and https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L79 may or may not finish the trace that is started on line 51. If there are two instrumentations active (which I guess is expected if auto_instrument is required), perhaps the non-block span gets orphaned which then causes a memory leak.

@wahlg
Copy link
Author

wahlg commented May 7, 2025

For some more context, the task where we are running karafka has 2GB of available memory. Prior to upgrading the datadog gem, when running karafka in production with all the other datadog instrumentation enabled (profiling, tracing of other dependencies, etc) the memory usage hovers around 40%.

On the datadog gem version 2.14.0, if I enable karafka tracing through karafka.rb, memory grows linearly from the moment the task starts until it reaches 100% and is ultimately killed due to out-of-memory.

@p-datadog
Copy link
Member

gem 'datadog', '~> 2.0', require: 'datadog/auto_instrument'

This requests datadog to instrument everything it knows how to instrument which includes karafka.

If you do not wish to instrument everything by default, you can remove the auto_instrument require and configure integrations manually via Datadog.configure.

I ran the above code inspecting included modules and it does not show Datadog::* with

 ruby -Ilib -rbundler -rkarafka -rdatadog test.rb 

And shows the output provided above with Datadog modules added to Karafka with

 ruby -Ilib -rbundler -rkarafka -rdatadog/auto_instrument test.rb 

(The bundler require is needed due to karafka/karafka#2590 and I had to manually bypass the Rails-delayed activation with

diff --git a/lib/datadog/tracing/contrib/auto_instrument.rb b/lib/datadog/tracing/contrib/auto_instrument.rb
index 23988c2e2f..76a4e74412 100644
--- a/lib/datadog/tracing/contrib/auto_instrument.rb
+++ b/lib/datadog/tracing/contrib/auto_instrument.rb
@@ -12,7 +12,7 @@ module Datadog
         require_relative '../../core/contrib/rails/utils'
 
         # Defer to Rails if this is a Rails application
-        if Datadog::Core::Contrib::Rails::Utils.railtie_supported?
+        if Datadog::Core::Contrib::Rails::Utils.railtie_supported? && false
           require_relative 'rails/auto_instrument_railtie'
         else
           AutoInstrument.patch_all!

Since it looks like karafka is bringing in Rails?

@wahlg
Copy link
Author

wahlg commented May 7, 2025

@p-datadog that makes sense. Is there a way to enable auto-instrumention but exclude specific items? Or is there a way to list everything that is enabled via auto-instrumentation that might not be explicitly configured in the datadog initializer? I just want to make sure I don't inadvertently disable something that was previously enabled.

@wahlg
Copy link
Author

wahlg commented May 8, 2025

Currently with the datadog auto-instrumentation enabled, and the karafka.rb config for datadog traces disabled, we are seeing normal memory usage and traces being recorded for karafka consumer processes. So I think we are in good shape now.

Thanks all for the help on this issue

@mensfeld
Copy link

mensfeld commented May 9, 2025

@p-datadog in such case, WDYT about helping me figuring out how to detect that DD Karafka is in use so I can issue a warning to users when they use both?

@Strech
Copy link
Member

Strech commented May 12, 2025

@mensfeld You can use Karafka patcher interface to track was it patched or not

Datadog::Tracing::Contrib::Karafka::Patcher.patched? # => true/false

@wahlg We already have some discussion around auto-instrument and configuration compatibility. But technically, you should not experience high-memory usage at the first place, no matter how you instrument. I think we will launch investigation about this contrib (but I can't guarantee the priority of it).

Glad that the issue is resolved, but I would not close the ticket in order to track it.

@p-datadog
Copy link
Member

@mensfeld What about storing the span created in https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L51 in thread/fiber-local storage so that it can be finished later? Would that work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member
Projects
None yet
Development

No branches or pull requests

4 participants