Deadlock under unusual conditions #3837

masseyke · 2025-07-18T21:34:28Z

masseyke
Jul 18, 2025

An Elasticsearch user reported deadlock in Elasticsearch that happens occasionally (elastic/elasticsearch#131404). When this happens, they have to restart their Elasticsearch node. It looks like log4j is the source of the deadlock, and I have managed to reproduce it (or something very close to it) using only log4j. The conditions required are a little strange, but they happen to all occur inside of Elasticsearch. They are:

You have set appender.rolling.strategy.fileIndex = nomax
You have logging happening from multiple threads
You have log rollover happening fairly frequently
You have System.getenv() configured to throw a SecurityException (this is what Elasticsearch's custom SecurityManager does in JVMs before Java 24; I believe that Elasticsearch's still throws a SecurityException in newer JVMs that lack the SecurityManager using a different mechanism, but I have not tested this yet )
You set both System.out and System.err to output to go to your logger
You create the PrintWriter used for System.out and System.err logging with buffering disabled

Below is a test that does all of the above, and causes deadlock every time. If you run kill -3 on the process you'll see that it reports Found 1 deadlock.

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.core.LoggerContext;
import org.apache.logging.log4j.core.config.Configurator;
import org.apache.logging.log4j.core.config.builder.api.*;
import org.apache.logging.log4j.core.config.builder.impl.BuiltConfiguration;

import java.io.File;
import java.io.OutputStream;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Log4j2DeadlockTests {

    @SuppressWarnings("checkstyle:DescendantToken")
    public static void main(String[] args) throws Exception {
        // Clean up old log files
        File logDir = new File("target/logs");
        if (logDir.exists()) {
            for (File f : logDir.listFiles()) {
                f.delete();
            }
        } else {
            logDir.mkdirs();
        }
        ConfigurationBuilder<BuiltConfiguration> builder = ConfigurationBuilderFactory.newConfigurationBuilder();
        builder.setStatusLevel(org.apache.logging.log4j.Level.INFO);
        builder.setConfigurationName("RollingBuilder");

        LayoutComponentBuilder layout = builder.newLayout("PatternLayout").addAttribute("pattern", "%d %p %c [%t] %m%n");

        // Condition 3: Rollover happens frequently
        ComponentBuilder<?> triggeringPolicy = builder.newComponent("Policies")
            .addComponent(builder.newComponent("SizeBasedTriggeringPolicy").addAttribute("size", "1000B"));

        // Condition 1: set appender.rolling.strategy.fileIndex = nomax
        ComponentBuilder<?> rolloverStrategy = builder.newComponent("DefaultRolloverStrategy").addAttribute("fileIndex", "nomax");

        AppenderComponentBuilder appenderBuilder = builder.newAppender("rolling", "RollingFile")
            .addAttribute("fileName", "target/logs/app.log")
            .addAttribute("filePattern", "target/logs/app-%i.log.gz")
            .add(layout)
            .addComponent(triggeringPolicy)
            .addComponent(rolloverStrategy);

        builder.add(appenderBuilder);

        builder.add(
            builder.newLogger("TestLogger", org.apache.logging.log4j.Level.INFO)
                .add(builder.newAppenderRef("rolling"))
                .addAttribute("additivity", false)
        );

        builder.add(builder.newRootLogger(org.apache.logging.log4j.Level.INFO).add(builder.newAppenderRef("rolling")));

        LoggerContext ctx = Configurator.initialize(builder.build());
        Logger logger = LogManager.getLogger("TestLogger");
        PrintStream logPrintStream = new PrintStream(new OutputStream() {
            private StringBuilder sb = new StringBuilder();
            @Override
            public void write(int b) {
                if (b == '\n') {
                    logger.info(sb.toString());
                    sb.setLength(0);
                } else {
                    sb.append((char) b);
                }
            }
        }, false); // Condition 6: buffering is disabled in this PrintStream
        // Condition 5: System out and err go to the logger
        System.setOut(logPrintStream);
        System.setErr(logPrintStream);

        int threadCount = 10;
        int messagesPerThread = 10000;
        ExecutorService executor = Executors.newFixedThreadPool(threadCount);
        CountDownLatch latch = new CountDownLatch(threadCount);
        SecurityManager originalSecurityManager = System.getSecurityManager();
        // Condition 4: Calls to System.getenv result in a SecurityException
        System.setSecurityManager(new SecurityManager() {
            @Override
            public void checkPermission(java.security.Permission perm) {
                if ("getenv.*".equals(perm.getName()) || "getenv".equals(perm.getName()) && Thread.currentThread().getName().contains("pool")) {
                    throw new SecurityException("Access to environment variables denied");
                }
            }
        });
        // Condition 2: Log from multiple threads
        for (int i = 0; i < threadCount; i++) {
            executor.submit(() -> {
                System.out.println("Running task on thread " + Thread.currentThread());
                for (int j = 0; j < messagesPerThread; j++) {
                    logger.info("Deadlock test message " + j);
                }
                latch.countDown();
            });
        }

        System.out.println("Waiting");
        latch.await();
        System.setSecurityManager(originalSecurityManager);

        executor.shutdown();
        ctx.close();
    }
}

vy · 2025-07-20T18:30:43Z

vy
Jul 20, 2025
Collaborator

@masseyke, thanks so much for the report and the reproduction. As you very well know concurrency is hard, and troubleshooting concurrency issues is even harder. @ppkarwasz and I have glanced over the issue. We will try to see if we can spare time for this.

1 reply

masseyke Jul 21, 2025
Author

Thanks. I forgot to mention that I believe we have seen this in logstash 2.18 and 2.19. I believe that 2.17 did not have this problem, but I am away from my computer so I can't double-check that. For the test code I posted, I was using 2.19.0 and Java 21.

vy · 2025-07-20T18:35:30Z

vy
Jul 20, 2025
Collaborator

@masseyke, for the record, we have a dedicated page for ELK in the Log4j documentation and a logstash-input-tcp PR (logstash-plugins/logstash-input-tcp#222) that has been waiting for some attention since July 2024. 😅 We actually never recommend ELK users rolling file appenders, FYI.

0 replies

jvz · 2025-07-25T22:07:50Z

jvz
Jul 25, 2025
Collaborator

Do you have some platform information for this? I'm not seeing a deadlock myself, but I do see a lot of CPU time spent in readdir native calls.

2 replies

masseyke Jul 26, 2025
Author

This was on an M4 MacBook Pro. I have not tried it on anything else. The issue I was attempting to reproduce had happened on Linux servers, but I'm unsure of the details or architecture.

jvz Jul 28, 2025
Collaborator

Seems as though I was trying this on the latest 2.x branch where this may already be fixed due to changes in StatusLogger that replaced LowLevelLogUtil, a class that previously relied on stderr for early error reporting.

jhl221123 · 2025-07-28T15:11:22Z

jhl221123
Jul 28, 2025

I have investigated this deadlock issue and would like to share a summary of my analysis and results.

Summary

The issue is resolved in version 2.25.1. My analysis of version 2.19.0 shows the root cause was a classic lock-order inversion deadlock.

1. Root Cause Analysis (on v2.19.0)

The deadlock occurs from a conflict between two locks:

The Monitor Lock: The synchronized lock on the RollingFileManager instance.
The ReentrantLock: The internal lock within the PrintStream instance used by System.out.

This happens when two threads attempt to acquire these locks in the exact opposite order.

Path A: Monitor Lock → ReentrantLock

This path is triggered when a rollover occurs and throws an exception.

A thread enters a synchronized method like checkRollover(), acquiring the Monitor Lock on the RollingFileManager.
Inside the rollover process, the custom SecurityManager throws a SecurityException when System.getenv() is called.
Log4j's internal exception logging (LowLevelLogUtil) then attempts to print the stack trace, which requires acquiring the ReentrantLock. The thread now holds the Monitor Lock while waiting for the ReentrantLock.

Click to see relevant stack trace for Path A

- parking to wait for  <0x000000070fa290a0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
						…
at jdk.internal.misc.InternalLock.lock(java.base@21.0.6/InternalLock.java:74)
at java.io.PrintStream.write(java.base@21.0.6/PrintStream.java:621)
						…
at org.apache.logging.log4j.util.LowLevelLogUtil.log(LowLevelLogUtil.java:44)
at org.apache.logging.log4j.util.LowLevelLogUtil.logException(LowLevelLogUtil.java:55)
at org.apache.logging.log4j.util.EnvironmentPropertySource.logException(EnvironmentPropertySource.java:43)
at org.apache.logging.log4j.util.EnvironmentPropertySource.containsProperty(EnvironmentPropertySource.java:103)
at org.apache.logging.log4j.util.PropertiesUtil$Environment.get(PropertiesUtil.java:533)
at org.apache.logging.log4j.util.PropertiesUtil$Environment.access$200(PropertiesUtil.java:444)
						…
at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.rollover(RollingFileManager.java:394)
- locked <0x000000070e8700f8> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:308)
- locked <0x000000070e8700f8> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:300)

Path B: ReentrantLock → Monitor Lock

This path is triggered by a standard System.out.println() call in the test.

First, the test redirects System.out to a PrintStream that calls our logger.

PrintStream logPrintStream = new PrintStream(new OutputStream() {
    // ... (implementation that calls logger.info()) ...
}, false);
System.setOut(logPrintStream);
System.setErr(logPrintStream);

A worker thread calls System.out.println(), which immediately acquires the ReentrantLock inside the PrintStream.
The custom OutputStream then calls logger.info(). This logging attempt eventually requires access to the RollingFileManager, trying to acquire the Monitor Lock. The thread now holds the ReentrantLock while waiting for the Monitor Lock.

Click to see relevant stack trace for Path B

- waiting to lock <0x000000070e8700f8> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeEncodedText(TextEncoderHelper.java:94)
						…
at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:301)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:161)
						…
at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1320)
at log4j2.Log4j2DeadlockTests$1.write(Log4j2DeadlockTests.java:67)
at java.io.OutputStream.write(java.base@21.0.6/OutputStream.java:167)
at java.io.PrintStream.implWrite(java.base@21.0.6/PrintStream.java:643)
at java.io.PrintStream.write(java.base@21.0.6/PrintStream.java:623)
						…
at java.io.PrintStream.writeln(java.base@21.0.6/PrintStream.java:826)
at java.io.PrintStream.println(java.base@21.0.6/PrintStream.java:1168)
at log4j2.Log4j2DeadlockTests.lambda$main$0(Log4j2DeadlockTests.java:95)

In TextEncoderHelper.java:
private static void writeEncodedText(..., final ByteBufferDestination destination, ...) {
  	       			...
    result = charsetEncoder.flush(byteBuf);
    if (!result.isUnderflow()) {
        synchronized(destination) { // <-- DEADLOCK POINT: Tries to acquire the Monitor Lock held by Path A
            flushRemainingBytes(charsetEncoder, destination, byteBuf);
        }
    }
         			...
}

When Path A and Path B execute concurrently, a deadlock is guaranteed.

2. Verification on v2.25.1

I re-ran the exact same test code against version 2.25.1.

I can confirm that the deadlock no longer occurs. The application runs to completion successfully. It seems that the removal of LowLevelLogUtil in release 2.23.x ultimately resolved this specific deadlock.

I'd appreciate any feedback on this analysis. Thanks!

2 replies

jvz Jul 28, 2025
Collaborator

This sounds extremely promising! I was suspecting a race between writing the log event and doing a rollover check, but the necessity of rerouting stdout/stderr makes this race more likely to be a problem. I had forgotten about the changes we made to StatusLogger (partially because it was first a change made in 3.x, but that change was later dropped when we decided to keep the API at 2.x and then later backported conceptually). Are you seeing the same resolution in 3.0.0-beta3?

jhl221123 Jul 29, 2025

I've tested with version 3.0.0-beta2, and I can confirm that the deadlock does not occur. I aligned both modules to 3.0.0-beta2 because while log4j-core is available in beta3, the latest version I could find for log4j-api was beta2. If there's a different way to test with 3.0.0-beta3, please let me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deadlock under unusual conditions #3837

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Deadlock under unusual conditions #3837

Uh oh!

masseyke Jul 18, 2025

Replies: 4 comments · 5 replies

Uh oh!

vy Jul 20, 2025 Collaborator

Uh oh!

masseyke Jul 21, 2025 Author

Uh oh!

vy Jul 20, 2025 Collaborator

Uh oh!

jvz Jul 25, 2025 Collaborator

Uh oh!

masseyke Jul 26, 2025 Author

Uh oh!

jvz Jul 28, 2025 Collaborator

Uh oh!

jhl221123 Jul 28, 2025

Summary

1. Root Cause Analysis (on v2.19.0)

Path A: Monitor Lock → ReentrantLock

Path B: ReentrantLock → Monitor Lock

2. Verification on v2.25.1

Uh oh!

jvz Jul 28, 2025 Collaborator

Uh oh!

jhl221123 Jul 29, 2025

masseyke
Jul 18, 2025

Replies: 4 comments 5 replies

vy
Jul 20, 2025
Collaborator

masseyke Jul 21, 2025
Author

vy
Jul 20, 2025
Collaborator

jvz
Jul 25, 2025
Collaborator

masseyke Jul 26, 2025
Author

jvz Jul 28, 2025
Collaborator

jhl221123
Jul 28, 2025

jvz Jul 28, 2025
Collaborator