Native Memory Leak with DJL (0.32.0) + PyTorch (2.5.1) in Long-running JVM Application #3708

ivrisivris · 2025-05-30T13:41:55Z

ivrisivris
May 30, 2025

Hello,

We're using DJL 0.32.0 with the PyTorch 2.5.1 engine in a component integrated into a long-running JVM application. The component performs repeated inference operations over the application's lifetime.

We're experiencing a serious issue with native memory growth — after each inference, off-heap/native memory usage increases gradually until it reaches system limits and the application crashes due to out-of-native-memory conditions (OOM), even though JVM heap remains stable.

Environment
DJL: 0.32.0
Engine: PyTorch 2.5.1
Runtime: Long-running server-side JVM application
Hardware: Issue occurs on both CPU and GPU (CUDA) configurations

What we've tried
We're using NDManager properly with try-with-resources or manual close().
We avoid keeping NDArray instances beyond the lifetime of their NDManager.

Still, native memory continues to grow over time with each inference.

No Java heap leaks are observed — the problem is strictly off-heap/native memory.

Questions
Is it expected that the DJL PyTorch engine holds persistent native allocations that are not cleared by emptyCache()?

Thanks
Kamil

frankfliu · 2025-05-30T14:33:55Z

frankfliu
May 30, 2025

If you have a minimal reproduce case, I can take a look.

Native memory leak usually caused by resource not closed properly. Even you closed your own NDManager, it's possible some of the resource is leaked into parent NDManager.

Are you using your own Translator or use NDArray directly? We recommend follow Predictor/Translator pattern. It's easier to track the resource.

0 replies

ivrisivris · 2025-06-02T14:11:08Z

ivrisivris
Jun 2, 2025
Author

Memory not released until process exit despite closing all DJL objects.

This small is example

I'm working with DJL and I've encountered an issue where memory is not released immediately after closing all resources like Predictor, ZooModel, and NDManager. Instead, the memory is released only after the entire process (JVM) exits.

`import java.io.File;
import java.io.FilenameFilter;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
 
import ai.djl.Device;
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer;
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.Builder;
import ai.djl.huggingface.translator.TokenClassificationTranslator;
import ai.djl.inference.Predictor;
import ai.djl.modality.nlp.translator.NamedEntity;
import ai.djl.repository.zoo.Criteria;
import ai.djl.repository.zoo.ZooModel;
 
public class MinimalPredictLoop {
 
  
    public static void main(String[] args) throws Exception {
      
      System.out.println("Starting: " + 1);
      
      for (int i = 0; i < 10; i++) {
        
        new MinimalPredictLoop().doWork();
      }
      
      System.out.println("Press Enter to continue...");
      System.in.read();
    }
 
    public void doWork() throws Exception {
    
    Path resolvedModelPath = Path.of("/data/models/iiiorg_piiranha-v1-detect-personal-information_cpu");
      
      final File file = findModelFile(resolvedModelPath.toFile());
    
    Builder tokenizerBuilder = HuggingFaceTokenizer.builder();
    tokenizerBuilder.optTruncation(true);
    tokenizerBuilder.optPadding(true);  
    tokenizerBuilder.optTokenizerPath(resolvedModelPath);
    
    String str = "Hello my name is John and I live in Oxford Street 32, London";
    List<String> strs = new ArrayList<>();
    for (int j = 0; j < 100; j++) {
      strs.add(str);
    }
    
    try(var tokenizer = tokenizerBuilder.build()) {
      
      Map<String, Object> arguments = Map.of("includeTokenTypes", true);
      var translator = TokenClassificationTranslator.builder(tokenizer, arguments).build();
      
      var criteria = Criteria.builder()
          .optDevice(Device.cpu())
          .optModelName("MyModel")
          .setTypes(String.class, NamedEntity[].class)
          .optTranslator(translator)
          .optModelPath(file.toPath()).build();
  
      try(ZooModel<String, NamedEntity[]> mymodel = criteria.loadModel(); Predictor<String, NamedEntity[]> predictor = mymodel.newPredictor()) {
        for (int k = 0; k < 10; k++) {
          List<NamedEntity[]> entities = predictor.batchPredict(strs);
          System.out.println("Entity: " + entities);
        }
      }
     
      System.gc();
    }
  }
  
  private static File findModelFile(File directory) {
    if (!directory.exists() || !directory.isDirectory()) {
      return null;
    }
    final FilenameFilter FILE_MODEL_FILTER = new FilenameFilter() {
      @Override
      public boolean accept(File dir, String name) {
        return name.endsWith(".pt");
      }
    };
    
    final File[] files = directory.listFiles(FILE_MODEL_FILTER);
    if (files == null || files.length == 0) {
      return null;
    }
    
    return files[0];
  }
}`

Even though Predictor and Model are properly closed, the GPU memory usage (checked via nvidia-smi) stays allocated until the process exits. This can cause issues in long-running or reusable components.

Is there a recommended way to force full native memory release, or is this expected behavior due to CUDA context management?

Thanks!

0 replies

frankfliu · 2025-06-02T15:17:37Z

frankfliu
Jun 2, 2025

@ivrisivris

The code looks fine although it's not efficient. You can load the model, create the predictor and run the loop, then close the model when program exit.

I don't think there is native memory leak. A few thing you may need to be aware of:

PyTorch use memory pool, even all models and NDArray are closed, PyTorch won't release memory back to system until the process exit
CUDA also keep cache, so nvidia-smi won't report actually memory usage by each model
What you observe is only the peak memory usage, it won't really cause OOM if you keep running the loop

You application OOM may caused by other leak not related to the code you showed here.

0 replies

ivrisivris · 2025-06-02T20:23:31Z

ivrisivris
Jun 2, 2025
Author

Thank you for the information!

We're building a long-running application using DJL (Deep Java Library), and our goal is to load and unload different models dynamically, without restarting the process. Based on your explanation, we understand that PyTorch (via DJL) and CUDA manage memory through internal pools and caching, so memory might not be released back to the system even after the model and NDArray instances are closed.

However, we are seeing memory build-up over time, and we're looking for ways to release GPU memory between model runs to reduce peak usage and avoid fragmentation.

0 replies

frankfliu · 2025-06-02T22:08:13Z

frankfliu
Jun 2, 2025

@ivrisivris

If you keep loading/unloading models, you may experience OOM, this is caused by PyTorch memory pool segmentation. Let's say you can load 20 models and run predictions without any issue, but when you unload the model, you won't be able to load it back. So the only workaround is only load max 15 models (you need figure this number out). I know some customer is doing this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Native Memory Leak with DJL (0.32.0) + PyTorch (2.5.1) in Long-running JVM Application #3708

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Native Memory Leak with DJL (0.32.0) + PyTorch (2.5.1) in Long-running JVM Application #3708

Uh oh!

ivrisivris May 30, 2025

Replies: 5 comments

Uh oh!

frankfliu May 30, 2025

Uh oh!

ivrisivris Jun 2, 2025 Author

Uh oh!

Uh oh!

frankfliu Jun 2, 2025

Uh oh!

ivrisivris Jun 2, 2025 Author

Uh oh!

frankfliu Jun 2, 2025

ivrisivris
May 30, 2025

frankfliu
May 30, 2025

ivrisivris
Jun 2, 2025
Author

frankfliu
Jun 2, 2025

ivrisivris
Jun 2, 2025
Author

frankfliu
Jun 2, 2025