Skip to content

Docker inference: Significant Performance Degradation on First Run #3017

Open
@jugal-sheth

Description

@jugal-sheth

Hello everyone,

I hope this message finds you well. I'm encountering a significant performance degradation on the first run of whisper-cli application compared to subsequent runs.

Issue Description
When running the application for the first time after starting or rebuilding the container, there is a noticeable slowdown. Subsequent runs are significantly faster, indicating that caching mechanisms or background processes might be affecting the performance on the first run. The Question is How do I make sure to get consistent performance ?

Note: AI suggested that I should commit the docker itself but I am not fan of that solution I wish to handle everything in build

Platform: aarch Linux

Steps to Reproduce

  1. Build the Docker Container
  2. Build the Library with cmake version 3.30.2
    cmake -B build -DGGML_CUDA=1 cmake --build build -j --config Release
  3. Download whisper models
  4. Execute the custom Script Twice:./run.sh && ./run.sh
    Expected Behavior
    Consistent execution time for both runs.
    The first run should not take significantly longer than subsequent runs.

Actual Behavior
The first run takes much longer to execute.
Subsequent runs are much faster.
It keeps repeating every-time i restart the Docker Container
Below is my run.sh

export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/whisper.cpp/build/ggml/src:/whisper.cpp/build/ggml/src/ggml-cuda/:/whisper.cpp/build/src
# Define the command
COMMAND="./build/bin/whisper-cli -m models/ggml-medium.en.bin -f samples/jfk.wav -fa -pc -np"

# Record the start time
START_TIME=$(date +%s.%N)

# Execute the command
$COMMAND

# Record the end time
END_TIME=$(date +%s.%N)

# Calculate the execution time
EXECUTION_TIME=$(echo "$END_TIME - $START_TIME" | bc)

# Print the execution time
echo "Execution Time: ${EXECUTION_TIME}s"

And below is the result inside the docker


[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans,
[00:00:03.000 --> 00:00:08.000]   ask not what your country can do for you,
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

Execution Time: 478.130996608s

[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans,
[00:00:03.000 --> 00:00:08.000]   ask not what your country can do for you,
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

Execution Time: 5.049899231s
root@jugal:/whisper.cpp# ./run.sh && ./run.sh 

[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans,
[00:00:03.000 --> 00:00:08.000]   ask not what your country can do for you,
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

Execution Time: 4.822853356s

[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans,
[00:00:03.000 --> 00:00:08.000]   ask not what your country can do for you,
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

Execution Time: 4.710287755s

Request for Help
I would appreciate any insights, suggestions, or best practices from the community to further optimize the performance of my application during the first run. Any recommendations for profiling tools, caching strategies, or Docker configurations would be greatly appreciated!

Humble Note
I have used FROM nvcr.io/nvidia/l4t-pytorch:r35.1.0-pth1.13-py3 to build image if i should use any other base please recommend I tried nvidia/cuda:11.4.2-devel-ubuntu20.04 But seems to have issue with nvcc. I am quite new to this and might be missing something obvious. I'm looking for guidance on how to improve the performance and ensure that the initial run is not significantly slower than subsequent runs.

Thank you in advance for your help and support.

Best regards,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions