Skip to content

Issue with ghcr.io/huggingface/text-embeddings-inference:turing-1.7 Version (Turing Version for T4 GPU) for BAAI/bge-m3 #670

@NavKumarGit

Description

@NavKumarGit

System Info

While loading the BAAI/bge-m3 model with 1.7 version of image, there is always CUDA out of memory error, although there is enough space left(10GB) in GPU. This happens when I have already other models loaded in the same GPU, although there is enough memory available to load this model. If I load other bigger models as well, they work fine.

Attaching the error message:

docker run --gpus '"device=0"' -p 9050:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:turing-1.7 --model-id $model
turing-1.7: Pulling from huggingface/text-embeddings-inference
Digest: sha256:f0a865b76d7b2229cbb68d5f6a7881c225d65539a9aace3fbd5e7c1577ed987d
Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:turing-1.7
2025-07-03T05:26:45.269542Z INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "BAA*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "64f48fad3ca4", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-07-03T05:26:45.369607Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-07-03T05:26:45.369629Z INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading 1_Pooling/config.json
2025-07-03T05:26:45.369722Z INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading config_sentence_transformers.json
2025-07-03T05:26:45.369744Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading config.json
2025-07-03T05:26:45.369763Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading tokenizer.json
2025-07-03T05:26:45.369816Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 212.14µs
2025-07-03T05:26:46.055755Z INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 8192
2025-07-03T05:26:46.055961Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-07-03T05:26:48.592163Z INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-07-03T05:26:48.592887Z INFO text_embeddings_backend: backends/src/lib.rs:510: Downloading model.safetensors
2025-07-03T05:26:53.149473Z WARN text_embeddings_backend: backends/src/lib.rs:513: Could not download model.safetensors: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-m3/resolve/main/model.safetensors)
2025-07-03T05:26:53.149494Z INFO text_embeddings_backend: backends/src/lib.rs:518: Downloading model.safetensors.index.json
2025-07-03T05:26:53.284296Z WARN text_embeddings_backend: backends/src/lib.rs:386: safetensors weights not found. Using pytorch_model.bin instead. Model loading will be significantly slower.
2025-07-03T05:26:53.284314Z INFO text_embeddings_backend: backends/src/lib.rs:387: Downloading pytorch_model.bin
2025-07-03T05:26:53.284382Z INFO text_embeddings_backend: backends/src/lib.rs:394: Model weights downloaded in 4.691498035s
2025-07-03T05:26:53.910843Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:353: Starting Bert model on Cuda(CudaDevice(DeviceId(1)))
2025-07-03T05:27:06.592942Z INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
Error: Model backend is not healthy

Caused by:
DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

use t4 GPU
Load few more models in the same GPU.
Load the bge-m3 model with turing-1.7 version of docker image

Expected behavior

Model should load without errors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions