Issue with ghcr.io/huggingface/text-embeddings-inference:turing-1.7 Version (Turing Version for T4 GPU) for BAAI/bge-m3

### System Info

While loading the BAAI/bge-m3 model with 1.7 version of image, there is always CUDA out of memory error, although there is enough space left(10GB) in GPU. This happens when I have already other models loaded in the same GPU, although there is enough memory available to load this model. If I load other bigger models as well, they work fine. 

Attaching the error message:

 docker run --gpus '"device=0"' -p 9050:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:turing-1.7 --model-id $model
turing-1.7: Pulling from huggingface/text-embeddings-inference
Digest: sha256:f0a865b76d7b2229cbb68d5f6a7881c225d65539a9aace3fbd5e7c1577ed987d
Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:turing-1.7
2025-07-03T05:26:45.269542Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "BAA*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "64f48fad3ca4", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-07-03T05:26:45.369607Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-07-03T05:26:45.369629Z  INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading `1_Pooling/config.json`
2025-07-03T05:26:45.369722Z  INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading `config_sentence_transformers.json`
2025-07-03T05:26:45.369744Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading `config.json`
2025-07-03T05:26:45.369763Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading `tokenizer.json`
2025-07-03T05:26:45.369816Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 212.14µs
2025-07-03T05:26:46.055755Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 8192
2025-07-03T05:26:46.055961Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-07-03T05:26:48.592163Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-07-03T05:26:48.592887Z  INFO text_embeddings_backend: backends/src/lib.rs:510: Downloading `model.safetensors`
2025-07-03T05:26:53.149473Z  WARN text_embeddings_backend: backends/src/lib.rs:513: Could not download `model.safetensors`: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-m3/resolve/main/model.safetensors)
2025-07-03T05:26:53.149494Z  INFO text_embeddings_backend: backends/src/lib.rs:518: Downloading `model.safetensors.index.json`
2025-07-03T05:26:53.284296Z  WARN text_embeddings_backend: backends/src/lib.rs:386: safetensors weights not found. Using `pytorch_model.bin` instead. Model loading will be significantly slower.
2025-07-03T05:26:53.284314Z  INFO text_embeddings_backend: backends/src/lib.rs:387: Downloading `pytorch_model.bin`
2025-07-03T05:26:53.284382Z  INFO text_embeddings_backend: backends/src/lib.rs:394: Model weights downloaded in 4.691498035s
2025-07-03T05:26:53.910843Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:353: Starting Bert model on Cuda(CudaDevice(DeviceId(1)))
2025-07-03T05:27:06.592942Z  INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
Error: Model backend is not healthy

Caused by:
    DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

use t4 GPU
Load few more models in the same GPU.
Load the bge-m3 model with turing-1.7 version of docker image


### Expected behavior

Model should load without errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with ghcr.io/huggingface/text-embeddings-inference:turing-1.7 Version (Turing Version for T4 GPU) for BAAI/bge-m3 #670

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with ghcr.io/huggingface/text-embeddings-inference:turing-1.7 Version (Turing Version for T4 GPU) for BAAI/bge-m3 #670

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions