-
Notifications
You must be signed in to change notification settings - Fork 289
Description
System Info
While loading the BAAI/bge-m3 model with 1.7 version of image, there is always CUDA out of memory error, although there is enough space left(10GB) in GPU. This happens when I have already other models loaded in the same GPU, although there is enough memory available to load this model. If I load other bigger models as well, they work fine.
Attaching the error message:
docker run --gpus '"device=0"' -p 9050:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:turing-1.7 --model-id $model
turing-1.7: Pulling from huggingface/text-embeddings-inference
Digest: sha256:f0a865b76d7b2229cbb68d5f6a7881c225d65539a9aace3fbd5e7c1577ed987d
Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:turing-1.7
2025-07-03T05:26:45.269542Z INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "BAA*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "64f48fad3ca4", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-07-03T05:26:45.369607Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-07-03T05:26:45.369629Z INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading 1_Pooling/config.json
2025-07-03T05:26:45.369722Z INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading config_sentence_transformers.json
2025-07-03T05:26:45.369744Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading config.json
2025-07-03T05:26:45.369763Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading tokenizer.json
2025-07-03T05:26:45.369816Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 212.14µs
2025-07-03T05:26:46.055755Z INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 8192
2025-07-03T05:26:46.055961Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-07-03T05:26:48.592163Z INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-07-03T05:26:48.592887Z INFO text_embeddings_backend: backends/src/lib.rs:510: Downloading model.safetensors
2025-07-03T05:26:53.149473Z WARN text_embeddings_backend: backends/src/lib.rs:513: Could not download model.safetensors
: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-m3/resolve/main/model.safetensors)
2025-07-03T05:26:53.149494Z INFO text_embeddings_backend: backends/src/lib.rs:518: Downloading model.safetensors.index.json
2025-07-03T05:26:53.284296Z WARN text_embeddings_backend: backends/src/lib.rs:386: safetensors weights not found. Using pytorch_model.bin
instead. Model loading will be significantly slower.
2025-07-03T05:26:53.284314Z INFO text_embeddings_backend: backends/src/lib.rs:387: Downloading pytorch_model.bin
2025-07-03T05:26:53.284382Z INFO text_embeddings_backend: backends/src/lib.rs:394: Model weights downloaded in 4.691498035s
2025-07-03T05:26:53.910843Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:353: Starting Bert model on Cuda(CudaDevice(DeviceId(1)))
2025-07-03T05:27:06.592942Z INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
Error: Model backend is not healthy
Caused by:
DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
use t4 GPU
Load few more models in the same GPU.
Load the bge-m3 model with turing-1.7 version of docker image
Expected behavior
Model should load without errors