Skip to content

Commit 297257b

Browse files
fix docs
1 parent 2defbcc commit 297257b

File tree

2 files changed

+12
-19
lines changed

2 files changed

+12
-19
lines changed

README.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ Options:
105105
[default: thenlper/gte-base]
106106
107107
--revision <REVISION>
108-
The actual revision of the model if you are referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
108+
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
109109
110110
[env: REVISION=]
111111
@@ -131,24 +131,22 @@ Options:
131131
--max-batch-tokens <MAX_BATCH_TOKENS>
132132
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
133133
134-
This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.
134+
This represents the total amount of potential tokens within a batch.
135135
136-
However in the non-padded (flash attention) version this can be much finer.
136+
For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
137137
138-
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
139-
140-
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on other parameters like if you are flash attention or the model implementation, text-embeddings cannot infer this number automatically.
138+
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
141139
142140
[env: MAX_BATCH_TOKENS=]
143-
[default: 8192]
141+
[default: 16384]
144142
145143
--max-batch-requests <MAX_BATCH_REQUESTS>
146144
Optionally control the maximum number of individual requests in a batch
147145
148146
[env: MAX_BATCH_REQUESTS=]
149147
150148
--max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
151-
Control the maximum number of inputs that a client can send
149+
Control the maximum number of inputs that a client can send in a single request
152150
153151
[env: MAX_CLIENT_BATCH_SIZE=]
154152
[default: 32]
@@ -171,10 +169,10 @@ Options:
171169
[default: 3000]
172170
173171
--uds-path <UDS_PATH>
174-
The name of the unix socket some text-embeddings backends will use as they communicate internally with gRPC
172+
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC
175173
176174
[env: UDS_PATH=]
177-
[default: /tmp/text-embeddings-server]
175+
[default: /tmp/text-embeddings-inference-server]
178176
179177
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
180178
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance

router/src/main.rs

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -60,26 +60,21 @@ struct Args {
6060
/// of the available hardware.
6161
///
6262
/// This represents the total amount of potential tokens within a batch.
63-
/// When using padding (not recommended) this would be equivalent of
64-
/// `batch_size` * `max_total_tokens`.
6563
///
66-
/// However in the non-padded (flash attention) version this can be much finer.
67-
///
68-
/// For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100`
64+
/// For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100`
6965
/// or a single query of `1000` tokens.
7066
///
7167
/// Overall this number should be the largest possible until the model is compute bound.
72-
/// Since the actual memory overhead depends on other parameters like if you're flash attention
73-
/// or the model implementation, text-embeddings-inference cannot infer this number
74-
/// automatically.
68+
/// Since the actual memory overhead depends on the model implementation,
69+
/// text-embeddings-inference cannot infer this number automatically.
7570
#[clap(default_value = "16384", long, env)]
7671
max_batch_tokens: usize,
7772

7873
/// Optionally control the maximum number of individual requests in a batch
7974
#[clap(long, env)]
8075
max_batch_requests: Option<usize>,
8176

82-
/// Control the maximum number of inputs that a client can send
77+
/// Control the maximum number of inputs that a client can send in a single request
8378
#[clap(default_value = "32", long, env)]
8479
max_client_batch_size: usize,
8580

0 commit comments

Comments
 (0)