You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-10Lines changed: 8 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -105,7 +105,7 @@ Options:
105
105
[default: thenlper/gte-base]
106
106
107
107
--revision <REVISION>
108
-
The actual revision of the model if you are referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
108
+
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
109
109
110
110
[env: REVISION=]
111
111
@@ -131,24 +131,22 @@ Options:
131
131
--max-batch-tokens <MAX_BATCH_TOKENS>
132
132
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
133
133
134
-
This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.
134
+
This represents the total amount of potential tokens within a batch.
135
135
136
-
However in the non-padded (flash attention) version this can be much finer.
136
+
For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
137
137
138
-
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
139
-
140
-
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on other parameters like if you are flash attention or the model implementation, text-embeddings cannot infer this number automatically.
138
+
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
141
139
142
140
[env: MAX_BATCH_TOKENS=]
143
-
[default: 8192]
141
+
[default: 16384]
144
142
145
143
--max-batch-requests <MAX_BATCH_REQUESTS>
146
144
Optionally control the maximum number of individual requests in a batch
147
145
148
146
[env: MAX_BATCH_REQUESTS=]
149
147
150
148
--max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
151
-
Control the maximum number of inputs that a client can send
149
+
Control the maximum number of inputs that a client can send in a single request
152
150
153
151
[env: MAX_CLIENT_BATCH_SIZE=]
154
152
[default: 32]
@@ -171,10 +169,10 @@ Options:
171
169
[default: 3000]
172
170
173
171
--uds-path <UDS_PATH>
174
-
The name of the unix socket some text-embeddings backends will use as they communicate internally with gRPC
172
+
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC
175
173
176
174
[env: UDS_PATH=]
177
-
[default: /tmp/text-embeddings-server]
175
+
[default: /tmp/text-embeddings-inference-server]
178
176
179
177
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
180
178
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
0 commit comments