11
11
12
12
A blazing fast inference solution for text embeddings models.
13
13
14
- Benchmark for [ BAAI/bge-base-en-v1.5] ( https://huggingface.co/BAAI/bge-base-en-v1.5 ) on a Nvidia A10 with a sequence length of 512 tokens:
14
+ Benchmark for [ BAAI/bge-base-en-v1.5] ( https://huggingface.co/BAAI/bge-base-en-v1.5 ) on an Nvidia A10 with a sequence length of 512 tokens:
15
15
16
16
<p >
17
17
<img src =" assets/bs1-lat.png " width =" 400 " />
@@ -36,14 +36,18 @@ Benchmark for [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1
36
36
- [ Local Install] ( #local-install )
37
37
- [ Docker Build] ( #docker-build )
38
38
39
- - No compilation step
40
- - Dynamic shapes
41
- - Small docker images and fast boot times. Get ready for true serverless!
42
- - Token based dynamic batching
43
- - Optimized transformers code for inference using [ Flash Attention] ( https://github.com/HazyResearch/flash-attention ) ,
39
+ Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings models. TEI enables
40
+ high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. TEI implements many features
41
+ such as:
42
+
43
+ * No model graph compilation step
44
+ * Small docker images and fast boot times. Get ready for true serverless!
45
+ * Token based dynamic batching
46
+ * Optimized transformers code for inference using [ Flash Attention] ( https://github.com/HazyResearch/flash-attention ) ,
44
47
[ Candle] ( https://github.com/huggingface/candle ) and [ cuBLASLt] ( https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api )
45
- - [ Safetensors] ( https://github.com/huggingface/safetensors ) weight loading
46
- - Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
48
+ * [ Safetensors] ( https://github.com/huggingface/safetensors ) weight loading
49
+ * Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
50
+
47
51
48
52
## Get Started
49
53
0 commit comments