Skip to content

Commit c0b2925

Browse files
authored
feat:add qwen3
* merge * merge * add Mistral-Small-3.1-24B-Instruct-2503 * modify qwq-32b deploy * add txgemma model; * modify model list command * fix typo * add some ecs parameters * add glm4-z1 models * modify vllm backend * add qwen3
1 parent 70fe05d commit c0b2925

File tree

10 files changed

+364
-9
lines changed

10 files changed

+364
-9
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515
</p>
1616

1717
## 🔥 Latest News
18-
18+
- 2025-04-29: Deploy Qwen 3 series models with [one command line](https://github.com/aws-samples/easy-model-deployer/blob/main/docs/en/best_deployment_practices.md##famous-models###Qwen-3-Series).
19+
- 2025-04-21: Deploy GLM Z1/0414 series models with [one command line](https://github.com/aws-samples/easy-model-deployer/blob/main/docs/en/best_deployment_practices.md##famous-models###GLM-Z1/0414-Series).
1920
- 2025-03-17: Deploy Gemma 3 series models with [one command line](https://github.com/aws-samples/easy-model-deployer/blob/main/docs/en/best_deployment_practices.md##famous-models###gemma-3-series).
2021
- 2025-03-06: Deploy QwQ-32B with [one command line](docs/en/best_deployment_practices.md##famous-models###qwen-series###qwq-32b).
2122

docs/en/best_deployment_practices.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,23 @@
33
This document provides examples of best practices for deploying models using EMD for various use cases.
44

55
## Famous Models
6+
### Qwen 3 Series
7+
```
8+
emd deploy --model-id Qwen3-30B-A3B --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker_realtime
9+
10+
emd deploy --model-id Qwen3-32B --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker_realtime
11+
12+
emd deploy --model-id Qwen3-8B --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker_realtime
13+
```
14+
15+
16+
### GLM Z1/0414 Series
17+
```
18+
emd deploy --model-id GLM-Z1-32B-0414 --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker_realtime
19+
20+
emd deploy --model-id GLM-4-32B-0414 --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker_realtime
21+
```
22+
623

724
### Mistral Small Series
825
```

src/emd/cfn/sagemaker_realtime/template.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ Parameters:
2626
Region:
2727
Type: String
2828
Description: The region to be used for the SageMaker Endpoint
29+
MinCapacity:
30+
Type: Number
31+
Description: The minimum capacity of the endpoint
32+
Default: 1
2933
MaxCapacity:
3034
Type: Number
3135
Description: The maximum capacity of the endpoint
@@ -120,7 +124,7 @@ Resources:
120124
Type: AWS::ApplicationAutoScaling::ScalableTarget
121125
Properties:
122126
MaxCapacity: !Ref MaxCapacity
123-
MinCapacity: 1
127+
MinCapacity: !Ref MinCapacity
124128
RoleARN: !GetAtt ExecutionRole.Arn
125129
ResourceId: !Sub "endpoint/${SageMakerEndpoint.EndpointName}/variant/AllTraffic"
126130
ScalableDimension: "sagemaker:variant:DesiredInstanceCount"

src/emd/models/engines.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,25 @@ class KtransformersEngine(OpenAICompitableEngine):
127127
"environment_variables": "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
128128
"default_cli_args": " --chat-template emd/models/chat_templates/qwen2vl_add_prefill_chat_template.jinja --max_model_len 16000 --disable-log-stats --limit-mm-per-prompt image=2,video=1 --max_num_seq 1 --gpu_memory_utilization 0.9"
129129
})
130+
131+
132+
vllm_ui_tars_1_5_engin084 = VllmEngine(**{
133+
**vllm_engine064.model_dump(),
134+
"engine_dockerfile_config": {"VERSION":"v0.8.4"},
135+
"environment_variables": "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
136+
"default_cli_args": " --max_model_len 16000 --disable-log-stats --limit-mm-per-prompt image=1,video=0 --max_num_seq 2 --gpu_memory_utilization 0.9 --enable-prefix-caching"
137+
})
138+
139+
140+
141+
vllm_qwen3_engin084 = VllmEngine(**{
142+
**vllm_engine064.model_dump(),
143+
"engine_dockerfile_config": {"VERSION":"v0.8.4"},
144+
"environment_variables": "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
145+
"default_cli_args": " --max_model_len 16000 --disable-log-stats --enable-reasoning --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --enable-prefix-caching"
146+
})
147+
148+
130149
vllm_qwen2vl72b_engine064 = VllmEngine(**{
131150
**vllm_engine064.model_dump(),
132151
"environment_variables": "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
@@ -141,6 +160,14 @@ class KtransformersEngine(OpenAICompitableEngine):
141160
"default_cli_args": " --max_model_len 25000 --disable-log-stats --limit-mm-per-prompt image=20,video=1 --max_num_seq 1 --gpu_memory_utilization 0.9"
142161
})
143162

163+
vllm_qwen25vl72b_engine084 = VllmEngine(**{
164+
**vllm_engine064.model_dump(),
165+
"engine_dockerfile_config": {"VERSION":"v0.8.4"},
166+
"dockerfile_name":"Dockerfile_qwen25_vl",
167+
"environment_variables": "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
168+
"default_cli_args": " --max_model_len 32000 --disable-log-stats --limit-mm-per-prompt image=1,video=1 --max_num_seq 1 --gpu_memory_utilization 0.9"
169+
})
170+
144171
vllm_qwq_engine073 = VllmEngine(**{
145172
**vllm_qwen25vl72b_engine073.model_dump(),
146173
"environment_variables": "export VLLM_ATTENTION_BACKEND=FLASHINFER && export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",

src/emd/models/llms/qwen.py

Lines changed: 239 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
tgi_qwen2d5_72b_on_inf2,
99
vllm_qwen2d5_72b_engine064,
1010
vllm_qwq_engine073,
11-
vllm_qwq_engine082
11+
vllm_qwq_engine082,
12+
vllm_qwen3_engin084
1213
)
1314
from ..services import (
1415
sagemaker_service,
@@ -34,7 +35,7 @@
3435
from emd.models.utils.constants import ModelType
3536
from emd.models.utils.constants import ModelType
3637
from emd.models import ModelSeries
37-
from ..model_series import QWEN2D5_SERIES,QWEN_REASONING_MODEL
38+
from ..model_series import QWEN2D5_SERIES,QWEN_REASONING_MODEL,QWEN3_SERIES
3839

3940
Model.register(
4041
dict(
@@ -498,3 +499,239 @@
498499
model_series=QWEN_REASONING_MODEL
499500
)
500501
)
502+
503+
504+
Model.register(
505+
dict(
506+
model_id = "Qwen3-8B",
507+
supported_engines=[vllm_qwen3_engin084],
508+
supported_instances=[
509+
g5d2xlarge_instance,
510+
g5d4xlarge_instance,
511+
g5d8xlarge_instance,
512+
g5d16xlarge_instance,
513+
g4dn2xlarge_instance,
514+
# g5d24xlarge_instance,
515+
# g5d48xlarge_instance,
516+
local_instance
517+
],
518+
supported_services=[
519+
sagemaker_service,
520+
sagemaker_async_service,
521+
ecs_service,
522+
local_service
523+
],
524+
supported_frameworks=[
525+
fastapi_framework
526+
],
527+
allow_china_region=True,
528+
huggingface_model_id="Qwen/Qwen3-8B",
529+
modelscope_model_id="Qwen/Qwen3-8B",
530+
require_huggingface_token=False,
531+
application_scenario="Agent, tool use, translation, summary",
532+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
533+
model_type=ModelType.LLM,
534+
model_series=QWEN3_SERIES
535+
)
536+
)
537+
538+
Model.register(
539+
dict(
540+
model_id = "Qwen3-0.6B",
541+
supported_engines=[vllm_qwen3_engin084],
542+
supported_instances=[
543+
g5d2xlarge_instance,
544+
g5d4xlarge_instance,
545+
g5d8xlarge_instance,
546+
g5d16xlarge_instance,
547+
g4dn2xlarge_instance,
548+
# g5d24xlarge_instance,
549+
# g5d48xlarge_instance,
550+
local_instance
551+
],
552+
supported_services=[
553+
sagemaker_service,
554+
sagemaker_async_service,
555+
ecs_service,
556+
local_service
557+
],
558+
supported_frameworks=[
559+
fastapi_framework
560+
],
561+
allow_china_region=True,
562+
huggingface_model_id="Qwen/Qwen3-0.6B",
563+
modelscope_model_id="Qwen/Qwen3-0.6B",
564+
require_huggingface_token=False,
565+
application_scenario="Agent, tool use, translation, summary",
566+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
567+
model_type=ModelType.LLM,
568+
model_series=QWEN3_SERIES
569+
)
570+
)
571+
572+
Model.register(
573+
dict(
574+
model_id = "Qwen3-1.7B",
575+
supported_engines=[vllm_qwen3_engin084],
576+
supported_instances=[
577+
g5d2xlarge_instance,
578+
g5d4xlarge_instance,
579+
g5d8xlarge_instance,
580+
g5d16xlarge_instance,
581+
g4dn2xlarge_instance,
582+
# g5d24xlarge_instance,
583+
# g5d48xlarge_instance,
584+
local_instance
585+
],
586+
supported_services=[
587+
sagemaker_service,
588+
sagemaker_async_service,
589+
ecs_service,
590+
local_service
591+
],
592+
supported_frameworks=[
593+
fastapi_framework
594+
],
595+
allow_china_region=True,
596+
huggingface_model_id="Qwen/Qwen3-1.7B",
597+
modelscope_model_id="Qwen/Qwen3-1.7B",
598+
require_huggingface_token=False,
599+
application_scenario="Agent, tool use, translation, summary",
600+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
601+
model_type=ModelType.LLM,
602+
model_series=QWEN3_SERIES
603+
)
604+
)
605+
606+
607+
Model.register(
608+
dict(
609+
model_id = "Qwen3-4B",
610+
supported_engines=[vllm_qwen3_engin084],
611+
supported_instances=[
612+
g5d2xlarge_instance,
613+
g5d4xlarge_instance,
614+
g5d8xlarge_instance,
615+
g5d16xlarge_instance,
616+
g4dn2xlarge_instance,
617+
# g5d24xlarge_instance,
618+
# g5d48xlarge_instance,
619+
local_instance
620+
],
621+
supported_services=[
622+
sagemaker_service,
623+
sagemaker_async_service,
624+
ecs_service,
625+
local_service
626+
],
627+
supported_frameworks=[
628+
fastapi_framework
629+
],
630+
allow_china_region=True,
631+
huggingface_model_id="Qwen/Qwen3-4B",
632+
modelscope_model_id="Qwen/Qwen3-4B",
633+
require_huggingface_token=False,
634+
application_scenario="Agent, tool use, translation, summary",
635+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
636+
model_type=ModelType.LLM,
637+
model_series=QWEN3_SERIES
638+
)
639+
)
640+
641+
642+
Model.register(
643+
dict(
644+
model_id = "Qwen3-14B",
645+
supported_engines=[vllm_qwen3_engin084],
646+
supported_instances=[
647+
g5d12xlarge_instance,
648+
g5d24xlarge_instance,
649+
g5d48xlarge_instance,
650+
# g5d24xlarge_instance,
651+
# g5d48xlarge_instance,
652+
local_instance
653+
],
654+
supported_services=[
655+
sagemaker_service,
656+
sagemaker_async_service,
657+
ecs_service,
658+
local_service
659+
],
660+
supported_frameworks=[
661+
fastapi_framework
662+
],
663+
allow_china_region=True,
664+
huggingface_model_id="Qwen/Qwen3-14B",
665+
modelscope_model_id="Qwen/Qwen3-14B",
666+
require_huggingface_token=False,
667+
application_scenario="Agent, tool use, translation, summary",
668+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
669+
model_type=ModelType.LLM,
670+
model_series=QWEN3_SERIES
671+
)
672+
)
673+
674+
Model.register(
675+
dict(
676+
model_id = "Qwen3-32B",
677+
supported_engines=[vllm_qwen3_engin084],
678+
supported_instances=[
679+
g5d12xlarge_instance,
680+
g5d24xlarge_instance,
681+
g5d48xlarge_instance,
682+
# g5d24xlarge_instance,
683+
# g5d48xlarge_instance,
684+
local_instance
685+
],
686+
supported_services=[
687+
sagemaker_service,
688+
sagemaker_async_service,
689+
ecs_service,
690+
local_service
691+
],
692+
supported_frameworks=[
693+
fastapi_framework
694+
],
695+
allow_china_region=True,
696+
huggingface_model_id="Qwen/Qwen3-32B",
697+
modelscope_model_id="Qwen/Qwen3-32B",
698+
require_huggingface_token=False,
699+
application_scenario="Agent, tool use, translation, summary",
700+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
701+
model_type=ModelType.LLM,
702+
model_series=QWEN3_SERIES
703+
)
704+
)
705+
706+
707+
Model.register(
708+
dict(
709+
model_id = "Qwen3-30B-A3B",
710+
supported_engines=[vllm_qwen3_engin084],
711+
supported_instances=[
712+
g5d12xlarge_instance,
713+
g5d24xlarge_instance,
714+
g5d48xlarge_instance,
715+
# g5d24xlarge_instance,
716+
# g5d48xlarge_instance,
717+
local_instance
718+
],
719+
supported_services=[
720+
sagemaker_service,
721+
sagemaker_async_service,
722+
ecs_service,
723+
local_service
724+
],
725+
supported_frameworks=[
726+
fastapi_framework
727+
],
728+
allow_china_region=True,
729+
huggingface_model_id="Qwen/Qwen3-30B-A3B",
730+
modelscope_model_id="Qwen/Qwen3-30B-A3B",
731+
require_huggingface_token=False,
732+
application_scenario="Agent, tool use, translation, summary",
733+
description="The latest series of Qwen LLMs, offers base and tuned models from 0.5B to 72B\n parameters, featuring enhanced knowledge, improved coding and math skills, better instruction\n following, long-text generation, structured data handling, 128K token context support, and\n multilingual capabilities for 29+ languages.",
734+
model_type=ModelType.LLM,
735+
model_series=QWEN3_SERIES
736+
)
737+
)

src/emd/models/model_series.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@
77
reference_link="https://github.com/QwenLM/Qwen2.5"
88
)
99

10+
QWEN3_SERIES = ModelSeries(
11+
model_series_name = ModelSeriesType.QWEN3,
12+
description="the latest addition to the Qwen family of large language models. These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5. We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models.",
13+
reference_link="https://github.com/QwenLM/Qwen3"
14+
)
15+
16+
1017
GLM4_SERIES = ModelSeries(
1118
model_series_name = ModelSeriesType.GLM4,
1219
description="The GLM-4 series includes the latest generation of pre-trained models launched by Zhipu AI.",
@@ -62,6 +69,13 @@
6269
reference_link="https://github.com/QwenLM/Qwen2-VL"
6370
)
6471

72+
73+
AGENT_SERIES = ModelSeries(
74+
model_series_name=ModelSeriesType.AGENT,
75+
description="""LLM or VLM models for Agentic tasks, e.g. computer-use,brower-use""",
76+
reference_link=""
77+
)
78+
6579
INTERNVL25_SERIES = ModelSeries(
6680
model_series_name=ModelSeriesType.INTERNVL25,
6781
description="""InternVL2.5 is an advanced multimodal large language model (MLLM) series with parameter coverage ranging from 1B to 78B. InternVL2_5-78B is the first open-source MLLMs to achieve over 70% on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o.""",

0 commit comments

Comments
 (0)