Skip to content

Draft model displayed in UI differs from draft model loaded via API #661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sealad886 opened this issue May 18, 2025 · 0 comments
Open

Comments

@sealad886
Copy link

[BUG] Display draft model currently loaded in UI instead of draft model configured in default model settings

Which version of LM Studio?

LM Studio 0.3.16 (build 3)

Which operating system?

MacOS 15.5

What is the bug?

GGUF model described, but the same issue affects MLX models.

A model loaded just-in-time (let's call this Model A) will load the draft model specified in the API call (Model Draft). This is correct behavior. Separately, Model A can specify a Draft Model in:

  1. Model Settings > Speculative Decoding > Draft Model (My Models view, see Image 1)
  2. Model Settings > Inference > Speculative Decoding > Draft Model (Developer view, see Image 2).
    NB: a draft model can not be specified or otherwise configured directly from the Chat window.

Reasonably speaking, the first method of defining a default model would conceptually be accurate as these are settings that are not expected to relate to any specific instantiation of a model on the CPU/GPU.

However, the second method does directly apply to (and is only visible for) a specific model instantiation.

Proposed solution(s)

I contend that the Developer tab's view should do one (or both) of two possible solutions:

  1. Add a nested row in the table of models loaded indicating the draft model loaded for the given main model.
  2. Override the field setting for > Inference > Speculative Decoding > Draft Model as it gets displayed to the user.

Thus, the user is informed via the GUI which draft model is currently being used by the loaded main model.

Screenshots

Image 1. The My Models screen interface to configure a default draft model.
Image

Image 2. The Developer screen interface to configure the same default draft model.
Image

Image 3. The developer logs enumerate the draft model being loaded, alongside the settings that does the default configured draft model in the Model Settings.
Image

Logs

Add any relevant logs.

Log of ggml loading
2025-05-18 19:21:46 [DEBUG] 
[LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
2025-05-18 19:21:46 [DEBUG] 
[LM Studio] Live GPU memory info:
No live GPU info available
2025-05-18 19:21:46 [DEBUG] 
[LM Studio] Model load size estimate with raw num offload layers 'max' and context length '131072':
  Model: 277.88 MB
  Context: 3.84 GB
  Total: 4.11 GB
2025-05-18 19:21:46 [DEBUG] 
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '131072'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
  Num Offload Layers: max
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
2025-05-18 19:21:46 [DEBUG] 
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 15757 MiB free
2025-05-18 19:21:46 [DEBUG] 
llama_model_loader: loaded meta data with 32 key-value pairs and 290 tensors from /Users/andrew/.lmstudio/models/sealad886/qwen2.5-coder-0.5b-instruct-q8/qwen2.5-coder-0.5b-instruct-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ea3F2471Cf1B1F0Db85067F1Ef93848E38E88C25
llama_model_loader: - kv   3:                         general.size_label str              = 494M
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 Coder 0.5B
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  10:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
2025-05-18 19:21:46 [DEBUG] 
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  13:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  14:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  15:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  16:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  17:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  18:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
2025-05-18 19:21:46 [DEBUG] 
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
2025-05-18 19:21:46 [DEBUG] 
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-05-18 19:21:46 [DEBUG] 
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  169 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 500.79 MiB (8.50 BPW)
2025-05-18 19:21:46 [DEBUG] 
load: special tokens cache size = 22
2025-05-18 19:21:46 [DEBUG] 
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
2025-05-18 19:21:46 [DEBUG] 
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 494.03 M
print_info: general.name     = Ea3F2471Cf1B1F0Db85067F1Ef93848E38E88C25
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
2025-05-18 19:21:46 [DEBUG] 
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors: Metal_Mapped model buffer size =   500.79 MiB
load_tensors:   CPU_Mapped model buffer size =   137.94 MiB
...........
2025-05-18 19:21:46 [DEBUG] 
................................................
2025-05-18 19:21:46 [DEBUG] 
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 131072
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (131072) > n_ctx_train (32768) -- possible training context overflow
ggml_metal_init: allocating
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
2025-05-18 19:21:46 [DEBUG] 
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
2025-05-18 19:21:46 [DEBUG] 
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 131072, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1, padding = 256
2025-05-18 19:21:47 [DEBUG] 
llama_kv_cache_unified:        CPU KV buffer size =  1536.00 MiB
llama_kv_cache_unified: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
2025-05-18 19:21:47 [DEBUG] 
llama_context:      Metal compute buffer size =   455.00 MiB
llama_context:        CPU compute buffer size =   257.76 MiB
llama_context: graph nodes  = 799
llama_context: graph splits = 50
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2025-05-18 19:21:47 [DEBUG] 
Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 60, top_p = 0.940, min_p = 0.190, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.250
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
2025-05-18 19:21:47 [DEBUG] 
sampling: 
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 131072, n_batch = 512, n_predict = -1, n_keep = 52548
2025-05-18 19:21:47 [DEBUG] 
BeginProcessingPrompt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant