-
Notifications
You must be signed in to change notification settings - Fork 20
Description
easydistill --config /easydistill/configs/kd_white_box.json
2025-06-10 20:03:26,249 - INFO - Running command: python /easydistill/easydistill/kd/infer.py --config /easydistill/configs/kd_white_box.json
2025-06-10 20:03:36,784 - INFO - 2025-06-10 20:03:36.784495: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2025-06-10 20:03:36,784 - ERROR - Detected error in output: 2025-06-10 20:03:36.784495: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2025-06-10 20:03:36,806 - INFO - 2025-06-10 20:03:36.806417: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-06-10 20:03:36,833 - INFO - WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
2025-06-10 20:03:36,833 - INFO - E0000 00:00:1749557016.833063 814 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-06-10 20:03:36,841 - INFO - E0000 00:00:1749557016.841735 814 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-10 20:03:36,869 - INFO - 2025-06-10 20:03:36.869151: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
2025-06-10 20:03:36,869 - INFO - To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-06-10 20:03:41,562 - INFO - INFO 06-10 20:03:41 [init.py:239] Automatically detected platform cuda.
2025-06-10 20:03:44,964 - INFO - 2025-06-10 20:03:44,964 - INFO - Generating distillation data from the teacher model!
2025-06-10 20:03:44,968 - INFO - 2025-06-10 20:03:44,968 - INFO - Loading ckpt and tokenizer: /mnt/bn/summary/model_hub/qwen/Qwen2-5-3B-Instruct/
2025-06-10 20:03:45,447 - INFO - 2025-06-10 20:03:45,447 - INFO - Initial eos_token_id 151645 from tokenizer
2025-06-10 20:03:45,448 - INFO - 2025-06-10 20:03:45,448 - INFO - tokenizer's eos_token: <|im_end|>, pad_token: <|im_end|>
2025-06-10 20:03:45,448 - INFO - 2025-06-10 20:03:45,448 - INFO - tokenizer's eos_token_id: 151645, pad_token_id: 151645
2025-06-10 20:04:06,308 - INFO - INFO 06-10 20:04:06 [config.py:689] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
2025-06-10 20:04:06,310 - INFO - INFO 06-10 20:04:06 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.
2025-06-10 20:04:08,608 - INFO - INFO 06-10 20:04:08 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='/mnt/bn/summary/model_hub/qwen/Qwen2-5-3B-Instruct/', speculative_config=None, tokenizer='/mnt/bn/summary/model_hub/qwen/Qwen2-5-3B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/bn/summary/model_hub/qwen/Qwen2-5-3B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
2025-06-10 20:04:09,602 - INFO - WARNING 06-10 20:04:09 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ff9479c48b0>
2025-06-10 20:04:11,176 - INFO - INFO 06-10 20:04:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
2025-06-10 20:04:11,177 - INFO - INFO 06-10 20:04:11 [cuda.py:221] Using Flash Attention backend on V1 engine.
2025-06-10 20:04:11,220 - INFO - INFO 06-10 20:04:11 [gpu_model_runner.py:1276] Starting to load model /mnt/bn/summary/model_hub/qwen/Qwen2-5-3B-Instruct/...
2025-06-10 20:04:11,674 - INFO - WARNING 06-10 20:04:11 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
2025-06-10 20:04:11,683 - INFO -
2025-06-10 20:04:11,683 - INFO - Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
2025-06-10 20:04:12,598 - INFO -
2025-06-10 20:04:12,598 - INFO - Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.09it/s]
2025-06-10 20:04:13,325 - INFO -
2025-06-10 20:04:13,325 - INFO - Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.24it/s]
2025-06-10 20:04:13,326 - INFO -
2025-06-10 20:04:13,326 - INFO - Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.22it/s]
2025-06-10 20:04:13,326 - INFO -
2025-06-10 20:04:13,426 - INFO - INFO 06-10 20:04:13 [loader.py:458] Loading weights took 1.74 seconds
2025-06-10 20:04:14,005 - INFO - INFO 06-10 20:04:14 [gpu_model_runner.py:1291] Model loading took 5.7916 GiB and 2.211553 seconds
2025-06-10 20:04:27,097 - INFO - INFO 06-10 20:04:27 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/20e2d2865e/rank_0_0 for vLLM's torch.compile
2025-06-10 20:04:27,100 - INFO - INFO 06-10 20:04:27 [backends.py:426] Dynamo bytecode transform time: 13.09 s
2025-06-10 20:04:28,228 - INFO - INFO 06-10 20:04:28 [backends.py:115] Directly load the compiled graph for shape None from the cache
2025-06-10 20:04:45,039 - INFO - INFO 06-10 20:04:45 [monitor.py:33] torch.compile takes 13.09 s in total
2025-06-10 20:04:46,273 - INFO - INFO 06-10 20:04:46 [kv_cache_utils.py:634] GPU KV cache size: 1,847,824 tokens
2025-06-10 20:04:46,274 - INFO - INFO 06-10 20:04:46 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 451.13x
2025-06-10 20:05:29,238 - INFO - INFO 06-10 20:05:29 [gpu_model_runner.py:1626] Graph capturing finished in 43 secs, took 1.81 GiB
2025-06-10 20:05:29,251 - INFO - INFO 06-10 20:05:29 [core.py:163] init engine (profile, create kv cache, warmup model) took 75.25 seconds
2025-06-10 20:05:29,417 - INFO - INFO 06-10 20:05:29 [core_client.py:435] Core engine process 0 ready.
2025-06-10 20:05:29,418 - INFO - 2025-06-10 20:05:29,418 - INFO - vLLM model loaded successfully
2025-06-10 20:05:29,424 - INFO -
2025-06-10 20:05:29,425 - INFO - Generating responses: 0it [00:00, ?it/s]
2025-06-10 20:05:29,425 - INFO - Generating responses: 0it [00:00, ?it/s]
2025-06-10 20:05:32,717 - ERROR - Command failed (returncode=0, errors detected)
2025-06-10 20:05:32,718 - ERROR - Infer failed, skipping training