Different Inference-speed when loading model through hf-hub #5283

Gitclop · 2024-02-02T12:45:49Z

Gitclop
Feb 2, 2024

Hi,
i get much faster inference speed (almost twice as fast) when I load a model like this:

model_path = hf_hub_download(repo_id=repo_id, filename=model_file_name, repo_type="model")

# initialize LlamaCpp LLM model
self.llm = LlamaCpp(
            model_path=model_path,
            seed=2,
            temperature=0.5,  # Default 0.8
            max_tokens=256,  # The maximum number of tokens to generate. Default 256
            top_k=20,  # Default 40
            top_p=0.85, # Default 0.95
            n_ctx=1024, # Text context window, 0 = from model
            n_gpu_layers=64,
            n_batch=1024,  # Prompt processing maximum batch size Default 512
            stop = ['</s>'],
            #n_threads=8,  # Number of threads to use for generation --8 Am schnellsten!
            #callback_manager=callbacks,
            verbose=True,  # Verbose is required to pass to the callback manager
            streaming=False  # Whether to stream the results, token by token.
        )

instead of this:

model_path "Path/to/Model"

# initialize LlamaCpp LLM model
self.llm = LlamaCpp(
            model_path=model_path,
            seed=2,
            temperature=0.5,  # Default 0.8
            max_tokens=256,  # The maximum number of tokens to generate. Default 256
            top_k=20,  # Default 40
            top_p=0.85, # Default 0.95
            n_ctx=1024, # Text context window, 0 = from model
            n_gpu_layers=64,
            n_batch=1024,  # Prompt processing maximum batch size Default 512
            stop = ['</s>'],
            #n_threads=8,  # Number of threads to use for generation --8 Am schnellsten!
            #callback_manager=callbacks,
            verbose=True,  # Verbose is required to pass to the callback manager
            streaming=False  # Whether to stream the results, token by token.
        )

What is the reason for it and how can i load a local model without hf-hub and the same speed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different Inference-speed when loading model through hf-hub #5283

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Different Inference-speed when loading model through hf-hub #5283

Uh oh!

Gitclop Feb 2, 2024

Replies: 0 comments

Gitclop
Feb 2, 2024