You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
i get much faster inference speed (almost twice as fast) when I load a model like this:
model_path = hf_hub_download(repo_id=repo_id, filename=model_file_name, repo_type="model")
# initialize LlamaCpp LLM model
self.llm = LlamaCpp(
model_path=model_path,
seed=2,
temperature=0.5, # Default 0.8
max_tokens=256, # The maximum number of tokens to generate. Default 256
top_k=20, # Default 40
top_p=0.85, # Default 0.95
n_ctx=1024, # Text context window, 0 = from model
n_gpu_layers=64,
n_batch=1024, # Prompt processing maximum batch size Default 512
stop = ['</s>'],
#n_threads=8, # Number of threads to use for generation --8 Am schnellsten!
#callback_manager=callbacks,
verbose=True, # Verbose is required to pass to the callback manager
streaming=False # Whether to stream the results, token by token.
)
instead of this:
model_path "Path/to/Model"
# initialize LlamaCpp LLM model
self.llm = LlamaCpp(
model_path=model_path,
seed=2,
temperature=0.5, # Default 0.8
max_tokens=256, # The maximum number of tokens to generate. Default 256
top_k=20, # Default 40
top_p=0.85, # Default 0.95
n_ctx=1024, # Text context window, 0 = from model
n_gpu_layers=64,
n_batch=1024, # Prompt processing maximum batch size Default 512
stop = ['</s>'],
#n_threads=8, # Number of threads to use for generation --8 Am schnellsten!
#callback_manager=callbacks,
verbose=True, # Verbose is required to pass to the callback manager
streaming=False # Whether to stream the results, token by token.
)
What is the reason for it and how can i load a local model without hf-hub and the same speed?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
i get much faster inference speed (almost twice as fast) when I load a model like this:
instead of this:
What is the reason for it and how can i load a local model without hf-hub and the same speed?
Beta Was this translation helpful? Give feedback.
All reactions