Replies: 1 comment 5 replies
-
Hey there, @rmansoul! I'm here to help you with any bugs, questions, or contributions you might have. Let's get to work! Yes, the Here is a relevant code snippet demonstrating how def batch(
self,
inputs: List[LanguageModelInput],
config: Optional[Union[RunnableConfig, List[RunnableConfig]]] = None,
*,
return_exceptions: bool = False,
**kwargs: Any,
) -> List[str]:
if not inputs:
return []
config = get_config_list(config, len(inputs))
max_concurrency = config[0].get("max_concurrency")
if max_concurrency is None:
try:
llm_result = self.generate_prompt(
[self._convert_input(input) for input in inputs],
callbacks=[c.get("callbacks") for c in config],
tags=[c.get("tags") for c in config],
metadata=[c.get("metadata") for c in config],
run_name=[c.get("run_name") for c in config],
**kwargs,
)
return [g[0].text for g in llm_result.generations]
except Exception as e:
if return_exceptions:
return cast(List[str], [e for _ in inputs])
else:
raise e
else:
batches = [
inputs[i : i + max_concurrency]
for i in range(0, len(inputs), max_concurrency)
]
config = [{**c, "max_concurrency": None} for c in config] # type: ignore[misc]
return [
output
for i, batch in enumerate(batches)
for output in self.batch(
batch,
config=config[i * max_concurrency : (i + 1) * max_concurrency],
return_exceptions=return_exceptions,
**kwargs,
)
] This code shows how the Best Practices for Using
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
Description :
Hello,
I'm seeking a better understanding of how the 'max_concurrency' parameter works in the 'invoke' or 'batch' endpoints to manage simultaneous requests. Specifically, I'm wondering if this parameter is applicable to models other than those from OpenAI, particularly GGUF models loaded via Llama-CPP-Python ?
Details :
I'm setting up a web service using FastAPI, with Langserve hosting a Large Language Model (LLM) from hugingFace. This specific model is a GGUF format, as shown in the code, loaded using Llama-CPP-Python. My goal is to control incoming requests to the web service by queuing them to avoid saturation problems (out_of_memory) and I wonder if langchain's max_concurenncy parameter is useful in my case?
Specific Questions:
Can the 'max_concurrency' parameter be used with GGUF models loaded via Llama-CPP-Python?
How does the 'max_concurrency' parameter function within the context of the 'invoke' endpoint (in my case) ?
Are there any best practices or implementation examples of this parameter in a similar environment to the one I'm deploying (FastAPI + Langserve + Langchain) ?
important link of the max_concurrency parameter on langchain : https://python.langchain.com/v0.1/docs/expression_language/primitives/functions/#accepting-a-runnable-config
https://python.langchain.com/v0.1/docs/expression_language/primitives/functions/#accepting-a-runnable-config
System Info
Problem Details:
Environment: llama cpp-python 0.2.76 , langchain 0.2.1
Model: mistral-7b-openorca.Q4_0.gguf (Hugging Face)
Current Configuration:
AWS Instance: g5.4xlarge (24 vCPU, 64 GB RAM)
llama-cpp-python Version: 0.2.76
Web Server Framework: Fast api
Operating System: Linux
Beta Was this translation helpful? Give feedback.
All reactions