Can 'max_concurrency' be used with GGUF models loaded via Llama-CPP-Python in a FastAPI + Langserve + Langchain setup? #22359

rmansoul · 2024-05-31T12:03:42Z

rmansoul
May 31, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

# Import des modules nécessaires
from langserve import add_routes
from langchain_community.llms import LlamaCpp
from langchain_core.prompts import ChatPromptTemplate
from fastapi import FastAPI
from langchain.schema.runnable import RunnableLambda
from langchain_core.runnables import RunnableConfig

# Path to the language model
model_path = "./models/mistral-7b-openorca.Q4_0.gguf"

# Initializing FastAPI app
app = FastAPI()


# Function to handle requests
def rag_function(x):
    content = x['content']  # Extracting the content of the document from the JSON request
    question = x['question']  # Extracting the question from the JSON request
    temperature = x['temperature']  # Extracting the temperature parameter from the JSON request
    max_tokens = x['max_tokens']  # Extracting the max_tokens parameter from the JSON request
     
    n_gpu_layers = -1 
    n_batch = 512  # Bat # Setting n_gpu_layers to -1 to allocate all layers to the GPUch size for processing input data

    # Creating an instance of the LlamaCpp language model with the specified parameters
    model = LlamaCpp(
        model_path=model_path,  # Path to the language model
        n_gpu_layers=n_gpu_layers,  # Number of GPU layers
        n_batch=n_batch,  # Batch size for processing input data
        temperature=temperature,  # Temperature parameter for controlling response creativity
        max_tokens=max_tokens,  # Maximum number of tokens in the generated response
        n_ctx=32000  # Context size for the language model
    )
    config=RunnableConfig(max_concurrency=2)

    # Creating template
    template = """ You are an helpful assistant. Here are some documents. Use the documents to answer the user's question.
    You must not use you're knowledge or any external source of information to answer the question, so do not provide any information that does not belong in these documents.
    If you can't find the answer in the documents, just say that you can't find the answer.
    

    Documents: {document}
    Question: {question}

    Answer:    """

    # Creating prompt from template
    prompt = ChatPromptTemplate.from_template(template)

    # Combining prompt with language model
    chain = prompt | model

    response = chain.invoke({"document": content, "question": question},config)

    return response


# Creating an instance of RunnableLambda to encapsulate the func function
runnable = RunnableLambda(rag_function).with_types(input_type=dict)

# Adding routes to expose the func function via a web API
add_routes(app, runnable, path="/Rag")

# Launching the FastAPI application using Uvicorn server if this script is run as the main program
if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8888)

Description

Description :
Hello,
I'm seeking a better understanding of how the 'max_concurrency' parameter works in the 'invoke' or 'batch' endpoints to manage simultaneous requests. Specifically, I'm wondering if this parameter is applicable to models other than those from OpenAI, particularly GGUF models loaded via Llama-CPP-Python ?

Details :
I'm setting up a web service using FastAPI, with Langserve hosting a Large Language Model (LLM) from hugingFace. This specific model is a GGUF format, as shown in the code, loaded using Llama-CPP-Python. My goal is to control incoming requests to the web service by queuing them to avoid saturation problems (out_of_memory) and I wonder if langchain's max_concurenncy parameter is useful in my case?

Specific Questions:

Can the 'max_concurrency' parameter be used with GGUF models loaded via Llama-CPP-Python?

How does the 'max_concurrency' parameter function within the context of the 'invoke' endpoint (in my case) ?

Are there any best practices or implementation examples of this parameter in a similar environment to the one I'm deploying (FastAPI + Langserve + Langchain) ?

important link of the max_concurrency parameter on langchain : https://python.langchain.com/v0.1/docs/expression_language/primitives/functions/#accepting-a-runnable-config
https://python.langchain.com/v0.1/docs/expression_language/primitives/functions/#accepting-a-runnable-config

System Info

Problem Details:

Environment: llama cpp-python 0.2.76 , langchain 0.2.1
Model: mistral-7b-openorca.Q4_0.gguf (Hugging Face)

Current Configuration:

AWS Instance: g5.4xlarge (24 vCPU, 64 GB RAM)
llama-cpp-python Version: 0.2.76
Web Server Framework: Fast api
Operating System: Linux

@rmansoul · 2024-05-31T12:08:54Z

dosubot[bot]
bot May 31, 2024

Hey there, @rmansoul! I'm here to help you with any bugs, questions, or contributions you might have. Let's get to work!

Yes, the max_concurrency parameter can be used with GGUF models loaded via Llama-CPP-Python in a FastAPI + Langserve + LangChain setup. The max_concurrency parameter is used to control the maximum number of concurrent requests that can be processed. This is evident from the batch and abatch methods in the provided code, which handle batching of inputs based on the max_concurrency parameter ^[1].

Here is a relevant code snippet demonstrating how max_concurrency is used:

def batch(
    self,
    inputs: List[LanguageModelInput],
    config: Optional[Union[RunnableConfig, List[RunnableConfig]]] = None,
    *,
    return_exceptions: bool = False,
    **kwargs: Any,
) -> List[str]:
    if not inputs:
        return []

    config = get_config_list(config, len(inputs))
    max_concurrency = config[0].get("max_concurrency")
    if max_concurrency is None:
        try:
            llm_result = self.generate_prompt(
                [self._convert_input(input) for input in inputs],
                callbacks=[c.get("callbacks") for c in config],
                tags=[c.get("tags") for c in config],
                metadata=[c.get("metadata") for c in config],
                run_name=[c.get("run_name") for c in config],
                **kwargs,
            )
            return [g[0].text for g in llm_result.generations]
        except Exception as e:
            if return_exceptions:
                return cast(List[str], [e for _ in inputs])
            else:
                raise e
    else:
        batches = [
            inputs[i : i + max_concurrency]
            for i in range(0, len(inputs), max_concurrency)
        ]
        config = [{**c, "max_concurrency": None} for c in config]  # type: ignore[misc]
        return [
            output
            for i, batch in enumerate(batches)
            for output in self.batch(
                batch,
                config=config[i * max_concurrency : (i + 1) * max_concurrency],
                return_exceptions=return_exceptions,
                **kwargs,
            )
        ]

This code shows how the max_concurrency parameter is used to create batches of inputs and process them concurrently.

Best Practices for Using `max_concurrency`

Memory Management: Ensure that your system has enough memory to handle the maximum number of concurrent requests. Using half-precision for key/value cache (f16_kv) and memory mapping (use_mmap) can help manage memory usage effectively ^[2].
Thread Management: Configure the number of threads (n_threads) to balance CPU usage. This can be set in the RunnableConfig to optimize performance ^[3].
Batch Size: Adjust the batch size (n_batch) to optimize the processing speed and memory usage. This parameter controls the number of tokens processed in parallel ^[2].
GPU Utilization: If using GPU, set the n_gpu_layers parameter to offload layers to the GPU, which can significantly speed up processing ^[4].

By following these practices, you can effectively manage concurrency and optimize the performance of your Llama-CPP-Python models in a FastAPI + Langserve + LangChain setup.

To continue talking to Dosu, mention @dosu.

5 replies

rmansoul May 31, 2024
Author

thank you for your answer but is it useful for the invoke method ?

dosubot[bot] bot May 31, 2024

Yes, the max_concurrency parameter is useful for the invoke method. It controls the number of concurrent requests that can be processed by batching the inputs and processing each batch sequentially. This helps manage the load and optimize performance.

Here's how it works within the context of the invoke method:

if max_concurrency is None:
    try:
        llm_result = self.generate_prompt(
            [self._convert_input(input) for input in inputs],
            callbacks=[c.get("callbacks") for c in config],
            tags=[c.get("tags") for c in config],
            metadata=[c.get("metadata") for c in config],
            run_name=[c.get("run_name") for c in config],
            **kwargs,
        )
        return [g[0].text for g in llm_result.generations]
    except Exception as e:
        if return_exceptions:
            return cast(List[str], [e for _ in inputs])
        else:
            raise e
else:
    batches = [
        inputs[i : i + max_concurrency]
        for i in range(0, len(inputs), max_concurrency)
    ]
    config = [{**c, "max_concurrency": None} for c in config]  # type: ignore[misc]
    return [
        output
        for i, batch in enumerate(batches)
        for output in self.batch(
            batch,
            config=config[i * max_concurrency : (i + 1) * max_concurrency],
            return_exceptions=return_exceptions,
            **kwargs,
        )
    ]

In this code, if max_concurrency is set, the inputs are divided into batches, and each batch is processed sequentially. This ensures that the number of concurrent requests does not exceed the specified limit, helping to manage system resources effectively ^[1].

To continue talking to Dosu, mention @dosu.

rmansoul May 31, 2024
Author

can you tell me how to implement this parameter in my case ?

dosubot[bot] bot May 31, 2024

To implement the max_concurrency parameter in your FastAPI + Langserve + LangChain setup, you need to ensure that the RunnableConfig passed to the invoke method includes the max_concurrency parameter. Here’s how you can modify your existing code to include this parameter:

Define the RunnableConfig with max_concurrency:
Ensure that you set the max_concurrency parameter in the RunnableConfig.
Pass the RunnableConfig to the invoke method:
When invoking the chain, pass the RunnableConfig with the max_concurrency parameter.

Here’s your updated code with the max_concurrency parameter implemented:

# Import necessary modules
from langserve import add_routes
from langchain_community.llms import LlamaCpp
from langchain_core.prompts import ChatPromptTemplate
from fastapi import FastAPI
from langchain.schema.runnable import RunnableLambda
from langchain_core.runnables import RunnableConfig

# Path to the language model
model_path = "./models/mistral-7b-openorca.Q4_0.gguf"

# Initializing FastAPI app
app = FastAPI()

# Function to handle requests
def rag_function(x):
    content = x['content']  # Extracting the content of the document from the JSON request
    question = x['question']  # Extracting the question from the JSON request
    temperature = x['temperature']  # Extracting the temperature parameter from the JSON request
    max_tokens = x['max_tokens']  # Extracting the max_tokens parameter from the JSON request
     
    n_gpu_layers = -1 
    n_batch = 512  # Batch size for processing input data

    # Creating an instance of the LlamaCpp language model with the specified parameters
    model = LlamaCpp(
        model_path=model_path,  # Path to the language model
        n_gpu_layers=n_gpu_layers,  # Number of GPU layers
        n_batch=n_batch,  # Batch size for processing input data
        temperature=temperature,  # Temperature parameter for controlling response creativity
        max_tokens=max_tokens,  # Maximum number of tokens in the generated response
        n_ctx=32000  # Context size for the language model
    )
    
    # Creating template
    template = """ You are a helpful assistant. Here are some documents. Use the documents to answer the user's question.
    You must not use your knowledge or any external source of information to answer the question, so do not provide any information that does not belong in these documents.
    If you can't find the answer in the documents, just say that you can't find the answer.
    
    Documents: {document}
    Question: {question}
    Answer:    """

    # Creating prompt from template
    prompt = ChatPromptTemplate.from_template(template)

    # Combining prompt with language model
    chain = prompt | model

    # Define the RunnableConfig with max_concurrency
    config = RunnableConfig(max_concurrency=2)

    # Invoke the chain with the config
    response = chain.invoke({"document": content, "question": question}, config=config)

    return response

# Creating an instance of RunnableLambda to encapsulate the func function
runnable = RunnableLambda(rag_function).with_types(input_type=dict)

# Adding routes to expose the func function via a web API
add_routes(app, runnable, path="/Rag")

# Launching the FastAPI application using Uvicorn server if this script is run as the main program
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8888)

In this code, the RunnableConfig with max_concurrency=2 is passed to the invoke method, ensuring that no more than 2 requests are processed concurrently. This helps manage system resources and avoid saturation problems ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Cogwiswai Sep 18, 2024

Dosu how to send request to it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can 'max_concurrency' be used with GGUF models loaded via Llama-CPP-Python in a FastAPI + Langserve + Langchain setup? #22359

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can 'max_concurrency' be used with GGUF models loaded via Llama-CPP-Python in a FastAPI + Langserve + Langchain setup? #22359

Uh oh!

rmansoul May 31, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 5 replies

Uh oh!

dosubot[bot] bot May 31, 2024

Best Practices for Using max_concurrency

Uh oh!

rmansoul May 31, 2024 Author

Uh oh!

dosubot[bot] bot May 31, 2024

Uh oh!

rmansoul May 31, 2024 Author

Uh oh!

dosubot[bot] bot May 31, 2024

Uh oh!

Cogwiswai Sep 18, 2024

rmansoul
May 31, 2024

Replies: 1 comment 5 replies

dosubot[bot]
bot May 31, 2024

Best Practices for Using `max_concurrency`

rmansoul May 31, 2024
Author

rmansoul May 31, 2024
Author