Can llama.cpp run the same model in parallel? #8840

Zuo-Peng · 2024-08-03T09:11:53Z

Zuo-Peng
Aug 3, 2024

I'm trying to call the same model over and over again in a multi-process program to batch generate. I am using the LangChain framework.
However, it was found that in multithreaded tasks, the big models are called sequentially to generate output, rather than in parallel.

I was previously using Ollama, which automates requests to the same model, similar to the parameters below:
OLLAMA NUM PARALLEL

So I was wondering how to call the model in parallel when I use llama.cpp, here is my code:

import pandas as pd
from concurrent.futures import ProcessPoolExecutor, as_completed
import logging
from langchain_community.llms import LlamaCpp
import argparse

# Set up logging
logging.basicConfig(
        filename='000langchian.log',
        level=logging.INFO,
        format='%(asctime)s - %(processName)s - %(levelname)s - %(message)s - [%(filename)s:%(lineno)d]'
    )


def generate_story(animal):
    """Generate a story for a given animal using the Ollama model and handle exceptions."""
    logging.info(f"Started story generation for {animal}")
    llm = LlamaCpp(model_path="./meta-llama-3-8b-instruct.Q4_K_S.gguf")
    system_prompt = "You are an AI assistant."
    user_prompt = f"Write a short story about {animal}."
    final_prompt = f"""<|start_header_id|>system<|end_header_id|>
                        {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
                        {user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
    story = llm.invoke(final_prompt)
    logging.info(f"Completed story generation for {animal}")
    return animal, story  # Return as a tuple


def parallel_story_generation(animals, max_workers):
    """Generate multiple stories in parallel based on a list of animals."""
    results = []
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        future_to_animal = {executor.submit(generate_story, animal): animal for animal in animals}
        for future in as_completed(future_to_animal):
            result = future.result()
            results.append(result)  # Ensure that result is added as a tuple
    return results


def main():
    parser = argparse.ArgumentParser(description="Generate stories for a list of animals in parallel.")
    parser.add_argument('--max-workers', type=int, default=1, help='Number of workers to use for parallel processing')
    args = parser.parse_args()

    """Main function to manage story generation and save results to CSV."""
    animals = ['cat', 'dog', 'elephant', 'fox', 'bear', 'monkey', 'tiger', 'lion', 'giraffe', 'zebra', 'wolf', 'panda']
    df_animals = pd.DataFrame(animals, columns=['Animal'])

    stories = parallel_story_generation(df_animals['Animal'], args.max_workers)
    df_stories = pd.DataFrame(stories, columns=['Animal', 'Story'])  # Create DataFrame from a list of tuples
    df_stories.to_csv("animal_stories.csv", index=False)
    logging.info("Animal stories saved to animal_stories.csv")


if __name__ == '__main__':
    main()

Perhaps this is a very basic question. Would appreciate an answer or some references. Sincerely appreciated.

Answered by Zuo-Peng

Aug 8, 2024

Thank you for your reply. I think I have solved my problem.
I gave up using the LangChain framework.

I use ./llama-server -m model/path -np 4 to open parallel requests. Then use the following code to inference LLM:

import requests

url = "http://localhost:8080/completion"
headers = {"Content-Type": "application/json"}
data = {"prompt": prompt}
response = requests.post(url, headers=headers, json=data)
content = response.json()['content']

In this way, I can use the model for inference in parallel, greatly improving the speed.

View full answer

akhilreddy0703 · 2024-08-06T05:43:43Z

akhilreddy0703
Aug 6, 2024

Can you rephrase your question, based out of your code what I understood is you want to do multiple calls to receive responses simultaneously ? Is this correct !

1 reply

Zuo-Peng Aug 8, 2024
Author

Thank you for your reply. I think I have solved my problem.
I gave up using the LangChain framework.

I use ./llama-server -m model/path -np 4 to open parallel requests. Then use the following code to inference LLM:

import requests

url = "http://localhost:8080/completion"
headers = {"Content-Type": "application/json"}
data = {"prompt": prompt}
response = requests.post(url, headers=headers, json=data)
content = response.json()['content']

In this way, I can use the model for inference in parallel, greatly improving the speed.

Answer selected by Zuo-Peng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can llama.cpp run the same model in parallel? #8840

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can llama.cpp run the same model in parallel? #8840

Uh oh!

Zuo-Peng Aug 3, 2024

Replies: 1 comment · 1 reply

Uh oh!

akhilreddy0703 Aug 6, 2024

Uh oh!

Zuo-Peng Aug 8, 2024 Author

Zuo-Peng
Aug 3, 2024

Replies: 1 comment 1 reply

akhilreddy0703
Aug 6, 2024

Zuo-Peng Aug 8, 2024
Author