Replies: 4 comments
-
i have the some problem. |
Beta Was this translation helpful? Give feedback.
-
Hi @rivamarco what is your test script for repro? |
Beta Was this translation helpful? Give feedback.
-
您发的邮件我已收到,并会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
Hi @krrishdholakia, thanks for the answer: So if I setup the model_list:
- model_name: openai/gpt-4o-mini
litellm_params:
api_key: os.environ/OPENAI_API_KEY
model: openai/gpt-4o-mini
rpm: 5
tpm: 100
router_settings:
routing_strategy: usage-based-routing-v2 # 👈 KEY CHANGE
redis_host: redis
redis_password: myredissecret
redis_port: 6379
enable_pre_call_check: true
general_settings:
master_key: sk-1234 With a docker compose like this: services:
litellm:
image: litellm/litellm:latest
ports:
- "4000:4000"
environment:
- OPENAI_API_KEY=mykey
volumes:
- ./litellm.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--port", "4000" ]
redis:
image: redis:7
restart: always
command: >
--requirepass ${REDIS_AUTH:-myredissecret}
ports:
- 6379:6379
healthcheck:
test: [ "CMD", "redis-cli", "ping" ]
interval: 3s
timeout: 10s
retries: 10
redis-insight:
image: redis/redisinsight:latest
restart: always
ports:
- "5540:5540" I assume that if I perform more than 5 requests per minute and/or I use more than 100 token per minute I obtain an error, is this assumption correct? For example with this test (I've tried both parallel requests or sequential requests) I am not able to obtain an error but I always have answers (also if I modify the prompt putting from openai import OpenAI
import random
import concurrent.futures
import os
client = OpenAI(
api_key="fake",
base_url="http://localhost:4000")
# Sample 50 easy questions
questions = [
"What is 2 + 2?",
"What color is the sky?",
"How many legs does a dog have?",
"What is the capital of France?",
"Is the Earth round?",
"What sound does a cat make?",
"What is the opposite of hot?",
"What is the first letter of the alphabet?",
"What do you use to write on paper?",
"Is fire hot or cold?",
"What is 5 minus 3?",
"How many days are in a week?",
"What do bees make?",
"What shape has 4 equal sides?",
"What is water made of?",
"What is the color of grass?",
"What is the main language spoken in the USA?",
"How many hours in a day?",
"What do cows drink?",
"What is the freezing point of water in Celsius?",
"What fruit is yellow and curved?",
"What is the capital of Italy?",
"What comes after Monday?",
"How many wheels does a car have?",
"Which planet do we live on?",
"What do you do with your eyes?",
"How many fingers on one hand?",
"Is snow hot or cold?",
"What do you wear on your feet?",
"What color are bananas?",
"Which animal barks?",
"How many letters in the word 'cat'?",
"What do you use to eat soup?",
"What is 10 divided by 2?",
"What is the opposite of fast?",
"How many months are in a year?",
"Where does the sun rise?",
"What is H2O?",
"What do you wear on your head in winter?",
"What color are strawberries?",
"What time is it after 11 AM?",
"What rhymes with 'hat'?",
"Which animal has a trunk?",
"What is 7 + 1?",
"How do you spell 'dog'?",
"What is the main ingredient in a salad?",
"What do you drink in the morning?",
"What day comes before Friday?",
"What do you sleep on?",
"How do you greet someone?"
]
def ask_question(question):
try:
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer in a short and simple way."},
{"role": "user", "content": question}
]
)
answer = response.choices[0].message.content
return (question, answer)
except Exception as e:
return (question, f"Error: {str(e)}")
if __name__ == "__main__":
random.shuffle(questions)
selected_questions = questions[:50]
results = []
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(ask_question, q) for q in selected_questions]
for future in concurrent.futures.as_completed(futures):
for a in future.result():
print(f"{a}") I believe there is something I do no understand about the way RPM and TPM works. Thank you very much for the help |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm playing around with LiteLLM Proxy both for cloud (OpenAI) and served models (vLLM) and it's great.
I was trying to implement the rate and token limits but I don't understand if what I want to do is achievable, because it seems not working, but probably I'm doing something wrong.
What I would like to do is limit the invocations of the model, so for example if I have a proxy with one OpenAI model i would like to set a maximum of tokens (e.g. 1000) so that after that limit is reached, the proxy returns an error. The same for the RPM.
The problem is that given the following:
It doesn't work as I expect, because requests are never blocked even with an higher amount of tokens or an higher amount of requests
Did I miss something?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions