Skip to content

ignaciofls/LoadTest-AOAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

page_type languages products name description
sample
jmx
Azure Load Testing
Azure OpenAI
Others
Load Testing Azure OpenAI with JMeter
Config samples to perform load testing scenarios against Azure OpenAI, frontend, backend and Vector Search/DB

Azure Load Testing to benchmark Azure OpenAI and other E2E services

This repo helps to benchmark latency and concurrency levels for some of the services involved in GPT completions scenarios. A typical e2e QnA / Semantic Search project for a Enterprise GPT includes Azure OpenAI (aka AOAI), App Service (AAS) and Cognitive Search (ACS), all moving pieces that could create bottlenecks and unnecessary latencies. At the same time prompt complexity, length, turn history and geographical distribution will surely affect the user experience. Using GPT-4 vs other faster models like ChatGPT-turbo will surely affect the final performance.

Internally this repo leverages Apache JMeter as an open source load tool and Azure Load Testing (ALT) as a PaaS test orchestrator. ALT simplifies the configuration, deployment and hosting of the load test scenarios while allowing sofisticated deployment scenarios like Private networks, Secured Secrets (i.e. AOAI keys) and dynamic body ingestions (sending diversified promts to avoid possible cache replies).

You should start your tests with a "user like" approach, hitting the final endpoint and analyzing the results. If a high latency or low concurrency is found you can then focus on isolated tests for each breakdown piece (i.e. Embeddings API, Vector Search, AppService Frontend+Backend, etc)

image

Step by step guide

Follow this tutorial to spin up an Azure Load Testing resource and use the configuration file and prompts CSV in your own subscription.

Tailor your tests

Simple config.jmx

The sample configuration file is a basic test where 300 requests are sent over a minute, this can easily be adjusted to your real scenarios with some tuning from Apache JMeter. A simplified way of building it is with their UI tutorial.

More advanced scenarios

Other complex scenarios might include a sustained load during minutes or even unauthorized avalanche calls (Placeholder)

Generate your own test datasets

Because of the nature of ChatGPT and GPT in general we can expect varied and long conversations to be handled by the e2e system. Setting a fixed prompt will not simulate well the reality, hence we suggest to take advantage of a prepopulated CSV with sample prompts that will be randomized into the test threads. Note that most of the sample rows have been self populated with GPT-4, once you move into Production you can use the logs from real usage to tailor the test dataset to your real needs in terms of length, complexity, etc.

For example, this system message helps to generate other chat history samples:

"You are an AI assistant that helps to create test datasets with a format compatible with GPT Chat session history. For example: "{"messages":[{"role":"system","content":"You are an AI assistant that helps people find information."},{"role":"user","content":"when did Michael Jordan win his first ring in the NBA?"},{"role":"assistant","content":"Michael Jordan won his first NBA championship ring on June 12, 1991, when the Chicago Bulls defeated the Los Angeles Lakers in the NBA Finals."},{"role":"user","content":"cool, for which team was Rodman playing for?"},{"role":"assistant","content":"During the 1990-1991 NBA season, when Michael Jordan won his first championship, Dennis Rodman was playing for the Detroit Pistons."},{"role":"user","content":"ah, they were good. Tons of friction between those two teams. When did Isiah Tomas retire?"},{"role":"assistant","content":"Isiah Thomas retired from professional basketball after the 1993-1994 NBA season."},{"role":"user","content":"thanks for correcting my typo, noticed you gently fixed it"},{"role":"assistant","content":"You're welcome! I'm here to help and provide accurate information. If you have any more questions or need assistance, feel free to ask."},{"role":"user","content":"when did Penny Hardaway join Orlando?"}]}"

along a user message like: "Generate 10 examples of GPT Chat session history about customers complaining for slow internet connection". In this case the first completion was: "Example 1: {"messages":[{"role":"system","content":"You are an AI assistant that helps people with their internet connection issues."}, {"role":"user","content":"My internet connection is really slow lately. What could be the problem?"},{"role":"assistant","content":"There could be several reasons for a slow internet connection, such as network congestion, router issues, or interference from nearby devices. First, try restarting your modem and router to see if that resolves the issue."},{"role":"user","content":"I already tried restarting my modem and router, but it didn't help."}, {"role":"assistant","content":"In that case, you might want to check for any updates to your router's firmware, or try changing the Wi-Fi channel to reduce interference. If the issue persists, please contact your internet service provider for further assistance."}]}"

Secure your tests

Key vault secrets

Following security best practice, you should not use your key in clear in the .jmx config file, instead you can use a variable and keep the actual key in Azure Key Vault. See more guidance in the following link

VNET

Most likely your final deployment will be hardened and behind private networks, the following article shows the different possibilities to inject traffic from the Azure Load Testing tool into your boundary GPT deployment.

Results and reports

Out of the box results

The results are sketched out of the box containing valuable info (latency, HTTP code, etc) along the thread run. See the documentation for more details and the jmx itself for a list of data to include

image

Compare different results

Different variables in the test will lead to different patterns in latency and max concurrency. Compare your results as seen here

Customize the output csv

Quite often you not only need HTTP code, latency and timestamp but also the returned completion or other headers from the server. In that case you might want to check Listener and other config from JMeter as seen here

About

Load test repo to benchmark performance and identify bottlenecks of E2E Azure OpenAI projects

Resources

Stars

Watchers

Forks