Azure Load Testing to benchmark Azure OpenAI and other E2E services

page_type

languages

products

name

description

sample

jmx

Azure Load Testing

Azure OpenAI

Others

Load Testing Azure OpenAI with JMeter

Config samples to perform load testing scenarios against Azure OpenAI, frontend, backend and Vector Search/DB

Azure Load Testing to benchmark Azure OpenAI and other E2E services

This repo helps to benchmark latency and concurrency levels for some of the services involved in GPT completions scenarios. A typical e2e QnA / Semantic Search project for a Enterprise GPT includes Azure OpenAI (aka AOAI), App Service (AAS) and Cognitive Search (ACS), all moving pieces that could create bottlenecks and unnecessary latencies. At the same time prompt complexity, length, turn history and geographical distribution will surely affect the user experience. Using GPT-4 vs other faster models like ChatGPT-turbo will surely affect the final performance.

Internally this repo leverages Apache JMeter as an open source load tool and Azure Load Testing (ALT) as a PaaS test orchestrator. ALT simplifies the configuration, deployment and hosting of the load test scenarios while allowing sofisticated deployment scenarios like Private networks, Secured Secrets (i.e. AOAI keys) and dynamic body ingestions (sending diversified promts to avoid possible cache replies).

You should start your tests with a "user like" approach, hitting the final endpoint and analyzing the results. If a high latency or low concurrency is found you can then focus on isolated tests for each breakdown piece (i.e. Embeddings API, Vector Search, AppService Frontend+Backend, etc)

Step by step guide

Follow this tutorial to spin up an Azure Load Testing resource and use the configuration file and prompts CSV in your own subscription.

Tailor your tests

Simple config.jmx

The sample configuration file is a basic test where 300 requests are sent over a minute, this can easily be adjusted to your real scenarios with some tuning from Apache JMeter. A simplified way of building it is with their UI tutorial.

More advanced scenarios

Other complex scenarios might include a sustained load during minutes or even unauthorized avalanche calls (Placeholder)

Generate your own test datasets

Because of the nature of ChatGPT and GPT in general we can expect varied and long conversations to be handled by the e2e system. Setting a fixed prompt will not simulate well the reality, hence we suggest to take advantage of a prepopulated CSV with sample prompts that will be randomized into the test threads. Note that most of the sample rows have been self populated with GPT-4, once you move into Production you can use the logs from real usage to tailor the test dataset to your real needs in terms of length, complexity, etc.

For example, this system message helps to generate other chat history samples:

"You are an AI assistant that helps to create test datasets with a format compatible with GPT Chat session history. For example: "{"messages":[{"role":"system","content":"You are an AI assistant that helps people find information."},{"role":"user","content":"when did Michael Jordan win his first ring in the NBA?"},{"role":"assistant","content":"Michael Jordan won his first NBA championship ring on June 12, 1991, when the Chicago Bulls defeated the Los Angeles Lakers in the NBA Finals."},{"role":"user","content":"cool, for which team was Rodman playing for?"},{"role":"assistant","content":"During the 1990-1991 NBA season, when Michael Jordan won his first championship, Dennis Rodman was playing for the Detroit Pistons."},{"role":"user","content":"ah, they were good. Tons of friction between those two teams. When did Isiah Tomas retire?"},{"role":"assistant","content":"Isiah Thomas retired from professional basketball after the 1993-1994 NBA season."},{"role":"user","content":"thanks for correcting my typo, noticed you gently fixed it"},{"role":"assistant","content":"You're welcome! I'm here to help and provide accurate information. If you have any more questions or need assistance, feel free to ask."},{"role":"user","content":"when did Penny Hardaway join Orlando?"}]}"

along a user message like: "Generate 10 examples of GPT Chat session history about customers complaining for slow internet connection". In this case the first completion was: "Example 1: {"messages":[{"role":"system","content":"You are an AI assistant that helps people with their internet connection issues."}, {"role":"user","content":"My internet connection is really slow lately. What could be the problem?"},{"role":"assistant","content":"There could be several reasons for a slow internet connection, such as network congestion, router issues, or interference from nearby devices. First, try restarting your modem and router to see if that resolves the issue."},{"role":"user","content":"I already tried restarting my modem and router, but it didn't help."}, {"role":"assistant","content":"In that case, you might want to check for any updates to your router's firmware, or try changing the Wi-Fi channel to reduce interference. If the issue persists, please contact your internet service provider for further assistance."}]}"

Secure your tests

Key vault secrets

Following security best practice, you should not use your key in clear in the .jmx config file, instead you can use a variable and keep the actual key in Azure Key Vault. See more guidance in the following link

VNET

Most likely your final deployment will be hardened and behind private networks, the following article shows the different possibilities to inject traffic from the Azure Load Testing tool into your boundary GPT deployment.

Results and reports

Out of the box results

The results are sketched out of the box containing valuable info (latency, HTTP code, etc) along the thread run. See the documentation for more details and the jmx itself for a list of data to include

Compare different results

Different variables in the test will lead to different patterns in latency and max concurrency. Compare your results as seen here

Customize the output csv

Quite often you not only need HTTP code, latency and timestamp but also the returned completion or other headers from the server. In that case you might want to check Listener and other config from JMeter as seen here

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
jmeter		jmeter
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Load Testing to benchmark Azure OpenAI and other E2E services

Step by step guide

Tailor your tests

Simple config.jmx

More advanced scenarios

Generate your own test datasets

Secure your tests

Key vault secrets

VNET

Results and reports

Out of the box results

Compare different results

Customize the output csv

About

Uh oh!

ignaciofls/LoadTest-AOAI

Folders and files

Latest commit

History

Repository files navigation

Azure Load Testing to benchmark Azure OpenAI and other E2E services

Step by step guide

Tailor your tests

Simple config.jmx

More advanced scenarios

Generate your own test datasets

Secure your tests

Key vault secrets

VNET

Results and reports

Out of the box results

Compare different results

Customize the output csv

About

Resources

Uh oh!

Stars

Watchers

Forks