Periscope K6 LLM Performance Testing Framework
Explore the docs »
View Scripts
·
Report Bug
·
Request Feature
A comprehensive framework for load testing and benchmarking OpenAI API endpoints using K6, with a focus on measuring performance metrics for completions and embeddings.
- Overview
- Features
- Prerequisites
- Installation
- Architecture
- Usage
- Makefile Reference
- Configuration Options
- Example Workflows
- Scripts Explained
- Custom Tests
- Troubleshooting
- Advanced Usage
- License
This framework provides a Docker-based environment for performance testing of OpenAI API endpoints. It includes preconfigured K6, InfluxDB, and Grafana services, along with customized scripts designed specifically for testing various aspects of OpenAI's API services. The framework allows you to measure key metrics like response times, token usage efficiency, throughput, and error rates under different load scenarios.
- Containerized Environment: Docker and Docker Compose based deployment
- Metrics Visualization: Pre-configured Grafana dashboards for test results
- Core API Testing:
- Chat completions testing
- Embeddings generation testing (single and batch)
- Code completion with prefix caching
- Performance Testing Patterns:
- Smoke tests for basic functionality validation
- Stress tests for identifying breaking points
- Spike tests for sudden load surges
- Soak tests for long-duration stability
- Recovery tests for measuring system stabilization
- Prefill-heavy tests for context processing efficiency
- Decode-heavy tests for output generation throughput
- Extensible Framework: Modular design for custom test script creation
- Comprehensive Metrics: Token usage, latency, throughput, and processing rates
- Automated Workflows: Makefile-based command system for test management
- Docker and Docker Compose
- OpenAI API Key
- Bash or compatible shell environment
-
Clone this repository:
git clone https://github.com/wizenheimer/periscope.git cd periscope
-
Initialize the environment:
make setup
-
Start the infrastructure services:
make start
-
Verify services are running:
make status
The framework consists of three main components:
- K6: Open-source load testing tool that executes the test scripts
- InfluxDB: Time-series database that stores test metrics
- Grafana: Visualization platform that displays real-time and historical test results
These components are orchestrated using Docker Compose, with configuration files for seamless integration.
k6-openai-testing/
├── docker-compose.yaml # Container orchestration
├── Makefile # Simplified command interface
├── grafana/ # Grafana configuration
│ ├── grafana-dashboard.yaml
│ ├── grafana-datasource.yaml
├── dashboards/ # Dashboard templates
│ ├── k6-load-testing-results_rev3.json
│ └── k6-openai-tokens_rev1.json
├── scripts/ # Test scripts
│ ├── config.js # Shared configuration
│ ├── openai-completions.js
│ ├── openai-embeddings.js
│ ├── openai-benchmark.js
│ ├── openai-prefix-caching.js
│ ├── helpers/ # Utilities
│ │ ├── openaiGeneric.js
│ │ ├── utils.js
│ │ └── http.js
│ └── payloads/ # Request templates
│ ├── completions.js
│ └── embeddings.js
└── README.md
-
Set your OpenAI API key:
export OPENAI_API_KEY=your_api_key
-
Run a test:
make test-completions
-
View results:
make grafana-dashboard
The framework includes several specialized test scripts:
# Test chat completions
make test-completions
# Test embedding generation
make test-embeddings
# Test code completion with prefix caching
make test-prefix-caching
# Run comprehensive benchmark
make test-benchmark
# Run all tests sequentially
make test-all
# Initial setup
make setup
# Start services
make start
# Check status
make status
# Open Grafana dashboard
make grafana-dashboard
# Set your OpenAI API key (replace with your actual key)
export OPENAI_API_KEY=sk-your-api-key
# Run completions test
make test-completions
# Run embeddings test
make test-embeddings
# Run benchmark test
make test-benchmark
# Run prefix caching test
make test-prefix-caching
# Run all tests sequentially
make test-all
# Run a specific script
make test script=custom-script.js
You can override any configuration option either through environment variables or by passing them as arguments:
# Override with environment variables
export OPENAI_COMPLETION_MODEL=gpt-4
make test-completions
# Or pass directly as arguments
make test-completions OPENAI_COMPLETION_MODEL=gpt-4 MAX_TOKENS=128
# View logs
make logs
# Restart services
make restart
# Stop services
make stop
# Clean up (stop and remove containers)
make clean
# Full purge (remove containers, volumes, and data)
make purge
# Show all available commands and configuration
make help
Option | Description | Default |
---|---|---|
OPENAI_API_KEY |
Your OpenAI API key | "your-api-key-here" |
OPENAI_BASE_URL |
Base URL for OpenAI API | "https://api.openai.com" |
OPENAI_COMPLETION_MODEL |
Model for completion requests | "gpt-3.5-turbo" |
OPENAI_EMBEDDING_MODEL |
Model for embedding requests | "text-embedding-3-small" |
OPENAI_CODING_MODEL |
Model for code completion requests | "gpt-3.5-turbo" |
MAX_TOKENS |
Maximum tokens to generate | 64 |
VUS |
Number of virtual users | 1 |
ENABLE_BATCH_MODE |
Enable batch embedding requests | "false" |
Compare performance metrics between different models:
# Test with GPT-3.5 Turbo
make test-completions OPENAI_COMPLETION_MODEL=gpt-3.5-turbo
# Test with GPT-4
make test-completions OPENAI_COMPLETION_MODEL=gpt-4
Test how the API performs under increased load:
# Test with 1 virtual user
make test-completions VUS=1
# Test with 5 virtual users
make test-completions VUS=5
Test how input size affects performance:
# Test with default token limit
make test-prefix-caching MAX_TOKENS=16
# Test with larger token limit
make test-prefix-caching MAX_TOKENS=128
Test against a compatible alternative API:
make test-completions OPENAI_BASE_URL=https://alternative-api.example.com
Tests the chat completions endpoint with various prompts. Measures response time, token usage, and generation throughput.
Tests the embeddings endpoint for single text embedding requests. Measures embedding generation latency and tracks vector dimensions for individual text items.
Tests the embeddings endpoint specifically for batch processing (multiple texts in a single request). Measures batch processing efficiency, per-text latency, and compares performance across different batch sizes.
Simulates an IDE-like code completion scenario where each completion is appended to the prefix for the next request. Tests continuous usage patterns and measures token efficiency.
Comprehensive benchmark that tests both completions and embeddings endpoints with increasing numbers of virtual users. Provides comparative performance metrics.
You can easily create custom test scripts by using the provided helpers and utilities:
-
Create a new JavaScript file in the
scripts
directory:// scripts/my-custom-test.js import * as oai from "./helpers/openaiGeneric.js"; import config from "./config.js"; export const options = { vus: 1, duration: "30s", }; const client = oai.createClient({ url: config.openai.url, options: { model: "gpt-3.5-turbo", }, headers: { Authorization: `Bearer ${config.openai.key}`, }, }); export default function () { const response = client.chatComplete({ messages: [ { role: "user", content: "Generate a random number between 1 and 100", }, ], }); console.log(oai.getContent(response)); }
-
Run your custom test:
make test script=my-custom-test.js
The framework includes a comprehensive set of specialized test scripts for different performance testing scenarios:
-
Smoke Tests - Basic functionality verification with minimal load
make test-completions-smoke make test-embeddings-smoke make test-smoke-all
-
Stress Tests - Testing system behavior under high load to find breaking points
make test-completions-stress make test-embeddings-stress make test-stress-all
-
Spike Tests - Testing system reaction to sudden, dramatic increases in load
make test-completions-spike make test-embeddings-spike make test-spike-all
-
Soak Tests - Long-duration testing to identify issues that appear over time
make test-completions-soak make test-embeddings-soak make test-soak-all
-
Recovery Tests - Testing how the system recovers after failure or high load
make test-completions-recovery make test-embeddings-recovery make test-recovery-all
-
Smoke Tests: Minimal load (1 VU, few iterations) to verify basic functionality is working correctly before running more intensive tests.
-
Stress Tests: Gradually increasing load until performance degradation or failures occur, to identify maximum operational capacity.
-
Spike Tests: Sudden jumps to high user counts, then returning to baseline, to evaluate how the API handles unexpected traffic surges.
-
Soak Tests: Moderate but consistent load maintained for extended periods, to catch issues that only appear over time (memory leaks, gradual degradation).
-
Recovery Tests: High load followed by a return to normal levels, to measure how quickly the system stabilizes after stress.
For a comprehensive evaluation of the API's performance characteristics:
- Start with smoke tests to verify basic functionality
- Run stress tests to identify performance limits
- Run spike tests to assess resilience to sudden load
- Run recovery tests to measure stabilization capabilities
- Run soak tests to verify long-term stability
For soak tests and other long-running tests, you may want to modify the duration:
# Edit the test file to change the duration settings
# Or pass duration parameters as environment variables
SOAK_DURATION=60m make test-completions-soak
Each specialized test outputs different metrics relevant to its test pattern:
- Smoke Tests: Basic response validation and error rates
- Stress Tests: Identifies breaking points and maximum throughput
- Spike Tests: Measures failure rates during load spikes and recovery times
- Soak Tests: Tracks performance stability over time and error accumulation
- Recovery Tests: Measures stabilization time after stress periods
All these metrics are visualized in the Grafana dashboard for easy analysis.
The framework includes specialized tests designed to evaluate performance under different workload patterns:
Prefill-heavy tests focus on scenarios with large input contexts but relatively shorter outputs. These tests evaluate how effectively the model processes and understands extensive context.
make test-completions-prefill-heavy
This test simulates:
- Long document analysis
- Multi-turn conversations with extensive history
- Complex questions requiring deep context understanding
- Legal documents, research papers, or literature analysis
Key metrics:
- Prefill processing time
- Token processing rate (tokens/second)
- Performance with increasing context size
Decode-heavy tests focus on scenarios that require generating lengthy, detailed outputs from relatively concise prompts. These tests evaluate the model's token generation speed and throughput.
make test-completions-decode-heavy
This test simulates:
- Detailed explanations and tutorials
- Creative writing tasks
- Comprehensive guides and analyses
- Step-by-step instructions
Key metrics:
- Output generation time
- Token generation rate (tokens/second)
- Performance with varying output lengths
To run both types of heavy workload tests:
make test-completions-heavy-all
You can customize these tests with environment variables:
# Set maximum output tokens for decode-heavy test
MAX_OUTPUT_TOKENS=2000 make test-completions-decode-heavy
# Use a specific model for heavy tests
OPENAI_COMPLETION_MODEL=gpt-4 make test-completions-heavy-all
These specialized tests are particularly valuable for:
- Model Comparison: Compare how different models handle prefill vs. decode tasks
- Pricing Optimization: Understand performance tradeoffs between models to optimize cost
- Application Design: Make informed decisions about prompt design based on performance characteristics
- Resource Planning: Plan infrastructure based on expected workload patterns
-
API Connection Errors
- Verify your API key is correct
- Check if you're hitting rate limits
- Ensure your network allows connections to the OpenAI API
-
Container Issues
- Try restarting the services:
make restart
- Check logs for errors:
make logs
- Verify Docker is running properly
- Try restarting the services:
-
Grafana Dashboard Not Showing Data
- Ensure InfluxDB is running:
make status
- Verify tests are outputting data to InfluxDB
- Try restarting Grafana:
docker-compose restart k6-grafana
- Ensure InfluxDB is running:
K6 supports various load patterns that can be defined in your test scripts:
export const options = {
// Ramping pattern
stages: [
{ duration: "1m", target: 5 }, // Ramp up to 5 VUs
{ duration: "3m", target: 5 }, // Stay at 5 VUs
{ duration: "1m", target: 0 }, // Ramp down to 0 VUs
],
// Or use fixed VUs
// vus: 10,
// duration: '5m',
};
You can define custom metrics in your test scripts:
import { Trend } from "k6/metrics";
// Define custom metrics
const promptLength = new Trend("prompt_length");
const responseLength = new Trend("response_length");
export default function () {
// Your test logic
// Record metrics
promptLength.add(prompt.length);
responseLength.add(response.length);
}
Define pass/fail criteria for your tests:
export const options = {
thresholds: {
http_req_duration: ["p(95)<500"], // 95% of requests must complete below 500ms
http_req_failed: ["rate<0.01"], // Error rate must be less than 1%
},
};
This project is licensed under the MIT License and is provided "as is" and comes with absolutely no guarantees. If it breaks your system, well, that's kind of the point, isn't it? Congratulations, you're now doing perf testing!
Use at your own risk. Side effects may include improved system resilience, fewer 3 AM panic attacks, and an irresistible urge to push big red buttons.
Consider this as my small act of rebellion against the "just eyeball the performance" approach to perf testing. Feel free to star the repo - each star will be printed and taped to my manager's door.
- A Developer With Too Much Time and Not Enough Approval