A Python tool for comparing responses from multiple Large Language Model APIs side by side.
- Compares responses from three AI APIs:
- Inception Labs API (mercury, gpt-3.5-turbo, gpt-4, etc.)
- OpenAI API (gpt-3.5-turbo, gpt-4, gpt-4o, etc.)
- Google Gemini AI (gemini-pro, gemini-1.5-pro, etc.)
- Two comparison modes:
- Single Prompt: Compare responses to one prompt across all APIs
- Prompt Chains: Execute multi-step chains where output feeds into next step
- Comprehensive Performance Benchmarking:
- TTFT (Time to First Token): Measures response initiation speed
- TPOT (Time Per Output Token): Average token generation time
- E2EL (End-to-End Latency): Total request completion time
- Throughput Metrics: Tokens/second and requests/second analysis
- Statistical Analysis: Mean, median, percentiles for all timing metrics
- Configurable model selection - Choose specific models for each provider
- Measures and records response times for each API and chain step
- Saves results in CSV format for easy analysis
- Reads prompts from external text files or JSON chain definitions
- Handles API errors gracefully
- Command-line options to list available models and enable performance analysis
- Python 3.7 or higher
- pip (Python package installer)
-
Clone or download this repository:
git clone <repository-url> cd LLM-comparison-toolkit
-
Create a virtual environment (recommended):
python -m venv venv # Activate virtual environment # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install required dependencies:
pip install -r requirements.txt
-
Set up environment variables:
# Copy the example environment file cp .env.example .env # Edit .env file with your actual API keys nano .env # or use any text editor
-
Get your API Keys:
-
OpenAI:
- Visit OpenAI Platform
- Create an account and generate an API key
- Add to
.env
as:OPENAI_API_KEY=sk-your-key-here
-
Google Gemini:
- Visit Google AI Studio
- Create a Google account and generate an API key
- Add to
.env
as:GEMINI_API_KEY=your-key-here
-
Inception Labs:
- Contact Inception Labs for API access
- Add to
.env
as:INCEPTION_API_KEY=your-key-here
- Note: You may need to adjust the API endpoint in the code
-
-
Prepare your prompt:
# Edit the prompt file with your question echo "What are the benefits of renewable energy?" > prompt.txt
-
Run the comparison:
python llm_comparison.py
-
Prepare your chain:
# Use the provided example or create your own cp simple_chain_example.json input_prompt_chain.json # Or edit the chain file nano input_prompt_chain.json
-
Run the chain comparison:
python llm_comparison.py --chain
# View available models
python llm_comparison.py --models
# Edit models_config.json to select different models
nano models_config.json
# Run with detailed performance metrics
python llm_comparison.py --performance
# Performance analysis for chains
python llm_comparison.py --chain --performance
The --performance
flag enables comprehensive benchmarking that measures:
- Time to First Token (TTFT): How quickly each API starts responding
- Time Per Output Token (TPOT): Average time to generate each token
- End-to-End Latency (E2EL): Total time from request to completion
- Throughput: Tokens per second and requests per second
- Statistical Analysis: Mean, median, standard deviation, and percentiles
- Console output shows a summary
- Single prompt: Results saved to
llm_comparison_results_TIMESTAMP.csv
- Chain: Results saved to
llm_chain_results_TIMESTAMP.csv
- Performance metrics: Detailed benchmarking report displayed in console when using
--performance
$ python llm_comparison.py
Starting LLM Comparison...
============================================================
CURRENT MODEL CONFIGURATION
============================================================
Inception Labs: mercury
OpenAI: gpt-3.5-turbo
Gemini: gemini-pro
Max Tokens: 1000
Temperature: 0.7
============================================================
Prompt: What are the benefits of renewable energy?
Calling APIs...
- Calling Inception Labs API (mercury)...
- Calling OpenAI API (gpt-3.5-turbo)...
- Calling Gemini AI (gemini-pro)...
Results saved to: llm_comparison_results.csv
================================================================================
COMPARISON SUMMARY
================================================================================
Prompt: What are the benefits of renewable energy?
Inception Labs - mercury (2.341s):
Success: True
Response: Renewable energy offers numerous benefits including reduced carbon emissions, energy independence, job creation...
OpenAI - gpt-3.5-turbo (1.823s):
Success: True
Response: Renewable energy sources like solar, wind, and hydroelectric power provide several key advantages...
Gemini AI - gemini-pro (1.456s):
Success: True
Response: The transition to renewable energy brings multiple benefits for both the environment and economy...
# Run single prompt comparison
python llm_comparison.py
# Run prompt chain comparison
python llm_comparison.py --chain
# List all available models
python llm_comparison.py --models
# Enable detailed performance benchmarking
python llm_comparison.py --performance
# Combine flags for performance analysis with chains
python llm_comparison.py --chain --performance
# Show help
python llm_comparison.py --help
The models_config.json
file controls which models are used for comparison:
{
"inception_labs": {
"available_models": ["mercury", "gpt-3.5-turbo", "gpt-4", "gpt-4-turbo"],
"default_model": "mercury"
},
"openai": {
"available_models": ["gpt-3.5-turbo", "gpt-4", "gpt-4o", "gpt-4o-mini"],
"default_model": "gpt-3.5-turbo"
},
"gemini": {
"available_models": ["gemini-pro", "gemini-1.5-pro", "gemini-1.5-flash"],
"default_model": "gemini-pro"
},
"selected_models": {
"inception_labs": "mercury",
"openai": "gpt-3.5-turbo",
"gemini": "gemini-pro"
},
"api_parameters": {
"max_tokens": 1000,
"temperature": 0.7,
"timeout_seconds": 60
}
}
-
View available models:
python llm_comparison.py --models
-
Edit configuration:
nano models_config.json
-
Update the
selected_models
section:"selected_models": { "inception_labs": "gpt-4", "openai": "gpt-4o", "gemini": "gemini-1.5-pro" }
- For speed: Use
gpt-3.5-turbo
,gemini-1.5-flash
,mercury
- For quality: Use
gpt-4o
,gemini-1.5-pro
,gpt-4
- For cost-effectiveness: Use
gpt-3.5-turbo
,gemini-pro
,mercury
Prompt chains allow you to execute a sequence of LLM calls where the output of one step becomes the input for the next step. This enables complex workflows like:
- Research → Analysis → Summary → Action Items
- Draft → Review → Improve → Finalize
- Question → Research → Synthesize → Recommend
Create a JSON file with the following structure:
{
"chain_name": "Your Chain Name",
"description": "Description of what this chain does",
"steps": [
{
"step_id": 1,
"name": "Step Name",
"prompt": "Your prompt here",
"use_previous_output": false,
"output_variable": "variable_name"
},
{
"step_id": 2,
"name": "Next Step",
"prompt": "Process this data: {previous_output}",
"use_previous_output": true,
"output_variable": "next_variable"
}
],
"chain_parameters": {
"max_tokens_per_step": 1500,
"temperature": 0.7,
"timeout_seconds": 60,
"continue_on_error": false,
"save_intermediate_results": true
}
}
use_previous_output
: Iftrue
, replaces{previous_output}
in the prompt with the previous step's responsecontinue_on_error
: Iftrue
, continues executing remaining steps even if one failsmax_tokens_per_step
: Token limit for each step (can override global setting)save_intermediate_results
: Whether to save each step's output
The toolkit includes example chains:
simple_chain_example.json
: Basic 2-step writing improvement chaininput_prompt_chain.json
: Complex 4-step research and analysis chain
Chain results include:
- Step-by-step execution details
- Timing for each step and total time
- Success/failure status for each step
- Final output from the last successful step
- Detailed CSV with all intermediate results
The script provides real-time feedback and a summary including:
- Response Time: How long each API took to respond (in seconds)
- Success Status: Whether the API call was successful
- Response Preview: First 200 characters of each response
The llm_comparison_results.csv
file contains two rows:
Row 1 - Main Results:
Column | Description |
---|---|
Timestamp | When the comparison was run |
Prompt | The exact prompt sent to all APIs |
Inception_Response | Full response from Inception Labs |
Inception_Time_Seconds | Response time in seconds |
Inception_Success | True/False for API success |
OpenAI_Response | Full response from OpenAI |
OpenAI_Time_Seconds | Response time in seconds |
OpenAI_Success | True/False for API success |
Gemini_Response | Full response from Gemini |
Gemini_Time_Seconds | Response time in seconds |
Gemini_Success | True/False for API success |
Row 2 - Timing Summary: A quick reference row showing just the response times for easy comparison.
-
Response Quality: Compare the actual responses to see which API provides the most helpful, accurate, or detailed answer for your use case.
-
Speed Comparison: Check the timing columns to see which API responds fastest. Typical ranges:
- Fast: < 2 seconds
- Moderate: 2-5 seconds
- Slow: > 5 seconds
-
Reliability: Check the Success columns. Failed calls might indicate:
- API rate limits
- Network issues
- Invalid API keys
- Service outages
-
Cost Analysis: Different APIs have different pricing models. Faster responses might cost more per request.
Timestamp,Prompt,Inception_Response,Inception_Time_Seconds,Inception_Success,OpenAI_Response,OpenAI_Time_Seconds,OpenAI_Success,Gemini_Response,Gemini_Time_Seconds,Gemini_Success
2024-01-15 14:30:22,"What are the benefits of renewable energy?","Renewable energy offers numerous benefits...",2.341,True,"Renewable energy sources like solar...",1.823,True,"The transition to renewable energy...",1.456,True
2024-01-15 14:30:22 - Timing Summary,Response Times (seconds),"Inception: 2.341s",,,"OpenAI: 1.823s",,,"Gemini: 1.456s",,
- Models: Edit the model names in the script (e.g., change from
gpt-3.5-turbo
togpt-4
) - Parameters: Adjust temperature, max_tokens, etc. in each API call function
- Inception Labs Endpoint: Update the
inception_endpoint
URL if using a different endpoint
The tool handles various error scenarios:
- Missing API keys
- Network timeouts
- API rate limits
- Invalid responses
All errors are logged in the CSV output for analysis.