A comprehensive evaluation framework for testing Large Language Model (LLM) agents in mobile environments. This repository implements the QualGent Research Coding Challenge and benchmarks how well LLM agents (GPT-4, Claude-3) can navigate Android applications by interpreting UI observations and generating valid actions.
Repository: https://github.com/dishant2009/android-llm-agent-eval
- Multi-Provider Support: Unified interface for OpenAI GPT-4 and Anthropic Claude-3
- Advanced Prompting Strategies: Base, few-shot, and self-reflection prompting approaches
- Comprehensive Evaluation Metrics: Step accuracy, episode success rates, and failure analysis
- Memory Buffer Integration: Context-aware action selection with configurable history tracking
- Interactive Visualization: Real-time Streamlit dashboard for monitoring and analysis
- Production-Ready Architecture: Robust error handling, retry mechanisms, and comprehensive logging
- Extensible Design: Modular structure for easy addition of new models and strategies
Provider | Strategy | Success Rate | Step Accuracy |
---|---|---|---|
OpenAI | Few-shot | 40.0% | 77.1% |
Anthropic | Few-shot | 60.0% | 88.6% |
OpenAI | Base | 20.0% | 62.9% |
- Python 3.11 or higher
- OpenAI API key (for GPT-4 access)
- Anthropic API key (for Claude-3 access)
# Clone the repository
git clone https://github.com/dishant2009/android-llm-agent-eval.git
cd android-llm-agent-eval
# Set up virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies and download dataset
bash scripts/setup.sh
# Configure API keys
cp dot_env_example .env
# Edit .env file with your API keys
# Run evaluation on 10 episodes with GPT-4 using few-shot prompting
python scripts/run_evaluation.py --episodes 10 --models openai --strategies few_shot
# Test single episode with detailed output
python scripts/run_single.py --episode install_app_001 --provider openai --strategy few_shot
# Launch interactive visualization dashboard
python -m streamlit run scripts/visualize_results.py
android-llm-agent-eval/
βββ src/ # Core framework implementation
β βββ agent.py # LLM agent with memory buffer integration
β βββ llm_client.py # Unified LLM provider interface
β βββ prompts.py # Jinja2 templating system
β βββ evaluate.py # Comprehensive evaluation engine
β βββ utils.py # Dataset loading and validation utilities
βββ prompts/ # Prompt templates and examples
β βββ base_template.md # Simple goal-to-action prompting
β βββ few_shot_examples.json # Curated training examples
β βββ reflection_template.md # Self-reflection prompting
βββ scripts/ # Command-line utilities
β βββ setup.sh # Environment setup automation
β βββ run_single.py # Single episode testing
β βββ run_evaluation.py # Batch evaluation across strategies
β βββ visualize_results.py # Streamlit dashboard
βββ results/ # Auto-generated evaluation results
βββ tests/ # Comprehensive test suite
βββ android_world/ # Dataset (auto-downloaded)
βββ config.yaml # Experiment configuration
βββ requirements.txt # Python dependencies
βββ report.md # Detailed research findings
# Required API keys
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Optional configuration
ANDROID_WORLD_DATA=./android_world/data
DEFAULT_LLM_PROVIDER=openai
DEFAULT_MODEL=gpt-4-turbo
evaluation:
num_episodes: 15
max_steps_per_episode: 20
timeout_seconds: 30
models:
openai:
model: "gpt-4-turbo"
temperature: 0.1
max_tokens: 150
anthropic:
model: "claude-3-sonnet-20240229"
temperature: 0.1
max_tokens: 150
prompting:
max_history_steps: 5
include_reasoning: true
retry_on_invalid: true
max_retries: 3
# Compare all strategies across multiple providers
python scripts/run_evaluation.py \
--episodes 10 \
--models openai,anthropic \
--strategies base,few_shot,reflection
# Test specific episodes with detailed logging
python scripts/run_single.py \
--episode audio_recorder_001 \
--provider anthropic \
--strategy few_shot
# Launch comprehensive results dashboard
python -m streamlit run scripts/visualize_results.py
The Streamlit dashboard provides:
- Real-time evaluation monitoring
- Interactive performance metrics visualization
- Detailed failure pattern analysis
- Provider comparison charts
- Episode-level drill-down capabilities
- Configuration management interface
- Step Accuracy: Percentage of individual actions matching ground truth
- Episode Success Rate: Percentage of complete task sequences executed correctly
- Fuzzy Match Scores: Semantic similarity analysis for near-miss evaluation
- Action type distribution and error patterns
- Hallucination detection and frequency analysis
- UI reasoning capability assessment
- Memory buffer effectiveness measurement
The evaluation system generates structured results:
results/
βββ episode_<id>.json # Detailed per-episode logs
βββ summary_<timestamp>.json # Aggregated performance metrics
βββ comparison_<timestamp>.json # Cross-model comparison data
βββ detailed_failure_analysis.json # Systematic failure patterns
{
"episode_id": "install_app_001",
"goal": "Install the Twitter app from the Play Store",
"success": false,
"steps": [
{
"observation": {
"app_name": "Home Screen",
"ui_elements": ["Play Store", "Settings", "Chrome"],
"screen_text": "Welcome to Android"
},
"predicted": "CLICK(\"Play Store\")",
"ground_truth": "CLICK(\"Play Store\")",
"exact_match": true,
"fuzzy_score": 100
}
]
}
# Execute full test suite
pytest tests/ -v
# Run specific test categories
pytest tests/test_agent.py -v # Agent functionality
pytest tests/test_utils.py -v # Utility functions
pytest tests/test_evaluate.py -v # Evaluation metrics
# Linting and formatting (if configured)
flake8 src/
black src/
- Few-shot prompting significantly outperforms base prompting across all providers
- Claude demonstrates superior performance when format issues are resolved (89% vs 77% step accuracy)
- Quote formatting inconsistencies represent a critical deployment challenge
- Memory buffer integration improves multi-step task performance
- Action type confusion (TYPE vs CLICK) is a primary failure mode
- Hallucinated Actions (15% of failures): References to non-existent UI elements
- Goal Misinterpretation (25% of first-step failures): Incorrect initial navigation
- Action Type Confusion (40% of mid-sequence failures): TYPE vs CLICK distinction
- Format Incompatibility (Critical): Quote style variations cause validation failures
# Clone and set up development environment
git clone https://github.com/dishant2009/android-llm-agent-eval.git
cd android-llm-agent-eval
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
- Extend
src/llm_client.py
with new provider integration - Add provider-specific configuration to
config.yaml
- Update evaluation scripts to include new provider
- Add comprehensive tests for new functionality
- Create new template in
prompts/
directory - Register strategy in
src/prompts.py
- Update evaluation framework to support new strategy
- Document strategy rationale and expected performance
Import Errors: Ensure virtual environment is activated and PYTHONPATH is set:
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
API Rate Limiting: The framework includes automatic retry mechanisms with exponential backoff
Missing Dataset: Run setup script to download android_world data:
bash scripts/setup.sh
Quote Format Issues: Update validation logic to handle both single and double quotes
If you use this framework in your research, please cite:
@misc{android-llm-agent-eval,
author = {Dishant Digdarshi},
title = {Android LLM Agent Evaluation Framework},
year = {2025},
url = {https://github.com/dishant2009/android-llm-agent-eval}
}
MIT License - see LICENSE file for details.
- QualGent Research for the original coding challenge
- Google Research for the AndroidWorld environment
- OpenAI and Anthropic for LLM API access
For detailed research findings, methodology, and performance analysis, see report.md.