Skip to content

FilledVaccum/LLM-Inferencing-Recommender

Repository files navigation

LLM Inference GPU Recommender

A modern, responsive single-page application that helps users select the optimal GPU configuration for running large language model inference workloads. The application analyzes user requirements including model specifications, performance targets, and optimization preferences to recommend the top 3 GPU options with detailed cost, latency, and throughput metrics.

Screenshot 2025-08-08 at 2 37 42 PM

Features

  • Intelligent GPU Recommendations: Get top 3 GPU recommendations based on your specific requirements using scientific memory calculations and weighted scoring algorithms
  • Scientific Memory Calculations: Uses precise formulas to calculate inference memory requirements and recommend optimal serving methodologies
  • Hugging Face Integration: Autocomplete model search and automatic model specification retrieval from Hugging Face Model Hub
  • AWS Service Recommendations: Suggests relevant cloud services (EC2, SageMaker, Bedrock, Inferentia) for your workload
  • Dark Mode Support: Built-in theme switching with localStorage persistence
  • Responsive Design: Mobile-first design that works on all devices with touch-friendly interactions
  • Accessibility: Full keyboard navigation, screen reader support, and proper ARIA labels
  • Form Persistence: Automatically saves and restores form inputs using localStorage
  • Comprehensive Error Handling: Graceful error handling with user-friendly messages and fallback options

Tech Stack

  • Frontend: React 19 with TypeScript and functional components
  • Framework: Next.js 15.4.6 with App Router
  • Styling: Tailwind CSS v4 with dark mode support
  • Testing: Jest with React Testing Library
  • API Integration: Hugging Face Model Hub API
  • State Management: React hooks (useState, useEffect, useContext)
  • Build Tool: Next.js with Turbopack for fast development

Getting Started

Prerequisites

  • Node.js 18+ (recommended: Node.js 20+)
  • npm, yarn, or pnpm package manager

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd llm-gpu-recommender
  2. Install dependencies:

    npm install
    # or
    yarn install
    # or
    pnpm install
  3. Run the development server:

    npm run dev
    # or
    yarn dev
    # or
    pnpm dev
  4. Open http://localhost:3000 in your browser

Available Scripts

  • npm run dev - Start development server with Turbopack for fast hot reloading
  • npm run build - Build the application for production
  • npm run start - Start the production server (requires npm run build first)
  • npm run lint - Run ESLint to check code quality and style
  • npm run test - Run all tests once
  • npm run test:watch - Run tests in watch mode for development

Development Commands

# Start development with hot reloading
npm run dev

# Run tests during development
npm run test:watch

# Check code quality
npm run lint

# Build and test production build locally
npm run build && npm run start

API Endpoints

POST /api/recommend-inference

Main recommendation endpoint that analyzes user requirements and returns GPU recommendations.

Request Body:

{
  "modelId": "meta-llama/Llama-2-7b-hf",
  "paramCount": 7000000000,
  "seqLen": 2048,
  "batchSize": 1,
  "latencyMs": 100,
  "throughputTps": 50,
  "techniques": ["quantization", "vllm"]
}

Response:

{
  "recommendations": [
    {
      "id": "nvidia-a100-80gb",
      "name": "NVIDIA A100 80GB",
      "vendor": "NVIDIA",
      "estimatedCost": 3.06,
      "latency": 45,
      "throughput": 120,
      "memory": 80,
      "memoryBandwidth": 2039,
      "fp16Tflops": 312,
      "int8Tflops": 624,
      "rationale": "Best performance for large models, excellent memory headroom",
      "compositeScore": 0.85,
      "awsServices": [
        {
          "service": "EC2 (p4d.24xlarge)",
          "reasoning": "Direct GPU access with NVIDIA A100, ideal for multi-GPU setups",
          "costEffectiveness": 0.8
        }
      ]
    }
  ],
  "memoryCalculation": {
    "modelMemory": 14000,
    "activationMemory": 512,
    "totalMemory": 15912,
    "recommendedMethod": "standard-single-gpu",
    "bufferMemory": 1400
  },
  "metadata": {
    "totalGPUsEvaluated": 15,
    "inferenceMethod": "standard-single-gpu",
    "timestamp": "2024-01-15T10:30:00.000Z"
  }
}

GET /api/models/search?q={query}

Search for Hugging Face models with autocomplete functionality.

Parameters:

  • q (required): Search query (minimum 2 characters)

Response:

{
  "models": [
    {
      "id": "meta-llama/Llama-2-7b-hf",
      "name": "Llama 2 7B",
      "downloads": 1000000,
      "likes": 5000,
      "tags": ["llama", "7b"],
      "pipeline_tag": "text-generation"
    }
  ],
  "cached": false,
  "timestamp": "2024-01-15T10:30:00.000Z"
}

GET /api/models/{modelId}

Get detailed metadata for a specific Hugging Face model.

Response:

{
  "metadata": {
    "id": "meta-llama/Llama-2-7b-hf",
    "parameterCount": 7000000000,
    "hiddenSize": 4096,
    "vocabularySize": 32000,
    "maxSequenceLength": 4096,
    "architecture": "LlamaForCausalLM",
    "quantizationSupport": true
  },
  "cached": false,
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Project Structure

src/
├── app/                    # Next.js App Router
│   ├── api/               # API routes
│   │   ├── models/        # Model-related endpoints
│   │   │   ├── search/    # Model search endpoint
│   │   │   └── [modelId]/ # Model metadata endpoint
│   │   └── recommend-inference/ # Main recommendation endpoint
│   ├── globals.css        # Global styles and Tailwind imports
│   ├── layout.tsx         # Root layout with theme provider
│   ├── page.tsx          # Home page component
│   └── favicon.ico       # Application favicon
├── components/            # React components
│   ├── forms/            # Form-related components
│   │   ├── InputForm.tsx         # Main input form
│   │   ├── ModelSelector.tsx     # Model selection with autocomplete
│   │   ├── NumericInput.tsx      # Reusable numeric input
│   │   ├── OptimizationCheckboxes.tsx # Technique selection
│   │   └── PerformanceInputs.tsx # Latency/throughput inputs
│   ├── layout/           # Layout components
│   │   ├── HeroSection.tsx       # Title and subtitle
│   │   └── ResultsSection.tsx    # GPU recommendations display
│   └── ui/               # UI components
│       ├── ErrorBoundary.tsx     # Error boundary wrapper
│       ├── GPUCard.tsx          # Individual GPU recommendation card
│       ├── LoadingSpinner.tsx   # Loading state component
│       ├── NetworkStatus.tsx    # Network connectivity indicator
│       └── ThemeToggle.tsx      # Dark/light mode toggle
├── contexts/             # React contexts
│   └── ThemeContext.tsx  # Theme management context
├── data/                 # Static data
│   └── gpuDatabase.ts    # GPU specifications database
├── hooks/                # Custom React hooks
│   └── useFormPersistence.ts # Form state persistence hook
├── lib/                  # Library utilities
│   └── utils.ts          # Utility functions (clsx, etc.)
├── types/                # TypeScript type definitions
│   └── index.ts          # All application types
├── utils/                # Utility functions
│   ├── errorHandling.ts  # Error handling utilities
│   ├── gpuScoring.ts     # GPU scoring algorithms
│   ├── localStorage.ts   # localStorage utilities
│   └── memoryCalculator.ts # Memory calculation functions
└── test-setup.ts         # Jest test configuration

__tests__/                # Test files (mirrors src structure)
├── api/                  # API endpoint tests
├── components/           # Component tests
├── data/                 # Data layer tests
├── hooks/                # Custom hook tests
├── integration/          # End-to-end integration tests
├── utils/                # Utility function tests
└── setup.test.ts         # Test environment setup

Core Algorithms

Memory Calculation

The application uses scientific formulas to calculate inference memory requirements:

  1. Model Weights Memory: M_model = P × b

    • P: Parameter count
    • b: Bytes per parameter (2 for FP16, 1 for INT8)
  2. Activation Memory: M_act = α × B × L × H × b

    • α: Activation multiplier (≈1 for inference)
    • B: Batch size
    • L: Sequence length
    • H: Hidden size
  3. Total Memory: M_total = M_model + M_act + M_buffer

    • Buffer: 10% overhead for system operations

GPU Scoring

Uses a weighted composite scoring algorithm:

  • Cost (40%): Lower cost per hour = higher score
  • Latency (30%): Meeting latency requirements = higher score
  • Throughput (20%): Meeting throughput targets = higher score
  • Memory Fit (5%): Better memory utilization = higher score
  • Technique Support (5%): Supporting user's optimization techniques = higher score

Usage Examples

Basic Usage

  1. Select a Model: Choose from common sizes (7B, 13B, 70B) or search for specific Hugging Face models
  2. Set Requirements: Configure sequence length, batch size, latency, and throughput requirements
  3. Choose Optimizations: Select techniques like quantization, vLLM, or tensor parallelism
  4. Get Recommendations: Click "Get Top 3 GPUs" to receive personalized recommendations

Advanced Usage

  • Custom Models: Enter any Hugging Face model ID for automatic parameter detection
  • Performance Tuning: Adjust latency and throughput requirements based on your use case
  • Cost Optimization: Compare recommendations to find the most cost-effective option
  • AWS Integration: Review AWS service recommendations for cloud deployment

Development

Spec-Driven Development

This project follows a spec-driven development approach. See the .kiro/specs/llm-gpu-recommender/ directory for:

  • requirements.md - Detailed feature requirements in EARS format
  • design.md - Technical design document with architecture and algorithms
  • tasks.md - Implementation task list with progress tracking

Testing Strategy

  • Unit Tests: Test individual functions and components
  • Integration Tests: Test API endpoints and data flow
  • Component Tests: Test React component behavior and interactions
  • End-to-End Tests: Test complete user workflows

Run tests with:

# Run all tests
npm run test

# Run tests in watch mode during development
npm run test:watch

# Run specific test files
npm run test -- --testPathPattern=memoryCalculator

Code Quality

The project uses ESLint for code quality and consistency:

# Check code quality
npm run lint

# Auto-fix issues where possible
npm run lint -- --fix

Troubleshooting

Common Issues

1. Development server won't start

  • Ensure Node.js 18+ is installed: node --version
  • Clear node_modules and reinstall: rm -rf node_modules package-lock.json && npm install
  • Check if port 3000 is available or use a different port: npm run dev -- -p 3001

2. Tests failing

  • Ensure all dependencies are installed: npm install
  • Clear Jest cache: npm run test -- --clearCache
  • Check test setup file: src/test-setup.ts

3. Build errors

  • Check TypeScript errors: npx tsc --noEmit
  • Ensure all imports are correct and files exist
  • Clear Next.js cache: rm -rf .next

4. API endpoints not working

  • Check network connectivity for Hugging Face API calls
  • Verify API route files are in correct locations under src/app/api/
  • Check browser developer tools for detailed error messages

5. Styling issues

  • Ensure Tailwind CSS is properly configured in tailwind.config.ts
  • Check if PostCSS is configured correctly in postcss.config.mjs
  • Verify global styles are imported in src/app/globals.css

Performance Issues

Slow autocomplete search:

  • Results are cached for 5 minutes to improve performance
  • Rate limiting prevents excessive API calls
  • Fallback models are provided when Hugging Face API is unavailable

Memory calculation taking too long:

  • Calculations are optimized for common model sizes
  • Results are memoized to avoid recalculation
  • Consider reducing batch size or sequence length for very large models

Browser Compatibility

  • Supported Browsers: Chrome 90+, Firefox 88+, Safari 14+, Edge 90+
  • Mobile Support: iOS Safari 14+, Chrome Mobile 90+
  • Accessibility: Tested with screen readers and keyboard navigation

Environment Variables

No environment variables are required for basic functionality. The application works entirely with client-side code and public APIs.

For production deployment, consider setting:

  • NODE_ENV=production for optimized builds
  • Custom API endpoints if using proxied Hugging Face access

Getting Help

  1. Check the Issues: Look for similar problems in the project issues
  2. Review Logs: Check browser developer tools console for detailed error messages
  3. Test API Endpoints: Use tools like curl or Postman to test API endpoints directly
  4. Verify Dependencies: Ensure all package versions match package.json

Development Tips

  • Use npm run test:watch during development for immediate feedback
  • Enable React Developer Tools browser extension for component debugging
  • Use the Network tab in browser dev tools to monitor API calls
  • Check the Application tab in dev tools to verify localStorage persistence

Contributing

  1. Code Style: Follow the existing TypeScript and React conventions
  2. Testing: Write tests for new functionality using Jest and React Testing Library
  3. Documentation: Update README and inline comments for new features
  4. Type Safety: Ensure all TypeScript types are properly defined
  5. Accessibility: Test with keyboard navigation and screen readers
  6. Performance: Consider performance implications of new features

Pull Request Process

  1. Fork the repository and create a feature branch
  2. Make your changes with appropriate tests
  3. Ensure all tests pass: npm run test
  4. Check code quality: npm run lint
  5. Build successfully: npm run build
  6. Update documentation as needed
  7. Submit a pull request with a clear description

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

It's hard to decide the GPU to run LLM Inferencing, building this tool to make the job easier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages