LLM Inference GPU Recommender

A modern, responsive single-page application that helps users select the optimal GPU configuration for running large language model inference workloads. The application analyzes user requirements including model specifications, performance targets, and optimization preferences to recommend the top 3 GPU options with detailed cost, latency, and throughput metrics.

Features

Intelligent GPU Recommendations: Get top 3 GPU recommendations based on your specific requirements using scientific memory calculations and weighted scoring algorithms
Scientific Memory Calculations: Uses precise formulas to calculate inference memory requirements and recommend optimal serving methodologies
Hugging Face Integration: Autocomplete model search and automatic model specification retrieval from Hugging Face Model Hub
AWS Service Recommendations: Suggests relevant cloud services (EC2, SageMaker, Bedrock, Inferentia) for your workload
Dark Mode Support: Built-in theme switching with localStorage persistence
Responsive Design: Mobile-first design that works on all devices with touch-friendly interactions
Accessibility: Full keyboard navigation, screen reader support, and proper ARIA labels
Form Persistence: Automatically saves and restores form inputs using localStorage
Comprehensive Error Handling: Graceful error handling with user-friendly messages and fallback options

Tech Stack

Frontend: React 19 with TypeScript and functional components
Framework: Next.js 15.4.6 with App Router
Styling: Tailwind CSS v4 with dark mode support
Testing: Jest with React Testing Library
API Integration: Hugging Face Model Hub API
State Management: React hooks (useState, useEffect, useContext)
Build Tool: Next.js with Turbopack for fast development

Getting Started

Prerequisites

Node.js 18+ (recommended: Node.js 20+)
npm, yarn, or pnpm package manager

Installation

Clone the repository:

git clone <repository-url>
cd llm-gpu-recommender

Install dependencies:

npm install
# or
yarn install
# or
pnpm install

Run the development server:
```
npm run dev
# or
yarn dev
# or
pnpm dev
```
Open http://localhost:3000 in your browser

Available Scripts

npm run dev - Start development server with Turbopack for fast hot reloading
npm run build - Build the application for production
npm run start - Start the production server (requires npm run build first)
npm run lint - Run ESLint to check code quality and style
npm run test - Run all tests once
npm run test:watch - Run tests in watch mode for development

Development Commands

# Start development with hot reloading
npm run dev

# Run tests during development
npm run test:watch

# Check code quality
npm run lint

# Build and test production build locally
npm run build && npm run start

API Endpoints

POST /api/recommend-inference

Main recommendation endpoint that analyzes user requirements and returns GPU recommendations.

Request Body:

{
  "modelId": "meta-llama/Llama-2-7b-hf",
  "paramCount": 7000000000,
  "seqLen": 2048,
  "batchSize": 1,
  "latencyMs": 100,
  "throughputTps": 50,
  "techniques": ["quantization", "vllm"]
}

Response:

{
  "recommendations": [
    {
      "id": "nvidia-a100-80gb",
      "name": "NVIDIA A100 80GB",
      "vendor": "NVIDIA",
      "estimatedCost": 3.06,
      "latency": 45,
      "throughput": 120,
      "memory": 80,
      "memoryBandwidth": 2039,
      "fp16Tflops": 312,
      "int8Tflops": 624,
      "rationale": "Best performance for large models, excellent memory headroom",
      "compositeScore": 0.85,
      "awsServices": [
        {
          "service": "EC2 (p4d.24xlarge)",
          "reasoning": "Direct GPU access with NVIDIA A100, ideal for multi-GPU setups",
          "costEffectiveness": 0.8
        }
      ]
    }
  ],
  "memoryCalculation": {
    "modelMemory": 14000,
    "activationMemory": 512,
    "totalMemory": 15912,
    "recommendedMethod": "standard-single-gpu",
    "bufferMemory": 1400
  },
  "metadata": {
    "totalGPUsEvaluated": 15,
    "inferenceMethod": "standard-single-gpu",
    "timestamp": "2024-01-15T10:30:00.000Z"
  }
}

GET /api/models/search?q={query}

Search for Hugging Face models with autocomplete functionality.

Parameters:

q (required): Search query (minimum 2 characters)

Response:

{
  "models": [
    {
      "id": "meta-llama/Llama-2-7b-hf",
      "name": "Llama 2 7B",
      "downloads": 1000000,
      "likes": 5000,
      "tags": ["llama", "7b"],
      "pipeline_tag": "text-generation"
    }
  ],
  "cached": false,
  "timestamp": "2024-01-15T10:30:00.000Z"
}

GET /api/models/{modelId}

Get detailed metadata for a specific Hugging Face model.

Response:

{
  "metadata": {
    "id": "meta-llama/Llama-2-7b-hf",
    "parameterCount": 7000000000,
    "hiddenSize": 4096,
    "vocabularySize": 32000,
    "maxSequenceLength": 4096,
    "architecture": "LlamaForCausalLM",
    "quantizationSupport": true
  },
  "cached": false,
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Project Structure

src/
├── app/                    # Next.js App Router
│   ├── api/               # API routes
│   │   ├── models/        # Model-related endpoints
│   │   │   ├── search/    # Model search endpoint
│   │   │   └── [modelId]/ # Model metadata endpoint
│   │   └── recommend-inference/ # Main recommendation endpoint
│   ├── globals.css        # Global styles and Tailwind imports
│   ├── layout.tsx         # Root layout with theme provider
│   ├── page.tsx          # Home page component
│   └── favicon.ico       # Application favicon
├── components/            # React components
│   ├── forms/            # Form-related components
│   │   ├── InputForm.tsx         # Main input form
│   │   ├── ModelSelector.tsx     # Model selection with autocomplete
│   │   ├── NumericInput.tsx      # Reusable numeric input
│   │   ├── OptimizationCheckboxes.tsx # Technique selection
│   │   └── PerformanceInputs.tsx # Latency/throughput inputs
│   ├── layout/           # Layout components
│   │   ├── HeroSection.tsx       # Title and subtitle
│   │   └── ResultsSection.tsx    # GPU recommendations display
│   └── ui/               # UI components
│       ├── ErrorBoundary.tsx     # Error boundary wrapper
│       ├── GPUCard.tsx          # Individual GPU recommendation card
│       ├── LoadingSpinner.tsx   # Loading state component
│       ├── NetworkStatus.tsx    # Network connectivity indicator
│       └── ThemeToggle.tsx      # Dark/light mode toggle
├── contexts/             # React contexts
│   └── ThemeContext.tsx  # Theme management context
├── data/                 # Static data
│   └── gpuDatabase.ts    # GPU specifications database
├── hooks/                # Custom React hooks
│   └── useFormPersistence.ts # Form state persistence hook
├── lib/                  # Library utilities
│   └── utils.ts          # Utility functions (clsx, etc.)
├── types/                # TypeScript type definitions
│   └── index.ts          # All application types
├── utils/                # Utility functions
│   ├── errorHandling.ts  # Error handling utilities
│   ├── gpuScoring.ts     # GPU scoring algorithms
│   ├── localStorage.ts   # localStorage utilities
│   └── memoryCalculator.ts # Memory calculation functions
└── test-setup.ts         # Jest test configuration

__tests__/                # Test files (mirrors src structure)
├── api/                  # API endpoint tests
├── components/           # Component tests
├── data/                 # Data layer tests
├── hooks/                # Custom hook tests
├── integration/          # End-to-end integration tests
├── utils/                # Utility function tests
└── setup.test.ts         # Test environment setup

Core Algorithms

Memory Calculation

The application uses scientific formulas to calculate inference memory requirements:

Model Weights Memory: M_model = P × b
- P: Parameter count
- b: Bytes per parameter (2 for FP16, 1 for INT8)
Activation Memory: M_act = α × B × L × H × b
- α: Activation multiplier (≈1 for inference)
- B: Batch size
- L: Sequence length
- H: Hidden size
Total Memory: M_total = M_model + M_act + M_buffer
- Buffer: 10% overhead for system operations

GPU Scoring

Uses a weighted composite scoring algorithm:

Cost (40%): Lower cost per hour = higher score
Latency (30%): Meeting latency requirements = higher score
Throughput (20%): Meeting throughput targets = higher score
Memory Fit (5%): Better memory utilization = higher score
Technique Support (5%): Supporting user's optimization techniques = higher score

Usage Examples

Basic Usage

Select a Model: Choose from common sizes (7B, 13B, 70B) or search for specific Hugging Face models
Set Requirements: Configure sequence length, batch size, latency, and throughput requirements
Choose Optimizations: Select techniques like quantization, vLLM, or tensor parallelism
Get Recommendations: Click "Get Top 3 GPUs" to receive personalized recommendations

Advanced Usage

Custom Models: Enter any Hugging Face model ID for automatic parameter detection
Performance Tuning: Adjust latency and throughput requirements based on your use case
Cost Optimization: Compare recommendations to find the most cost-effective option
AWS Integration: Review AWS service recommendations for cloud deployment

Development

Spec-Driven Development

This project follows a spec-driven development approach. See the .kiro/specs/llm-gpu-recommender/ directory for:

requirements.md - Detailed feature requirements in EARS format
design.md - Technical design document with architecture and algorithms
tasks.md - Implementation task list with progress tracking

Testing Strategy

Unit Tests: Test individual functions and components
Integration Tests: Test API endpoints and data flow
Component Tests: Test React component behavior and interactions
End-to-End Tests: Test complete user workflows

Run tests with:

# Run all tests
npm run test

# Run tests in watch mode during development
npm run test:watch

# Run specific test files
npm run test -- --testPathPattern=memoryCalculator

Code Quality

The project uses ESLint for code quality and consistency:

# Check code quality
npm run lint

# Auto-fix issues where possible
npm run lint -- --fix

Troubleshooting

Common Issues

1. Development server won't start

Ensure Node.js 18+ is installed: node --version
Clear node_modules and reinstall: rm -rf node_modules package-lock.json && npm install
Check if port 3000 is available or use a different port: npm run dev -- -p 3001

2. Tests failing

Ensure all dependencies are installed: npm install
Clear Jest cache: npm run test -- --clearCache
Check test setup file: src/test-setup.ts

3. Build errors

Check TypeScript errors: npx tsc --noEmit
Ensure all imports are correct and files exist
Clear Next.js cache: rm -rf .next

4. API endpoints not working

Check network connectivity for Hugging Face API calls
Verify API route files are in correct locations under src/app/api/
Check browser developer tools for detailed error messages

5. Styling issues

Ensure Tailwind CSS is properly configured in tailwind.config.ts
Check if PostCSS is configured correctly in postcss.config.mjs
Verify global styles are imported in src/app/globals.css

Performance Issues

Slow autocomplete search:

Results are cached for 5 minutes to improve performance
Rate limiting prevents excessive API calls
Fallback models are provided when Hugging Face API is unavailable

Memory calculation taking too long:

Calculations are optimized for common model sizes
Results are memoized to avoid recalculation
Consider reducing batch size or sequence length for very large models

Browser Compatibility

Supported Browsers: Chrome 90+, Firefox 88+, Safari 14+, Edge 90+
Mobile Support: iOS Safari 14+, Chrome Mobile 90+
Accessibility: Tested with screen readers and keyboard navigation

Environment Variables

No environment variables are required for basic functionality. The application works entirely with client-side code and public APIs.

For production deployment, consider setting:

NODE_ENV=production for optimized builds
Custom API endpoints if using proxied Hugging Face access

Getting Help

Check the Issues: Look for similar problems in the project issues
Review Logs: Check browser developer tools console for detailed error messages
Test API Endpoints: Use tools like curl or Postman to test API endpoints directly
Verify Dependencies: Ensure all package versions match package.json

Development Tips

Use npm run test:watch during development for immediate feedback
Enable React Developer Tools browser extension for component debugging
Use the Network tab in browser dev tools to monitor API calls
Check the Application tab in dev tools to verify localStorage persistence

Contributing

Code Style: Follow the existing TypeScript and React conventions
Testing: Write tests for new functionality using Jest and React Testing Library
Documentation: Update README and inline comments for new features
Type Safety: Ensure all TypeScript types are properly defined
Accessibility: Test with keyboard navigation and screen readers
Performance: Consider performance implications of new features

Pull Request Process

Fork the repository and create a feature branch
Make your changes with appropriate tests
Ensure all tests pass: npm run test
Check code quality: npm run lint
Build successfully: npm run build
Update documentation as needed
Submit a pull request with a clear description

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__tests__		__tests__
public		public
scripts		scripts
src		src
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
requirements.txt		requirements.txt
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel.json		vercel.json

FilledVaccum/LLM-Inferencing-Recommender

Folders and files

Latest commit

History

Repository files navigation

LLM Inference GPU Recommender

Features

Tech Stack

Getting Started

Prerequisites

Installation

Available Scripts

Development Commands

API Endpoints

POST /api/recommend-inference

GET /api/models/search?q={query}

GET /api/models/{modelId}

Project Structure

Core Algorithms

Memory Calculation

GPU Scoring

Usage Examples

Basic Usage

Advanced Usage

Development

Spec-Driven Development

Testing Strategy

Code Quality

Troubleshooting

Common Issues

Performance Issues

Browser Compatibility

Environment Variables

Getting Help

Development Tips

Contributing

Pull Request Process

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages