A simple FastAPI project with a chat interface and API endpoints, featuring a local LLM model optimized for macOS.
- Chat Interface: Multi-threaded chat UI similar to popular AI chat applications
- Dark/Light Theme: Toggle between dark and light modes
- Local LLM Integration: Run AI models directly on your machine
- Model Switching: Change between different models on-the-fly
- API Endpoints: Access LLM functionality programmatically
This project uses uv for dependency management. If you don't have uv installed:
# Install uv (using pip)
pip install uv
# Install dependencies using uv
uv pip install -e .
This project uses strict typing with mypy. To run type checks:
uv run mypy main.py
# Run in development mode with auto-reload
python main.py
For macOS users (especially with Apple Silicon), use the optimized script:
./run_macos.sh
The script automatically detects your available RAM and selects the best model.
The application will be available at:
- Chat Interface: http://localhost:8000/
- API Documentation: http://localhost:8000/docs
For production deployment, use the provided production script:
./run_production.sh
- Access the Chat Interface: Open your browser and go to http://localhost:8000/
- Create New Conversations: Click "New Chat" to start a new thread
- Switch Between Threads: Click on any thread in the sidebar to switch contexts
- Change Models: Use the dropdown menu in the top-right to switch between models
- Toggle Dark/Light Mode: Click the moon/sun icon to change the theme
GET /
: Chat interfacePOST /api/chat
: Generate a chat responsePOST /api/set-model
: Change the active model
GET /welcome
: Returns a welcome messageGET /items
: Returns all items in the collectionPOST /items
: Add a new item to the collectionPOST /llm/generate
: Generate a response from the LLM modelGET /llm/info
: Get information about available models
This project has been optimized to work on macOS with Apple Silicon (M1/M2/M3). It uses:
- MPS (Metal Performance Shaders) when available for GPU acceleration
- Models that are compatible with 8GB RAM on macOS
- Memory optimizations for efficient inference
You can select which LLM model to use by setting the LLM_MODEL
environment variable:
# Use the tiny model (default, suitable for systems with limited RAM)
LLM_MODEL=tiny python main.py
# Use the small model (better capabilities but requires more RAM)
LLM_MODEL=small python main.py
# Use the medium model (best capabilities on 8GB RAM)
LLM_MODEL=medium python main.py
Available models:
tiny
: TinyLlama-1.1B-Chat-v1.0 (works on 4-8GB RAM)small
: bigscience/bloom-560m (works on 4-8GB RAM)medium
: microsoft/phi-2 (works on 8GB+ RAM)
See TRAINING.md for information on fine-tuning models with custom datasets.
You can test the API endpoints with the interactive Swagger UI at http://localhost:8000/docs