Maintenance policy
This list is reviewed quarterly (January – April – July – October).
Projects must be (1) actively maintained within the last 6 months, (2) released under a permissive open-source license, and (3) either have ≥ 300 GitHub stars or demonstrable industry adoption.
Items that drop below these thresholds move to a watch-list and may be removed in the next cycle.
LLMOps (Large Language Model Operations) is a specialized discipline of MLOps tailored to the unique challenges of managing the entire lifecycle of LLM-powered applications. As organizations move from experimenting with LLMs to deploying them in production, they face distinct hurdles that traditional MLOps practices do not fully address. These challenges include complex prompt engineering, continuous fine-tuning, managing Retrieval-Augmented Generation (RAG) pipelines, handling high computational costs for inference, and monitoring for specific failure modes like hallucinations, toxicity, and data-privacy leakage.
LLMOps provides the principles, practices, and tools necessary to build, deploy, and maintain these applications in a reliable, scalable, and efficient manner. This guide organizes a curated list of high-relevance, open-source tools according to the core stages of the LLMOps lifecycle, providing a top-down workflow from initial concept to production monitoring.
- Phase 1 – Development & Experimentation
- 1.1 Data Versioning & Governance
- 1.2 Vector Stores & RAG Tooling
- 1.3 Document Processing & Data Cleaning
- 1.4 Prompt Engineering & Optimization
- 1.5 Experiment Tracking
- 1.6 LLM Evaluation
- 1.7 Agent / App Frameworks
- 1.8 Pipeline Orchestration
- 1.9 Text-to-SQL & Database Agents
- 1.10 LLM Web Clients & Chat UIs
- Phase 2 – Model Adaptation
- Phase 3 – Deployment & Serving
- Phase 4 – Operations
- Phase 5 – Privacy / Governance / Compliance
Goal: Rapidly iterate on ideas, data, and prompts to prove technical feasibility.
Description: These tools help collect, clean, version, and explore data; craft and test prompts; prototype agents; and keep experiments reproducible.
Goal: Make datasets reproducible and auditable across the project’s lifetime.
Description: Git-style version control and labeling frameworks ensure data integrity and provenance.
Project | Details | Repository |
---|---|---|
DVC | Data Version Control – Git for Data & Models – ML Experiments Management. | |
deeplake | Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data in real-time to PyTorch/TensorFlow. | |
LakeFS | Git-like capabilities for your object storage. | |
Cleanlab | The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. | |
Label Studio | A multi-type data labeling and annotation tool with a standardized output format. Essential for creating high-quality datasets. |
Goal: Store and retrieve embeddings efficiently for Retrieval-Augmented Generation.
Description: RAG platforms and vector databases manage unstructured knowledge and power hybrid search.
Project | Details | Repository |
---|---|---|
RagFlow | An open-source RAG application that provides a streamlined workflow based on deep document understanding. | |
FastGPT | A platform that based on LLM, allows you to create your own knowledge-base QA model with out-of-the-box capabilities. |
Goal: Convert raw files and web sources into high-quality, LLM-ready text.
Description: ETL, parsing, and adversarial augmentation frameworks enhance data variety and robustness.
Project | Details | Repository |
---|---|---|
Data-Juicer | A one-stop data processing system for LLMs. Used to build diverse, high-quality data recipes for pre-training and fine-tuning. | |
Firecrawl | An API service that crawls any URL and converts it into clean, LLM-ready Markdown or structured data. | |
OneFileLLM | A CLI tool to aggregate and preprocess data from multiple sources (files, GitHub, web) into a single text file for LLM use. | |
Apache Tika | A content detection and analysis framework that extracts text and metadata from a huge variety of file formats. | |
Unstructured | Open-source libraries and APIs to build custom data transformation pipelines for ETL, LLMs, and data analysis. | |
DeepKE | A deep learning based knowledge extraction toolkit, supporting named entity, relation, and attribute extraction. | |
Lilac | An open-source tool that helps you see and understand your unstructured text data. Explore, cluster, clean, and enrich datasets for LLMs. | |
TextAttack | A Python framework for adversarial attacks, data augmentation, and hard-negative generation to improve robustness. |
Goal: Design, test, and version prompts for consistent, high-quality outputs.
Description: These tools provide A/B testing, genetic search, and interactive sandboxes for rapid iteration.
Project | Details | Repository |
---|---|---|
promptfoo | Open-source tool for testing & evaluating prompt quality. | |
Agenta | An open-source LLMOps platform with tools for prompt management, evaluation, and deployment. | |
DSPy | A framework for programming—not just prompting—language models. It allows you to optimize prompts and weights. | |
Chainlit | Build and share conversational UIs in seconds; perfect for interactive prompt sandboxing and demos. |
Goal: Record, compare, and reproduce experiments across data, prompts, and models.
Description: Track metrics, parameters, and artifacts; integrate with CI to enable data-driven decisions.
Project | Details | Repository |
---|---|---|
MLflow | An open-source framework for the end-to-end machine learning lifecycle, helping developers track experiments, evaluate models/prompts, and more. | |
Weights & Biases | A developer-first MLOps platform for experiment tracking, dataset versioning, and model management. Featuring W&B Prompts for LLM execution flow visualization. | |
Aim | An easy-to-use and performant open-source experiment tracker. |
Goal: Quantify performance, robustness, and safety of prompts and models.
Description: Local and cloud frameworks automate scoring for RAG, summarization, Q&A, and more.
Project | Details | Repository |
---|---|---|
LangWatch | Visualize LLM evaluations experiments and DSPy pipeline optimizations. | |
Arize-Phoenix | ML observability for LLMs, vision, language, and tabular models. Also offers powerful local evaluation capabilities. | |
Evidently | An open-source framework to evaluate, test and monitor ML and LLM-powered systems. | |
Ragas | RAG evaluation metrics and pipelines for faithfulness and answer relevancy. | |
OpenAI Evals | Reference harness for benchmarking GPT-style models across tasks. |
Goal: Compose prompts, tools, and workflows into full-stack LLM applications.
Description: High-level SDKs and low-code builders accelerate agent development and experimentation.
Project | Details | Repository |
---|---|---|
LangChain | Building applications with LLMs through composability. | |
LlamaIndex | Provides a central interface to connect your LLMs with external data. | |
Dify | An open-source LLM app development platform for building and operating generative AI-native applications. | |
Flowise | Drag & drop UI to build your customized LLM flow using LangchainJS. | |
OpenChat | Open-source ChatGPT alternative. A robust and extensible open platform for conversational AI, agentic workflows, and custom plugins. | |
MaxKB | An extensible, self-hosted, open-source knowledge base and conversational agent platform for RAG, workflow automation, and personal/private GPTs. |
Goal: Automate batch and streaming workflows for data ingestion, fine-tuning, and evaluation.
Description: DAG-based schedulers and function-graph frameworks ensure reproducible, modular pipelines.
Project | Details | Repository |
---|---|---|
Apache Airflow | A platform to programmatically author, schedule, and monitor workflows. Ideal for orchestrating batch jobs like fine-tuning or RAG indexing. | |
Apache NiFi | An easy-to-use, powerful, and reliable system to process and distribute data. Well-suited for real-time, streaming data pipelines for RAG. | |
ZenML | MLOps framework to create reproducible pipelines for ML and LLM workflows. | |
Hamilton | A lightweight framework to represent ML/language model pipelines as a series of Python functions. |
Goal: Translate natural-language queries to SQL and unlock structured data for business users.
Description: These tools combine LLMs with schema discovery and query execution to generate accurate, safe SQL across diverse databases.
Project | Details | Repository |
---|---|---|
Chat2DB | AI-augmented SQL client: natural-language to SQL, visualization, and reporting. | |
Vanna.ai | Python-based framework for schema-aware text-to-SQL and RAG-enhanced analytics. | |
DB-GPT | Private, self-hosted text-to-SQL agent framework with RAG support. |
Goal: Provide user-friendly, open-source frontends for ChatGPT-compatible and self-hosted LLMs, with multi-backend support, plugin systems, knowledge base, and teamwork features.
Description: These projects make it easy to interact with LLMs from web browsers and mobile devices, enabling team or personal usage, plugin integration, and knowledge management.
Project | Details | Repository |
---|---|---|
ChatGPT-Next-Web | Open-source ChatGPT web UI, supports multiple LLM backends, fast deployment, personal/private use, and advanced features. | |
Open WebUI | Modern, extensible, and self-hosted UI for local or remote LLMs. Supports Ollama, OpenAI, and more. Teamwork and plugin support. | |
Chatbot UI | ChatGPT-style open-source web UI for connecting to OpenAI and compatible APIs, extensible and customizable for personal use. | |
LobeChat | An open-source, extensible ChatGPT web UI. Team workspace, plugin ecosystem, multi-LLM support (OpenAI, Azure, Google, Anthropic, Ollama, etc). | |
NeatChat | Minimal, clean, and privacy-friendly ChatGPT web UI, supports OpenAI, Azure, local LLMs, and markdown knowledge base. |
Goal: Specialize general-purpose LLMs to domain-specific tasks while controlling compute and data cost.
Description: Parameter-efficient fine-tuning and editing techniques inject new knowledge and correct errors without full retraining.
Project | Details | Repository |
---|---|---|
LlamaFactory | A unified, efficient fine-tuning framework for over 100 LLMs and VLMs. | |
Swift (modelscope) | A framework for fine-tuning and deploying 500+ LLMs and 200+ MLLMs, with extensive support for PEFT techniques. | |
peft | State-of-the-art Parameter-Efficient Fine-Tuning. | |
QLoRA | Finetune a 65 B parameter model on a single 48 GB GPU while preserving full 16-bit finetuning task performance. | |
axolotl | A tool designed to streamline the fine-tuning of various AI models. | |
LoRA-Hub | Community marketplace and registry for sharing and discovering LoRA weight adapters. |
Project | Details | Repository |
---|---|---|
FastEdit | FastEdit aims to assist developers with injecting fresh and customized knowledge into large language models efficiently. |
Goal: Deliver low-latency, scalable inference to end users across cloud and edge environments.
Description: Engines, packaging frameworks, and local runtimes optimize throughput, cost, and portability.
Project | Details | Repository |
---|---|---|
vllm | A high-throughput and memory-efficient inference and serving engine for LLMs. | |
SGLang | A fast serving framework for LLMs and VLMs, designed for high throughput and controllable, structured generation. | |
TensorRT-LLM | Inference engine for TensorRT on Nvidia GPUs. | |
Ollama | Serve LLMs locally. A user-friendly application often powered by llama.cpp underneath. | |
llama.cpp | A foundational library for LLM inference in pure C/C++, enabling efficient performance on CPUs and consumer hardware. |
Project | Details | Repository |
---|---|---|
Xinference | A versatile platform to serve language, speech, and multimodal models with a unified, OpenAI-compatible API. | |
BentoML | The Unified Model Serving Framework. | |
OpenLLM | An open platform for operating large language models (LLMs) in production. | |
Kserve | Standardized Serverless ML Inference Platform on Kubernetes. | |
Triton Server | The Triton Inference Server provides an optimized cloud and edge inferencing solution. | |
Kubeflow | Machine Learning Toolkit for Kubernetes, often used for orchestrating deployment pipelines. |
Project | Details | Repository |
---|---|---|
llama.cpp | A foundational library for LLM inference in pure C/C++, enabling efficient performance on CPUs and consumer hardware. | |
Ollama | Serve LLMs locally. A user-friendly application often powered by llama.cpp underneath. |
Goal: Maintain reliability, cost efficiency, and user safety for live systems.
Description: Observability, guardrails, and policy frameworks provide continuous feedback and protection.
Project | Details | Repository |
---|---|---|
Helicone | Open source LLM observability platform for logging, monitoring, and debugging. | |
Portkey-SDK | Control Panel with an observability suite & an AI gateway — to ship fast, reliable, and cost-efficient apps. | |
Langfuse | Open Source LLM Engineering Platform: Traces, evals, prompt management and metrics to debug and improve your LLM application. |
Project | Details | Repository |
---|---|---|
Guardrails-AI | Declarative, schema-driven validation and content moderation for LLM outputs. |
Goal: Ensure AI systems meet legal, ethical, and organizational standards.
Description: Policy-as-code, bias detection, and continuous validation frameworks enable trustworthy deployment.
Project | Details | Repository |
---|---|---|
Giskard | Testing framework dedicated to ML models, from tabular to LLMs. Detect risks of biases, performance issues and errors. | |
Deepchecks | Tests for Continuous Validation of ML Models & Data. |