This project is currently under active development and is not ready for production use.
This project provides a local Kubernetes development environment to experiment with GitOps using ArgoCD and workflow automation via Argo Workflows. It is designed for testing machine learning workflows and deploying applications using lightweight Docker images, minimal compute, and a simplified infrastructure footprint.
argo-workflow/
: Argo Workflow templates defining machine learning pipelines and utility jobs.argocd-templates/
: ArgoCD Application templates that manage deployments of services.cluster-conf/
: Kubernetes manifests to bootstrap the cluster (e.g., namespaces, service accounts).docker/
: Dockerfiles and scripts to build container images used throughout the environment.inference-ui/
: Static frontend UI to perform inference through a web interface.private-registry/
: Setup files for a private Docker registry used to store locally built images.
This environment was tested exclusively on a local Kubernetes cluster composed of three Ubuntu nodes with limited compute and disk resources. For this reason:
- All Docker images are minimal and purpose-built.
- Machine Learning training is extremely simplified and uses CPU-only execution.
- Node
ubuntu2
had the highest compute capacity, which is why some templates in this repository specify anodeSelector
to schedule heavier tasks to that node.
Please adjust nodeSelectors if your cluster layout is different. Or remove it completely if you have a cluster with more compute resources available.
THIS DOCUMENTATION HAS BEEN CREATED WITH THE HELP OF AI.
I used 2 type of storage for the experiment:
-
A local storage attached to Ubuntu02 and configured with pv and pvc.
Ubuntu02 was the VM used to train the models because it's where the majority of the avalable compute power was allocated. Docker images would also be saved on Ubuntu02.
-
NFS volumes shared from Ubuntu03
The NFS volumes were used to make DVC repo related tasks and the Inference UI work from any node.
To use this setup, you must have the following installed and configured:
- A running Kubernetes cluster
- ArgoCD
- Argo Workflows
- Kustomize
This section describes each Argo Workflow in the argo-workflow/
directory and its purpose in the local MLOps pipeline.
This workflow handles the training of a new machine learning model version. It performs the following steps:
- Pulls a subset of training data from IMDB_Dataset.
- Launches a training job using a lightweight container.
- Trains the model using CPU resources only.
- Saves the trained model to a shared persistent volume.
- Tags/Save the model version using a tag using DVC.
This pipeline is CPU-optimized and assumes limited compute and memory availability.
This workflow extends the training process by including an inference validation phase. It:
- Trains a new model using the same procedure as in
train-new-model-version.yaml
. - After training, it deploys the model for local inference testing.
- Sends test inputs to the inference service.
- Logs results and verifies outputs for basic validation.
- No models are saved or tagged with DVC.
This is useful for quickly checking model quality after training, all within the same automated flow.
This workflow promotes a validated model version to be used in production or as the default. It:
- Pull the correct tag of the model from DVC.
- Saves the model version in the current version folder used by the UI.
This flow represents the final approval step in a lightweight CI/CD-style pipeline.
The inference-ui
component provides a very simple user interface for inference operations. The rollout process for inference-ui
is managed through ArgoCD and follows these steps:
-
Manifest Definition
The Kubernetes manifests forinference-ui
are located in theinference-ui/
directory. These include Deployment, Service, and Ingress resources necessary for the application. -
Application Configuration
An ArgoCD Application resource is defined to manage theinference-ui
deployment. This configuration specifies the source repository, target revision, and the path to the manifests. -
Deployment Process
- ArgoCD monitors the specified Git repository for changes in the
inference-ui
manifests. - Upon detecting changes, ArgoCD synchronizes the live cluster state with the desired state defined in the repository.
- The synchronization process involves creating or updating Kubernetes resources to match the manifests.
- ArgoCD monitors the specified Git repository for changes in the
-
Monitoring and Health Checks
- ArgoCD provides real-time monitoring of the
inference-ui
application. - Health checks are configured to ensure the application is running as expected.
- Any discrepancies or issues are reported in the ArgoCD dashboard for prompt resolution.
- ArgoCD provides real-time monitoring of the
This automated rollout process ensures that updates to the inference-ui
component are deployed consistently and reliably across environments.
To enhance the functionality and usability of this project, the following improvements are planned:
-
CI/CD Integration
Implement continuous integration and deployment pipelines to automate testing and deployment processes, improving efficiency and reliability. -
Parameterization
Introduce parameterization in the Kubernetes manifests and Docker Images to allow for dynamic configuration. -
User Interface Enhancements
Improve theinference-ui
component with additional features and a more intuitive user interface to enhance user experience. -
LLM Monitoring and Experimenting
Include the proper monitoring stack to make us of DVC's experimenting features.