This repository leverages Oracle Cloud's always free tier to provision a kubernetes cluster. In its current setup there are no monthly costs anymore, as I've now moved the last part (DNS) from oci to cloudflare.
Oracle Kubernetes Engine (OKE) is free to use, and you only pay for worker
nodes if you exceed the Always Free tier — which we don’t.
The free tier provides 4 oCPUs and 24GB of memory, which are split between two
worker nodes (VM.Standard.A1.Flex
), allowing for efficient resource
utilization. Each node has a 100GB boot volume, with around 60GB available for
in-cluster storage via Longhorn. For ingress, we use k8s.io/nginx
with Oracle’s
Flexible Load Balancer (10Mbps; layer 7), for teleport
we use the network LB (layer 4),
as both are free as well.
Getting an Always Free account can sometimes be tricky, but there are several guides on Reddit that explain how to speed up the creation process.
The initial infra setup is inspired by this great tutorial: https://arnoldgalovics.com/free-kubernetes-oracle-cloud/
Warning
This project uses arm instances, no x86 architecture; due to limitations of the always free tier.
And please mind: This setup is loosely documented and opinionated. It's working and in use by myself. It's public, to showcase how this setup can be recreated, but you need to know what you're doing and where to make modification for yourself.
This repo hosts my personal stuff and is a playground for my kubernetes tooling.
Tip
In case you want to reproduce another oke
setup, you might find this guide
also helpful.
- K8s control plane
- Worker Nodes
- Ingress
- nginx-ingress controller on a layer 7 lb
- teleport svc on a layer 4 lb
- Certmanager
- with letsencrypt for dns & http challenge
- External DNS
- with sync to the cloudflare dns management
- CR to provide
A
records for my home-network
- Dex as OIDC Provider
- with github as idP
- FluxCD for Gitops
- deployed with the new fluxcd operator
- github → flux webhook receiver for instant reconciliation
- flux → github commit annotation about conciliation status
- Teleport for k8s cluster access
- Storage
- with longhorn (rook/ceph & piraeus didn't work out)
- Grafana with Dex Login
- Dashboards for Flux
- Loki for log aggregation
- Metrics Server for cpu/mem usage overview
- Kyverno and Image Signing
Note
I've recently updated the tf-backend config, to utilizes the oci native backend now. This requires terraform >= v1.12
This setup uses terraform to manage the oci and a bit of the kubernetes part.
- terraform
- oci-binary
oci setup config
successfully run
The terraform state is pushed to oracle object storage (free as well). For that we have to create a bucket initially:
❯ oci os bucket create --name terraform-states --versioning Enabled --compartment-id xxx
With the bucket created we can configure the ~/.oci/config
:
[DEFAULT]
user=ocid1.user.xxx
fingerprint=ee:f4:xxx
tenancy=ocid1.tenancy.oc1.xxx
region=eu-frankfurt-1
key_file=/Users/xxxx.pem
[default]
aws_access_key_id = xxx <- this needs to be created via UI: User -> customer secret key
aws_secret_access_key = xxx <- this needs to be created via UI: User -> customer secret key
Refer to my backend config for the terraform s3 configuration.
- The infrastructure (everything to a usable k8s-api endpoint) is managed by terrafom in infra
- The k8s-modules (OCI specific config for secrets etc.) are managed by terraform in config
- The k8s apps/config is done with flux; see below
These components are independent from each other, but obv. the infra should be created first.
For the config part, you need to add a private *.tfvars
file:
compartment_id = "ocid1.tenancy.zzz"
... # this list is currently not complete; there's more to add
Running the config
section you need more variables, which either get output
by the infra
-run or have to be extracted from the webui.
Tip
During the initial provisioning the terraform run of config
might fail,
it's trying to create a ClusterSecretStore
which only exist after the
initial deployment of external secrets
with flux. This is expected.
Just rerun terraform after external secrets is successfully deployed.
After running terraform in the infra folder, a kubeconfig file
should be created in the terraform folder called .kube.config
.
This can be used to access the cluster.
For a more regulated access, see the Teleport section below.
The terraform resources in the config folder will rely on the kubeconfig.
Most resources and core components of the k8s cluster are provisioned with fluxcd.
Therefore we need a Github Personal acccess Token (pat
- fine grained) in your repo.
# github permission scope for the token:
contents - read, write
commit statuses - read, write
webhooks - read, write
- Place this token in a private tfvars. This is used to generate the fluxcd webhook url, which triggers fluxcd reconciliation after each commit
- Place this token in the oci vault (
github-fluxcd-token
). This allows fluxcd to annotate the github commit status, depending on the state of theKustomization
.
Migrating from the flux bootstrap
method to the flux-operator might be tricky.
I lost most installed apps during my upgrade, because i misconfigured the
FluxInstace.path
(this could've mitigated by setting prune: false
on the KS).
Destroying the old Bootstrap resource during the TF apply, lead
to the removal of the fluxcd crds like GitRepo, HelmRelease
etc
(had the remove the finalizers of the crds
to allow removal). This didn't impact my already deployed CRs though.
The Flux Operator takes care of reinstalling everything.
I've setup a Githup App and mostly followed the official guide, this was pretty straightforward.
Teleport is my preferred way to access the kuberentes api
In it's current state, teleport wants to setup a wildcard domain like *.teleport.example.com
(could be disabled).
With OracleCloud managing the dns, this is not possible, as cert-manager
is not
able to do a dns01
challenge against orcale dns.
I've now switched to Cloudflare (also to mitigate costs of a few cents).
The Teleport <-> K8s Role (k8s/system:masters
) is created by the teleport
operator (see the fluxcd-addons/Kustomization
). The SSO setup is created with
fluxcd.
I've removed local users in teleport and am using SSO with github as idP.
This might still be useful for local setups not using SSO:
The login process must be reset
for each user, so that
password and 2FA can be configured by each user in the WebUI.
The User can be created via the teleport-operator by creating a TelepertUser
in
kubernetes.
# reset the user once
❯ k --kubeconfig ~/.kube/oci.kubeconfig exec -n teleport -ti deployment/teleport-cluster-auth -- tctl users reset nce
# login to teleport
❯ tsh login --proxy teleport.nce.wtf:443 --auth=local --user nce teleport.nce.wt
There's no user management in teleport, so no reset, or 2FA setup is needed.
❯ tsh login --proxy teleport.nce.wtf:443 --auth=github-acme --user nce teleport.nce.wtf
# login to the k8s cluster
❯ tsh kube login oci
# test
❯ k get po -n teleport
Warning
Todo: write about the svc/ingress annotations of the security groups
A collection of relevant upstream documentation for reference
- LB Annotation for oracle cloud
- Providing OCI-IDs to Helm Releases on nginx
- teleport-operator
- Teleport User/Roles RBAC
- Mapping to teleport role
- SSO with GithubConnector and External Client Secret
- Helm Chart Deploy Infos & Helm Chart ref
I recommend only upgrading to the version the first command (available-kubernetes-upgrades
) shows.
Other upgrades, or jumps to the latest version not being shown, might break the process.
The K8s Skew policy allows the worker nodes (kubelets
)
to be three minor versions behind, so you might be alright, if you incrementally update the controlplane,
before updating the nodepool.
The commands should be executed inside terraform/infra/
# get new cluster versions
❯ oci ce cluster get --cluster-id $(terraform output --raw k8s_cluster_id) | jq -r '.data."available-kubernetes-upgrades"'
# update the cluster version with the information from above
❯ sed -i '' 's/default = "'$(terraform output --raw kubernetes_version)'"/default = "v1.31.1"/' _variables.tf
# upgrade the controlplane and the nodepool & images
# this shouldn't roll the nodes and might take around 10mins
❯ terraform apply
To roll the nodes, i cordon & drain the k8s node:
❯ k drain <k8s-node-name> --force --ignore-daemonsets --delete-emptydir-data
❯ k cordon <k8s-node-name>
A node deletion in k8s doesn't trigger a change in the nodepool
. For that, we
need to terminate the correct instance. But i haven't figured out how to delete the -
currently cordoned - node, only using oci
.
So, login to the webui -> Oke Cluster -> Node pool and check for the right instance by looking at the private_ip and copy the id.
Now terminate that instance:
❯ oci compute instance terminate --force --instance-id <oidc.id>
This triggers a node recreation. Now wait till the node is Ready; And then wait for longhorn to sync the volumes.
# wait until all volumes are healthy again
❯ k get -w volumes.longhorn.io -A
Repeat the cordon/drain/terminate for the second node.
For the current update, i've written above upgrade instructions. Worked flawlessly, though still with a bit of manual interaction in the webui...
I mostly skipped 1.27.2
& 1.28.2
(on the workers) and went for the 1.29
release. As the UI didn't
prompt for a direct upgrade path of the control-plane, i upgraded the k8s-tf
version to the prompted next release, ran the upgrade, and continued with the next version.
The worker nodes remained at 1.26.7
during the oke upgrade, which worked because with 1.28
the new skew policy allows for worker nodes to be three versions behind.
PSP
s first
- Upgrade the nodepool & cluster version by setting the k8s variable; Run terrafrom (takes ~10min)
- Drain/Cordon worker01
- Go to the UI; delete the worker01 from the nodepool
- Scale the Nodepool back to 2 (takes ~10min)
- Wait for longhorn to sync (no volume in state
degraded
) - repeat for second node (2-5)
The 1.23.4 -> 1.24.1 Kubernetes Upgrade went pretty smooth, but by hand.
I followed the official guide:
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengupgradingk8smasternode.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengupgradingk8sworkernode.htm
Longhorn synced all volumes after the new node got ready. No downtime experienced.