Skip to content

Scrapyard Cluster

Documentation - targeted level of detail

Category Level ↗️
Normal use 🚅🚅🚅 ⬛⬛
Lifespan 🧓🧓🧓 ⬛⬛
Current status 🔬🔬⬛⬛⬛
Maintenance 🛠️🛠️🛠️🛠️🛠️
Repair 🚧🚧🚧🚧🚧
Troubleshooting 🤔🤔🤔🤔🤔

Scrapyard is a Kubernetes cluster that runs on a set of machines on the "scrapyard" network. It is made up of a single control plane and ~6 additional nodes. The control plane is accessible via ssh through tailscale at jukebox. the hardware router accessible via ssh through tailscale.

Basic Usage

Physical Machines

Jukebox - Control Plane

The control plane, jukebox, runs nfs shares through zfs. These are used extensively by the nodes for persistent storage. Local storage on the nodes is used whenever the data being stored isn't important to keep long-term, either using the rancher local-path provisioner or Longhorn.

Nodes

The cluster is made up of ~6 additional nodes:

  • node1nuc
  • node2mbp - retired
  • node3satellite
  • node4probook
  • node5framework
  • node6acer

Cluster Components

The cluster's main components are istio, cilium, and cert-manager. The kubectl manifests that build the system are available in the scrapmetal manifests github repo.

Istio

Istio manages access to any of the pods by routing traffic through the ingress gateway. The gateway terminates TLS, and all traffic beyond the gateway is unsecured. mTLS is disabled.

Cilium

Cilium handles the low level network traffic. It's the "CNI" for the cluster. Due to running in L2 mode, the network cannot expose ipv6 addresses.

Cert-manager

Cert-manager handles TLS certificates for the "origin server". It requests certs from letsencrypt for the domains it serves, and it provides them to traffic that is inbound from cloudflare.

Networking

The scrapmetal cluster can serve subdomains from multiple subdomains by way of cloudflare. Currently it is only serving coffee-anon.com subdomains. It serves HTTPS by using a combination of cert-manager for k8s, letsencrypt for trusted certs, and a connection to the cloudflare API (with a token) for allowing the dns challenge.

Docker containers

Media Server Containers

Media server containers are still extensively used in the Scrapyard cluster. The most notable of these is Plex, which as of 2023, must run on a Docker container due to its reliance on Nvidia hardware for transcoding.

Other components of the media server such as Sonarr, Radarr, Prowlarr, and Transmission could run on the cluster, but they are kept on Docker for stability as we continue to learn Kubernetes.

Performance

For some reason, performance is much better on Docker containers than on other nodes. This is an area of ongoing investigation.

TLS Management

TLS for the Docker containers is still managed by Istio. Istio will terminate TLS and then route the traffic to an external (to the cluster) service on the LAN (the Docker container).

Networking

The Scrapyard cluster can serve subdomains from multiple subdomains by way of Cloudflare. Currently, it is only serving coffee-anon.com subdomains. It serves HTTPS by using a combination of cert-manager for k8s, LetsEncrypt for trusted certs, and a connection to the Cloudflare API (with a token) for allowing the DNS challenge.

Lifespan

Current lifespan of the cluster (the point at which there's roughly a 50% chance something has gone wrong) is about 6 months.

The most likely causes of failure are:

  • hardware issues
  • network failures
  • recent changes to the cluster

The lifespan could be improved by:

  1. Making the cluster high availability by way of two additional control planes
  2. Switching to exclusively use gigabit wired connections
  3. Reducing resource utilization during regular use

Current status

Note

Last updated 2024-3-11

Currently the cluster is doing a few things:

  1. managing access to the media server services
  2. Serving the python flask boilerplate application
  3. Managing dev environments with Coder
  4. Hosting LLM tools using ollama and open webui
  5. Hosting an under-construction personal blog
  6. Hosting the Cartographer server

Maintenance

List of items to maintain, prioritized list of maint. tasks, how often to do them, how to tell if they need doing, what will happen if they're not done.

Things to maintain

  1. The hardware of node machines themselves, including the control plane
  2. The non-cluster software running on the nodes and control-plane
  3. The router firmware
  4. Software components of the cluster (istio, cilium, metallb, cert-manager, kubeadm, kubectl)
  5. SSH access keys
  6. Github repos
  7. system secrets

Maintenance tasks

every 1 month
  1. update software packages on ubuntu server OS for the nodes and control plane (one at a time)
  2. check node logs (via k9s) for suspicious failures on any of the nodes themselves
  3. check zfs for the status of the disks being used for nfs
every 3 months
  1. check for new patch versions of cluster components (istio, cilium, cert-manager, kubeadm, kubectl)
  2. run kube-bench and kube hunter to check for vulnerabilities
  3. ~~changes to the kubelet config can be done via the /var/lib/kubelet/config.yaml file. Current docs for this are here. e.g. to disable debugging handlers to fix a kube hunter issue, add enableDebuggingHandlers: false. Then restart kubelet~~
    1. (this didn't work and i don't know why. i think it's 90% correct though)
every 12 months
  1. generate new ssh access keys, system secrets
  2. update router firmware

Repair

How to fix everything i know of.

Complete tear-down and rebuild almost always takes a couple of days, because the notes aren't perfect.

Cluster tear-down

To remove a single node:

  1. reset kubernetes
sudo kubeadm reset --cri-socket unix:///run/containerd/containerd.sock
  1. delete local config files
sudo rm -rf /etc/cni/net.d && \
sudo rm -rf $HOME/.kube/config
  1. remove the node from the control plane by deleting the node resource
kubectl delete node node-name

If tearing down the entire cluster, just reset kubeadm and remove the config files on all nodes.

Cluster setup / rebuild

Set up the host machines for the nodes

On both the control plane and the workers:

Install pre-requisites

For the control plane, install containerd and docker. For the worker nodes, install containerd and set up a basic config

# control plane
sudo apt remove -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin && \
sudo sysctl --system && \
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# worker nodes
sudo apt remove -y containerd.io && \
sudo sysctl --system && \
sudo apt install -y containerd.io
Set up container config

For the control plane and worker nodes, set the cgroup flag on runc options in /etc/containerd/config.toml to systemd_cgroup = true.

sudo su -
rm -rf /etc/containerd && \
mkdir -p /etc/containerd && \
containerd config default>/etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
Set up control plane gfx hardware

Since the control plane also has gfx hardware, install nvidia drivers and set the default runtime.

sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
Reset containerd

Manually cycle containerd and set enabled to take new config

sudo systemctl restart containerd && sudo systemctl enable containerd && systemctl status containerd
Hostfile and static pods

I believe it's still required to manually add a hostfile entry for the control plane.

Note

Static pods were previously used to attempt "High availability" for a multiple control plane setup. At the time, the additional traffic on the cluster for handling leader election caused issues with the network. There's a good chance that was related to a previous misconfiguration of the cilium device plugin. If the cluster needs to be completely torn down, it might be worth trying again. The static pod manifests should still be present and can be copied with sudo cp /etc/kubernetes/manifests_backup_2023-10-31/* /etc/kubernetes/manifests/

Initialize first control plane

sudo kubeadm init \
    --control-plane-endpoint=k8s.coffee-anon.com:6443 \
    --pod-network-cidr=10.244.0.0/16 \
    --apiserver-cert-extra-sans=k8s.coffee-anon.com \
    --upload-certs \
    --cri-socket unix:///run/containerd/containerd.sock \
    --skip-phases=addon/kube-proxy \
    --token=abcdef.0123456789abcdef

Add additional control planes

sudo cp /etc/kubernetes/manifests_backup_2023-10-31/* /etc/kubernetes/manifests/
sudo kubeadm join k8s.coffee-anon.com:6443 \
    --token abcdef.0123456789abcdef \
    --control-plane \
    --discovery-token-ca-cert-hash sha256:1234567890123456789012345678901234567890 \
    --certificate-key abcdef1234567890abcdef1234567890abcdef1234567890 \
    --cri-socket unix:///run/containerd/containerd.sock

Add workers

sudo kubeadm join k8s.coffee-anon.com:6443 \
        --token abcdef.0123456789abcdef \
        --discovery-token-ca-cert-hash sha256:fc2c4d5e6a97bcfff10deb801cc32b746362bb23f534aad3af7b5a89ff50260b \
    --cri-socket unix:///run/containerd/containerd.sock

Note

Add all nodes before installing cilium.

As part of the install, be sure to set up the kube config file on the control plane, and copy it to any workstation that needs to interact with the k8s cluster directly.

Install Cilium

Cilium is installed using the CLI tool. I believe there's a dependency on having the helm client installed as well.

Hubble is not a requirement for the cluster, but it can be useful and doesn't use many resources.

Note

If using HA, use the kube-vip ip address. Otherwise, use the control plane ip address.

API_SERVER_IP=192.168.1.157
API_SERVER_PORT=6443
QPS=50
BURST=75
LEASE_DURATION="20s"
RENEW_DEADLINE="7s"
RETRY_PERIOD="3s"
cilium install \
    -n kube-system \
    --helm-set kubeProxyReplacement=true \
    --helm-set ipv4NativeRoutingCIDR="10.244.0.0/16" \
    --helm-set k8sServiceHost=${API_SERVER_IP} \
    --helm-set k8sServicePort=${API_SERVER_PORT} \
    --helm-set k8s.requireIPv4PodCIDR=true \
    --helm-set hostServices.enabled=false \
    --helm-set externalIPs.enabled=true \
    --helm-set nodePort.enabled=true \
    --helm-set hostPort.enabled=true \
    --helm-set image.pullPolicy=IfNotPresent \
    --helm-set ipam.mode=kubernetes \
    --helm-set enable-ipv4=true \
    --helm-set enable-ipv6=false \
    --helm-set l2announcements.enabled=true \
    --helm-set l2NeighDiscovery.enabled=true \
    --helm-set k8sClientRateLimit.burst=${BURST} \
    --helm-set k8sClientRateLimit.qps=${QPS} \
    --helm-set l2announcements.leaseRenewDeadline=${RENEW_DEADLINE} \
    --helm-set l2announcements.leaseRetryPeriod=${RETRY_PERIOD} \
    --helm-set l2announcements.leaseDuration=${LEASE_DURATION} \
    --helm-set socketLB.hostNamespaceOnly=true \
    --helm-set cni.exclusive=false \
    --helm-set devices="en+ wl+" \
    --helm-set envoy.enabled=false \
    --set prometheus.enabled=true \
    --set operator.prometheus.enabled=true

Install Istio

Istio is similarly installed using the CLI tools. Currently no additional configuration is required.

kubectl create namespace istio-system
istioctl install

Next steps

Deploy contents of 01_scrapyard_cluster_essentials, install cert-manager, set up private CA, and proceed with the rest of the setup.

Complete re-deploy of bytebase

Steps for tear-down
  1. delete the app resources
  2. delete the postgres CR called postgres-bytebase
Steps to re-create

In order to re-deploy with the same passwords for the database, secrets for the login info for the users created need to be generated manually in the same format as the secrets that would be automatically added by the generator.

The operator deletes them based on their names, and it also accesses them by the same name. There's no way to persist these automatically, and unless these are manually deployed when re-deploying bytebase, the system WILL get an access denied error, since it'll create a new set of passwords.

Misc other repairs

  1. One of the nodes keeps getting marked as non-responsive due to lack of resources
  2. use a kubelet-config.yaml to reserve more cpu or ram for the host machine. SSH into the node and run a kubeadm join command passing it the new kubelet configuration
  3. One of the pods isn't starting
  4. go to the details of the pod and check what it's missing. it's usually either missing resources, or a configmap or secret, or it cannot find a match for it's persistent volume claim.
  5. Set up a new machine as an additional node
  6. Get the required packages
    1. Add the kubernetes and docker repos to the apt-get list with their gpg key(?)
    2. Install kubelet, kubectl, kubeadm, vim, git, wget, curl
    3. Disable swap space (comment out the swap line in /etc/fstab)
    4. Install containerd runtime
    5. Install nfs-common on the machine so that it can access nfs shared PVCs
  7. Connect to the cluster with the [[#Cluster join command]]
  8. These steps are explained in the guide saved in the file Install Kubernetes Cluster on Ubuntu 22.04 using kubeadm.pdf in dropbox
  9. Set up a new local persistent volume
  10. Make sure that the local-storage StorageClass is loaded
  11. An example of how to add a local storage PV is in minio.
  12. Be sure to either use an existing node label for selecting a node, or create one (or maybe don't care depending on what the data is)
  13. Adjust the amount of memory available for k8s on a node
  14. Edit the kubeadm flags on the node
    1. Open the command file with sudo nano /var/lib/kubelet/kubeadm-flags.env
    2. Add the arg --system-reserved=cpu=300m,memory=200Mi (or whatever you want to reserve for other processes)
    3. Restart the service with sudo systemctl restart kubelet
  15. node5framework is getting evicted
  16. Check that test-bucket is set with a lifecycle policy to evict things. it may have filled up the ssd.
  17. Create a new postgres database using a persistent volume and connect it to a deployment in a different namespace. (There's an example of this with bytebase)
  18. Create a persistent volume using the nfs storage class. Add it to the folder currently marked 02_scrapyard_cluster_enhancements/06_volume_manifests/persistent_volumes_nfs
  19. Create a postgres "kind" yaml with the volume storageclass, and use an app label for a selector. (Others could be used, but this is a nice convention)
  20. apply the yaml. The user and db sections seem to take a while to be applied. probably operator related. Always use the credentials in the secrets created by the operator.
  21. To connect to the postgres instance, either using something like sqlalchemy or a simple psql command, the path will need to look something like this: postgresql://username:[email protected]:5432/bytebase?sslmode=require where in this case (and all current cases postgres is the namespace of the postgres instance. change that to a different namespace if needed). The sslmode=require is not optional.

Troubleshooting

  • router tools
  • access via ssh once sshed into jukebox
  • web interface
  • ssh tools for viewing traffic etc.
  • k8s tool
  • cilium hubble
  • kiali
  • cloudflare
  • only lets https traffic through with proxy

Jukebox gets really slow

run a speed test with speedtest-cli on jukebox, and compare it to the results from the openwrt

(project specific) the pipeline starts failing

minio isn't working

Check k9s - was the minio system reset recently? is the pv/pvc mounted properly? is the SSD node running? If it was just restarted but everything is fine, it might just be that it needs to pull a fresh version of the object data from postgres (this is a bit silly and should be made better)