Dynamic Cloud Worker Nodes for On-Premises Kubernetes

One of the first things I wanted to do with my Kubernetes cluster at home was start using it for Jenkins jobs. With the Kubernetes plugin, Jenkins can run create ephemeral Kubernetes pods to use as worker nodes to execute builds. Migrating all of my jobs to use this mechanism would allow me to get rid of the static agents running on VMs and Raspberry Pis.

Getting the plugin installed and configured was relatively straightforward, and defining pod templates for CI pipelines was simple enough. It did not take long to migrate the majority of the jobs that can run on x86_64 machines. The aarch64, jobs, though, needed some more attention.

It's no secret that Raspberry Pis are slow. They are fine for very light use, or for dedicated single-application purposes, but trying to compile code, especially Rust, on one is a nightmare. So, while I was redoing my Jenkins jobs, I took the opportunity to try to find a better, faster solution.

Jenkins has an Amazon EC2 plugin, which dynamically launches EC2 instances to execute builds and terminates them when they are no longer needed. We use this plugin at work, and it is a decent solution. I could configure Jenkins to launch Graviton instances to build aarch64 code. Unfortunately, I would either need to pre-create AMIs with all of the necessary build dependencies and run the jobs directly on the worker nodes, or use the Docker Pipeline plugin to run them in Docker containers. What I really wanted, though, was to be able to use Kubernetes for all of the jobs, so I set out to find a way to dynamically add cloud machines to my local Kubernetes cluster.

The Cluster Autoscaler is a component for Kubernetes that integrates with cloud providers to automatically launch and terminate instances in response to demand in the Kubernetes cluster. That is all it does, though; it does not integrate with the Kubernetes API to perform TLS bootstrapping or register the node in the cluster. In the Autoscaler FAQ, it hints at how to handle this limitation, though:

Example: If you use kubeadm to provision your cluster, it is up to you to automatically execute kubeadm join at boot time via some script.

With that in mind, I set out to build a solution that uses the Cluster Autoscaler, WireGuard, and kubeadm to automatically provision nodes in the cloud to run Jenkins jobs on pods created by the Jenkins Kubernetes plugin.

Process

When Jenkins starts running a job that is configured to run in a Kubernetes Pod, it uses the job's pod template to create the Pod resource. It also creates a worker node and waits for the JNLP agent in the pod to attach itself to that node.
Kubernetes attempts to schedule the pod Jenkins created. If there is not a node available, the scheduling fails.
The Cluster Autoscaler detects that scheduling the pod failed. It checks the requirements for the pod, matches them to an EC2 Autoscaling Group, and determines that scheduling would succeed if it increased the capacity of the group.
The Cluster Autoscaler increases the desired capacity of the EC2 Autoscaling Group, launching a new EC2 instance.
Amazon EventBridge sends a notification, via Amazon Simple Notification Service, to the provisioning service, indicating that a new EC2 instance has started.
The provisioning service generates a kubeadm boostrap token for the new instance and stores it as a Secret resource in Kubernetes.
The provisioning service looks for an available Secret resource in Kubernetes containing WireGuard configuration and marks it as assigned to the new EC2 instance.
The EC2 instance, via a script executed by cloud-init, fetches the WireGuard configuration assigned to it from the provisioning service.
The provisioning service searches for the Secret resource in Kubernetes containing the WireGuard configuration assigned to the EC2 instance and returns it in the HTTP response.
The cloud-init script on the EC2 instance uses the returned WireGuard configuration to configure a WireGuard interface and connect to the VPN.
The cloud-init script on the EC2 instance generates a JoinConfiguration document with cluster discovery configuration pointing to the provisioning service and passes it to kubeadm join.
The provisioning service looks up the Secret resource in Kubernetes containing the bootstrap token assigned to the EC2 instance and generates a kubeconfig file containing the cluster configuration information and that token. The kubeconfig file is returned in the HTTP response.
kubeadm join, running on the EC2 instance communicates with the Kubernetes API server, over the WireGuard tunnel, to perform TLS bootstrapping and configure the Kubelet as a worker node in the cluster.
When the Kubelet on the new EC2 instance is ready, Kubernetes detects that the pod created by Jenkins can now be scheduled to run on it and instructs the Kublet to start the containers in the pod.
The Kublet on the new EC2 instance starts the pod's containers. The JNLP agent, running as one of the containers in the pod, connects to the Jenkins controller.
Jenkins assigns the job run to the new agent, which executes the job.

Components

Jenkins Kubernetes Plugin

The Kubernetes plugin for Jenkins is responsible for dynamically creating Kubernetes pods from templates associated with pipeline jobs. Jobs provide a pod template that describe the containers and configuration they require in order to run. Jenkins creates the corresponding resources using the Kubernetes API.

Autoscaler

The Cluster Autoscaler is an optional Kubernetes component that integrates with cloud provider APIs to create or destroy worker nodes. It does not handle any configuration on the machines themselves (i.e. running kubeadm join), but it does watch the cluster state and determine when to create or destroy new nodes based on pod requests.

cloud-init

cloud-init is a tool that comes pre-installed on most cloud machine images (including the official Fedora AMIs) that can be used to automatically provision machines when they are first launched. It can install packages, create configuration files, run commands, etc.

WireGuard

WireGuard is a simple and high-performance VPN protocol. It will provide the cloud instances with connectivity back to the private network, and therefore access to internal resources including the Kubernetes API.

Unfortunately, WireGuard is not particularly amenable to "dynamic" clients (i.e. peers that come and go). This means either custom tooling will be necessary to configure WireGuard peers on the fly OR pre-generating configuration for a set number of peers and ensuring that no more than that number of instances are every online simultaneously.

Provisioning Service

This is a custom piece of software that is responsible for provisioning secrets, etc. for the dynamic nodes. Since it will be responsible for handing out WireGuard keys, it will have to be accessible directly over the Internet. It will have to authenticate requests somehow to ensure that they are from authorized clients (i.e. EC2 nodes created by the k8s Autoscaler) before generating any keys/tokens.