Speed Up Jenkins Startup Time in Kubernetes

Thursday 01 December 2022 21:40 -06:00

I recently migrated my Jenkins server at home to run inside my Kubernetes cluster. I am very happy with it overall; upgrades are a lot simpler, and Longhorn volume snapshots make rolling back bad plugin updates a breeze. One issue that troubled me for a while, though, was that it took a really long time for the Jenkins server container to start. Kubernetes would list the pod in ContainerCreating state for several minutes, and then in ContainerCreateError for a while, before finally starting the process. It turns out this was because of the huge number of files in the Jenkins home directory. When the container starts up, the container runtime has to go through every file in the persistent volume and fix its permissions. My Jenkins instance has over 1.5 million files, so scanning and modifying them all takes a very long time.

I was finally able to fix this issue today, after messing with it for a week or so. There are two changes the container runtime has to make to every file in the persistent volume:

The group ownership/GID
The SELinux label

Fixing the first problem is straightforward: set securityContext.fsGroupChangePolicy on the pod or container to OnRootMismatch. The container runtime will check the GID of the root directory of the persistent volume, and if it is correct, skip checking any of the rest of the files and directories.

The second problem was quite a bit trickier, but still fixable. It took me a bit longer to get the solution right, but with the help of a cri-o GitHub issue, I finally managed. The key is to configure the container to have a static SELinux context; by default, the container runtime will assign a random category when the container starts. Naturally, this means the context labels of all the files in the persistent volume have to be changed every time, to match the new category. Fortunately, the securityContext.seLinuxOptions.level setting on the pod/container is available. I looked at the category of the Jenkins current process and set level to that:

ps Z -p $(pgrep -f 'jenkins\.war')

LABEL                               PID TTY      STAT   TIME COMMAND
system_u:system_r:container_t:s0:c525,c600 196790 ? Sl   0:50 java -Duser.home=/var/jenkins_home -Djenkins.model.Jenkins.slaveAgentPort=50000 -Dhudson.lifecycle=hudson.lifecycle.ExitLifecycle -jar /usr/share/jenkins/jenkins.war

The level field is the final two parts of the process's label and includes the context's category.

spec:
  containers:
  - securityContext:
      seLinuxOptions:
        level: s0:c525,c600

With this setting in place, the container will start with the same SELinux context every time, so if the files are already labelled correctly, they do not have to be changed. Unfortunately, by default, CRI-O, still walks the whole directory tree to make sure. It can be configured to skip that step, though, similar to the fsGroupChangePolicy. The pod needs a special annotation:

metadata:
  annotations:
    io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true'

CRI-O itself also has to be configured to respect that annotation. CRI-O's configuration is not well documented, but I was able to determine that these two lines need to be added to /etc/crio/crio.conf:

[crio.runtime.runtimes.runc]
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]

In summary, there were four steps to configure the container runtime not to scan and touch every file in the persistent volume when starting the Jenkins container:

Set securityContext.fsGroupChangePolicy to OnRootMismatch
Set securityContext.seLinuxOptions.level to a static value
Add the io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel annotation
Configure CRI-O to respect said annotation

After completing all four steps, the Jenkins container starts up in seconds instead of minutes.