Speed Up Jenkins Startup Time in Kubernetes
I recently migrated my Jenkins server at home to run inside my Kubernetes
cluster. I am very happy with it overall; upgrades are a lot simpler, and
Longhorn volume snapshots make rolling back bad plugin updates a breeze. One
issue that troubled me for a while, though, was that it took a really long
time for the Jenkins server container to start. Kubernetes would list the pod
in ContainerCreating
state for several minutes, and then in
ContainerCreateError
for a while, before finally starting the process. It
turns out this was because of the huge number of files in the Jenkins home
directory. When the container starts up, the container runtime has to go
through every file in the persistent volume and fix its permissions. My
Jenkins instance has over 1.5 million files, so scanning and modifying them all
takes a very long time.
I was finally able to fix this issue today, after messing with it for a week or so. There are two changes the container runtime has to make to every file in the persistent volume:
- The group ownership/GID
- The SELinux label
Fixing the first problem is straightforward: set
securityContext.fsGroupChangePolicy
on the pod or container to
OnRootMismatch
. The container runtime will check the GID of the root
directory of the persistent volume, and if it is correct, skip checking any of
the rest of the files and directories.
The second problem was quite a bit trickier, but still fixable. It took me a
bit longer to get the solution right, but with the help of a cri-o GitHub
issue, I finally managed. The key is to configure the container to have a
static SELinux context; by default, the container runtime will assign a random
category when the container starts. Naturally, this means the context labels
of all the files in the persistent volume have to be changed every time, to
match the new category. Fortunately, the
securityContext.seLinuxOptions.level
setting on the pod/container is
available. I looked at the category of the Jenkins current process and set
level
to that:
ps Z -p $(pgrep -f 'jenkins\.war')
LABEL PID TTY STAT TIME COMMAND
system_u:system_r:container_t:s0:c525,c600 196790 ? Sl 0:50 java -Duser.home=/var/jenkins_home -Djenkins.model.Jenkins.slaveAgentPort=50000 -Dhudson.lifecycle=hudson.lifecycle.ExitLifecycle -jar /usr/share/jenkins/jenkins.war
The level field is the final two parts of the process's label and includes the context's category.
spec:
containers:
- securityContext:
seLinuxOptions:
level: s0:c525,c600
With this setting in place, the container will start with the same SELinux
context every time, so if the files are already labelled correctly, they do not
have to be changed. Unfortunately, by default, CRI-O, still walks the whole
directory tree to make sure. It can be configured to skip that step, though,
similar to the fsGroupChangePolicy
. The pod needs a special annotation:
metadata:
annotations:
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true'
CRI-O itself also has to be configured to respect that annotation. CRI-O's
configuration is not well documented, but I was able to determine that these
two lines need to be added to /etc/crio/crio.conf
:
[crio.runtime.runtimes.runc]
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
In summary, there were four steps to configure the container runtime not to scan and touch every file in the persistent volume when starting the Jenkins container:
- Set
securityContext.fsGroupChangePolicy
toOnRootMismatch
- Set
securityContext.seLinuxOptions.level
to a static value - Add the
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel
annotation - Configure CRI-O to respect said annotation
After completing all four steps, the Jenkins container starts up in seconds instead of minutes.