Let's Encrypt Certificates: DNS Blocked
The certs Jenkins job has been failing for a while, ever since I blocked
outbound DNS traffic to the Internet. The problem is lego
queries DNS for
each domain in the certificate request repeatedly until it sees the
_acme-challenge
TXT record it created. With DNS traffic blocked, it is never
able to contact the configured DNS servers (was Cloudflare, now Quad9) so it
just waits until its timeout expires.
Attempt 1: _acme-challenge
CNAME
At first, I thought the problem was simply that lego
just needed a DNS
server. I couldn't remember why I configured it to use a third-party server,
so I just disabled that. By default, it uses the same name servers as the
operating system. Unfortunately, I quickly remembered the reason I needed to
use an external DNS server: the internal name servers have different records
for pyrocufflink.blue.
I remembered reading about using CNAME records to "redirect" ACME challenges to another domain, so I thought I would try that for pyrocufflink.blue:
_acme-challenge CNAME 5 _acme-challenge.o-ak4p9kqlmt5uuc.com
This should tell Let's Encrypt to look for its TXT record in the
o-ak4p9kqlmt5uuc.com domain instead of the pyrocufflink.blue domain.
Unfortunately, it seems that lego
does not support this, even with
LEGO_EXPERIMENTAL_CNAME_SUPPORT=true
, for Namecheap.
In any case, I later discovered that this would not have helped.
Attempt 2: DNS-over-HTTPS Proxy
Since I couldn't get lego
to work with the CNAME trick, I decided to try
using a DNS-over-HTTPS (DoH) proxy to tunnel DNS queries to an external name
server. I looked at dnscrypt-proxy
and cloudflared
, as these were the only
two implementations of DNS-to-DoH proxies I could find. cloudflared
is
simple and requires no configuration, but it's a 40 megabyte binary.
dnscrypt-proxy
, on the other hand is a bit smaller (10 MB), but more
complicated to run. It requires a configuration file and at least one
reference to a list of public resolvers, which it must fetch and load when it
starts up.
I made some modifications to the CI pipeline to support starting and stopping
the DoH proxy, and configured lego
to send its queries there instead.
Unfortunately, this didn't work, either. It turns out lego
only uses the
configured name server to find the NS
records for the domain in question.
Once it gets the names of the authoritative name servers, it sends queries to
them directly, NOT through the configured server.
I was able to determine this by watching the network traffic with tshark
for
both "normal" DNS and DoH-proxied DNS:
tshark -i any port domain
tshark -i lo -d tcp.port==5053,dns -d udp.port==5053,dns port 5053
(port 5053 is where dnscrypt-proxy
is listening)
I could see lego
making TXT and NS record requests to dnscrypt-proxy
, and
then switching to making TXT requred requests to external servers. I am not
sure why it bothers making the initial TXT request, since it does not seem to
care about the result, whether it is correct or not.
Temporary Solution
I am not sure exactly where to go from here. It seems lego
is simply
incompatible with strict DNS. I will most likely need to find an alternate
ACME client that:
- Supports Namecheap API
- Works without access to the authoritative name servers
- Is simple enough to install that it can be run from a Jenkins job
Alternatively, I may investigate
acme-dns. I may be able to combine CNAME
records in the target domains pointing to a (sub-)domain hosted by acme-dns
to get lego
to work correctly. I would just have to make sure that the
server is accessible both internally and externally.
In the meantime, I have added firewall rules to allow outbound DNS to Namecheap servers only.