Let's Encrypt Certificates: DNS Blocked
The certs Jenkins job has been failing for a while, ever since I blocked
outbound DNS traffic to the Internet. The problem is
lego queries DNS for
each domain in the certificate request repeatedly until it sees the
_acme-challenge TXT record it created. With DNS traffic blocked, it is never
able to contact the configured DNS servers (was Cloudflare, now Quad9) so it
just waits until its timeout expires.
At first, I thought the problem was simply that
lego just needed a DNS
server. I couldn't remember why I configured it to use a third-party server,
so I just disabled that. By default, it uses the same name servers as the
operating system. Unfortunately, I quickly remembered the reason I needed to
use an external DNS server: the internal name servers have different records
I remembered reading about using CNAME records to "redirect" ACME challenges to another domain, so I thought I would try that for pyrocufflink.blue:
_acme-challenge CNAME 5 _acme-challenge.o-ak4p9kqlmt5uuc.com
This should tell Let's Encrypt to look for its TXT record in the
o-ak4p9kqlmt5uuc.com domain instead of the pyrocufflink.blue domain.
Unfortunately, it seems that
lego does not support this, even with
LEGO_EXPERIMENTAL_CNAME_SUPPORT=true, for Namecheap.
In any case, I later discovered that this would not have helped.
Attempt 2: DNS-over-HTTPS Proxy
Since I couldn't get
lego to work with the CNAME trick, I decided to try
using a DNS-over-HTTPS (DoH) proxy to tunnel DNS queries to an external name
server. I looked at
cloudflared, as these were the only
two implementations of DNS-to-DoH proxies I could find.
simple and requires no configuration, but it's a 40 megabyte binary.
dnscrypt-proxy, on the other hand is a bit smaller (10 MB), but more
complicated to run. It requires a configuration file and at least one
reference to a list of public resolvers, which it must fetch and load when it
I made some modifications to the CI pipeline to support starting and stopping
the DoH proxy, and configured
lego to send its queries there instead.
Unfortunately, this didn't work, either. It turns out
lego only uses the
configured name server to find the
NS records for the domain in question.
Once it gets the names of the authoritative name servers, it sends queries to
them directly, NOT through the configured server.
I was able to determine this by watching the network traffic with
both "normal" DNS and DoH-proxied DNS:
tshark -i any port domain
tshark -i lo -d tcp.port==5053,dns -d udp.port==5053,dns port 5053
(port 5053 is where
dnscrypt-proxy is listening)
I could see
lego making TXT and NS record requests to
then switching to making TXT requred requests to external servers. I am not
sure why it bothers making the initial TXT request, since it does not seem to
care about the result, whether it is correct or not.
I am not sure exactly where to go from here. It seems
lego is simply
incompatible with strict DNS. I will most likely need to find an alternate
ACME client that:
- Supports Namecheap API
- Works without access to the authoritative name servers
- Is simple enough to install that it can be run from a Jenkins job
Alternatively, I may investigate
acme-dns. I may be able to combine CNAME
records in the target domains pointing to a (sub-)domain hosted by acme-dns
lego to work correctly. I would just have to make sure that the
server is accessible both internally and externally.
In the meantime, I have added firewall rules to allow outbound DNS to Namecheap servers only.