Let's Encrypt Certificates: DNS Blocked

The certs Jenkins job has been failing for a while, ever since I blocked outbound DNS traffic to the Internet. The problem is lego queries DNS for each domain in the certificate request repeatedly until it sees the _acme-challenge TXT record it created. With DNS traffic blocked, it is never able to contact the configured DNS servers (was Cloudflare, now Quad9) so it just waits until its timeout expires.

Attempt 1: _acme-challenge CNAME

At first, I thought the problem was simply that lego just needed a DNS server. I couldn't remember why I configured it to use a third-party server, so I just disabled that. By default, it uses the same name servers as the operating system. Unfortunately, I quickly remembered the reason I needed to use an external DNS server: the internal name servers have different records for pyrocufflink.blue.

I remembered reading about using CNAME records to "redirect" ACME challenges to another domain, so I thought I would try that for pyrocufflink.blue:

_acme-challenge CNAME 5 _acme-challenge.o-ak4p9kqlmt5uuc.com

This should tell Let's Encrypt to look for its TXT record in the o-ak4p9kqlmt5uuc.com domain instead of the pyrocufflink.blue domain. Unfortunately, it seems that lego does not support this, even with LEGO_EXPERIMENTAL_CNAME_SUPPORT=true, for Namecheap.

In any case, I later discovered that this would not have helped.

Attempt 2: DNS-over-HTTPS Proxy

Since I couldn't get lego to work with the CNAME trick, I decided to try using a DNS-over-HTTPS (DoH) proxy to tunnel DNS queries to an external name server. I looked at dnscrypt-proxy and cloudflared, as these were the only two implementations of DNS-to-DoH proxies I could find. cloudflared is simple and requires no configuration, but it's a 40 megabyte binary. dnscrypt-proxy, on the other hand is a bit smaller (10 MB), but more complicated to run. It requires a configuration file and at least one reference to a list of public resolvers, which it must fetch and load when it starts up.

I made some modifications to the CI pipeline to support starting and stopping the DoH proxy, and configured lego to send its queries there instead. Unfortunately, this didn't work, either. It turns out lego only uses the configured name server to find the NS records for the domain in question. Once it gets the names of the authoritative name servers, it sends queries to them directly, NOT through the configured server.

I was able to determine this by watching the network traffic with tshark for both "normal" DNS and DoH-proxied DNS:

tshark -i any port domain
tshark -i lo -d tcp.port==5053,dns -d udp.port==5053,dns port 5053

(port 5053 is where dnscrypt-proxy is listening)

I could see lego making TXT and NS record requests to dnscrypt-proxy, and then switching to making TXT requred requests to external servers. I am not sure why it bothers making the initial TXT request, since it does not seem to care about the result, whether it is correct or not.

Temporary Solution

I am not sure exactly where to go from here. It seems lego is simply incompatible with strict DNS. I will most likely need to find an alternate ACME client that:

  1. Supports Namecheap API
  2. Works without access to the authoritative name servers
  3. Is simple enough to install that it can be run from a Jenkins job

Alternatively, I may investigate acme-dns. I may be able to combine CNAME records in the target domains pointing to a (sub-)domain hosted by acme-dns to get lego to work correctly. I would just have to make sure that the server is accessible both internally and externally.

In the meantime, I have added firewall rules to allow outbound DNS to Namecheap servers only.