Recently my server went down! Well… not exactly… It did however stop serving pages with a trusted certificate. Looking into this took my NGINX container down and required some updates. So, without further ado, here’s what happened…
First Things First!
I try to be as transparent as possible. I am hosting a few other websites on my provider using containers, so I loaded up MailChimp and fired off an email to my hosts. It went something like this:
Yep, simple and sweet. Nothing too foreboding, but just enough to let everyone know their site may be down or otherwise inaccessible. Once that was fired off I started digging in!
Confident that restarting might fix it, I used c2technology.net as the guinea pig. And, <sad trombone sound> it still had the same issues. Seemed like an NGINX issue, so I restarted that too. Still no luck. Looking at the NGINX logs I see the restart just mentioned an unknown “virtual host” environment variable. That’s weird, this only routes to virtual hosts, it isn’t one of them not does it know of any of them via an environment variable… Interesting…
Let’s Get Sleuthy
Digging into the NGINX Generator container logs didn’t show anything out of the ordinary, and the Let’s Encrypt companion container didn’t turn up any weirdness either. So I started with the NGINX container configurations to see what was up. I went through /etc/nginx/conf.d/default.conf and found the environment variable there so it was somehow passed down to the NGINX Generator which then wrote it into the NGINX config. Thankfully (SO THANKFUL) the NGINX Generator also commented which container this configuration was written for. If you recall, I was previously working on my Alexa Bot a deploy by this project was triggered with no value for the VIRTUAL_HOST variable. NGINX Generator decided that was the literal value and passed it on the the NGINX configs. Fixing this required going outside the automated deploy pipeline. I ran a shell on the NGINX container and opened up the default.conf file again and just removed the block. Restarting NGINX still had the same environment variable issue.
Taking a look at the running containers, the Alexa Bot was still running (presumably with the wrong VIRTUAL_HOST variable). So I killed it and restarted NGINX… same error. Again I opened a shell on the NGINX container and opened the default.conf file and the VIRUTAL_HOST variable was back! NGINX Generator must have picked up a change and re-wrote the config with the Alexa Bot container values. Oops! Removed the block again and NGINX restarted just fine without the environment variable issue. Success! Let’s reboot the whole NGINX stack (NGINX, NGINX Generator, and the NGINX Let’s Encrypt Companion containers). Everything restarted just fine! Perfect!
But Wait, There’s More!
Going to c2technology.net still had a certificate issue. But alllll the other sites worked fine. This was super weird. So, easy thing to do was to restart the container for the site! Nope. Still had an expired cert. But, this time, it was a self-signed certificate by the Let’s Encrypt Companion. Different results are good right? I took a peek in the Let’s Encrypt Companion container and there it was! I had added the IP address of the c2technology.net server as a Virtual Host to the NGINX Generator configurations, which were then written to the NGINX configs. This works great in NGINX land. But SSL certificates are only ever issued to host names. I removed the IP address form the c2technology.net build parameters and viola! It’s back up and running! Following up with my users, I sent an email to them that looks something like this:
The root cause of this issue was related to an unrelated build which was poorly configured. This is not a shock given it was a Hacktoberfest project. Fortunately, this was specifically isolated to the c2technology.net hosting. Unfortunately, having to restart the NGINX container brought down all other hosted sites. This did highlight a flaw in our build pipeline for Let’s Encrypt certificates. The Virtual Host and the Let’s Encrypt Host values were shared. Isolating each to their own variables would have prevented this issue while still retaining the NGINX handling for the raw IP address. By the time this is published, this will already be resolved, but is does serve as a reminder that configuration shortcuts can definitely cause some obfuscated problems. This particular problem lasted 2 hours, 59 minutes, and 49 seconds.