Metrics

Metrics

If you’re reading this, chances are you’ve developed some type of toy app. Maybe even deployed a few. Perhaps you even have users that are happily plugging away with your app! But, do you know how it is performing? Do you know what your users tend to use the most? Do you know when things are failing, maybe even silently? In this article, I’m going to build a metrics stack and hook up my Discord Bot to it so I can start seeing just how much use it’s getting

The Idea

So, I’ve used Grafana at work. I believe it’s even backed with InfluxDB. We do time-series metrics gathering for successful executions of a service, when an expected failure condition occurs, and even when an unexpected condition occurs. This is great as it’s plugged into PagerDuty that has a bunch of alerts and checks against the metrics in InfluxDB. So, when something fails that’s really important at 2 a.m. I get a call from PagerDuty and I get to go to work a bit early.

Alright, so that’s an enterprise solution there. You probably don’t want your own app waking you up when a user hits an error. But, how would you even know it happened without your user notifying you? My my metrics gathering goal is to track servers the bot joins, bot departures, queries per server, queries per user, vendor cache resets, queries that return no results, and the total amount of results returned across all servers.

The Setup

Now, I could go about this and write some sort of database that’s fronted with some services and then invoke those services with my app. But, that’s a lot of work. Other people that are smarter than me have already figured this out. I’m researching a few solutions to implement. So far, I’ve found:

  1. Grafana – a metrics visualization web application
  2. InfluxDB – a time-series database
  3. Graphite – A statsd compliant time-series metrics service
  4. Prometheus – Another statsd compliant time-series metrics service that’s supposedly better than Graphite
  5. Statsd – a protocol for metrics gathering, also a tool that Esty made

I’ll be researching these more to determine what I might actually use. If I find better alternatives, I’ll be sure to write something on them as well. For now, I think I have some sort of solution with these options.

I’m hoping to also use Docker in setting this all up. Maybe there will even be a handy Dockerfile I can use that has all the things needed to get this up and running.

If you have any other suggestions, let me know in the comments! If you have ideas on what else to comment. I’ll write something about the metrics system and post it soon!

Downtime

Downtime

Recently my server went down! Well… not exactly… It did however stop serving pages with a trusted certificate. Looking into this took my NGINX container down and required some updates. So, without further ado, here’s what happened…

First Things First!

I try to be as transparent as possible. I am hosting a few other websites on my provider using containers, so I loaded up MailChimp and fired off an email to my hosts. It went something like this:

Ruh roh!

Yep, simple and sweet. Nothing too foreboding, but just enough to let everyone know their site may be down or otherwise inaccessible. Once that was fired off I started digging in!

Confident that restarting might fix it, I used c2technology.net as the guinea pig. And, <sad trombone sound> it still had the same issues. Seemed like an NGINX issue, so I restarted that too. Still no luck. Looking at the NGINX logs I see the restart just mentioned an unknown “virtual host” environment variable. That’s weird, this only routes to virtual hosts, it isn’t one of them not does it know of any of them via an environment variable… Interesting…

Let’s Get Sleuthy

Digging into the NGINX Generator container logs didn’t show anything out of the ordinary, and the Let’s Encrypt companion container didn’t turn up any weirdness either.  So I started with the NGINX container configurations to see what was up. I went through /etc/nginx/conf.d/default.conf and found the environment variable there so it was somehow passed down to the NGINX Generator which then wrote it into the NGINX config. Thankfully (SO THANKFUL) the NGINX Generator also commented which container this configuration was written for. If you recall, I was previously working on my Alexa Bot a deploy by this project was triggered with no value for the VIRTUAL_HOST variable. NGINX Generator decided that was the literal value and passed it on the the NGINX configs. Fixing this required going outside the automated deploy pipeline. I ran a shell on the NGINX container and opened up the default.conf file again and just removed the block. Restarting NGINX still had the same environment variable issue.

Taking a look at the running containers, the Alexa Bot was still running (presumably with the wrong VIRTUAL_HOST variable). So I killed it and restarted NGINX… same error. Again I opened a shell on the NGINX container and opened the default.conf file and the VIRUTAL_HOST variable was back! NGINX Generator must have picked up a change and re-wrote the config with the Alexa Bot container values. Oops! Removed the block again and NGINX restarted just fine without the environment variable issue. Success! Let’s reboot the whole NGINX stack (NGINX, NGINX Generator, and the NGINX Let’s Encrypt Companion containers). Everything restarted just fine! Perfect!

But Wait, There’s More!

Going to c2technology.net still had a certificate issue. But alllll the other sites worked fine. This was super weird. So, easy thing to do was to restart the container for the site! Nope. Still had an expired cert. But, this time, it was a self-signed certificate by the Let’s Encrypt Companion. Different results are good right? I took a peek in the Let’s Encrypt Companion container and there it was! I had added the IP address of the c2technology.net server as a Virtual Host to the NGINX Generator configurations, which were then written to the NGINX configs. This works great in NGINX land. But SSL certificates are only ever issued to host names. I removed the IP address form the c2technology.net build parameters and viola! It’s back up and running! Following up with my users, I sent an email to them that looks something like this:

Everything is awesome!

Post Mortem

The root cause of this issue was related to an unrelated build which was poorly configured. This is not a shock given it was a Hacktoberfest project. Fortunately, this was specifically isolated to the c2technology.net hosting. Unfortunately, having to restart the NGINX container brought down all other hosted sites. This did highlight a flaw in our build pipeline for Let’s Encrypt certificates. The Virtual Host and the  Let’s Encrypt Host values were shared. Isolating each to their own variables would have prevented this issue while still retaining the NGINX handling for the raw IP address. By the time this is published, this will already be resolved, but is does serve as a reminder that configuration shortcuts can definitely cause some obfuscated problems. This particular problem lasted 2 hours, 59 minutes, and 49 seconds.