Wednesday, January 6, 2016

Using Tufte's discrete sparklines to monitor Nginx status codes

Since I manage several of my own sites end-to-end, including configuring and running their servers, it is important for me know what the sites are up to. This includes knowing when they may be down, as well as checking out any error codes that may be generated. I went ahead and created this simple dashboard vignette:



to keep track of my `Nginx` response status codes. I can see if there are any error codes that have been generated (those in the 400s or the 500s), and act accordingly.

Why did I go ahead and make my own dashboard? First off, reading through Nginx logs in their raw form is a pain. For those who haven't worked with them before, this is what they look like:


1
2
3
180.76.15.141 - - [04/Jan/2016:07:42:26 -0500] "GET / HTTP/1.1" 200 36005 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
180.76.15.147 - - [04/Jan/2016:07:43:33 -0500] "GET / HTTP/1.1" 200 36005 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
91.65.255.241 - - [04/Jan/2016:07:50:31 -0500] "GET /portfolio/resume HTTP/1.1" 301 5 "http://rowanv.com/portfolio/about/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"


Bleh. So I decided to go out and hunt for some existing visualizations, none of which seemed too great. They seemed far too cluttered with chart junk and lacked customizability. Here is an example of an existing dashboard element that aims to visualize something similar to my final visualization:



 I found this graph fairly distracting, and for some reason the non-2xx status codes didn't show up on here. Also, any spikes in traffic would make it very hard to read -- if we had hundreds of requests that returned a 200 response, but only a handful of requests that returned a 400 or 500 response, it would be very difficult to see the 400s and 500s because they are graphed on the same axis. I didn't just want to see how many 200 responses occurred -- I wanted visual confirmation that 400s and 500s were not occurring. Besides, I didn't really care how many responses of each type were occurring -- just the presence or absence of error responses was what mattered to me, so a simple discrete chart would abstract away a lot of the distracting additional noise. It could be useful to have error rates / total requests available as well, but I think that would be best visualized in another chart.

Along the way, I also found a few completely text-based interfaces to monitor one's servers. I found these completely text-based outputs difficult to parse at a glance. In the end, creating a custom dashboard made the most sense for my needs.

The process was fairly straightforward. First, I located my Nginx access logs and wrote a Python script to read them into a Pandas Dataframe, which is more manageable than their native format. Then, I aggregated and reshaped the dataframe to get a dataset that would identify what error codes had occurred over the course of each hour. This left me with a dataframe that looked like this:



Each row consisted of an hour; each column identified how times each type of error code (one in the 200s, one in the 300s, etc.) had occurred over the course of that hour.

Then, I used Flask to serve an HTTP response within a web application which served my reshaped dataset in a JSON format. This would enable my front-end view to locate the data and read it in. Here's my basic view code that server my HTTP response:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import re
import pprint
import pandas as pd
import datetime

@app.route('/nginx_dash/data/status_code_hourly/')
def status_code_hourly_data():


    log_path = 'access.log'

    conf = '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'
    regex = ''.join(
        '(?P<' + g + '>.*?)' if g else re.escape(c)
        for g, c in re.findall(r'\$(\w+)|(.)', conf))

    conf_list = []

    f = open(log_path, 'r')
    for line in f:
        parsed_string = re.match(regex, line)
        conf_list.append(parsed_string.groupdict())
    df = pd.DataFrame(conf_list)

    df['date'] = df['time_local'].apply(lambda x: datetime.datetime.strptime(x[:11], '%d/%b/%Y'))
    df['date_time'] = df['time_local'].apply(lambda x: datetime.datetime.strptime(x[:20], '%d/%b/%Y:%H:%M:%S'))
    df['status'] = df['status'].apply(lambda x: str(x))
    df['is_100'] = df['status'].apply(lambda x: x[0] == '1')
    df['is_200'] = df['status'].apply(lambda x: x[0] =='2')
    df['is_300'] = df['status'].apply(lambda x: x[0] =='3')
    df['is_400'] = df['status'].apply(lambda x: x[0] == '4')
    df['is_500'] = df['status'].apply(lambda x: x[0] == '5')
    df['date_hour'] = df['date_time'].apply(lambda dt: dt.replace(minute=0, second=0))
    df_status_codes = df[['date_hour', 'is_100', 'is_200', 'is_300', 'is_400', 'is_500']]
    df_status_codes_grouped = df_status_codes.groupby('date_hour').sum()
    df_status_codes_grouped['date_hour'] = df_status_codes_grouped.index
    json_response = df_status_codes_grouped.to_json(orient='records')
    return Response(response=json_response,
                    status=200,
                    mimetype='application/json')


Finally, I wrote the frontend in `d3.js` which actually renders the discrete sparklines. You can check out my previous tutorial on making discrete sparklines to learn how to make this type of a visualization.

And that's it! I now have an awesome, live-updating, minimalist visualization that lets me know if any concerning error codes have occurred over the last couple of days.

If you want to learn more about sparklines, check out Tufte's Beautiful Evidence.

No comments:

Post a Comment

Sign up to receive data viz talk tips and updates via email. We're low traffic and take privacy very seriously.