Effective Logs and Metrics

Application metrics and logs are a cornerstone of effective operations.  It is almost certain that bad things will happen during the operation of an application.  It is also a near certainty that when bad things to happen, you will want to know why, and you will want to prevent it in the future.  If you don’t, rest assured the person that signs your paycheck every other week will.  In this article I will present some strategies to effectively gather metrics and logs to monitor an application.

Log Wisely

Commercial log aggregation services will tell you to log anything and everything.  Of course, the same companies also charge by the gigabyte.  They make more when you log more.  In my experience logging too much is not helpful.  This is especially true when the data logged is something mundane and unhelpful like starting request for user foo… or finished request for user foo in 1812ms.  There may be valuable information in those statements but they will be difficult to aggregate in the future because they are not machine readable.  When someone taps you on the shoulder and says user foo is complaining about slow loading times it will be difficult to answer questions like how long did user foo have to wait?

Log Consistently

To log consistently, use a serializable data format like JSON.  Resist the temptation to pretty print your serialized data for readability.  Log aggregation tools have an annoying tendency to treat new lines as discrete events.  As a general guideline, if it is easy for the developer to read, it is hard for the machine to read.  If it is hard for the machine to read, a human will need to spend time massaging the data to make it easier for the machine to read.  Stop that kind of waste before it starts by resigning your log statements to a single line.  Use tools like jq to expand the data later.

On the subject of making your logs easier for machines to read, pick a schema for your logs and stick to it.  Prevent the schema from becoming a point of disagreement among your team by using data envelopes.  Determine a schema for the outer payload that everyone can agree upon and allow inner payloads to have their own schema that may differ from others.  Avoid making a single data envelope responsible for too much.

Use Logs for Logs and Metrics for Metrics

Going back to the example finished request for user foo in 1812ms.  What if you need to calculate the average response time for an application on an ongoing basis and alert on it when you reach a predetermined threshold?  If the logs are machine readable they can be aggregated and converted into a metric.  However, it is easier to skip this step and record the metrics directly.  On the other hand, metrics aggregation systems will suffer from poor performance (Prometheus) or are costly (CloudWatch) when recording arbitrary data.  Metrics are about the big picture rather than discrete events.

Minimize the Cardinality of Metrics

Metrics should be limited to wholistic views of a system.  For example, the response time for an individual route on a given instance is a valuable metric with low cardinality because the scope of the metric is broad.  The same metric with a trace_id for a given request will considerably increase the number of dimensions recorded for this same metric.  Narrow scope means high cardinality, which will have performance or financial implications, if not both.

Prefer Middleware and Wrappers

Do not allow the implementation details of your metrics and logs to clutter a codebase.  Hardcoding log and metric statements inside of functions is an anti-pattern.  This practice will quickly clutter your codebase.  Instead, wrap functions after importing/requiring them and use middleware to wrap entire routes.

Wrappers

Below is an example wrapper for Node.js.  The example uses function currying (functions that return other functions) to implement wrappers that time and log execution time.  This is the  functional programming approach.  However, this task can be accomplished using other programming paradigms.

First a function is needed to time the execution in milliseconds.

To use this function, assign the results of calling it to a variable, then call the returned function.

The timer function can be used in a wrapper function that will record and log the execution time of the function it wraps.

To use the wrapper pass the function you wish to wrap as the argument and assign the result to a variable.  The variable holds a function which calls the wrapped function and logs the execution time.

It is also possible to wrap functions which expect a callback function.

This is useful in situations where a callback based API is used such as the asynchronous I/O modules in the Node.js standard library.

Middleware

Middleware are another place to abstract logs and metrics.  The example Express middleware below observes a Prometheus histogram of response times along with labels for the request path and the response status code.  A timer is started and an event listener is registered when the middleware is called.  When the finish event is dispatched for the response, the time is observed.

This minimal example allows for a considerable insights.  Not only will it tell you the response time for a given route, it can also tell you the number of requests count(response_time_ms).  With labels, you can alert when too many bad responses happen in a given timeframe for a given route count(response_time_ms[1m]{"status" = 500, "path" = "foo"}) > 50.

Make Data Visible and Accessible

Use off the shelf products to aggregate and visualize the data collected by logs and metrics.  Tools like Prometheus and Grafana allow dashboards and alerts to be created quickly.  Log aggregators like LogDNA and Splunk can help search logs for events and even generate metrics from logs that can also roll up to dashboards and alerts.  This data is important to people at all levels of an organization.  Use role based access and dedicated instances to provide appropriate information to each group.

Do not hoard, conceal, or hide data without good reason.  There may be valid reasons to limit access data to remain compliant with legal regulations.  Do not resort to arbitrary and excessive trainings or tests to safeguard data.  Anonymize data and provide appropriate training to remain compliant.

Start Now and Do Not Stop

If you are not collecting logs and metrics today, start.  Continue to add logs, metrics, dashboards, and alerts in response to the behavior of the application.  This is an ongoing and evolutionary process.  Addressing every bug and new feature should involve adjusting or creating logs, metrics, dashboards, and alerts.  Treat it like writing tests.  Make it part of the operational mantra.