Monitoring Your App in the Cloud

Improving applications' value propositions for users and return on investment (ROI) for owners is dependent on deep knowledge of those applications–whether the applications are minimally viable or full-fledged and mature. Good monitoring provides a lot of that knowledge.

Folks launching new beta’s in particular can use concrete data to better interpret the other half of the user experience story–the narrative we base our feature increments and even business pivots around. And, everyone with a product at any maturity can benefit from the understanding monitoring provides of the economics of their application’s production environment.

Clouds are Changing Monitoring

Deploying applications in the cloud–with an infrastructure or platform as a service provider such as Rackspace or Heroku–fundamentally changes two factors that determine how application owners monitor their environments.

  • Round-the-clock operations teams are an integral part of the services we’re buying and so monitoring and responding to low level, environmental details happens outside our purview: we pay for availability, not machines.

  • In the cloud, we can adjust the capacity and costs of applications on an hourly basis and in fine increments, unlike the near impossibility of decommissioning surplus servers purchased for a seasonal spike.

To leverage these differences for the most ROI for application owners and users, we want to monitor applications from two perspectives: user experience and production efficiency.

Monitoring the User Experience

User experience monitoring provides understanding of production environments' impact on users' decisions, e.g., a beta getting little traction can’t be said to be a positioning failure if it’s responsiveness or availability are ongoing issues.

Usage

Monitoring how people use our applications often seems like the most valuable measure of all; so it’s a bit surprising how difficult it can be to make useful decisions just by looking at usage statistics. The key is to combine broad, historical usage data with the economics of your own and your customers' ROI to create small, time-limited monitoring projects with specific goals.

The big name in monitoring usage is Google Analytics. Being focused more on websites, it’s utility for web applications is limited in the vanilla integration. But, careful use of it’s tracking features–essentially turning it into a click-stream logger–can show us a lot about how users are interacting with our applications in ways we can actually act on.

Responsiveness

It used to be that problems with responsiveness were rare unless a system was overloaded. With the proliferation of third-party API’s, this is changing as a web request to one application by an end-user can generate several web requests to other applications' (e.g., Facebook, Salesforce) all needing to complete within the narrow time allowed for a positive user experience. And, because of those dependencies, fluctuations in responsiveness can occur in completely unloaded, unchanged applications.

There aren’t many solid solutions for monitoring responsiveness, in part because application “pages” are now usually many pieces loaded asynchronously by client-side software making it highly subjective when the goal of user’s action has been achieved. We’ve had some success working this into Google Analytics, and more recently, New Relic added a very promising component to their service called Real User Monitoring.

Failures

With the complex architecture and extensive external dependencies of current application comes a new flavor of failures. Previously, failures often were consistent presentations of application defects or, less commonly, a temporary environment issues, e.g., frozen server processes.

Now, failures can come and go as external API teams present and correct their own defects, partial cloud outages manifest in limited areas of production environments, etc. So, rather than the old black and white measure of a current defect count, we need to monitor failures with more of a time-based perspective and an understanding that many failures are not rectifiable–other than by helping a third-party.

While we love Airbrake for individual error gathering and notification, and build it into nearly everything we ship, we’re still looking for the best, fully automated solution for keeping tabs on current and historical statistics of our applications failures to satisfy users' actions–right now we use a mixture of New Relic and custom monitoring built into applications.

Availability

Well designed availability monitoring gives us a measure of how often an application is unable to meet end-users demands and how that correlates to other variables, e.g., recent application updates or changing usage patterns.

The keys to availability monitoring are to define availability for the particular product and to connect availability to other data such as usage statistics and failure data.

The definition of availability is important to establish at launch and should be reviewed with each significant functional or infrastructure change. It’s a subjective description, e.g., when the users can sign in and see, but not alter, data an application may be considered available and degraded while we might consider it unavailable if sign-in was not functioning even when all other areas are. It’s important to construct definitions with the value proposition in mind; if an application’s primary value to users is in modifying data, then read-only availability is no help.

Connecting availability to statistical, root cause data is more difficult and evolves over time with the sophistication of the product. The goal is know what impact changes are having on the application’s availability in order to make those statistics actionable rather than merely interesting.

Good availability-oriented monitoring services are plentiful; our favorite is Pingdom because it’s very tightly focused on one problem and it has a good API. When you’re choosing, look for:

  • Multiple monitoring locations, unrelated by infrastructure or geography to the application.

  • Ability to monitor comprehensive application health–though most services, including Pingdom, do very little to help out here so a large part of this may come from what’s built into the application itself, e.g., comprehensive self-health checks.

Monitoring Production Efficiency

Production efficiency is the cost of the current throughput, e.g., an environment responding to a hundred requests per second and at a hundred dollars an hour is at 0.27USD per thousand requests. It’s important to compare costs to actual throughput–rather than peak load or theoretical limits–because with a cloud deployment, we can adjust on the fly to maximize cost efficiency through peaks and troughs.

Throughput

Throughput is the measure of how many requests the application is servicing per unit of time, usually within a second or a minute, abbreviated as RPS or RPM. It doesn’t relate directly to what end users see as a given “page” they perceive is made up of dozens or even hundreds of smaller pieces, many coming from content distribution networks, third-party environments, or from the same application via secondary requests–not to mention applications whose primary user interfaces are text message or interactive voice response.

Throughput is an important measurement to take at a frequent interval and to compare over long periods of time because it marries user trends to capacity and responsiveness. New Relic is a great service for monitoring throughput, and we build it into almost every product we create or manage.

Bottlenecks

Bottlenecks are the application’s constraints on throughput–given a load exceeding it’s capacity. The most common bottleneck is the relational database. So much so that problems caused by optimizing or replacing relational databases with non-relational systems–without first analyzing the whole system for constraints–are becoming almost as common a source of problems.

Because of its detailed breakdown of throughput, New Relic is a great service for finding bottlenecks in applications in their production environments. And, finding bottlenecks there instead artificial load test environments is crucial even if both are used.

That’s a lot to think about; keep it simple

The ability to constantly finely tune production capacity requires fine-grained monitoring of that efficiency as a function of holistic cost, production throughput, and customer satisfaction.

The best advice we can give is to forget all of this, gather a few napkins and stakeholders and sketch the goals you have that monitoring supports and how it could do so. Afterwards, worry about the what, when, and how.