Skip to main content

API Monitoring: What should you measure?

When it comes to monitoring third-party APIs and web services, what you monitor is as important as how you monitor. Data is useful, but actionable data is where the true value is. Below we've listed the most common and valuable metrics to monitor when relying on third-party API integrations and web services. Accurate monitoring and alerting can provide your business with the data it needs to make decisions about which APIs to use, how to build resilient applications, and where to focus your engineering efforts.

Here are the metrics that we recommend when you start to monitor an API or web services:

Latency

Latency is the time the message spends “on the wire”. Here, the shorter the number, the better. Latency can be caused by the connection between your server and the API server. It can also be caused by delays that occur between your server and the API server. This may be the result of network traffic or resource overload—where throttling the requests might accommodate the heavy load.

To monitor latency, the web service needs to track timestamps for the outgoing and incoming requests and compare them to past and future requests over a given time. This can still be tricky, as the responses from the server will also be affected by response time. If available, pinging an endpoint or calling a health-check endpoint can be the best way to receive an accurate latency estimate.

This evaluation can be useful when positioning servers geographically. By determining the lowest latency your business can make decisions as to which provider to select. You can also select specific regional provider services if it is determined that the latency is the true cause for delayed responses, or select different providers if the response time of their resources is the problem. In actual practice, latency and response time will often be combined as a single value.

Response Time

Where latency takes into account the delays in the network itself, response time is the time it takes a service to respond to a request. This can be harder to track with third-party APIs and web services, as the latency to send and receive data is part of the response time. You can estimate response time by comparing the response time across multiple resources on a given API. From this, you can estimate the shared latency between the API's servers and your servers, and decide what the true value is.

The response time has a direct effect on your application's performance. Delays in the response of an API will result in slower interactions for your users. You can avoid this by ensuring your chosen API providers have response time guarantees, or by implementing a solution that uses a fallback API or cached resources when spikes are detected.

Availability

The availability of an API can be described as either downtime or uptime. Both are based on the same data, but can tell a different story depending on the context.

Availability is perhaps the easiest metric to keep track of. Downtime errors are recognizable and sometimes expected as an API provider will announce scheduled outages. However, even the most reliable APIs experience unforeseen downtimes. Downtime can be presented as individual events, or as an overall average across a given period. While downtime quotas and assurances such as "99.999% uptime" can be valuable when assessing an API provider, even the smallest downtime can have large impacts on your application.

Many APIs rely on external providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Services. As a result, the downtimes of an individual web service provider are also now dependent on a third-party that your application does not directly do business with. Even if the API provider's services are running as expected, the third-party may not be. As a result, when large downtimes occur, you will want to have a fallback in place that does not rely on the same underlying provider as the original API.

While similar to downtime in the way it is measured, the uptime of an API can provide insight into business decisions. If you know that one API has better uptime during key business hours for your customers, you can use this metric to move between providers.

Where some stakeholders may respond to downtime when selecting which API provider to drop, others may be more likely to respond to uptime when considering which to select. These values are linked, however, they can tell a different data story.

Consumption

It can be easy to forget usage, or consumption, when monitoring APIs. Internal APIs may not require a usage metric, but telemetry into third-party API consumption can aid in making business decisions. Estimating costs when consuming a web service can be difficult without the appropriate data. Consumption can be evaluated as a whole, or in bursts. Some API providers bill on a monthly scale, but some may have rate limits on their pricing tiers that also watch for usage over a smaller time window.

By keeping track of consumption and setting alerts for high usage, you can avoid unnecessary costs. Additionally, recognizing when APIs are not being used can also be beneficial. A lack of consumption is a sign that an API is still part of your codebase, but may not be vital to your application. In this case, you can adjust feature priority and gain insight into the usage of your application.

Consumption is best viewed as a running value, and filterable by a time window. This allows dashboards to provide an overview, as well as granular details about when an API is being used

Failure rate

There are a variety of reasons a request will fail. When a request to a third-party API or web service fails, it may be from user error, API downtime, rate limiting, or a variety of network-related issues. While API failures can sometimes be caused by your application, when it comes to tracking third-party APIs you want to focus primarily on failure rates out of your control.

Tracking failures and determining failure rates can aid in:

  1. Reporting problems to the API provider
  2. Deciding between multiple API providers
  3. Making informed decisions related to fallback scenarios
  4. Building resiliency around certain resources

Some errors may come from invalid requests. These can tell you that your application needs to adjust internal validation before making a request. Errors that come from server-related issues, like status codes in the 400 and 500 ranges, are a sign that the problem is likely with the API or web service provider.

Status Codes

Tracking HTTP responses can give you granular details about an individual API, but tracking specific status codes can give you better insight into the type of problems. For example, some API providers will respond with a 200 OK status, even when an error occurred. This false metric may lead you to believe that everything is working as expected, but users may experience problems and your internal logging may tell a different story.

Comparing status code metrics from API providers with internal error logs can provide additional insight into the true error rates of the third-party web services your application relies on.

Comments

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

1. Local Storage Events You might have already used LocalStorage, which is accessible across Tabs within the same application origin. But do you know that it also supports events? You can use this feature to communicate across Browser Tabs, where other Tabs will receive the event once the storage is updated. For example, let’s say in one Tab, we execute the following JavaScript code. window.localStorage.setItem("loggedIn", "true"); The other Tabs which listen to the event will receive it, as shown below. window.addEventListener('storage', (event) => { if (event.storageArea != localStorage) return; if (event.key === 'loggedIn') { // Do something with event.newValue } }); 2. Broadcast Channel API The Broadcast Channel API allows communication between Tabs, Windows, Frames, Iframes, and  Web Workers . One Tab can create and post to a channel as follows. const channel = new BroadcastChannel('app-data'); channel.postMessage(data); And oth...

Certbot SSL configuration in ubuntu

  Introduction Let’s Encrypt is a Certificate Authority (CA) that provides an easy way to obtain and install free  TLS/SSL certificates , thereby enabling encrypted HTTPS on web servers. It simplifies the process by providing a software client, Certbot, that attempts to automate most (if not all) of the required steps. Currently, the entire process of obtaining and installing a certificate is fully automated on both Apache and Nginx. In this tutorial, you will use Certbot to obtain a free SSL certificate for Apache on Ubuntu 18.04 and set up your certificate to renew automatically. This tutorial will use a separate Apache virtual host file instead of the default configuration file.  We recommend  creating new Apache virtual host files for each domain because it helps to avoid common mistakes and maintains the default files as a fallback configuration. Prerequisites To follow this tutorial, you will need: One Ubuntu 18.04 server set up by following this  initial ...

Working with Node.js streams

  Introduction Streams are one of the major features that most Node.js applications rely on, especially when handling HTTP requests, reading/writing files, and making socket communications. Streams are very predictable since we can always expect data, error, and end events when using streams. This article will teach Node developers how to use streams to efficiently handle large amounts of data. This is a typical real-world challenge faced by Node developers when they have to deal with a large data source, and it may not be feasible to process this data all at once. This article will cover the following topics: Types of streams When to adopt Node.js streams Batching Composing streams in Node.js Transforming data with transform streams Piping streams Error handling Node.js streams Types of streams The following are four main types of streams in Node.js: Readable streams: The readable stream is responsible for reading data from a source file Writable streams: The writable stream is re...