Forecasting visitor interactions and anomaly detection

Anomaly detection can be used as a data-centered approach to check your website for flaws. This approach compares the data flow that comes from user interactions with your website, to similar historical data. Using Harvest Store, we built an anomaly detection system for a number of our customers, that fires alerts whenever parts of a website, or its tracking, suddenly perform significantly worse.

If one of your websites or apps stops working, you probably would like to know this as fast as possible so that you can resolve the issue. If your server is down, you will hopefully receive an alert quite quickly. However, if only a part of your website or app, or the tracking thereof, is suddenly performing worse, then this might take you quite a while to notice. It can have a severe impact on the validity of your data analyses and even on your revenue, if such a problem stays on unnoticed for a while. Hence, a rapid discovery of such problems is essential.

One way to test your website for faults is by using or building tools that periodically test every part of your website. For this procedure, you will have to think of all the things that can break down on your website, write tests for all of these cases, and let this tool crawl your website regularly. In theory, this can be a very effective solution, but it is in most cases practically impossible in terms of time and resources needed to build such crawlers and think of all the ways things can go wrong.

Another way of checking your website for errors, which is more reactive in nature, is by using anomaly detection. This approach is based on the data flow generated by the users that are interacting with your website. The idea is to predict the expected number of events, based on historical data, and compare this to the number of events you're seeing at the moment, and fire an alert “when something doesn’t look right”. This approach can be applied to all sorts of events, like pageviews, transactions, or user interactions with funnels on your website. In addition, these observations can be split based on dimensions like browser, device, etc, making the monitoring sensitive to issues that might otherwise go unnoticed. This way you can detect your website breaking, when e.g. Safari deploys an update that is incompatible with your front-end code.

To notice potential irregularities in the incoming user interaction data, you need some historical data to compare the current situation to. A tool that can help you with collecting data of users interacting with your website is Harvest Store. With Harvest Store, which is built around Splunk, it is possible to search in your raw data. You can make custom searches that will satisfy your specific needs and interests, and split the data based on the dimensions that you are interested in, like browser or device. For a number of our customers, we used Harvest Store to set up an anomaly detection system, consisting of a set of such searches. These searches are scheduled to run periodically and will fire an alert whenever a particular part of your website, or the tracking thereof, seems to not work properly. Below, we’ll explain how this works.

The simplest form of anomaly detection is just periodically counting the occurrences of a specific kind of event over a given period, and firing an alert if this count is zero. However, sometimes it happens that only a part of a website, or its tracking, has a problem, or that a problem only occurs for users with a certain browser or device. In this case, the event count will not be zero, but just lower (or higher!) than “usual”. In anomaly detection, we try to predict this “usual” number of events, based on comparable data in the historical data.

The question now is, how much should the current data differ from the historical data, before we fire an alert? One idea is to fire an alert whenever the current event count is a certain fixed percentage lower than the prediction based on the historical data. However, choosing such a percentage might not be very practical, as this differs a lot for different kinds of data sets and for different applications.

A more sophisticated method of deciding how much you allow the current data to differ from the historic data is to use the standard deviation of the historical data . If we assume that the data coming from your website is approximately normally distributed for a given time slot of the day, then this implies that about 68% of the data values lie within one standard deviation of the mean. Within two standard deviations of the mean, this is even 95%, and about 99.7% of the data should lie within three standard deviations from the mean. This means that if you measure a data point that is more than three times the standard deviation away from the mean, then this data point is probably anomalous. Depending on how strict you want your anomaly detection system to be, you can choose different thresholds for sending an alert.

Let’s say we are setting up anomaly detection for the number of pageviews of the different pages of a certain website. Most likely the number of pageviews varies during the day, and perhaps even during the week. Using data that we already collected for a while in Harvest Store, we record the average number of pageviews per page per hour per day of the week, as well as the standard deviation for every time slot and page. This record thus represents the periodic change in pageviews, that we consider “normal” for this particular website. Then, we set up a search in Harvest Store that checks every hour how many pageviews each page had in the last hour, and compares these values to the averages that we saved in our table. We decide that if the current number of pageviews for a certain page is lower than its historical average minus two times the standard deviation for that time slot, we want an alert to be sent through email.

We have successfully implemented setups, similar to the one described above, for a number of our customers. They receive notices through email, Slack, or other channels, that warn them when part of their website or app, or the corresponding tracking, does not operate correctly. This turns out to be a very valuable addition to their infrastructure. Since, for many of them, a lot of people work on the online platforms, making errors and conflicting changes inevitable. Also, new browser- and os-compatibility issues might otherwise go unnoticed for a long time. Using an alerting system based on the flow of raw data, allows for the early detection of such issues and makes it possible to fix them before they can do real harm.