In general, an Uptime value reflects a percentage of time, measured within the bounds of a specified period, during which Dotcom-Monitor receives successful responses from monitoring agents (“Agents” hereinafter). The Downtime value reflects a percentage of time, measured within the bounds of a specified period, during which Dotcom-Monitor received negative responses. Dotcom-Monitor worldwide network of Agents precisely interpret these responses using a specific method. However, this basic description of Uptime vs. Downtime values doesn’t allow for the specific Uptime calculation realities which many organizations face. Specifically, when organizations consider their processes and goals it’s clear that choices need to be made due to the fact that by there is an interdependent relationship between the definitions of the Uptime value and Downtime value. The two values influence each other.
With that in mind, below are several examples for asking the question “How do you define Downtime?”:
- If you have scheduled maintenance on your web server every Sunday evening, is your website down?
- If your Chicago-based web server cannot be reached from Orlando, FL (but is available from the rest of the USA) because the backbone provider Time Warner is having an issue in Orlando, is your website down?
- If a third-party hosted elements (say a chat widget) is experiencing a server error, but the rest of your website is available, is your website down?
- If your website is not available from anywhere in the world –due to a server hiccup, which last 5 seconds – is your website down?
- If your retail website’s shopping cart is working, but your About Us page is not, is your website down?
- If one DNS server is down, but three others are working, so 25% of clients cannot get access to the website after the cached time-to-live (TTL) expires, is it a down condition?
- If one of three web servers in a web farm went down and the page response time increased by 10%, or 25%, or 50% (slower page load time) when is this downtime?
Two Approaches to Calculating Uptime
Because different organizations have different uptime and downtime definitions, and these definitions can change over time, Dotcom-Monitor has two approaches to calculating Uptime.
On the one hand, the first calculation (“Regular approach” hereinafter) fits the needs of most organizations as it provides precise filtering of Uptime vs. Downtime definitions. On the other hand, calculating Uptime using the second (“Weighted approach” hereinafter) does not use Filters.
Regular Uptime/Downtime calculation approach
The most widely used, Regular approach, provides the ability to carefully define how Dotcom-Monitor interprets responses as either “up” or “down” responses. This is accomplished using filters. (Incidentally, a filter can also both be applied to a device (cutting false triggering) and to any type of Reporting.)
Filtering defines the UP/DOWN state using the following adjustable criteria:
- Error is reported for a specified number of Minutes
- Error is confirmed by a specified number of Agents
- Error is detected in a specified number of Tasks.
All filters and their settings are available at Configure > Filters:
After a Filter is applied to a Device all of the Device’s notifications are based on the Filter’s criteria.
The Formula for the Downtime calculation is as follows:
- Downtime Duration is tied directly to the configurations within the Filter:
- Downtime starts when a filter’s conditions are met: For example, when the number of agents which report a failure equals the number of agents specified in the filter, and as also specified the conditions are met for the number of minutes and tasks, then a downtime alert is sent.
- Uptime when the filter’s conditions are no longer met: Specifically, uptime starts when the number of agents, minutes, or tasks, which have reported “up” success, no longer meet the conditions needed for the filtered “down” conditions. For example, an “up” state is indicated when the number of error (“down”) responses received by agents becomes less then number of error (“down”) responses that agents need, as set in the filter, in order to indicate a “down” condition.
- Dotcom-Monitor calculates uptime/downtime based on responses gathered over 24 hours. Because the calculation begins with the monitored response of each 24-hour day, there is often a gap of time between the start of the day and the first instance of monitoring (due to the monitoring frequency). That gap of time is considered an “undefined” state and is excluded from the calculation of Uptime.Quite seldom such state can be caused by delays in results transit. It can happen only when each agent involved into monitoring will report undefine. It its turn, to report undefined, agent should NOT receive any response in a certain amount of time.
This duration is being calculated depending on a last response state (error or success):
- Error response Duration = overall agents number × (monitoring frequency + 15 min)
- Success response Duration = monitoring frequency + 15 min
- Postponing a device at any moment will stop any monitoring activity until it is re-enabled.
- Another entity that can significantly affect DOWNTIME/UPTIME calculations is Schedules. This is an option for managing your monitoring during routine maintenance. Monitoring can be postponed for specific days of the week as well as specific hours and minutes during a day. To set up a schedule, follow instruction. A new schedule will appear in the dropdown menu on the Device edit page under Monitoring Scheduler.
Let’s say we have device monitored from 7 locations and filter set that 3 locations must report error for DOWNTIME condition. First monitoring node (agent 1) detects an error while the rest are still reporting success, then the second (agent 2) and at last third one (agent 4) detects an error at ‘T4’ which triggers filter to set “downtime” beginning right from this moment. DOWN state will remain until you set hypothetical “Postpone” at ‘T5’ because of the number of Agent reporting errors higher than adjusted 3 throughout all this time. The time gap between T6 and T7 is an illustration of the fact we get the first response with a delay (monitoring session processing time includes network transfer delays and execution itself), so “Postponed” time is being calculated as ∆ (T7–T5) (Postponed 2nd). Again, we fall into DOWNTIME only on 3rd error from Agent 3 and get in UP state only on the T9 response, when the number of failing Agents becomes less than adjusted in Filter. Here comes the final downtime % calculation formula for this case:
Weighted Uptime/Downtime calculation approach
The weighted approach is used only in case no filter was applied.
Downtime Duration, in this case, is being calculated based on the following rule:
- Downtime is being stated by a negative response from an agent.
- Uptime starts only after success confirmation received from all involved agents
General formula is:
Example that illustrates this approach
Timeline from 1 to 100 percents is being divided into periods by each new response. In our example the duration of the first period equals ∆ (T2–T1). For each period we calculate Weighted Uptime and Weighted Downtime. Final Weighted Downtime and Weighted uptime values are a sum of values calculated for each period within the timeline.
Based on the general formula, here are some calculations for each period: