The approach provides the ability to carefully define how Dotcom-Monitor interprets responses as either “up” or “down” responses. This is accomplished using filters.
Filtering defines the UP/DOWN state using the following adjustable criteria:
- Error is reported for a specified number of Minutes
- Error is confirmed by a specified number of Agents
- Error is detected in a specified number of Tasks.
All filters and their settings are available at Configure > Filters:
After a Filter is applied to a Device all of the Device’s notifications are based on the Filter’s criteria.
The Formula for the Downtime calculation is as follows:
- Downtime Duration is tied directly to the configurations within the Filter:
- Downtime starts when a filter’s conditions are met: For example, when the number of agents which report a failure equals the number of agents specified in the filter, and as also specified the conditions are met for the number of minutes and tasks, then a downtime alert is sent.
- Uptime when the filter’s conditions are no longer met: Specifically, uptime starts when the number of agents, minutes, or tasks, which have reported “up” success, no longer meet the conditions needed for the filtered “down” conditions. For example, an “up” state is indicated when the number of error (“down”) responses received by agents becomes less then number of error (“down”) responses that agents need, as set in the filter, in order to indicate a “down” condition.
- Dotcom-Monitor calculates uptime/downtime based on responses gathered over 24 hours. Because the calculation begins with the monitored response of each 24-hour day, there is often a gap of time between the start of the day and the first instance of monitoring (due to the monitoring frequency). That gap of time is considered an “undefined” state and is excluded from the calculation of Uptime.Quite seldom such state can be caused by delays in results transit. It can happen only when each agent involved into monitoring will report undefine. It its turn, to report undefined, agent should NOT receive any response in a certain amount of time.
This duration is being calculated depending on a last response state (error or success):
- Error response Duration = overall agents number × (monitoring frequency + 15 min)
- Success response Duration = monitoring frequency + 15 min
- Postponing a device at any moment will stop any monitoring activity until it is re-enabled.
- Another entity that can significantly affect DOWNTIME/UPTIME calculations is Schedules. This is an option for managing your monitoring during routine maintenance. Monitoring can be postponed for specific days of the week as well as specific hours and minutes during a day. To set up a schedule, follow instruction. A new schedule will appear in the dropdown menu on the Device edit page under Monitoring Scheduler.
Let’s say we have device monitored from 7 locations and filter set that 3 locations must report error for DOWNTIME condition. First monitoring node (agent 1) detects an error while the rest are still reporting success, then the second (agent 2) and at last third one (agent 4) detects an error at ‘T4’ which triggers filter to set “downtime” beginning right from this moment. DOWN state will remain until you set hypothetical “Postpone” at ‘T5’ because of the number of Agent reporting errors higher than adjusted 3 throughout all this time. The time gap between T6 and T7 is an illustration of the fact we get the first response with a delay (monitoring session processing time includes network transfer delays and execution itself), so “Postponed” time is being calculated as ∆ (T7–T5) (Postponed 2nd). Again, we fall into DOWNTIME only on 3rd error from Agent 3 and get in UP state only on the T9 response, when the number of failing Agents becomes less than adjusted in Filter. Here comes the final downtime % calculation formula for this case: