Understanding Incidents

This guide explains UptimeIO’s intelligent incident detection system and how to interpret incident data.

What is an Incident?

An incident represents a period when your monitored service is unavailable or not meeting expectations. UptimeIO creates incidents only after confirming the issue across multiple regions to avoid false positives.

How Incidents Are Created

UptimeIO uses a multi-region consensus system to ensure incidents are real, not transient network issues.

Initial Check Fails

Your monitor performs a check from Region A (e.g., US East).

❌ Check failed: Connection timeout
Region: US East
Time: 10:30:00

No incident created yet - could be a temporary issue.

First Retry (1 second later)

UptimeIO automatically retries from Region B (e.g., Europe).

❌ Check failed: Connection timeout
Region: Europe
Time: 10:30:01

Still no incident - waiting for final confirmation.

Second Retry (1 second later)

Final retry from Region C (e.g., Asia).

❌ Check failed: Connection timeout
Region: Asia
Time: 10:30:02

Incident created! - 3 failures from 2+ different regions confirms the issue is real.

Notifications Sent

All integrations in your notification profiles receive alerts:

📧 Email notifications
📱 SMS messages
💬 Slack/Discord messages
🔗 Webhook calls

Why 3 failures from 2+ regions? This consensus mechanism virtually eliminates false positives caused by:

Temporary network glitches
Single region outages
ISP routing issues
Transient server hiccups

Incident Timeline

Each incident has a detailed timeline showing exactly what happened:

Incident #1234 - API Server Down
Duration: 5 minutes 23 seconds

Timeline:
├─ 10:30:00 - Check failed (US East)
│  Error: Connection timeout after 10000ms
│  Status: Retrying...
│
├─ 10:30:01 - Check failed (Europe)
│  Error: Connection timeout after 10000ms
│  Status: Retrying...
│
├─ 10:30:02 - Check failed (Asia)
│  Error: Connection timeout after 10000ms
│  Status: Incident created
│  Notifications: Sent to 3 integrations
│
├─ 10:32:15 - Check succeeded (US East)
│  Status: 200 OK
│  Response time: 145ms
│  Status: Recovery attempt 1/3
│
├─ 10:32:16 - Check succeeded (Europe)
│  Status: 200 OK
│  Response time: 152ms
│  Status: Recovery attempt 2/3
│
└─ 10:32:17 - Check succeeded (Asia)
   Status: 200 OK
   Response time: 138ms
   Status: Incident resolved ✅
   Notifications: Sent to 3 integrations

Incident Recovery

Recovery works the same way as incident creation - 3 successful checks from 2+ regions are required to resolve an incident.

This prevents “flapping” where a service goes up and down rapidly, creating alert fatigue.

First Success

After incident is created, next scheduled check succeeds.

✅ Check succeeded
Region: US East
Status: Recovery attempt 1/3

Second Success

One second later, check from another region succeeds.

✅ Check succeeded
Region: Europe
Status: Recovery attempt 2/3

Third Success

Final confirmation from third region.

✅ Check succeeded
Region: Asia
Status: Incident resolved ✅

Recovery notifications sent to all integrations.

Incident Types

Monitor Down

The primary incident type when health checks fail.

Type: Monitor Down
Cause: Service unreachable
Trigger: 3 failed checks from 2+ regions
Resolution: 3 successful checks from 2+ regions

Common causes:

Server down or restarting
Network connectivity issues
DNS resolution failures
Firewall blocking requests
Application crashes

Slow Response

Created when response times exceed your configured threshold.

Type: Slow Response
Threshold: 2000ms
Actual: 3500ms
Trigger: Response time > threshold
Resolution: Response time < 80% of threshold for 3 checks

Example:

Threshold: 2000ms
Incident created: Response time 2500ms
Incident resolved: Response time drops below 1600ms (80% of 2000ms) for 3 consecutive checks

Slow response incidents are separate from downtime incidents. Your monitor can be “UP” but have an active slow response incident.

SSL Certificate Expiry

Alerts before SSL certificates expire.

Type: SSL Certificate Expiry
Certificate: example.com
Expires: 2024-02-15
Days remaining: 7
Severity: Warning

Warning periods (configurable):

30 days before expiry
7 days before expiry
1 day before expiry

DNS Error

Triggered when DNS resolution fails or returns unexpected values.

Type: DNS Error
Domain: example.com
Expected: 93.184.216.34
Actual: 10.0.0.1
Cause: DNS record changed

Reading Incident Details

Incident Status

Status	Meaning
Open	Incident is active, service is down
Resolved	Service recovered, incident closed

Incident Metadata

Each incident includes:

Start Time

When the incident was created (after 3 failures confirmed).

Started: 2024-01-15 10:30:02 UTC

Duration

How long the incident lasted.

Duration: 5 minutes 23 seconds

For open incidents, shows elapsed time.

Verification Details

Consensus information:

Failed Checks: 3
Regions: US East, Europe, Asia
Providers: Vultr, Scaleway, DigitalOcean

Error Details

Specific error from first failure:

Error: Connection timeout after 10000ms
Status Code: N/A
Region: US East
Response Time: N/A

Affected Monitor

Which monitor triggered the incident:

Monitor: API Server
URL: https://api.example.com/health
Type: HTTP/HTTPS

Incident Notifications

When an incident is created or resolved, notifications are sent through all integrations in your assigned notification profiles.

Incident Created

Subject: [UptimeIO] Monitor Down: API Server

Your monitor "API Server" is down.

URL: https://api.example.com/health
Started: 2024-01-15 10:30:02 UTC

Error: Connection timeout after 10000ms
Region: US East

Verification:
✓ 3 failed checks from 2 different regions
✓ Regions: US East, Europe, Asia

View incident: https://app.uptimeio.com/incidents/inc_123

Incident Resolved

Subject: [UptimeIO] Monitor Recovered: API Server

Your monitor "API Server" has recovered.

URL: https://api.example.com/health
Downtime: 5 minutes 23 seconds
Resolved: 2024-01-15 10:35:25 UTC

View incident: https://app.uptimeio.com/incidents/inc_123

Preventing False Positives

UptimeIO’s consensus system prevents false positives, but you can further reduce them:

Use multiple regions

Always monitor from at least 2 regions (required). For critical services, use 3-4 regions.

Set appropriate timeouts

Fast APIs: 10-15 seconds
Standard sites: 30 seconds
Slow services: 45-60 seconds

Timeouts too short cause false failures.

Configure expected status codes correctly

Ensure your expected status codes match what your service actually returns.

# Default (most sites)
Expected: 200-299, 300-399

# API that returns 201 for creation
Expected: 200, 201

# Maintenance mode (temporarily)
Expected: 503

Whitelist UptimeIO

If using firewall or rate limiting, whitelist UptimeIO’s user agent:

User-Agent: UptimeIO-Monitor/1.0

Incident History

View all past incidents for a monitor:

Go to Monitor Details

Click on any monitor from your monitors list.

Scroll to Recent Incidents

The “Recent Incidents” section shows the last 10 incidents.

View All Incidents

Click “View All Incidents” to see complete history.

Filter and Search

Filter by:

Date range
Status (open/resolved)
Incident type
Duration

Incident Metrics

UptimeIO calculates key metrics from your incident history:

Metric	Description	Calculation
Uptime %	Percentage of time service was available	`(Total Time - Downtime) / Total Time × 100`
Downtime	Total time in incidents	Sum of all incident durations
MTBF	Mean Time Between Failures	Average time between incidents
MTTR	Mean Time To Recovery	Average incident duration
Incident Count	Total number of incidents	Count of all incidents

Next Steps

Setting Up Alerts

Configure notifications for incidents

Reading Metrics

Understand uptime calculations

Creating Monitors

Create your first monitor

Notification Profiles

Manage notification routing

Getting Started

Monitor Types

Essentials

Notifications

Status Pages

Integrations

Plans & Billing

Understanding Incidents

What is an Incident?

How Incidents Are Created

Incident Timeline

Incident Recovery

Incident Types

Monitor Down

Slow Response

SSL Certificate Expiry

DNS Error

Reading Incident Details

Incident Status

Incident Metadata

Incident Notifications

Incident Created

Incident Resolved

Preventing False Positives

Incident History

Incident Metrics

Next Steps

Setting Up Alerts

Reading Metrics

Creating Monitors

Notification Profiles

Getting Started

Monitor Types

Essentials

Notifications

Status Pages

Integrations

Plans & Billing

​What is an Incident?

​How Incidents Are Created

​Incident Timeline

​Incident Recovery

​Incident Types

​Monitor Down

​Slow Response

​SSL Certificate Expiry

​DNS Error

​Reading Incident Details

​Incident Status

​Incident Metadata

​Incident Notifications

​Incident Created

​Incident Resolved

​Preventing False Positives

​Incident History

​Incident Metrics

​Next Steps

Setting Up Alerts

Reading Metrics

Creating Monitors

Notification Profiles

What is an Incident?

How Incidents Are Created

Incident Timeline

Incident Recovery

Incident Types

Monitor Down

Slow Response

SSL Certificate Expiry

DNS Error

Reading Incident Details

Incident Status

Incident Metadata

Incident Notifications

Incident Created

Incident Resolved

Preventing False Positives

Incident History

Incident Metrics

Next Steps