Skip to main content
This guide explains UptimeIO’s intelligent incident detection system and how to interpret incident data.

What is an Incident?

An incident represents a period when your monitored service is unavailable or not meeting expectations. UptimeIO creates incidents only after confirming the issue across multiple regions to avoid false positives.

How Incidents Are Created

UptimeIO uses a multi-region consensus system to ensure incidents are real, not transient network issues.
1

Initial Check Fails

Your monitor performs a check from Region A (e.g., US East).
❌ Check failed: Connection timeout
Region: US East
Time: 10:30:00
No incident created yet - could be a temporary issue.
2

First Retry (1 second later)

UptimeIO automatically retries from Region B (e.g., Europe).
❌ Check failed: Connection timeout
Region: Europe
Time: 10:30:01
Still no incident - waiting for final confirmation.
3

Second Retry (1 second later)

Final retry from Region C (e.g., Asia).
❌ Check failed: Connection timeout
Region: Asia
Time: 10:30:02
Incident created! - 3 failures from 2+ different regions confirms the issue is real.
4

Notifications Sent

All integrations in your notification profiles receive alerts:
  • 📧 Email notifications
  • 📱 SMS messages
  • 💬 Slack/Discord messages
  • 🔗 Webhook calls
Why 3 failures from 2+ regions? This consensus mechanism virtually eliminates false positives caused by:
  • Temporary network glitches
  • Single region outages
  • ISP routing issues
  • Transient server hiccups

Incident Timeline

Each incident has a detailed timeline showing exactly what happened:
Incident #1234 - API Server Down
Duration: 5 minutes 23 seconds

Timeline:
├─ 10:30:00 - Check failed (US East)
│  Error: Connection timeout after 10000ms
│  Status: Retrying...

├─ 10:30:01 - Check failed (Europe)
│  Error: Connection timeout after 10000ms
│  Status: Retrying...

├─ 10:30:02 - Check failed (Asia)
│  Error: Connection timeout after 10000ms
│  Status: Incident created
│  Notifications: Sent to 3 integrations

├─ 10:32:15 - Check succeeded (US East)
│  Status: 200 OK
│  Response time: 145ms
│  Status: Recovery attempt 1/3

├─ 10:32:16 - Check succeeded (Europe)
│  Status: 200 OK
│  Response time: 152ms
│  Status: Recovery attempt 2/3

└─ 10:32:17 - Check succeeded (Asia)
   Status: 200 OK
   Response time: 138ms
   Status: Incident resolved ✅
   Notifications: Sent to 3 integrations

Incident Recovery

Recovery works the same way as incident creation - 3 successful checks from 2+ regions are required to resolve an incident.
This prevents “flapping” where a service goes up and down rapidly, creating alert fatigue.
1

First Success

After incident is created, next scheduled check succeeds.
✅ Check succeeded
Region: US East
Status: Recovery attempt 1/3
2

Second Success

One second later, check from another region succeeds.
✅ Check succeeded
Region: Europe
Status: Recovery attempt 2/3
3

Third Success

Final confirmation from third region.
✅ Check succeeded
Region: Asia
Status: Incident resolved ✅
Recovery notifications sent to all integrations.

Incident Types

Monitor Down

The primary incident type when health checks fail.
Type: Monitor Down
Cause: Service unreachable
Trigger: 3 failed checks from 2+ regions
Resolution: 3 successful checks from 2+ regions
Common causes:
  • Server down or restarting
  • Network connectivity issues
  • DNS resolution failures
  • Firewall blocking requests
  • Application crashes

Slow Response

Created when response times exceed your configured threshold.
Type: Slow Response
Threshold: 2000ms
Actual: 3500ms
Trigger: Response time > threshold
Resolution: Response time < 80% of threshold for 3 checks
Example:
  • Threshold: 2000ms
  • Incident created: Response time 2500ms
  • Incident resolved: Response time drops below 1600ms (80% of 2000ms) for 3 consecutive checks
Slow response incidents are separate from downtime incidents. Your monitor can be “UP” but have an active slow response incident.

SSL Certificate Expiry

Alerts before SSL certificates expire.
Type: SSL Certificate Expiry
Certificate: example.com
Expires: 2024-02-15
Days remaining: 7
Severity: Warning
Warning periods (configurable):
  • 30 days before expiry
  • 7 days before expiry
  • 1 day before expiry

DNS Error

Triggered when DNS resolution fails or returns unexpected values.
Type: DNS Error
Domain: example.com
Expected: 93.184.216.34
Actual: 10.0.0.1
Cause: DNS record changed

Reading Incident Details

Incident Status

StatusMeaning
OpenIncident is active, service is down
ResolvedService recovered, incident closed

Incident Metadata

Each incident includes:
When the incident was created (after 3 failures confirmed).
Started: 2024-01-15 10:30:02 UTC
How long the incident lasted.
Duration: 5 minutes 23 seconds
For open incidents, shows elapsed time.
Consensus information:
Failed Checks: 3
Regions: US East, Europe, Asia
Providers: Vultr, Scaleway, DigitalOcean
Specific error from first failure:
Error: Connection timeout after 10000ms
Status Code: N/A
Region: US East
Response Time: N/A
Which monitor triggered the incident:
Monitor: API Server
URL: https://api.example.com/health
Type: HTTP/HTTPS

Incident Notifications

When an incident is created or resolved, notifications are sent through all integrations in your assigned notification profiles.

Incident Created

Subject: [UptimeIO] Monitor Down: API Server

Your monitor "API Server" is down.

URL: https://api.example.com/health
Started: 2024-01-15 10:30:02 UTC

Error: Connection timeout after 10000ms
Region: US East

Verification:
✓ 3 failed checks from 2 different regions
✓ Regions: US East, Europe, Asia

View incident: https://app.uptimeio.com/incidents/inc_123

Incident Resolved

Subject: [UptimeIO] Monitor Recovered: API Server

Your monitor "API Server" has recovered.

URL: https://api.example.com/health
Downtime: 5 minutes 23 seconds
Resolved: 2024-01-15 10:35:25 UTC

View incident: https://app.uptimeio.com/incidents/inc_123

Preventing False Positives

UptimeIO’s consensus system prevents false positives, but you can further reduce them:
Always monitor from at least 2 regions (required). For critical services, use 3-4 regions.
  • Fast APIs: 10-15 seconds
  • Standard sites: 30 seconds
  • Slow services: 45-60 seconds
Timeouts too short cause false failures.
Ensure your expected status codes match what your service actually returns.
# Default (most sites)
Expected: 200-299, 300-399

# API that returns 201 for creation
Expected: 200, 201

# Maintenance mode (temporarily)
Expected: 503
If using firewall or rate limiting, whitelist UptimeIO’s user agent:
User-Agent: UptimeIO-Monitor/1.0

Incident History

View all past incidents for a monitor:
1

Go to Monitor Details

Click on any monitor from your monitors list.
2

Scroll to Recent Incidents

The “Recent Incidents” section shows the last 10 incidents.
3

View All Incidents

Click “View All Incidents” to see complete history.
4

Filter and Search

Filter by:
  • Date range
  • Status (open/resolved)
  • Incident type
  • Duration

Incident Metrics

UptimeIO calculates key metrics from your incident history:
MetricDescriptionCalculation
Uptime %Percentage of time service was available(Total Time - Downtime) / Total Time × 100
DowntimeTotal time in incidentsSum of all incident durations
MTBFMean Time Between FailuresAverage time between incidents
MTTRMean Time To RecoveryAverage incident duration
Incident CountTotal number of incidentsCount of all incidents

Next Steps