Alerts & Incidents

Alert Rules

Define when and how alerts are triggered based on monitor results.

Alert Rules

Alert rules define the conditions that trigger notifications. Create rules to match your operational requirements and avoid alert fatigue.

Creating Alert Rules

  1. Go to Settings > Alert Rules
  2. Click New Rule
  3. Configure conditions and channels
  4. Apply to monitors

Rule Types

Consecutive Failures

Alert after N failures in a row. Most common rule type.

Name: 3 Consecutive Failures
Type: consecutive_failures
Threshold: 3
Channels:
  - Slack (#ops-alerts)
  - Email ([email protected])

Use when: You want to avoid alerting on transient issues.

ThresholdUse Case
1Critical systems, zero tolerance
2-3Standard production monitoring
5+Flaky checks, high-noise monitors

Error Rate

Alert when failure percentage exceeds threshold over a time window.

Name: High Error Rate
Type: error_rate
Threshold: 10%  # percentage
Window: 1 hour
Min Samples: 10  # minimum runs to evaluate
Channels:
  - Slack (#ops-alerts)

Use when: Occasional failures are acceptable, but sustained issues aren't.

Latency Threshold

Alert when response time exceeds limit.

Name: Slow Response Alert
Type: latency
Threshold: 2000  # milliseconds
Percentile: p95  # or p50, p99, avg
Window: 15 minutes
Channels:
  - Slack (#performance)

Use when: Performance degradation impacts users.

PercentileDescription
avgAverage response time
p50Median (50th percentile)
p9595th percentile (recommended)
p9999th percentile

Visual Diff Threshold

Alert when visual changes exceed threshold (journeys only).

Name: Visual Regression Alert
Type: visual_diff
Threshold: 5%  # pixel difference percentage
Channels:
  - Slack (#design-review)
  - Email ([email protected])

Use when: UI consistency is critical.

SSL Certificate Expiry

Alert before TLS certificate expires.

Name: SSL Expiry Warning
Type: ssl_expiry
Days Before: 30
Channels:
  - Email ([email protected])

Recommended thresholds:

  • 30 days: Warning
  • 14 days: Urgent
  • 7 days: Critical

Rule Configuration

Applying Rules to Monitors

Rules can be applied:

Per-monitor:

Monitor: Production API
Alert Rules:
  - 3 Consecutive Failures
  - Slow Response Alert

By tag:

Rule: Critical Alerts
Apply To: tag:production

Globally:

Rule: Basic Monitoring
Apply To: all monitors
Exclude: tag:experimental

Severity Levels

Assign severity to rules:

Severity: critical | high | medium | low

# Different channels per severity
Channels:
  critical:
    - PagerDuty
    - Slack (#critical-alerts)
  high:
    - Slack (#ops-alerts)
    - Email
  medium:
    - Slack (#monitoring)
  low:
    - Email (daily digest)

Time-Based Rules

Adjust alerting based on time:

Name: Business Hours Alert
Active Hours:
  Start: "09:00"
  End: "18:00"
  Timezone: America/New_York
  Days: [Monday, Tuesday, Wednesday, Thursday, Friday]
Channels:
  - Slack (#ops-alerts)

# Off-hours: different channel
Off Hours Channel:
  - PagerDuty (on-call)

Cooldown Period

Prevent repeated alerts:

Cooldown: 30 minutes  # Don't re-alert for 30 min

Cooldown starts after recovery. If the same issue recurs, a new alert is sent.

Recovery Notifications

Configure recovery alerts:

Recovery:
  Enabled: true
  Channels: same  # or specify different channels
  Message: "{{monitor.name}} has recovered after {{incident.duration}}"

Recovery notification includes:

  • Time down
  • Number of failed runs
  • Link to incident

Alert Grouping

Group related alerts to reduce noise:

Grouping:
  By: monitor | location | tag
  Window: 5 minutes

If multiple monitors fail within the window, one grouped alert is sent.

Muting Alerts

Temporarily silence alerts:

Manual Mute

Mute:
  Duration: 2 hours
  Reason: "Deploying new version"

Scheduled Maintenance

Maintenance Window:
  Name: Weekly Deployment
  Schedule: "0 2 * * 0"  # Sunday 2 AM
  Duration: 1 hour
  Mute Alerts: true

Example Configurations

Standard Production Setup

Rules:
  - name: Critical Failures
    type: consecutive_failures
    threshold: 2
    apply_to: tag:critical
    channels:
      - PagerDuty
      - Slack (#critical)

  - name: Standard Failures
    type: consecutive_failures
    threshold: 3
    apply_to: tag:production
    channels:
      - Slack (#ops-alerts)
      - Email

  - name: Performance Degradation
    type: latency
    threshold: 3000
    percentile: p95
    window: 15m
    apply_to: all
    channels:
      - Slack (#performance)

E-commerce Site

Rules:
  - name: Checkout Flow Down
    type: consecutive_failures
    threshold: 1  # Zero tolerance
    apply_to: journey:checkout
    channels:
      - PagerDuty
      - Slack (#critical)
      - SMS (on-call)

  - name: Product Pages Slow
    type: latency
    threshold: 2000
    apply_to: monitor:product-pages
    channels:
      - Slack (#performance)

  - name: Visual Changes
    type: visual_diff
    threshold: 3%
    apply_to: journey:homepage
    channels:
      - Slack (#design)

API Platform

Rules:
  - name: API Down
    type: consecutive_failures
    threshold: 2
    apply_to: tag:api
    channels:
      - PagerDuty
      - Slack (#api-alerts)

  - name: High Error Rate
    type: error_rate
    threshold: 5%
    window: 30m
    apply_to: tag:api
    channels:
      - Slack (#api-alerts)

  - name: SLA Breach Risk
    type: latency
    threshold: 500
    percentile: p99
    apply_to: tag:sla-monitored
    channels:
      - Email ([email protected])

Best Practices

  1. Start conservative - Begin with higher thresholds, tune down
  2. Avoid single-failure alerts - Too noisy for most cases
  3. Use severity levels - Not every alert is critical
  4. Set up escalation - Slack → Email → PagerDuty
  5. Include recovery - Know when issues resolve
  6. Review regularly - Tune based on false positive rate
  7. Document rules - Why does this rule exist?

Alert fatigue is real. If your team ignores alerts because there are too many, adjust your thresholds. Better to miss a minor issue than ignore a critical one.

Troubleshooting

Too Many Alerts

  • Increase consecutive failure threshold
  • Add cooldown period
  • Use error rate instead of per-failure
  • Review monitor reliability

Missing Alerts

  • Check rule is applied to monitor
  • Verify channels are configured
  • Check for mute/maintenance windows
  • Test with Trigger Test Alert

Alert Delays

  • Check channel delivery (Slack/email)
  • Verify webhook endpoints respond
  • Review alert processing logs