Alert Rules

Alert rules define the conditions that trigger notifications. Create rules to match your operational requirements and avoid alert fatigue.

Creating Alert Rules

Go to Settings > Alert Rules
Click New Rule
Configure conditions and channels
Apply to monitors

Rule Types

Consecutive Failures

Alert after N failures in a row. Most common rule type.

Name: 3 Consecutive Failures
Type: consecutive_failures
Threshold: 3
Channels:
  - Slack (#ops-alerts)
  - Email ([email protected])

Use when: You want to avoid alerting on transient issues.

Threshold	Use Case
1	Critical systems, zero tolerance
2-3	Standard production monitoring
5+	Flaky checks, high-noise monitors

Error Rate

Alert when failure percentage exceeds threshold over a time window.

Name: High Error Rate
Type: error_rate
Threshold: 10%  # percentage
Window: 1 hour
Min Samples: 10  # minimum runs to evaluate
Channels:
  - Slack (#ops-alerts)

Use when: Occasional failures are acceptable, but sustained issues aren't.

Latency Threshold

Alert when response time exceeds limit.

Name: Slow Response Alert
Type: latency
Threshold: 2000  # milliseconds
Percentile: p95  # or p50, p99, avg
Window: 15 minutes
Channels:
  - Slack (#performance)

Use when: Performance degradation impacts users.

Percentile	Description
`avg`	Average response time
`p50`	Median (50th percentile)
`p95`	95th percentile (recommended)
`p99`	99th percentile

Visual Diff Threshold

Alert when visual changes exceed threshold (journeys only).

Name: Visual Regression Alert
Type: visual_diff
Threshold: 5%  # pixel difference percentage
Channels:
  - Slack (#design-review)
  - Email ([email protected])

Use when: UI consistency is critical.

SSL Certificate Expiry

Alert before TLS certificate expires.

Name: SSL Expiry Warning
Type: ssl_expiry
Days Before: 30
Channels:
  - Email ([email protected])

Recommended thresholds:

30 days: Warning
14 days: Urgent
7 days: Critical

Rule Configuration

Applying Rules to Monitors

Rules can be applied:

Per-monitor:

Monitor: Production API
Alert Rules:
  - 3 Consecutive Failures
  - Slow Response Alert

By tag:

Rule: Critical Alerts
Apply To: tag:production

Globally:

Rule: Basic Monitoring
Apply To: all monitors
Exclude: tag:experimental

Severity Levels

Assign severity to rules:

Severity: critical | high | medium | low

# Different channels per severity
Channels:
  critical:
    - PagerDuty
    - Slack (#critical-alerts)
  high:
    - Slack (#ops-alerts)
    - Email
  medium:
    - Slack (#monitoring)
  low:
    - Email (daily digest)

Time-Based Rules

Adjust alerting based on time:

Name: Business Hours Alert
Active Hours:
  Start: "09:00"
  End: "18:00"
  Timezone: America/New_York
  Days: [Monday, Tuesday, Wednesday, Thursday, Friday]
Channels:
  - Slack (#ops-alerts)

# Off-hours: different channel
Off Hours Channel:
  - PagerDuty (on-call)

Cooldown Period

Prevent repeated alerts:

Cooldown: 30 minutes  # Don't re-alert for 30 min

Cooldown starts after recovery. If the same issue recurs, a new alert is sent.

Recovery Notifications

Configure recovery alerts:

Recovery:
  Enabled: true
  Channels: same  # or specify different channels
  Message: "{{monitor.name}} has recovered after {{incident.duration}}"

Recovery notification includes:

Time down
Number of failed runs
Link to incident

Alert Grouping

Group related alerts to reduce noise:

Grouping:
  By: monitor | location | tag
  Window: 5 minutes

If multiple monitors fail within the window, one grouped alert is sent.

Muting Alerts

Temporarily silence alerts:

Manual Mute

Mute:
  Duration: 2 hours
  Reason: "Deploying new version"

Scheduled Maintenance

Maintenance Window:
  Name: Weekly Deployment
  Schedule: "0 2 * * 0"  # Sunday 2 AM
  Duration: 1 hour
  Mute Alerts: true

Example Configurations

Standard Production Setup

Rules:
  - name: Critical Failures
    type: consecutive_failures
    threshold: 2
    apply_to: tag:critical
    channels:
      - PagerDuty
      - Slack (#critical)

  - name: Standard Failures
    type: consecutive_failures
    threshold: 3
    apply_to: tag:production
    channels:
      - Slack (#ops-alerts)
      - Email

  - name: Performance Degradation
    type: latency
    threshold: 3000
    percentile: p95
    window: 15m
    apply_to: all
    channels:
      - Slack (#performance)

E-commerce Site

Rules:
  - name: Checkout Flow Down
    type: consecutive_failures
    threshold: 1  # Zero tolerance
    apply_to: journey:checkout
    channels:
      - PagerDuty
      - Slack (#critical)
      - SMS (on-call)

  - name: Product Pages Slow
    type: latency
    threshold: 2000
    apply_to: monitor:product-pages
    channels:
      - Slack (#performance)

  - name: Visual Changes
    type: visual_diff
    threshold: 3%
    apply_to: journey:homepage
    channels:
      - Slack (#design)

API Platform

Rules:
  - name: API Down
    type: consecutive_failures
    threshold: 2
    apply_to: tag:api
    channels:
      - PagerDuty
      - Slack (#api-alerts)

  - name: High Error Rate
    type: error_rate
    threshold: 5%
    window: 30m
    apply_to: tag:api
    channels:
      - Slack (#api-alerts)

  - name: SLA Breach Risk
    type: latency
    threshold: 500
    percentile: p99
    apply_to: tag:sla-monitored
    channels:
      - Email ([email protected])

Best Practices

Start conservative - Begin with higher thresholds, tune down
Avoid single-failure alerts - Too noisy for most cases
Use severity levels - Not every alert is critical
Set up escalation - Slack → Email → PagerDuty
Include recovery - Know when issues resolve
Review regularly - Tune based on false positive rate
Document rules - Why does this rule exist?

Alert fatigue is real. If your team ignores alerts because there are too many, adjust your thresholds. Better to miss a minor issue than ignore a critical one.

Troubleshooting

Too Many Alerts

Increase consecutive failure threshold
Add cooldown period
Use error rate instead of per-failure
Review monitor reliability

Missing Alerts

Check rule is applied to monitor
Verify channels are configured
Check for mute/maintenance windows
Test with Trigger Test Alert

Alert Delays

Check channel delivery (Slack/email)
Verify webhook endpoints respond
Review alert processing logs

Alert Rules

On this page