Alert Rules
Define when and how alerts are triggered based on monitor results.
Alert Rules
Alert rules define the conditions that trigger notifications. Create rules to match your operational requirements and avoid alert fatigue.
Creating Alert Rules
- Go to Settings > Alert Rules
- Click New Rule
- Configure conditions and channels
- Apply to monitors
Rule Types
Consecutive Failures
Alert after N failures in a row. Most common rule type.
Name: 3 Consecutive Failures
Type: consecutive_failures
Threshold: 3
Channels:
- Slack (#ops-alerts)
- Email ([email protected])Use when: You want to avoid alerting on transient issues.
| Threshold | Use Case |
|---|---|
| 1 | Critical systems, zero tolerance |
| 2-3 | Standard production monitoring |
| 5+ | Flaky checks, high-noise monitors |
Error Rate
Alert when failure percentage exceeds threshold over a time window.
Name: High Error Rate
Type: error_rate
Threshold: 10% # percentage
Window: 1 hour
Min Samples: 10 # minimum runs to evaluate
Channels:
- Slack (#ops-alerts)Use when: Occasional failures are acceptable, but sustained issues aren't.
Latency Threshold
Alert when response time exceeds limit.
Name: Slow Response Alert
Type: latency
Threshold: 2000 # milliseconds
Percentile: p95 # or p50, p99, avg
Window: 15 minutes
Channels:
- Slack (#performance)Use when: Performance degradation impacts users.
| Percentile | Description |
|---|---|
avg | Average response time |
p50 | Median (50th percentile) |
p95 | 95th percentile (recommended) |
p99 | 99th percentile |
Visual Diff Threshold
Alert when visual changes exceed threshold (journeys only).
Name: Visual Regression Alert
Type: visual_diff
Threshold: 5% # pixel difference percentage
Channels:
- Slack (#design-review)
- Email ([email protected])Use when: UI consistency is critical.
SSL Certificate Expiry
Alert before TLS certificate expires.
Name: SSL Expiry Warning
Type: ssl_expiry
Days Before: 30
Channels:
- Email ([email protected])Recommended thresholds:
- 30 days: Warning
- 14 days: Urgent
- 7 days: Critical
Rule Configuration
Applying Rules to Monitors
Rules can be applied:
Per-monitor:
Monitor: Production API
Alert Rules:
- 3 Consecutive Failures
- Slow Response AlertBy tag:
Rule: Critical Alerts
Apply To: tag:productionGlobally:
Rule: Basic Monitoring
Apply To: all monitors
Exclude: tag:experimentalSeverity Levels
Assign severity to rules:
Severity: critical | high | medium | low
# Different channels per severity
Channels:
critical:
- PagerDuty
- Slack (#critical-alerts)
high:
- Slack (#ops-alerts)
- Email
medium:
- Slack (#monitoring)
low:
- Email (daily digest)Time-Based Rules
Adjust alerting based on time:
Name: Business Hours Alert
Active Hours:
Start: "09:00"
End: "18:00"
Timezone: America/New_York
Days: [Monday, Tuesday, Wednesday, Thursday, Friday]
Channels:
- Slack (#ops-alerts)
# Off-hours: different channel
Off Hours Channel:
- PagerDuty (on-call)Cooldown Period
Prevent repeated alerts:
Cooldown: 30 minutes # Don't re-alert for 30 minCooldown starts after recovery. If the same issue recurs, a new alert is sent.
Recovery Notifications
Configure recovery alerts:
Recovery:
Enabled: true
Channels: same # or specify different channels
Message: "{{monitor.name}} has recovered after {{incident.duration}}"Recovery notification includes:
- Time down
- Number of failed runs
- Link to incident
Alert Grouping
Group related alerts to reduce noise:
Grouping:
By: monitor | location | tag
Window: 5 minutesIf multiple monitors fail within the window, one grouped alert is sent.
Muting Alerts
Temporarily silence alerts:
Manual Mute
Mute:
Duration: 2 hours
Reason: "Deploying new version"Scheduled Maintenance
Maintenance Window:
Name: Weekly Deployment
Schedule: "0 2 * * 0" # Sunday 2 AM
Duration: 1 hour
Mute Alerts: trueExample Configurations
Standard Production Setup
Rules:
- name: Critical Failures
type: consecutive_failures
threshold: 2
apply_to: tag:critical
channels:
- PagerDuty
- Slack (#critical)
- name: Standard Failures
type: consecutive_failures
threshold: 3
apply_to: tag:production
channels:
- Slack (#ops-alerts)
- Email
- name: Performance Degradation
type: latency
threshold: 3000
percentile: p95
window: 15m
apply_to: all
channels:
- Slack (#performance)E-commerce Site
Rules:
- name: Checkout Flow Down
type: consecutive_failures
threshold: 1 # Zero tolerance
apply_to: journey:checkout
channels:
- PagerDuty
- Slack (#critical)
- SMS (on-call)
- name: Product Pages Slow
type: latency
threshold: 2000
apply_to: monitor:product-pages
channels:
- Slack (#performance)
- name: Visual Changes
type: visual_diff
threshold: 3%
apply_to: journey:homepage
channels:
- Slack (#design)API Platform
Rules:
- name: API Down
type: consecutive_failures
threshold: 2
apply_to: tag:api
channels:
- PagerDuty
- Slack (#api-alerts)
- name: High Error Rate
type: error_rate
threshold: 5%
window: 30m
apply_to: tag:api
channels:
- Slack (#api-alerts)
- name: SLA Breach Risk
type: latency
threshold: 500
percentile: p99
apply_to: tag:sla-monitored
channels:
- Email ([email protected])Best Practices
- Start conservative - Begin with higher thresholds, tune down
- Avoid single-failure alerts - Too noisy for most cases
- Use severity levels - Not every alert is critical
- Set up escalation - Slack → Email → PagerDuty
- Include recovery - Know when issues resolve
- Review regularly - Tune based on false positive rate
- Document rules - Why does this rule exist?
Alert fatigue is real. If your team ignores alerts because there are too many, adjust your thresholds. Better to miss a minor issue than ignore a critical one.
Troubleshooting
Too Many Alerts
- Increase consecutive failure threshold
- Add cooldown period
- Use error rate instead of per-failure
- Review monitor reliability
Missing Alerts
- Check rule is applied to monitor
- Verify channels are configured
- Check for mute/maintenance windows
- Test with Trigger Test Alert
Alert Delays
- Check channel delivery (Slack/email)
- Verify webhook endpoints respond
- Review alert processing logs