Alerting

The alerting system provides real-time monitoring and notification capabilities for your API infrastructure, enabling proactive response to performance issues and outages.

System Overview

The alerting system consists of four main components:

Rules - Define monitoring conditions and thresholds
Filters - Narrow down alert scope to specific conditions
Integrations - Configure where alerts are sent
Events - View alert history and patterns

1. Rules Configuration

Rules are the foundation of your alerting system, continuously monitoring metrics and triggering notifications when thresholds are crossed.

Pre-Built Rule Templates

Performance Monitoring:

Critical Latency Alert - Triggers when maximum response time exceeds critical threshold
Slow Response Time - Detects when API response times degrade beyond acceptable thresholds
High Traffic Alert - Monitors for unusually high request volumes that might impact performance
Bandwidth Consumption - Alerts when data transfer exceeds expected thresholds

Reliability Monitoring:

High Error Rate - Alerts when error rate exceeds threshold, enabling quick API failure detection
Low Availability Warning - Early warning when availability starts to drop
Vendor Availability Drop - Monitors when vendor service availability falls below acceptable levels
Client-specific Error Rate - Monitors errors for specific client integrations

Infrastructure Monitoring:

Connection Spike - Alerts on unusual spikes in connection count that might indicate issues

Custom Rule Creation

Step 1: Choose Alert Type

Monitor a Metric Against a Threshold (currently available)
Check for New Vendors, Endpoints, or Clients (coming soon)
Look for Percentage Changes in a Metric (coming soon)

Step 2: Configure Check Interval How frequently the rule evaluates your metrics:

1 Minute (most responsive)
5 Minutes
15 Minutes
30 Minutes
1 Hour
2, 4, 8, 12, 24 Hours (for less volatile metrics)

Step 3: Select Metric

Availability Metrics:

Availability - Overall service availability percentage
Average Availability - Mean availability across time period
P99/P95/P90 Availability - Percentile-based availability metrics

Performance Metrics:

Average Duration - Mean response time
P99/P95/P90/P50 Duration - Percentile response times
Maximum Duration - Worst-case response time

Traffic Metrics:

Total Connections - Active connection count
Connections per Second - Connection rate
Total Requests - Request volume
Requests per Second - Request rate

Error Metrics:

Errors - Total error count
Errors per Second - Error rate

Data Transfer Metrics:

Total Bytes In/Out - Combined data transfer
Total Bytes Sent/Received - Directional data transfer
Bytes Sent/Received per Second - Transfer rates

Step 4: Set Operator

Greater Than (>)
Less Than (<)
Equal (=)
Greater Than or Equal (≥)
Less Than or Equal (≤)

Step 5: Define Threshold Enter the numeric value that triggers the alert

Step 6: Group By (Optional) Segment alerts by:

Vendor - Monitor each vendor separately
Endpoint - Track individual endpoints
Client - Monitor per-client metrics

Step 7: Set Report Frequency

Don't Report - Evaluate only, no notifications
On State Change - Notify when alert state changes (recommended)
Always - Notify every time condition is met
Custom Schedule - Define specific intervals

2. Filters (Advanced Configuration)

Filters allow you to create highly specific alert conditions by narrowing the scope of monitored data.

Available Filter Dimensions:

Network & Infrastructure:

Source IP / Destination IP
Source Port / Destination Port
Protocol (HTTP/HTTPS, etc.)
Direction (inbound/outbound)
Host / Hostname

Geographic:

Country
Continent
Region
City

Application Layer:

Endpoint
HTTP Method
HTTP Status
Content Type
Error
TLS Version

Container/Kubernetes:

Container Image
Container Name
Pod Name
Pod Namespace

Business Logic:

Vendor
Environment
Strategy
Data Type

Security & Access:

Agent
System User
Bin
Executable
IP

How Filters Work:

Add a filter by clicking the "+" button
Select the dimension to filter on
Choose the matching criteria
Filters significantly reduce false positives by targeting specific scenarios

Filter operators (AND/OR) are coming soon.

3. Integrations

Integrations define where and how alerts are delivered. The system supports any webhook-based integration.

Creating an Integration

Name: Descriptive identifier for your integration
Description: Optional details about purpose/destination
Status: Enable/Disable toggle
Configuration:
- Method: HTTP method (typically POST)
- URL: Webhook endpoint URL
- Auth Token: Optional authentication header

Alert Payload Format

When an alert fires, the following JSON payload is sent to your configured webhook:

{
  "orgId": "X0ysTipXUDzr9JPkY4W6",
  "contextId": "",
  "alertId": "cv4umbq9io6g00lgdpog",
  "name": "Low Availability Warning",
  "description": "Early warning when availability starts to drop",
  "message": "P99 Availability of 0.00% for cloudflare.com is below threshold of 99.50%",
  "locationId": "cloudflare.com",
  "locationType": "vendor",
  "timestamp": "2025-07-29T19:24:52.726139471Z"
}

Payload Fields:

orgId: Your organization identifier
contextId: Additional context identifier (if applicable)
alertId: Unique identifier for this alert instance
name: The rule name that triggered
description: The rule description
message: Human-readable alert message with current value, threshold, and location
locationId: The specific entity that triggered the alert (vendor, endpoint, or client)
locationType: Type of entity ("vendor", "endpoint", or "client")
timestamp: ISO 8601 timestamp when the alert was triggered

4. Events Dashboard

The Events page provides comprehensive visibility into your alert history.

Overview Features:

Time Range Selector: Past 15 minutes to 30 days
Summary Table: Total occurrences per rule
Activity Timeline: Visual representation of alert patterns

Per-Rule Analytics:

Occurrence count
Time-series graph showing when alerts fired
Pattern identification for recurring issues
Peak activity periods

Using Events Data:

Identify false positive patterns
Adjust thresholds based on historical data
Correlate alerts with known incidents
Track improvement over time

Common Alerting Scenarios

API Health Monitoring:

Rule: Availability < 99.5%
Filter: Environment = Production
Group By: Endpoint
Check Interval: 1 minute

Vendor SLA Tracking:

Rule: P99 Duration > 500ms
Filter: Vendor exists
Group By: Vendor
Check Interval: 5 minutes

Error Spike Detection:

Rule: Errors per Second > 10
Filter: HTTP Status = 5xx
Group By: Client
Check Interval: 1 minute

Geographic Performance:

Rule: Average Duration > 200ms
Filter: Country = "United States"
Group By: Region
Check Interval: 15 minutes

PreviousConfiguration NextSecurity & Compliance

Last updated 1 month ago