Alerting
The alerting system provides real-time monitoring and notification capabilities for your API infrastructure, enabling proactive response to performance issues and outages.
System Overview
The alerting system consists of four main components:
Rules - Define monitoring conditions and thresholds
Filters - Narrow down alert scope to specific conditions
Integrations - Configure where alerts are sent
Events - View alert history and patterns
1. Rules Configuration
Rules are the foundation of your alerting system, continuously monitoring metrics and triggering notifications when thresholds are crossed.
Pre-Built Rule Templates
Performance Monitoring:
Critical Latency Alert - Triggers when maximum response time exceeds critical threshold
Slow Response Time - Detects when API response times degrade beyond acceptable thresholds
High Traffic Alert - Monitors for unusually high request volumes that might impact performance
Bandwidth Consumption - Alerts when data transfer exceeds expected thresholds
Reliability Monitoring:
High Error Rate - Alerts when error rate exceeds threshold, enabling quick API failure detection
Low Availability Warning - Early warning when availability starts to drop
Vendor Availability Drop - Monitors when vendor service availability falls below acceptable levels
Client-specific Error Rate - Monitors errors for specific client integrations
Infrastructure Monitoring:
Connection Spike - Alerts on unusual spikes in connection count that might indicate issues
Custom Rule Creation
Step 1: Choose Alert Type
Monitor a Metric Against a Threshold (currently available)
Check for New Vendors, Endpoints, or Clients (coming soon)
Look for Percentage Changes in a Metric (coming soon)
Step 2: Configure Check Interval How frequently the rule evaluates your metrics:
1 Minute (most responsive)
5 Minutes
15 Minutes
30 Minutes
1 Hour
2, 4, 8, 12, 24 Hours (for less volatile metrics)
Step 3: Select Metric
Availability Metrics:
Availability - Overall service availability percentage
Average Availability - Mean availability across time period
P99/P95/P90 Availability - Percentile-based availability metrics
Performance Metrics:
Average Duration - Mean response time
P99/P95/P90/P50 Duration - Percentile response times
Maximum Duration - Worst-case response time
Traffic Metrics:
Total Connections - Active connection count
Connections per Second - Connection rate
Total Requests - Request volume
Requests per Second - Request rate
Error Metrics:
Errors - Total error count
Errors per Second - Error rate
Data Transfer Metrics:
Total Bytes In/Out - Combined data transfer
Total Bytes Sent/Received - Directional data transfer
Bytes Sent/Received per Second - Transfer rates
Step 4: Set Operator
Greater Than (>)
Less Than (<)
Equal (=)
Greater Than or Equal (≥)
Less Than or Equal (≤)
Step 5: Define Threshold Enter the numeric value that triggers the alert
Step 6: Group By (Optional) Segment alerts by:
Vendor - Monitor each vendor separately
Endpoint - Track individual endpoints
Client - Monitor per-client metrics
Step 7: Set Report Frequency
Don't Report - Evaluate only, no notifications
On State Change - Notify when alert state changes (recommended)
Always - Notify every time condition is met
Custom Schedule - Define specific intervals
2. Filters (Advanced Configuration)
Filters allow you to create highly specific alert conditions by narrowing the scope of monitored data.
Available Filter Dimensions:
Network & Infrastructure:
Source IP / Destination IP
Source Port / Destination Port
Protocol (HTTP/HTTPS, etc.)
Direction (inbound/outbound)
Host / Hostname
Geographic:
Country
Continent
Region
City
Application Layer:
Endpoint
HTTP Method
HTTP Status
Content Type
Error
TLS Version
Container/Kubernetes:
Container Image
Container Name
Pod Name
Pod Namespace
Business Logic:
Vendor
Environment
Strategy
Data Type
Security & Access:
Agent
System User
Bin
Executable
IP
How Filters Work:
Add a filter by clicking the "+" button
Select the dimension to filter on
Choose the matching criteria
Filters significantly reduce false positives by targeting specific scenarios
3. Integrations
Integrations define where and how alerts are delivered. The system supports any webhook-based integration.
Creating an Integration
Name: Descriptive identifier for your integration
Description: Optional details about purpose/destination
Status: Enable/Disable toggle
Configuration:
Method: HTTP method (typically POST)
URL: Webhook endpoint URL
Auth Token: Optional authentication header
Alert Payload Format
When an alert fires, the following JSON payload is sent to your configured webhook:
{
"orgId": "X0ysTipXUDzr9JPkY4W6",
"contextId": "",
"alertId": "cv4umbq9io6g00lgdpog",
"name": "Low Availability Warning",
"description": "Early warning when availability starts to drop",
"message": "P99 Availability of 0.00% for cloudflare.com is below threshold of 99.50%",
"locationId": "cloudflare.com",
"locationType": "vendor",
"timestamp": "2025-07-29T19:24:52.726139471Z"
}
Payload Fields:
orgId: Your organization identifier
contextId: Additional context identifier (if applicable)
alertId: Unique identifier for this alert instance
name: The rule name that triggered
description: The rule description
message: Human-readable alert message with current value, threshold, and location
locationId: The specific entity that triggered the alert (vendor, endpoint, or client)
locationType: Type of entity ("vendor", "endpoint", or "client")
timestamp: ISO 8601 timestamp when the alert was triggered
4. Events Dashboard
The Events page provides comprehensive visibility into your alert history.
Overview Features:
Time Range Selector: Past 15 minutes to 30 days
Summary Table: Total occurrences per rule
Activity Timeline: Visual representation of alert patterns
Per-Rule Analytics:
Occurrence count
Time-series graph showing when alerts fired
Pattern identification for recurring issues
Peak activity periods
Using Events Data:
Identify false positive patterns
Adjust thresholds based on historical data
Correlate alerts with known incidents
Track improvement over time
Common Alerting Scenarios
API Health Monitoring:
Rule: Availability < 99.5%
Filter: Environment = Production
Group By: Endpoint
Check Interval: 1 minute
Vendor SLA Tracking:
Rule: P99 Duration > 500ms
Filter: Vendor exists
Group By: Vendor
Check Interval: 5 minutes
Error Spike Detection:
Rule: Errors per Second > 10
Filter: HTTP Status = 5xx
Group By: Client
Check Interval: 1 minute
Geographic Performance:
Rule: Average Duration > 200ms
Filter: Country = "United States"
Group By: Region
Check Interval: 15 minutes
Last updated