Alerting
The alerting system provides real-time monitoring and notification capabilities for your API infrastructure, enabling proactive response to performance issues and outages.
Quick start: See the POC Kick Off Guide - Set up error alerting section for a simple working example showing how to create rules and configure integrations.
System Overview
The alerting system consists of four main components:
Rules - Define monitoring conditions and thresholds
Filters - Narrow alert scope to specific conditions
Integrations - Configure where alerts are sent
Events - View alert history and patterns
Rules Configuration
Rules are the foundation of your alerting system, continuously monitoring metrics and triggering notifications when thresholds are crossed.
Pre-Built Rule Templates
Qpoint provides pre-built templates organized by category to help you get started quickly.
Discovery Templates
Monitor when new entities appear in your traffic:
New Vendor - Detects when a new vendor is used
New Endpoint - Detects when a new endpoint (domain or IP) is accessed
New Client + Vendor - Detects when a client uses a vendor for the first time
New Sensitive Data Type - Detects when a new sensitive data type is used
New Client + Sensitive Data Type + Vendor - Detects when a client sends a sensitive data type to a vendor for the first time
New Token - Detects when a new token is used
New Country - Detects when a new country is connected to
Risk Templates
Identify security and compliance risks:
Root User Connections - Detect when connections are established from processes running as root user
Unencrypted Data Transmission - Alert when data is transmitted over unencrypted connections
Direct IP Access - Detect when connections bypass DNS resolution and use direct IP addresses
Shell Access Attempts - Detect when shell access is attempted or established
Weak TLS Usage - Alert when deprecated or weak TLS versions are used
Authentication Failures - Detect when authentication errors occur
Token-Based Authentication - Track when authentication tokens are used
Unknown TLS Configuration - Alert when unknown or unsupported TLS configurations are detected
High-Risk Connection Pattern - Detect high-risk connections combining root access with unencrypted transmission
Authentication Token from Root - Detect when authentication tokens are used from root user processes
Reliability Templates
Monitor service health and availability:
High Error Rate - Alerts when error rate exceeds threshold
Sudden Error Rate Increase - Detects when error rate increases suddenly
Vendor Availability Unacceptable - Monitors when service availability falls below acceptable levels
Sudden Availability Drop - Detects when service availability drops suddenly
Performance Templates
Track response times and latency:
Slow Response Time - Detects when API response times degrade beyond acceptable thresholds
Sudden Response Time Increase - Detects when response time increases suddenly
Custom Rule Creation
When the pre-built templates don't meet your needs, create custom rules with complete control over monitoring conditions. There are four types of custom alerts, each designed for different monitoring scenarios.
Alert Types
1. Discover New Entities
Detect when new entities appear in your traffic for the first time.
Configuration:
Check Interval: How frequently to check for new entities (1 minute to 24 hours)
Entity Combinations: Select which entities to monitor (Vendor, Endpoint, Client, PII Data, Token, Country)
Report Frequency: How often to notify (Always, On State Change, Don't Report, Custom Schedule)
Example: Alert when your application connects to a new vendor you haven't seen before, which could indicate shadow IT or unauthorized integrations.
2. Risk Detection
Identify security and compliance risks in your connections based on predefined risk patterns.
Configuration:
Check Interval: How frequently to evaluate risk (1 minute to 24 hours)
Define Source By: What initiated the connection (Bin, Container, Pod, Process, etc.)
Define Destination By: Where the connection is going (Vendor, Endpoint, IP, etc.)
Risk Labels: Select which risk patterns to monitor from the available labels:
user-shell- Shell access attempts or established shell connectionsdirect-ip- Direct IP address connections bypassing DNSis-root- Connections from processes running as root userunencrypted- Unencrypted data transmissiondeprecated-tls- Usage of deprecated or weak TLS versionsunknown-tls- Unknown or unsupported TLS configurationsauth-error- Authentication failures during connectionsauth-token- Token-based authentication usage
Report Frequency: How often to notify
Example: Detect when root user processes establish connections to external vendors, or when unencrypted connections are made to sensitive endpoints.
3. Percentage Changes in a Metric
Monitor for significant increases or decreases in metrics compared to historical baselines.
Configuration:
Check Interval: How frequently to evaluate changes (1 minute to 24 hours)
Metric: Select the metric to monitor (same metrics as threshold alerts)
Increase Percentage: Alert when metric increases by this percentage (e.g., 50%)
Decrease Percentage: Alert when metric decreases by this percentage (e.g., 25%)
Group By: Segment alerts by Vendor, Endpoint, or Client (optional)
Report Frequency: How often to notify
Example: Alert when error rates suddenly spike by 50% or when traffic volume drops by 25%, indicating potential issues even if absolute thresholds aren't crossed.
4. Metric Surpasses a Threshold
Alert when a metric crosses a defined absolute threshold value.
Configuration:
Check Interval: How frequently to evaluate the metric (1 minute to 24 hours)
Metric: Select from 50+ available metrics (see metric categories below)
Operator: How to compare the metric (>, <, =, ≥, ≤)
Threshold: The numeric value that triggers the alert
Group By: Segment alerts by Vendor, Endpoint, or Client (optional)
Report Frequency: How often to notify
Example: Alert when P99 response time exceeds 500ms or when error count goes above 100 errors.
Check Intervals
How frequently the rule evaluates your metrics:
1 Minute (most responsive)
5 Minutes
15 Minutes
30 Minutes
1 Hour
2, 4, 8, 12, 24 Hours (for less volatile metrics)
Available Metrics
Choose from 50+ available metrics organized by category.
Availability Metrics:
Availability - Overall service availability percentage
Average Availability - Mean availability across time period
P99/P95/P90 Availability - Percentile-based availability metrics
Performance Metrics:
Average Duration - Mean response time
P99/P95/P90/P50 Duration - Percentile response times
Maximum Duration - Worst-case response time
Traffic Metrics:
Total Connections - Active connection count
Connections per Second - Connection rate
Total Requests - Request volume
Requests per Second - Request rate
Error Metrics:
Errors - Total error count
Errors per Second - Error rate
Data Transfer Metrics:
Total Bytes In/Out - Combined data transfer
Total Bytes Sent/Received - Directional data transfer
Bytes Sent/Received per Second - Transfer rates
Operators (for Threshold alerts)
Choose how to compare the metric against your threshold:
Greater Than (>)
Less Than (<)
Equal (=)
Greater Than or Equal (≥)
Less Than or Equal (≤)
Grouping (Optional)
Segment alerts by specific dimensions to get granular notifications:
Vendor - Monitor each vendor separately
Endpoint - Track individual endpoints
Client - Monitor per-client metrics
Report Frequency
Control how often you want to be notified:
Don't Report - Evaluate only, no notifications (useful for testing)
On State Change - Notify when alert state changes (recommended to avoid alert fatigue)
Always - Notify every time condition is met
Custom Schedule - Define specific intervals for notifications
Filters (Advanced Configuration)
Filters allow you to create highly specific alert conditions by narrowing the scope of monitored data. This significantly reduces false positives by targeting specific scenarios.
Available Filter Dimensions
Network & Infrastructure:
Source IP / Destination IP
Source Port / Destination Port
Protocol (HTTP/HTTPS, etc.)
Direction (inbound/outbound)
Host / Hostname
Geographic:
Country
Continent
Region
City
Application Layer:
Endpoint
HTTP Method
HTTP Status
Content Type
Error
TLS Version
Container/Kubernetes:
Container Image
Container Name
Pod Name
Pod Namespace
Business Logic:
Vendor
Environment
Strategy
Data Type
Security & Access:
Agent
System User
Bin
Executable
IP
How Filters Work
Add a filter by clicking the "+" button in the Filters section
Select the dimension to filter on
Choose the matching criteria (equals, contains, etc.)
Multiple filters can be combined to create precise conditions
{% hint style="info" %} Filter operators (AND/OR logic) are coming soon to enable even more sophisticated filtering. {% endhint %}
Integrations
Integrations define where and how alerts are delivered. The system supports any webhook-based integration, making it compatible with Slack, PagerDuty, Microsoft Teams, custom internal systems, and more.
Creating an Integration
Name: Descriptive identifier for your integration (e.g., "Production Alerts Slack")
Description: Optional details about purpose/destination
Status: Enable/Disable toggle for quickly turning integrations on/off
Configuration:
Method: HTTP method (typically POST)
URL: Webhook endpoint URL
Auth Token: Optional authentication header for secured webhooks
Alert Payload Format
When an alert fires, the following JSON payload is sent to your configured webhook:
{
"orgId": "X0ysTipXUDzr9JPkY4W6",
"contextId": "",
"alertId": "cv4umbq9io6g00lgdpog",
"name": "Low Availability Warning",
"description": "Early warning when availability starts to drop",
"message": "P99 Availability of 0.00% for cloudflare.com is below threshold of 99.50%",
"locationId": "cloudflare.com",
"locationType": "vendor",
"timestamp": "2025-07-29T19:24:52.726139471Z"
}Payload Fields:
orgId: Your organization identifier
contextId: Additional context identifier (if applicable)
alertId: Unique identifier for this alert instance
name: The rule name that triggered
description: The rule description
message: Human-readable alert message with current value, threshold, and location
locationId: The specific entity that triggered the alert (vendor, endpoint, or client)
locationType: Type of entity ("vendor", "endpoint", or "client")
timestamp: ISO 8601 timestamp when the alert was triggered
Events Dashboard
The Events page provides comprehensive visibility into your alert history and helps you identify patterns, false positives, and optimization opportunities.
Overview Features
Time Range Selector: View events from the past 15 minutes to 30 days
Summary Table: Total occurrences per rule for quick pattern identification
Activity Timeline: Visual representation of alert patterns over time
Per-Rule Analytics
For each alerting rule, you can view:
Occurrence count: How many times the alert fired
Time-series graph: Visual timeline showing when alerts fired
Pattern identification: Spot recurring issues or correlation with known incidents
Peak activity periods: Identify when alerts are most frequent
Using Events Data
The Events dashboard helps you:
Identify false positive patterns (rules firing too frequently may need threshold adjustment)
Adjust thresholds based on historical data (use actual performance patterns to tune alerts)
Correlate alerts with known incidents (match alert timelines with deployment or infrastructure changes)
Track improvement over time (verify that fixes reduce alert frequency)
Common Alerting Scenarios
API Health Monitoring
Monitor overall API availability in production:
Rule: Availability < 99.5%
Filter: Environment = Production
Group By: Endpoint
Check Interval: 1 minute
Report Frequency: On State ChangeVendor SLA Tracking
Track response time SLAs for third-party vendors:
Rule: P99 Duration > 500ms
Filter: Vendor exists
Group By: Vendor
Check Interval: 5 minutes
Report Frequency: On State ChangeError Spike Detection
Detect sudden increases in client errors:
Rule: Errors per Second > 10
Filter: HTTP Status = 5xx
Group By: Client
Check Interval: 1 minute
Report Frequency: AlwaysGeographic Performance
Monitor performance for specific regions:
Rule: Average Duration > 200ms
Filter: Country = "United States"
Group By: Region
Check Interval: 15 minutes
Report Frequency: On State ChangeSecurity Monitoring
Detect unencrypted data transmission in production:
Template: Unencrypted Data Transmission
Filter: Environment = Production
Check Interval: 1 minute
Report Frequency: AlwaysDiscovery Monitoring
Get notified when new vendors are discovered:
Template: New Vendor
Check Interval: 5 minutes
Report Frequency: Always
Entity Combinations: VendorBest Practices
Start with templates: Use pre-built templates and customize them to your needs rather than building from scratch
Use "On State Change" reporting: Reduces alert fatigue while ensuring you're notified of important changes
Group by relevant dimensions: Creates actionable alerts by identifying exactly which vendor, endpoint, or client is affected
Apply filters liberally: Narrow alert scope to reduce noise and false positives
Use the Events dashboard: Regularly review alert patterns to tune thresholds and identify improvements
Test with "Don't Report": When creating new rules, start with "Don't Report" to validate the rule logic before enabling notifications
Longer intervals for stable metrics: Use 15+ minute intervals for metrics that don't change rapidly to reduce overhead
Combine Discovery and Risk alerts: Use Discovery templates to know what's connecting, and Risk templates to monitor how it's connecting
Important Notes
Alerting operates exclusively in Qplane and is not part of YAML configuration.
The Report Usage plugin must be included in your stacks for alerting to function.
Alerts are generated from anonymized event metadata collected by the Report Usage plugin
All alert data is stored in Pulse (ClickHouse) and does not include payload data from object stores
Last updated