Alerting

The alerting system provides real-time monitoring and notification capabilities for your API infrastructure, enabling proactive response to performance issues and outages.

Quick start: See the POC Kick Off Guide - Set up error alerting section for a simple working example showing how to create rules and configure integrations.

System Overview

The alerting system consists of four main components:

Rules - Define monitoring conditions and thresholds
Filters - Narrow alert scope to specific conditions
Integrations - Configure where alerts are sent
Events - View alert history and patterns

Rules Configuration

Rules are the foundation of your alerting system, continuously monitoring metrics and triggering notifications when thresholds are crossed.

Pre-Built Rule Templates

Qpoint provides pre-built templates organized by category to help you get started quickly.

Discovery Templates

Monitor when new entities appear in your traffic:

New Vendor - Detects when a new vendor is used
New Endpoint - Detects when a new endpoint (domain or IP) is accessed
New Client + Vendor - Detects when a client uses a vendor for the first time
New Sensitive Data Type - Detects when a new sensitive data type is used
New Client + Sensitive Data Type + Vendor - Detects when a client sends a sensitive data type to a vendor for the first time
New Token - Detects when a new token is used
New Country - Detects when a new country is connected to

Risk Templates

Identify security and compliance risks:

Root User Connections - Detect when connections are established from processes running as root user
Unencrypted Data Transmission - Alert when data is transmitted over unencrypted connections
Direct IP Access - Detect when connections bypass DNS resolution and use direct IP addresses
Shell Access Attempts - Detect when shell access is attempted or established
Weak TLS Usage - Alert when deprecated or weak TLS versions are used
Authentication Failures - Detect when authentication errors occur
Token-Based Authentication - Track when authentication tokens are used
Unknown TLS Configuration - Alert when unknown or unsupported TLS configurations are detected
High-Risk Connection Pattern - Detect high-risk connections combining root access with unencrypted transmission
Authentication Token from Root - Detect when authentication tokens are used from root user processes

Reliability Templates

Monitor service health and availability:

High Error Rate - Alerts when error rate exceeds threshold
Sudden Error Rate Increase - Detects when error rate increases suddenly
Vendor Availability Unacceptable - Monitors when service availability falls below acceptable levels
Sudden Availability Drop - Detects when service availability drops suddenly

Performance Templates

Track response times and latency:

Slow Response Time - Detects when API response times degrade beyond acceptable thresholds
Sudden Response Time Increase - Detects when response time increases suddenly

Custom Rule Creation

When the pre-built templates don't meet your needs, create custom rules with complete control over monitoring conditions. There are four types of custom alerts, each designed for different monitoring scenarios.

Alert Types

1. Discover New Entities

Detect when new entities appear in your traffic for the first time.

Configuration:

Check Interval: How frequently to check for new entities (1 minute to 24 hours)
Entity Combinations: Select which entities to monitor (Vendor, Endpoint, Client, PII Data, Token, Country)
Report Frequency: How often to notify (Always, On State Change, Don't Report, Custom Schedule)

Example: Alert when your application connects to a new vendor you haven't seen before, which could indicate shadow IT or unauthorized integrations.

2. Risk Detection

Identify security and compliance risks in your connections based on predefined risk patterns.

Configuration:

Check Interval: How frequently to evaluate risk (1 minute to 24 hours)
Define Source By: What initiated the connection (Bin, Container, Pod, Process, etc.)
Define Destination By: Where the connection is going (Vendor, Endpoint, IP, etc.)
Risk Labels: Select which risk patterns to monitor from the available labels:
- user-shell - Shell access attempts or established shell connections
- direct-ip - Direct IP address connections bypassing DNS
- is-root - Connections from processes running as root user
- unencrypted - Unencrypted data transmission
- deprecated-tls - Usage of deprecated or weak TLS versions
- unknown-tls - Unknown or unsupported TLS configurations
- auth-error - Authentication failures during connections
- auth-token - Token-based authentication usage
Report Frequency: How often to notify

Example: Detect when root user processes establish connections to external vendors, or when unencrypted connections are made to sensitive endpoints.

3. Percentage Changes in a Metric

Monitor for significant increases or decreases in metrics compared to historical baselines.

Configuration:

Check Interval: How frequently to evaluate changes (1 minute to 24 hours)
Metric: Select the metric to monitor (same metrics as threshold alerts)
Increase Percentage: Alert when metric increases by this percentage (e.g., 50%)
Decrease Percentage: Alert when metric decreases by this percentage (e.g., 25%)
Group By: Segment alerts by Vendor, Endpoint, or Client (optional)
Report Frequency: How often to notify

Example: Alert when error rates suddenly spike by 50% or when traffic volume drops by 25%, indicating potential issues even if absolute thresholds aren't crossed.

4. Metric Surpasses a Threshold

Alert when a metric crosses a defined absolute threshold value.

Configuration:

Check Interval: How frequently to evaluate the metric (1 minute to 24 hours)
Metric: Select from 50+ available metrics (see metric categories below)
Operator: How to compare the metric (>, <, =, ≥, ≤)
Threshold: The numeric value that triggers the alert
Group By: Segment alerts by Vendor, Endpoint, or Client (optional)
Report Frequency: How often to notify

Example: Alert when P99 response time exceeds 500ms or when error count goes above 100 errors.

Check Intervals

How frequently the rule evaluates your metrics:

1 Minute (most responsive)
5 Minutes
15 Minutes
30 Minutes
1 Hour
2, 4, 8, 12, 24 Hours (for less volatile metrics)

Available Metrics

Choose from 50+ available metrics organized by category.

Availability Metrics:

Availability - Overall service availability percentage
Average Availability - Mean availability across time period
P99/P95/P90 Availability - Percentile-based availability metrics

Performance Metrics:

Average Duration - Mean response time
P99/P95/P90/P50 Duration - Percentile response times
Maximum Duration - Worst-case response time

Traffic Metrics:

Total Connections - Active connection count
Connections per Second - Connection rate
Total Requests - Request volume
Requests per Second - Request rate

Error Metrics:

Errors - Total error count
Errors per Second - Error rate

Data Transfer Metrics:

Total Bytes In/Out - Combined data transfer
Total Bytes Sent/Received - Directional data transfer
Bytes Sent/Received per Second - Transfer rates

Operators (for Threshold alerts)

Choose how to compare the metric against your threshold:

Greater Than (>)
Less Than (<)
Equal (=)
Greater Than or Equal (≥)
Less Than or Equal (≤)

Grouping (Optional)

Segment alerts by specific dimensions to get granular notifications:

Vendor - Monitor each vendor separately
Endpoint - Track individual endpoints
Client - Monitor per-client metrics

Report Frequency

Control how often you want to be notified:

Don't Report - Evaluate only, no notifications (useful for testing)
On State Change - Notify when alert state changes (recommended to avoid alert fatigue)
Always - Notify every time condition is met
Custom Schedule - Define specific intervals for notifications

Filters (Advanced Configuration)

Filters allow you to create highly specific alert conditions by narrowing the scope of monitored data. This significantly reduces false positives by targeting specific scenarios.

Available Filter Dimensions

Network & Infrastructure:

Source IP / Destination IP
Source Port / Destination Port
Protocol (HTTP/HTTPS, etc.)
Direction (inbound/outbound)
Host / Hostname

Geographic:

Country
Continent
Region
City

Application Layer:

Endpoint
HTTP Method
HTTP Status
Content Type
Error
TLS Version

Container/Kubernetes:

Container Image
Container Name
Pod Name
Pod Namespace

Business Logic:

Vendor
Environment
Strategy
Data Type

Security & Access:

Agent
System User
Bin
Executable
IP

How Filters Work

Add a filter by clicking the "+" button in the Filters section
Select the dimension to filter on
Choose the matching criteria (equals, contains, etc.)
Multiple filters can be combined to create precise conditions

{% hint style="info" %} Filter operators (AND/OR logic) are coming soon to enable even more sophisticated filtering. {% endhint %}

Integrations

Integrations define where and how alerts are delivered. The system supports any webhook-based integration, making it compatible with Slack, PagerDuty, Microsoft Teams, custom internal systems, and more.

Creating an Integration

Name: Descriptive identifier for your integration (e.g., "Production Alerts Slack")
Description: Optional details about purpose/destination
Status: Enable/Disable toggle for quickly turning integrations on/off
Configuration:
- Method: HTTP method (typically POST)
- URL: Webhook endpoint URL
- Auth Token: Optional authentication header for secured webhooks

Alert Payload Format

When an alert fires, the following JSON payload is sent to your configured webhook:

{
  "orgId": "X0ysTipXUDzr9JPkY4W6",
  "contextId": "",
  "alertId": "cv4umbq9io6g00lgdpog",
  "name": "Low Availability Warning",
  "description": "Early warning when availability starts to drop",
  "message": "P99 Availability of 0.00% for cloudflare.com is below threshold of 99.50%",
  "locationId": "cloudflare.com",
  "locationType": "vendor",
  "timestamp": "2025-07-29T19:24:52.726139471Z"
}

Payload Fields:

orgId: Your organization identifier
contextId: Additional context identifier (if applicable)
alertId: Unique identifier for this alert instance
name: The rule name that triggered
description: The rule description
message: Human-readable alert message with current value, threshold, and location
locationId: The specific entity that triggered the alert (vendor, endpoint, or client)
locationType: Type of entity ("vendor", "endpoint", or "client")
timestamp: ISO 8601 timestamp when the alert was triggered

Events Dashboard

The Events page provides comprehensive visibility into your alert history and helps you identify patterns, false positives, and optimization opportunities.

Overview Features

Time Range Selector: View events from the past 15 minutes to 30 days
Summary Table: Total occurrences per rule for quick pattern identification
Activity Timeline: Visual representation of alert patterns over time

Per-Rule Analytics

For each alerting rule, you can view:

Occurrence count: How many times the alert fired
Time-series graph: Visual timeline showing when alerts fired
Pattern identification: Spot recurring issues or correlation with known incidents
Peak activity periods: Identify when alerts are most frequent

Using Events Data

The Events dashboard helps you:

Identify false positive patterns (rules firing too frequently may need threshold adjustment)
Adjust thresholds based on historical data (use actual performance patterns to tune alerts)
Correlate alerts with known incidents (match alert timelines with deployment or infrastructure changes)
Track improvement over time (verify that fixes reduce alert frequency)

Common Alerting Scenarios

API Health Monitoring

Monitor overall API availability in production:

Rule: Availability < 99.5%
Filter: Environment = Production
Group By: Endpoint
Check Interval: 1 minute
Report Frequency: On State Change

Vendor SLA Tracking

Track response time SLAs for third-party vendors:

Rule: P99 Duration > 500ms
Filter: Vendor exists
Group By: Vendor
Check Interval: 5 minutes
Report Frequency: On State Change

Error Spike Detection

Detect sudden increases in client errors:

Rule: Errors per Second > 10
Filter: HTTP Status = 5xx
Group By: Client
Check Interval: 1 minute
Report Frequency: Always

Geographic Performance

Monitor performance for specific regions:

Rule: Average Duration > 200ms
Filter: Country = "United States"
Group By: Region
Check Interval: 15 minutes
Report Frequency: On State Change

Security Monitoring

Detect unencrypted data transmission in production:

Template: Unencrypted Data Transmission
Filter: Environment = Production
Check Interval: 1 minute
Report Frequency: Always

Discovery Monitoring

Get notified when new vendors are discovered:

Template: New Vendor
Check Interval: 5 minutes
Report Frequency: Always
Entity Combinations: Vendor

Best Practices

Start with templates: Use pre-built templates and customize them to your needs rather than building from scratch
Use "On State Change" reporting: Reduces alert fatigue while ensuring you're notified of important changes
Group by relevant dimensions: Creates actionable alerts by identifying exactly which vendor, endpoint, or client is affected
Apply filters liberally: Narrow alert scope to reduce noise and false positives
Use the Events dashboard: Regularly review alert patterns to tune thresholds and identify improvements
Test with "Don't Report": When creating new rules, start with "Don't Report" to validate the rule logic before enabling notifications
Longer intervals for stable metrics: Use 15+ minute intervals for metrics that don't change rapidly to reduce overhead
Combine Discovery and Risk alerts: Use Discovery templates to know what's connecting, and Risk templates to monitor how it's connecting

Important Notes

Alerting operates exclusively in Qplane and is not part of YAML configuration.
The Report Usage plugin must be included in your stacks for alerting to function.
Alerts are generated from anonymized event metadata collected by the Report Usage plugin
All alert data is stored in Pulse (ClickHouse) and does not include payload data from object stores

PreviousOrganizations & Environments NextHow It Fits Together

Last updated 4 months ago

hashtagSystem Overview

hashtagRules Configuration

hashtagPre-Built Rule Templates

hashtagCustom Rule Creation

hashtagFilters (Advanced Configuration)

hashtagAvailable Filter Dimensions

hashtagHow Filters Work

hashtagIntegrations

hashtagCreating an Integration

hashtagAlert Payload Format

hashtagEvents Dashboard

hashtagOverview Features

hashtagPer-Rule Analytics

hashtagUsing Events Data

hashtagCommon Alerting Scenarios

hashtagAPI Health Monitoring

hashtagVendor SLA Tracking

hashtagError Spike Detection

hashtagGeographic Performance

hashtagSecurity Monitoring

hashtagDiscovery Monitoring

hashtagBest Practices

hashtagImportant Notes

System Overview

Rules Configuration

Pre-Built Rule Templates

Custom Rule Creation

Filters (Advanced Configuration)

Available Filter Dimensions

How Filters Work

Integrations

Creating an Integration

Alert Payload Format

Events Dashboard

Overview Features

Per-Rule Analytics

Using Events Data

Common Alerting Scenarios

API Health Monitoring

Vendor SLA Tracking

Error Spike Detection

Geographic Performance

Security Monitoring

Discovery Monitoring

Best Practices

Important Notes