Alerting

The alerting system provides real-time monitoring and notification capabilities for your API infrastructure, enabling proactive response to performance issues and outages.

System Overview

The alerting system consists of four main components:

  1. Rules - Define monitoring conditions and thresholds

  2. Filters - Narrow alert scope to specific conditions

  3. Integrations - Configure where alerts are sent

  4. Events - View alert history and patterns

Rules Configuration

Rules are the foundation of your alerting system, continuously monitoring metrics and triggering notifications when thresholds are crossed.

Pre-Built Rule Templates

Qpoint provides pre-built templates organized by category to help you get started quickly.

Discovery Templates

Monitor when new entities appear in your traffic:

  • New Vendor - Detects when a new vendor is used

  • New Endpoint - Detects when a new endpoint (domain or IP) is accessed

  • New Client + Vendor - Detects when a client uses a vendor for the first time

  • New Sensitive Data Type - Detects when a new sensitive data type is used

  • New Client + Sensitive Data Type + Vendor - Detects when a client sends a sensitive data type to a vendor for the first time

  • New Token - Detects when a new token is used

  • New Country - Detects when a new country is connected to

Risk Templates

Identify security and compliance risks:

  • Root User Connections - Detect when connections are established from processes running as root user

  • Unencrypted Data Transmission - Alert when data is transmitted over unencrypted connections

  • Direct IP Access - Detect when connections bypass DNS resolution and use direct IP addresses

  • Shell Access Attempts - Detect when shell access is attempted or established

  • Weak TLS Usage - Alert when deprecated or weak TLS versions are used

  • Authentication Failures - Detect when authentication errors occur

  • Token-Based Authentication - Track when authentication tokens are used

  • Unknown TLS Configuration - Alert when unknown or unsupported TLS configurations are detected

  • High-Risk Connection Pattern - Detect high-risk connections combining root access with unencrypted transmission

  • Authentication Token from Root - Detect when authentication tokens are used from root user processes

Reliability Templates

Monitor service health and availability:

  • High Error Rate - Alerts when error rate exceeds threshold

  • Sudden Error Rate Increase - Detects when error rate increases suddenly

  • Vendor Availability Unacceptable - Monitors when service availability falls below acceptable levels

  • Sudden Availability Drop - Detects when service availability drops suddenly

Performance Templates

Track response times and latency:

  • Slow Response Time - Detects when API response times degrade beyond acceptable thresholds

  • Sudden Response Time Increase - Detects when response time increases suddenly

Custom Rule Creation

When the pre-built templates don't meet your needs, create custom rules with complete control over monitoring conditions. There are four types of custom alerts, each designed for different monitoring scenarios.

Alert Types

1. Discover New Entities

Detect when new entities appear in your traffic for the first time.

Configuration:

  • Check Interval: How frequently to check for new entities (1 minute to 24 hours)

  • Entity Combinations: Select which entities to monitor (Vendor, Endpoint, Client, PII Data, Token, Country)

  • Report Frequency: How often to notify (Always, On State Change, Don't Report, Custom Schedule)

Example: Alert when your application connects to a new vendor you haven't seen before, which could indicate shadow IT or unauthorized integrations.

2. Risk Detection

Identify security and compliance risks in your connections based on predefined risk patterns.

Configuration:

  • Check Interval: How frequently to evaluate risk (1 minute to 24 hours)

  • Define Source By: What initiated the connection (Bin, Container, Pod, Process, etc.)

  • Define Destination By: Where the connection is going (Vendor, Endpoint, IP, etc.)

  • Risk Labels: Select which risk patterns to monitor from the available labels:

    • user-shell - Shell access attempts or established shell connections

    • direct-ip - Direct IP address connections bypassing DNS

    • is-root - Connections from processes running as root user

    • unencrypted - Unencrypted data transmission

    • deprecated-tls - Usage of deprecated or weak TLS versions

    • unknown-tls - Unknown or unsupported TLS configurations

    • auth-error - Authentication failures during connections

    • auth-token - Token-based authentication usage

  • Report Frequency: How often to notify

Example: Detect when root user processes establish connections to external vendors, or when unencrypted connections are made to sensitive endpoints.

3. Percentage Changes in a Metric

Monitor for significant increases or decreases in metrics compared to historical baselines.

Configuration:

  • Check Interval: How frequently to evaluate changes (1 minute to 24 hours)

  • Metric: Select the metric to monitor (same metrics as threshold alerts)

  • Increase Percentage: Alert when metric increases by this percentage (e.g., 50%)

  • Decrease Percentage: Alert when metric decreases by this percentage (e.g., 25%)

  • Group By: Segment alerts by Vendor, Endpoint, or Client (optional)

  • Report Frequency: How often to notify

Example: Alert when error rates suddenly spike by 50% or when traffic volume drops by 25%, indicating potential issues even if absolute thresholds aren't crossed.

4. Metric Surpasses a Threshold

Alert when a metric crosses a defined absolute threshold value.

Configuration:

  • Check Interval: How frequently to evaluate the metric (1 minute to 24 hours)

  • Metric: Select from 50+ available metrics (see metric categories below)

  • Operator: How to compare the metric (>, <, =, ≥, ≤)

  • Threshold: The numeric value that triggers the alert

  • Group By: Segment alerts by Vendor, Endpoint, or Client (optional)

  • Report Frequency: How often to notify

Example: Alert when P99 response time exceeds 500ms or when error count goes above 100 errors.

Check Intervals

How frequently the rule evaluates your metrics:

  • 1 Minute (most responsive)

  • 5 Minutes

  • 15 Minutes

  • 30 Minutes

  • 1 Hour

  • 2, 4, 8, 12, 24 Hours (for less volatile metrics)

Available Metrics

Choose from 50+ available metrics organized by category.

Availability Metrics:

  • Availability - Overall service availability percentage

  • Average Availability - Mean availability across time period

  • P99/P95/P90 Availability - Percentile-based availability metrics

Performance Metrics:

  • Average Duration - Mean response time

  • P99/P95/P90/P50 Duration - Percentile response times

  • Maximum Duration - Worst-case response time

Traffic Metrics:

  • Total Connections - Active connection count

  • Connections per Second - Connection rate

  • Total Requests - Request volume

  • Requests per Second - Request rate

Error Metrics:

  • Errors - Total error count

  • Errors per Second - Error rate

Data Transfer Metrics:

  • Total Bytes In/Out - Combined data transfer

  • Total Bytes Sent/Received - Directional data transfer

  • Bytes Sent/Received per Second - Transfer rates

Operators (for Threshold alerts)

Choose how to compare the metric against your threshold:

  • Greater Than (>)

  • Less Than (<)

  • Equal (=)

  • Greater Than or Equal (≥)

  • Less Than or Equal (≤)

Grouping (Optional)

Segment alerts by specific dimensions to get granular notifications:

  • Vendor - Monitor each vendor separately

  • Endpoint - Track individual endpoints

  • Client - Monitor per-client metrics

Report Frequency

Control how often you want to be notified:

  • Don't Report - Evaluate only, no notifications (useful for testing)

  • On State Change - Notify when alert state changes (recommended to avoid alert fatigue)

  • Always - Notify every time condition is met

  • Custom Schedule - Define specific intervals for notifications

Filters (Advanced Configuration)

Filters allow you to create highly specific alert conditions by narrowing the scope of monitored data. This significantly reduces false positives by targeting specific scenarios.

Available Filter Dimensions

Network & Infrastructure:

  • Source IP / Destination IP

  • Source Port / Destination Port

  • Protocol (HTTP/HTTPS, etc.)

  • Direction (inbound/outbound)

  • Host / Hostname

Geographic:

  • Country

  • Continent

  • Region

  • City

Application Layer:

  • Endpoint

  • HTTP Method

  • HTTP Status

  • Content Type

  • Error

  • TLS Version

Container/Kubernetes:

  • Container Image

  • Container Name

  • Pod Name

  • Pod Namespace

Business Logic:

  • Vendor

  • Environment

  • Strategy

  • Data Type

Security & Access:

  • Agent

  • System User

  • Bin

  • Executable

  • IP

How Filters Work

  1. Add a filter by clicking the "+" button in the Filters section

  2. Select the dimension to filter on

  3. Choose the matching criteria (equals, contains, etc.)

  4. Multiple filters can be combined to create precise conditions

{% hint style="info" %} Filter operators (AND/OR logic) are coming soon to enable even more sophisticated filtering. {% endhint %}

Integrations

Integrations define where and how alerts are delivered. The system supports any webhook-based integration, making it compatible with Slack, PagerDuty, Microsoft Teams, custom internal systems, and more.

Creating an Integration

  1. Name: Descriptive identifier for your integration (e.g., "Production Alerts Slack")

  2. Description: Optional details about purpose/destination

  3. Status: Enable/Disable toggle for quickly turning integrations on/off

  4. Configuration:

    • Method: HTTP method (typically POST)

    • URL: Webhook endpoint URL

    • Auth Token: Optional authentication header for secured webhooks

Alert Payload Format

When an alert fires, the following JSON payload is sent to your configured webhook:

{
  "orgId": "X0ysTipXUDzr9JPkY4W6",
  "contextId": "",
  "alertId": "cv4umbq9io6g00lgdpog",
  "name": "Low Availability Warning",
  "description": "Early warning when availability starts to drop",
  "message": "P99 Availability of 0.00% for cloudflare.com is below threshold of 99.50%",
  "locationId": "cloudflare.com",
  "locationType": "vendor",
  "timestamp": "2025-07-29T19:24:52.726139471Z"
}

Payload Fields:

  • orgId: Your organization identifier

  • contextId: Additional context identifier (if applicable)

  • alertId: Unique identifier for this alert instance

  • name: The rule name that triggered

  • description: The rule description

  • message: Human-readable alert message with current value, threshold, and location

  • locationId: The specific entity that triggered the alert (vendor, endpoint, or client)

  • locationType: Type of entity ("vendor", "endpoint", or "client")

  • timestamp: ISO 8601 timestamp when the alert was triggered

Events Dashboard

The Events page provides comprehensive visibility into your alert history and helps you identify patterns, false positives, and optimization opportunities.

Overview Features

  • Time Range Selector: View events from the past 15 minutes to 30 days

  • Summary Table: Total occurrences per rule for quick pattern identification

  • Activity Timeline: Visual representation of alert patterns over time

Per-Rule Analytics

For each alerting rule, you can view:

  • Occurrence count: How many times the alert fired

  • Time-series graph: Visual timeline showing when alerts fired

  • Pattern identification: Spot recurring issues or correlation with known incidents

  • Peak activity periods: Identify when alerts are most frequent

Using Events Data

The Events dashboard helps you:

  • Identify false positive patterns (rules firing too frequently may need threshold adjustment)

  • Adjust thresholds based on historical data (use actual performance patterns to tune alerts)

  • Correlate alerts with known incidents (match alert timelines with deployment or infrastructure changes)

  • Track improvement over time (verify that fixes reduce alert frequency)

Common Alerting Scenarios

API Health Monitoring

Monitor overall API availability in production:

Rule: Availability < 99.5%
Filter: Environment = Production
Group By: Endpoint
Check Interval: 1 minute
Report Frequency: On State Change

Vendor SLA Tracking

Track response time SLAs for third-party vendors:

Rule: P99 Duration > 500ms
Filter: Vendor exists
Group By: Vendor
Check Interval: 5 minutes
Report Frequency: On State Change

Error Spike Detection

Detect sudden increases in client errors:

Rule: Errors per Second > 10
Filter: HTTP Status = 5xx
Group By: Client
Check Interval: 1 minute
Report Frequency: Always

Geographic Performance

Monitor performance for specific regions:

Rule: Average Duration > 200ms
Filter: Country = "United States"
Group By: Region
Check Interval: 15 minutes
Report Frequency: On State Change

Security Monitoring

Detect unencrypted data transmission in production:

Template: Unencrypted Data Transmission
Filter: Environment = Production
Check Interval: 1 minute
Report Frequency: Always

Discovery Monitoring

Get notified when new vendors are discovered:

Template: New Vendor
Check Interval: 5 minutes
Report Frequency: Always
Entity Combinations: Vendor

Best Practices

  1. Start with templates: Use pre-built templates and customize them to your needs rather than building from scratch

  2. Use "On State Change" reporting: Reduces alert fatigue while ensuring you're notified of important changes

  3. Group by relevant dimensions: Creates actionable alerts by identifying exactly which vendor, endpoint, or client is affected

  4. Apply filters liberally: Narrow alert scope to reduce noise and false positives

  5. Use the Events dashboard: Regularly review alert patterns to tune thresholds and identify improvements

  6. Test with "Don't Report": When creating new rules, start with "Don't Report" to validate the rule logic before enabling notifications

  7. Longer intervals for stable metrics: Use 15+ minute intervals for metrics that don't change rapidly to reduce overhead

  8. Combine Discovery and Risk alerts: Use Discovery templates to know what's connecting, and Risk templates to monitor how it's connecting

Important Notes

  • Alerting operates exclusively in Qplane and is not part of YAML configuration.

  • The Report Usage plugin must be included in your stacks for alerting to function.

  • Alerts are generated from anonymized event metadata collected by the Report Usage plugin

  • All alert data is stored in Pulse (ClickHouse) and does not include payload data from object stores

Last updated