Prometheus + Grafana Monitoring

This guide walks you through setting up comprehensive monitoring for qtap using Prometheus for metrics collection and Grafana for visualization. By the end, you'll have dashboards showing request rates, error rates, latency percentiles, and traffic patterns.

What You'll Build

  • Prometheus scraping qtap metrics every 15 seconds

  • Grafana dashboard visualizing HTTP traffic patterns

  • Alerts for high error rates and latency spikes

  • RED metrics (Rate, Errors, Duration) for all observed traffic

Prerequisites

  • Qtap installed and running (see Getting Started)

  • Docker or Kubernetes environment

  • Basic familiarity with Prometheus and Grafana

Step 0: Enable HTTP Metrics Plugin

Update Your Qtap Configuration

Edit your qtap configuration file to include the http_metrics plugin:

version: 2

services:
  event_stores:
    - type: stdout
  object_stores:
    - type: stdout

stacks:
  monitoring_stack:
    plugins:
      # HTTP capture plugin (optional - for logging/storage)
      - type: http_capture
        config:
          level: summary
          format: json

      # HTTP metrics plugin (REQUIRED for Prometheus metrics)
      - type: http_metrics

tap:
  direction: egress
  http:
    stack: monitoring_stack

Restart Qtap

After adding the plugin, restart qtap:

# Docker
docker restart qtap

# Kubernetes
kubectl rollout restart daemonset/qtap

Step 1: Verify Metrics Are Available

First, confirm qtap is exposing HTTP metrics:

# Check qtap is running
docker ps | grep qtap

# Access metrics endpoint
curl http://localhost:10001/metrics | grep qtap_http_requests_total

You should see Prometheus-formatted metrics:

# HELP qtap_http_requests_total Total HTTP requests observed
# TYPE qtap_http_requests_total counter
qtap_http_requests_total{host="httpbin.org",method="GET",protocol="http2",status_code="200"} 5

If metrics exist but show zero, generate test traffic:

# Generate test traffic
curl https://httpbin.org/get

# Check metrics again (wait 5 seconds for metrics to update)
sleep 5
curl http://localhost:10001/metrics | grep qtap_http_requests_total

Step 2: Deploy Prometheus

Already have Prometheus running? Skip the deployment sections below and jump to adding the qtap scrape configuration to your existing prometheus.yml.

Option A: Docker Compose

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Application metrics
  - job_name: 'qtap'
    static_configs:
      - targets: ['host.docker.internal:10001']
    metrics_path: '/metrics'

  # Agent health metrics
  - job_name: 'qtap-system'
    static_configs:
      - targets: ['host.docker.internal:10001']
    metrics_path: '/system/metrics'
    scrape_interval: 30s

Create docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    extra_hosts:
      - "host.docker.internal:host-gateway"  # On Linux you may need this to reach the host; see note below

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false

volumes:
  prometheus-data:
  grafana-data:

Start the stack:

docker compose up -d

Option B: Kubernetes with ServiceMonitor

If using Prometheus Operator, create a ServiceMonitor:

apiVersion: v1
kind: Service
metadata:
  name: qtap-metrics
  labels:
    app: qtap
spec:
  selector:
    app: qtap
  ports:
    - name: metrics
      port: 10001
      targetPort: 10001
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: qtap
  labels:
    app: qtap
spec:
  selector:
    matchLabels:
      app: qtap
  endpoints:
    - port: metrics
      path: /metrics
      interval: 15s
    - port: metrics
      path: /system/metrics
      interval: 30s

Apply it:

kubectl apply -f qtap-servicemonitor.yaml

Option C: Kubernetes with Pod Annotations

For standard Prometheus server, add annotations to your qtap DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: qtap
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "10001"
        prometheus.io/path: "/metrics"
    spec:
      # ... qtap pod spec

Add Qtap Scrape Config to Existing Prometheus

If you already have Prometheus running, add these scrape configs to your existing prometheus.yml:

scrape_configs:
  # ... your existing jobs ...

  # Qtap application metrics
  - job_name: 'qtap'
    # Docker for Mac/Windows exposes host.docker.internal automatically.
    # On Linux, add --add-host=host.docker.internal:host-gateway (Compose v2)
    # or replace with your bridge IP (commonly 172.17.0.1).
    static_configs:
      - targets: ['<qtap-host>:10001']  # Replace with your qtap host
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Qtap system/health metrics
  - job_name: 'qtap-system'
    static_configs:
      - targets: ['<qtap-host>:10001']
    metrics_path: '/system/metrics'
    scrape_interval: 30s

For Kubernetes with Prometheus Operator: If you already have Prometheus Operator installed, just apply the ServiceMonitor from Option B above - no need to modify Prometheus config files.

Reload Prometheus configuration:

# Docker
docker exec prometheus kill -HUP 1

# Kubernetes (if using Prometheus Operator, reload is automatic)
kubectl rollout restart deployment/prometheus-server -n monitoring

Step 3: Verify Prometheus Is Scraping

Open Prometheus UI at http://localhost:9090 and check:

  1. Status → Targets: Verify qtap and qtap-system jobs are "UP"

  2. Run a query: Try qtap_http_requests_total in the query box

If targets show as DOWN:

# Check network connectivity from Prometheus container
# macOS/Windows
docker exec prometheus wget -O- http://host.docker.internal:10001/metrics
# Linux (if host.docker.internal is unavailable)
# docker exec prometheus wget -O- http://172.17.0.1:10001/metrics

# For Kubernetes, check service and endpoints
kubectl get svc qtap-metrics
kubectl get endpoints qtap-metrics

Step 4: Import Grafana Dashboard

Already have Grafana running? Great! Skip ahead to Add Prometheus Data Source and use your existing Grafana instance.

Access Grafana

Navigate to http://localhost:3000 and log in:

  • Username: admin

  • Password: admin (or value from GF_SECURITY_ADMIN_PASSWORD)

Add Prometheus Data Source

Already have Prometheus configured as a data source? You can skip this section and go straight to Import Qtap Dashboard.

  1. Navigate to ConfigurationData Sources

  2. Click Add data source

  3. Select Prometheus

  4. Set URL:

    • Docker Compose: http://prometheus:9090

    • Kubernetes: http://prometheus-server.monitoring.svc.cluster.local

  5. Click Save & Test

Import Qtap Dashboard

  1. Download the dashboard: qtap-http-overview.json

  2. In Grafana, navigate to DashboardsImport

  3. Click Upload JSON file and select the downloaded file

  4. Select your Prometheus data source

  5. Click Import

The official Grafana dashboard may need label adjustments. Qtap v0 uses host labels, not domain. If panels are empty, edit queries to replace domain with host.

Dashboard Panels

The dashboard includes:

  • Request Rate: Total requests per second over time

  • Error Rate: Percentage of 4xx/5xx responses

  • Average Response Time: Mean response duration

  • Latency Percentiles: p50, p95, p99 request duration

  • Request/Response Sizes: Average payload sizes

  • Top Hosts: Highest traffic hosts

  • Status Code Distribution: Breakdown by HTTP status

  • Protocol Distribution: HTTP/1.1 vs HTTP/2 traffic

Step 5: Set Up Alerts

Create alerts.yml for Prometheus:

groups:
  - name: qtap-alerts
    interval: 30s
    rules:
      # High error rate alert
      - alert: HighHTTPErrorRate
        expr: |
          (
            sum(rate(qtap_http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(qtap_http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: warning
          component: qtap
        annotations:
          summary: "High HTTP 5xx error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # High latency alert
      - alert: HighResponseLatency
        expr: |
          histogram_quantile(0.95,
            rate(qtap_http_requests_duration_ms_bucket[5m])
          ) > 1000
        for: 10m
        labels:
          severity: warning
          component: qtap
        annotations:
          summary: "High response latency detected"
          description: "95th percentile latency is {{ $value }}ms"

      # Traffic spike alert
      - alert: TrafficSpike
        expr: |
          rate(qtap_http_requests_total[5m])
          > 2 * rate(qtap_http_requests_total[1h] offset 1h)
        for: 5m
        labels:
          severity: info
          component: qtap
        annotations:
          summary: "Unusual traffic spike detected"
          description: "Current request rate is {{ $value | humanize }}req/s"

      # Qtap agent health
      - alert: QtapAgentDown
        expr: up{job="qtap"} == 0
        for: 2m
        labels:
          severity: critical
          component: qtap
        annotations:
          summary: "Qtap agent is down"
          description: "Qtap metrics endpoint is not responding"

Update prometheus.yml to load the rules:

global:
  scrape_interval: 15s

rule_files:
  - /etc/prometheus/alerts.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']  # Optional: configure Alertmanager

scrape_configs:
  # ... existing scrape configs

Reload Prometheus configuration:

# Docker Compose
docker compose restart prometheus

# Kubernetes
kubectl rollout restart deployment prometheus-server

Configure Grafana Alerts (Alternative)

You can also set up alerts directly in Grafana:

  1. Open a dashboard panel

  2. Click EditAlert tab

  3. Create alert conditions based on the queries

  4. Configure notification channels (Slack, PagerDuty, email, etc.)

Step 6: Test the Monitoring Setup

Generate various types of traffic to see metrics flow through:

Successful Requests

for i in {1..50}; do
  curl -s https://httpbin.org/get > /dev/null
  sleep 0.1
done

Error Responses

for i in {1..10}; do
  curl -s https://httpbin.org/status/500 > /dev/null
  sleep 0.2
done

High Latency

for i in {1..5}; do
  curl -s https://httpbin.org/delay/2 > /dev/null
done

View in Grafana

  1. Open the qtap dashboard

  2. Observe request rate increase

  3. See error rate spike from 500 responses

  4. Check latency percentiles from delayed requests

Step 7: Optimize for Production

Reduce Metric Cardinality

High cardinality (many unique label combinations) can impact Prometheus performance. To reduce it:

Filter noisy processes in qtap config:

filters:
  groups:
    - qpoint
  custom:
    - exe: /usr/bin/health-check
      strategy: exact
    - exe: /usr/local/bin/monitoring-
      strategy: prefix

Focus on important domains:

tap:
  endpoints:
    - domain: 'api.production.example.com'
      http:
        stack: monitored_stack

Use Recording Rules

Pre-compute expensive queries with Prometheus recording rules:

groups:
  - name: qtap-recordings
    interval: 15s
    rules:
      # Pre-compute request rate by host
      - record: qtap:http_requests:rate5m
        expr: |
          sum by (host) (rate(qtap_http_requests_total[5m]))

      # Pre-compute error rate
      - record: qtap:http_errors:rate5m
        expr: |
          sum(rate(qtap_http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(qtap_http_requests_total[5m]))

      # Pre-compute p95 latency
      - record: qtap:http_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(qtap_http_requests_duration_ms_bucket[5m])) by (le)
          )

Use these in your dashboards and alerts for better performance.

Retention Settings

Configure Prometheus retention based on your needs:

# docker-compose.yml
command:
  - '--config.file=/etc/prometheus/prometheus.yml'
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=30d'    # Keep 30 days
  - '--storage.tsdb.retention.size=10GB'    # Or max 10GB

Common Queries for Troubleshooting

Find services with high error rates

topk(5,
  sum by (host) (rate(qtap_http_requests_total{status_code=~"5.."}[5m]))
  /
  sum by (host) (rate(qtap_http_requests_total[5m]))
)

Identify slow endpoints

topk(10,
  histogram_quantile(0.95,
    sum by (host, le) (rate(qtap_http_requests_duration_ms_bucket[5m]))
  )
)

HTTP/2 vs HTTP/1 Traffic

sum by (protocol) (rate(qtap_http_requests_total[5m]))

Average Request/Response Sizes

# Average request size
rate(qtap_http_requests_size_bytes_sum[5m])
/
rate(qtap_http_requests_size_bytes_count[5m])

# Average response size
rate(qtap_http_responses_size_bytes_sum[5m])
/
rate(qtap_http_responses_size_bytes_count[5m])

Compare traffic patterns over time

sum(rate(qtap_http_requests_total[5m]))
/
sum(rate(qtap_http_requests_total[5m] offset 1d))

Monitor overall system latency

# Average end-to-end transaction time (combined request+response)
rate(qtap_http_duration_ms_sum[5m])
/
rate(qtap_http_duration_ms_count[5m])

# 95th percentile overall latency
histogram_quantile(0.95,
  rate(qtap_http_duration_ms_bucket[5m])
)

Check active connections

# Total active connections
sum(qtap_connection_active_total)

# Active connections by destination
topk(10, qtap_connection_active_total)

Track TLS/HTTPS usage

# TLS handshake rate (HTTPS connection establishment)
sum(rate(qtap_connection_tls_handshake_total[5m]))

# Connections by TLS version
sum by (version) (rate(qtap_connection_tls_handshake_total[5m]))

Troubleshooting

No HTTP metrics appearing in Prometheus

Most Common Issue: Missing http_metrics plugin.

# Check if HTTP metrics exist
curl http://localhost:10001/metrics | grep "qtap_http_requests_total"

If empty:

  1. Verify http_metrics plugin is in your qtap config:

stacks:
  my_stack:
    plugins:
      - type: http_metrics  # ← Must be present
  1. Restart qtap after adding it:

docker restart qtap
  1. Verify metrics now appear:

curl http://localhost:10001/metrics | grep "qtap_http_requests_total"

Only seeing connection metrics

If you see qtap_connection_* metrics but no qtap_http_* metrics, the http_metrics plugin is not configured. See Step 0.

Metrics show but dashboard is empty

  1. Check label names: Qtap v0 uses host, not domain. Update queries:

    # Wrong
    sum by (domain) (rate(qtap_http_requests_total[5m]))
    
    # Correct
    sum by (host) (rate(qtap_http_requests_total[5m]))
  2. Check data source: Ensure Grafana is connected to the right Prometheus

  3. Check time range: Extend the time range in Grafana

  4. Verify queries: Test dashboard queries directly in Prometheus

Prometheus can't reach qtap

# Docker Compose (Mac/Windows)
docker exec prometheus wget -O- http://host.docker.internal:10001/metrics
# Docker Compose (Linux):
# docker exec prometheus wget -O- http://172.17.0.1:10001/metrics

# Kubernetes
kubectl run curl-test --image=curlimages/curl --rm -it -- \
  curl http://qtap-metrics.default.svc.cluster.local:10001/metrics

High cardinality warnings

level=warn msg="Metric has too many labels"

This means you have too many unique label combinations. Solutions:

  1. Filter processes in qtap configuration

  2. Use recording rules to aggregate

  3. Limit domain capture with endpoints configuration

  4. Consider using Prometheus remote write to long-term storage

Missing metrics after qtap restart

Prometheus counters reset when qtap restarts. This is normal. Use rate() function which handles counter resets automatically.

Next Steps

  • Customize dashboards: Add panels for your specific use cases

  • Set up Alertmanager: Route alerts to Slack, PagerDuty, or email

  • Create SLOs: Define service level objectives based on qtap metrics

  • Integrate with logs: Correlate metrics with qtap captured payloads in your object store

  • Multi-cluster monitoring: Federate metrics from multiple qtap deployments

Additional Resources

Last updated