Tracing

Qtap can export OpenTelemetry Traces that provide network-level observability of service connections using standard OpenTelemetry SDK environment variables. These traces capture the complete TCP connection lifecycle, including TLS handshake timing, protocol detection, and actual data transfer events.

Tracing vs Event Logs - Complementary Data:

  • Traces (this page): Network-level timing - connection lifecycle, TLS handshake, data transfer events, total service response time

  • Event Logs (event stores): HTTP transaction metadata - method, URL, path, status code, duration, process context

Use both together for complete observability: traces show network timing, event logs show HTTP transaction metadata. For headers/bodies, use object stores.

When to Use Tracing

Enable tracing when you need to:

  • Monitor service response times at the network level - Track complete connection duration including TLS handshake and data transfer timing

  • Understand connection lifecycle - See when connections open, TLS handshake completes, protocol detection occurs, and data is transmitted

  • Debug network-level issues - Identify slow TLS handshakes, protocol negotiation delays, or data transfer timing problems

  • Correlate network timing with HTTP transactions - Use connection.id to link network-level traces with HTTP transaction logs

  • Monitor event store performance - Track how long qtap takes to write events to backends

Traces complement logs by showing the network-level view while logs provide the HTTP transaction details.

Enabling Tracing

Tracing is opt-in via standard OpenTelemetry environment variables. No qtap.yaml configuration changes are required.

Required Environment Variables

OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=grpc  # or 'http'

Optional Environment Variables

# OTLP endpoint (defaults to localhost:4317 for gRPC, localhost:4318 for HTTP)
OTEL_EXPORTER_OTLP_ENDPOINT=http://your-collector:4317

# TLS configuration (if needed)
OTEL_EXPORTER_OTLP_INSECURE=false

Deployment Examples

Docker

docker run -d --name qtap \
  --user 0:0 --privileged \
  --cap-add CAP_BPF --cap-add CAP_SYS_ADMIN \
  --pid=host --network=host \
  -v /sys:/sys \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v "$(pwd)/qtap.yaml:/app/config/qtap.yaml" \
  -e TINI_SUBREAPER=1 \
  -e OTEL_TRACES_EXPORTER=otlp \
  -e OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
  --ulimit=memlock=-1 \
  us-docker.pkg.dev/qpoint-edge/public/qtap:v0 \
  --config="/app/config/qtap.yaml"

Kubernetes (Helm)

extraEnv:
  - name: OTEL_TRACES_EXPORTER
    value: "otlp"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.monitoring.svc.cluster.local:4317"

Kubernetes (DaemonSet)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: qtap
spec:
  template:
    spec:
      containers:
      - name: qtap
        image: us-docker.pkg.dev/qpoint-edge/public/qtap:v0
        env:
        - name: OTEL_TRACES_EXPORTER
          value: "otlp"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "grpc"
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://otel-collector.monitoring.svc.cluster.local:4317"
        # ... other config

What Traces Contain

Qtap exports traces showing network-level connection behavior and service response timing. These spans capture the complete TCP connection lifecycle from the kernel's perspective.

Connection Lifecycle Spans

Tracks the complete TCP connection lifecycle including network timing:

Span Name: Connection
Attributes:
  - connection.id: Unique connection identifier (correlates with logs)
  - connection.cookie: Internal tracking ID
  - connection.handler: Handler type (e.g., "raw")
  - connection.strategy: "observe"

Events:
  - OpenEvent: Connection established (timestamp)
  - TLSClientHelloEvent: TLS handshake initiated (timestamp + duration from open)
  - ProtocolEvent: L7 protocol detected (e.g., http2) (timestamp)
  - DataEvent: Network data transmitted (timestamp, byte count)
  - CloseEvent: Connection closed (timestamp)

Duration: Total connection lifetime (matches service response time)

Key Insight: The span duration shows the complete service response time at the network level. For example, a request to a service with a 2-second delay shows a Connection span of ~2 seconds, with DataEvents revealing exactly when the server responded.

Use case: Monitor service response times at the network level, identify slow TLS handshakes, track protocol negotiation, and correlate network timing with HTTP transaction details using connection.id.

Event Store Operation Spans

Instrumentation Scope: qtap.eventstore

Tracks qtap's internal performance when writing events to configured backends:

Span Name: (varies by operation)
Duration: Time qtap spent writing to event stores

Use case: Monitor event store write latency to identify slow backends or performance bottlenecks in qtap's event processing.

Sample Connection Trace

Real example from qtap v0.11.2 showing a request to https://httpbin.org/delay/2 (service with 2-second delay):

{
  "traceId": "2897161d955cf0cc50bb54778a15e99f",
  "spanId": "8388bab4ebafb20f",
  "name": "Connection",
  "kind": "Internal",
  "startTime": "2025-10-16T21:06:23.337719776Z",
  "endTime": "2025-10-16T21:06:26.363734867Z",
  "attributes": {
    "connection.id": "d3olsjo7p3qod2n6hdd0",
    "connection.cookie": 479863,
    "connection.handler": "raw",
    "connection.strategy": "observe"
  },
  "events": [
    {
      "name": "OpenEvent",
      "timestamp": "2025-10-16T21:06:23.337922855Z"
    },
    {
      "name": "TLSClientHelloEvent",
      "timestamp": "2025-10-16T21:06:23.4135878Z"
    },
    {
      "name": "ProtocolEvent",
      "timestamp": "2025-10-16T21:06:23.568508654Z",
      "attributes": { "protocol": "http2" }
    },
    {
      "name": "DataEvent",
      "timestamp": "2025-10-16T21:06:23.568554984Z",
      "attributes": { "bytes": 108 }
    },
    {
      "name": "DataEvent",
      "timestamp": "2025-10-16T21:06:26.362644285Z",
      "attributes": { "bytes": 436 }
    },
    {
      "name": "CloseEvent",
      "timestamp": "2025-10-16T21:06:26.363614848Z"
    }
  ],
  "resource": {
    "service.name": "tap",
    "service.namespace": "qpoint",
    "service.version": "v0.11.2",
    "service.instance.id": "qpoint"
  },
  "scope": {
    "name": "github.com/qpoint-io/qtap/pkg/connection"
  }
}

Understanding the Timing

Span Duration: 3.026 seconds (closely matches the 2-second service delay + network overhead)

Event Timeline:

  • 21:06:23.337: Connection opened (OpenEvent)

  • 21:06:23.413: TLS handshake initiated (+76ms)

  • 21:06:23.568: HTTP/2 protocol detected (+231ms total)

  • 21:06:23.568: Request data sent (108 bytes)

  • 21:06:26.362: Server response received (+2.718s - includes the 2-second delay!)

  • 21:06:26.363: Connection closed

This trace shows the complete network-level view of the service interaction. The 2.718-second gap between request and response DataEvents reveals the actual server processing time, matching the expected 2-second delay plus network latency.

Correlating Traces with Logs

Use connection.id to link network-level traces with HTTP transaction logs for complete observability:

In Traces (Network Timing):

{
  "name": "Connection",
  "attributes": {
    "connection.id": "d3olsjo7p3qod2n6hdd0"
  },
  "startTime": "2025-10-16T21:06:23.337719776Z",
  "endTime": "2025-10-16T21:06:26.363734867Z",
  "events": [
    { "name": "OpenEvent", "timestamp": "2025-10-16T21:06:23.337922855Z" },
    { "name": "TLSClientHelloEvent", "timestamp": "2025-10-16T21:06:23.4135878Z" },
    { "name": "DataEvent", "timestamp": "2025-10-16T21:06:26.362644285Z" }
  ]
}

In Event Logs (HTTP Transaction Details):

{
  "event.type": "artifact_record",
  "event": {
    "summary.connection_id": "d3olsjo7p3qod2n6hdd0",
    "summary.request_method": "GET",
    "summary.request_host": "httpbin.org",
    "summary.request_path": "/delay/2",
    "summary.response_status": 200,
    "summary.duration_ms": 2794
  }
}

Complete Picture:

  • Trace shows: Connection took 3.026s total, TLS handshake took 76ms, server responded after 2.718s

  • Log shows: GET request to httpbin.org/delay/2 returned 200 OK in 2.794s

  • Together: Full observability of both network-level timing and HTTP transaction details

Querying Traces

Monitor Service Response Times

// Find slow service responses (> 1 second at network level)
service.name="tap" AND span.name="Connection" AND span.duration > 1000

Identify TLS Handshake Issues

// Find connections with slow TLS handshakes (> 100ms)
service.name="tap" AND span.name="Connection" AND span.events.name="TLSClientHelloEvent"
// Then examine event timestamps to calculate TLS handshake duration

Track Protocol Detection Timing

// Find connections where protocol detection took time
service.name="tap" AND span.name="Connection" AND span.events.name="ProtocolEvent"

Monitor Event Store Performance

// Track event store write latency
instrumentation.scope="qtap.eventstore"

Correlate Trace with Specific HTTP Transaction

// Find network-level trace for specific HTTP transaction
service.name="tap" AND span.name="Connection" AND attributes.connection.id="d3olsjo7p3qod2n6hdd0"
// Then correlate with logs using the same connection.id

OpenTelemetry Collector Configuration

Ensure your OTel Collector has a traces pipeline configured:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s

exporters:
  otlphttp:
    endpoint: "https://your-backend.com/v1/traces"

service:
  pipelines:
    traces:  # Required for qtap tracing
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

Troubleshooting

No Traces Appearing

Verify environment variables are set:

# Docker
docker inspect qtap | grep -A 5 OTEL

# Kubernetes
kubectl describe pod qtap-xxxxx | grep -A 5 OTEL

Check OTel Collector has traces pipeline:

# Should see "traces" in pipelines section
kubectl logs -n monitoring otel-collector-xxxxx | grep pipeline

Traces But No Event Logs

Tracing and logging are independent. Traces require environment variables, logs require event_stores configuration in qtap.yaml. Ensure both are configured if you want both.

High Trace Volume

Qtap creates Connection spans for every TCP connection it processes. To reduce volume:

  1. Use sampling in your OTel Collector:

    processors:
      probabilistic_sampler:
        sampling_percentage: 10  # Sample 10% of traces
  2. Filter by instrumentation scope:

    processors:
      filter/traces:
        traces:
          exclude:
            match_type: strict
            instrumentation_libraries:
              - name: qtap.eventstore  # Exclude event store traces

Best Practices

  1. Use traces for network-level service monitoring - Monitor Connection span duration to track service response times at the network level, including TLS and protocol negotiation overhead

  2. Combine with logs for complete observability - Use connection.id to correlate network timing (traces) with HTTP transaction details (logs)

  3. Use sampling in production - Sample 10-20% of traces to reduce volume while maintaining visibility into network-level performance patterns

  4. Analyze DataEvent timing - Examine DataEvent timestamps to identify when servers actually respond, revealing server-side delays

  5. Monitor TLS and protocol timing - Track TLSClientHelloEvent and ProtocolEvent to identify connection establishment bottlenecks

  6. Investigate Connection span outliers - High-duration Connection spans indicate slow service responses; correlate with logs to see which endpoints/requests are affected

  7. Monitor event store latency - Track qtap.eventstore spans to identify slow event store writes

What Traces Show vs. What Logs Show

Traces, logs (events), and objects provide complementary views of the same traffic. Here's what each contains using a real httpbin.org example:

Traces Provide Network-Level Timing

Connection span showing network lifecycle (via OpenTelemetry Traces):

{
  "name": "Connection",
  "startTime": "2025-10-16T21:06:23.337719776Z",
  "endTime": "2025-10-16T21:06:26.363734867Z",
  "attributes": {
    "connection.id": "d3olsjo7p3qod2n6hdd0"
  },
  "events": [
    {"name": "OpenEvent", "timestamp": "2025-10-16T21:06:23.337922855Z"},
    {"name": "TLSClientHelloEvent", "timestamp": "2025-10-16T21:06:23.4135878Z"},
    {"name": "ProtocolEvent", "timestamp": "2025-10-16T21:06:23.568508654Z"},
    {"name": "DataEvent", "timestamp": "2025-10-16T21:06:26.362644285Z"}
  ]
}

Shows: Connection took 3.026s, TLS handshake at +76ms, server responded at +2.7s

Logs Provide HTTP Transaction Metadata

Event showing HTTP transaction summary (via OpenTelemetry Logs to event_stores):

{
  "event.type": "artifact_record",
  "event": {
    "summary.connection_id": "d3olsjo7p3qod2n6hdd0",
    "summary.request_method": "GET",
    "summary.request_host": "httpbin.org",
    "summary.request_path": "/delay/2",
    "summary.response_status": 200,
    "summary.duration_ms": 2794,
    "process.exe": "/usr/bin/curl"
  }
}

Shows: What happened (GET to /delay/2, returned 200 in 2.794s, from curl process) Does NOT contain: Headers or bodies

Objects Provide Sensitive Data

HTTP artifact with headers and bodies (captured by http_capture plugin, sent to object_stores like S3):

{
  "connection_id": "d3olsjo7p3qod2n6hdd0",
  "request": {
    "method": "GET",
    "url": "https://httpbin.org/delay/2",
    "headers": {
      ":authority": "httpbin.org",
      ":method": "GET",
      ":path": "/delay/2",
      "User-Agent": "curl/8.14.1"
    }
  },
  "response": {
    "status": 200,
    "headers": {
      ":status": "200",
      "Content-Type": "application/json",
      "Content-Length": "305",
      "Server": "gunicorn/19.9.0"
    },
    "body": "{\"args\": {}, \"headers\": {...}, \"origin\": \"73.71.138.108\"}"
  }
}

Shows: Complete HTTP details including headers and response body Stored in: S3-compatible storage for data sovereignty (keeps sensitive data in your network)

Using All Three Together

For the same request (connection.id: d3olsjo7p3qod2n6hdd0):

  • Trace (OpenTelemetry): Network timing (3.026s total, TLS handshake 76ms, server response at +2.7s)

  • Log (OpenTelemetry): What happened (GET /delay/2 returned 200 in 2.794s from curl)

  • Object (http_capture plugin): Complete details (all headers, response body with actual data)

Correlate using connection.id to get the complete picture.

Note: Traces and logs use OpenTelemetry. Objects are captured by the http_capture plugin and sent to object_stores.

Backend-Specific Notes

Jaeger

Jaeger works well with qtap traces:

# Deploy Jaeger with OTLP support
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml

Set endpoint to Jaeger's OTLP collector:

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger-collector:4317

Datadog APM

OTEL_EXPORTER_OTLP_ENDPOINT=https://trace.agent.datadoghq.com
# Note: Datadog also requires API key via DD-API-KEY header

Honeycomb

OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY,x-honeycomb-dataset=qtap-traces

Grafana Tempo

OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo-prod-us-central-0.grafana.net:443
# Requires authentication via headers

Additional Resources

Last updated