Service Mesh Lite: Observability Without Sidecars

Service meshes like Istio and Linkerd provide powerful observability into microservice communication, but they come with significant operational overhead: complex deployments, sidecar proxies on every pod, added latency, and resource consumption.

What if you could get service mesh observability without the complexity?

This guide demonstrates how Qtap provides core service mesh observability features—service discovery, request tracing, error monitoring, and latency tracking—without sidecars, without control planes, and with zero latency impact.


The Service Mesh Complexity Problem

Service meshes solve real problems but introduce significant complexity:

Deployment Complexity:

  • Control plane components (istiod, pilot, mixer)

  • Sidecar injection on every pod

  • Webhook configuration and RBAC setup

  • Certificate management for mTLS

Resource Overhead:

  • Each sidecar consumes CPU/memory (multiply by number of pods)

  • Control plane resource requirements

  • Significant cluster resource increase

Latency Impact:

  • Every request goes through proxy (adds milliseconds)

  • Data plane overhead on critical path

  • Performance degradation under load

Operational Burden:

  • Complex troubleshooting (is it my app or the mesh?)

  • Version upgrades affect entire cluster

  • Learning curve for operators

Most teams want observability, not traffic control. If you don't need advanced features like circuit breaking, traffic splitting, or mTLS enforcement, you're paying a high price for what you actually use.


What Qtap Provides Instead

Qtap offers a "Service Mesh Lite" approach focused purely on observability:

✅ What You Get:

  • Service Discovery: Automatic mapping of service dependencies

  • Request Tracing: Follow requests across services

  • Error Monitoring: Track failures and error rates per service

  • Latency Metrics: Measure p50, p95, p99 response times

  • Traffic Volume: Connections per second, bytes transferred

  • Zero Instrumentation: No code changes or configuration needed

✅ How It's Different:

  • Out-of-band observation: Qtap watches traffic passively, doesn't proxy it

  • Zero latency: No proxies in the request path

  • No sidecars: One DaemonSet per node (not per pod)

  • No control plane: Just install qtap and start observing

❌ What Qtap DOESN'T Do:

  • Traffic routing or load balancing

  • Circuit breaking or retries

  • Traffic splitting (canary deployments)

  • mTLS between services

  • Rate limiting or quotas

Qtap is observation-only. If you need traffic control, use a service mesh. If you just need visibility, Qtap is simpler.


Demo Architecture

We'll demonstrate service mesh observability with a multi-service application:

                    ┌─────────────┐
         ┌─────────→│  Frontend   │
         │          │  (Client)   │
         │          └──────┬──────┘
         │                 │
         │                 ↓
    [External]      ┌─────────────┐
    [Request]       │     API     │ (Port 5000)
                    │   Gateway   │
                    └──────┬──────┘

                    ┌──────┴──────┐
                    ↓             ↓
             ┌─────────────┐  ┌─────────────┐
             │   User      │  │   Order     │
             │  Service    │  │  Service    │
             │ (Port 5001) │  │ (Port 5002) │
             └─────────────┘  └──────┬──────┘


                              ┌─────────────┐
                              │  httpbin.org│ (External API)
                              └─────────────┘

Traffic Flows:

  1. Client → API Gateway → User Service: Simple service-to-service call

  2. Client → API Gateway → Order Service → httpbin.org: Service calling external API

  3. Client → API Gateway → User Service + Order Service: Fan-out to multiple services

Qtap captures ALL of this:

  • Service A calling Service B (both egress from A and ingress to B)

  • External API calls from services

  • Error responses and status codes

  • Request/response timing


Prerequisites

  • Linux host with Docker

  • jq for JSON parsing

  • Basic understanding of microservices


Part 1: Deploy the Multi-Service Application

We'll create three microservices that communicate with each other.

API Gateway Service

#!/usr/bin/env python3
"""
API Gateway - Routes requests to backend services
"""
from flask import Flask, jsonify
import requests
import os

app = Flask(__name__)

USER_SERVICE_URL = os.environ.get('USER_SERVICE_URL', 'http://localhost:5001')
ORDER_SERVICE_URL = os.environ.get('ORDER_SERVICE_URL', 'http://localhost:5002')

@app.route('/health')
def health():
    return jsonify({'status': 'healthy', 'service': 'api-gateway'})

@app.route('/api/users/<user_id>')
def get_user(user_id):
    """Proxy request to user service"""
    response = requests.get(f'{USER_SERVICE_URL}/users/{user_id}', timeout=5)
    return jsonify(response.json()), response.status_code

@app.route('/api/orders/<user_id>')
def get_orders(user_id):
    """Proxy request to order service"""
    response = requests.get(f'{ORDER_SERVICE_URL}/orders/{user_id}', timeout=5)
    return jsonify(response.json()), response.status_code

@app.route('/api/user-with-orders/<user_id>')
def get_user_with_orders(user_id):
    """Fan-out request to both services"""
    user_response = requests.get(f'{USER_SERVICE_URL}/users/{user_id}', timeout=5)
    orders_response = requests.get(f'{ORDER_SERVICE_URL}/orders/{user_id}', timeout=5)

    return jsonify({
        'user': user_response.json(),
        'orders': orders_response.json()
    }), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

User Service

#!/usr/bin/env python3
"""
User Service - Returns user data
"""
from flask import Flask, jsonify

app = Flask(__name__)

USERS = {
    '1': {'id': '1', 'name': 'Alice', 'email': '[email protected]'},
    '2': {'id': '2', 'name': 'Bob', 'email': '[email protected]'},
    '3': {'id': '3', 'name': 'Charlie', 'email': '[email protected]'},
}

@app.route('/health')
def health():
    return jsonify({'status': 'healthy', 'service': 'user-service'})

@app.route('/users/<user_id>')
def get_user(user_id):
    if user_id == '999':  # Simulate error
        return jsonify({'error': 'User not found', 'service': 'user-service'}), 404

    user = USERS.get(user_id)
    if user:
        return jsonify(user), 200
    else:
        return jsonify({'error': 'User not found', 'service': 'user-service'}), 404

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001)

Order Service

#!/usr/bin/env python3
"""
Order Service - Returns orders and calls external payment API
"""
from flask import Flask, jsonify
import requests

app = Flask(__name__)

ORDERS = {
    '1': [
        {'id': 'order-101', 'user_id': '1', 'total': 99.99, 'status': 'completed'},
        {'id': 'order-102', 'user_id': '1', 'total': 49.99, 'status': 'pending'},
    ],
    '2': [
        {'id': 'order-201', 'user_id': '2', 'total': 199.99, 'status': 'completed'},
    ],
    '3': [],
}

@app.route('/health')
def health():
    return jsonify({'status': 'healthy', 'service': 'order-service'})

@app.route('/orders/<user_id>')
def get_orders(user_id):
    """Get orders and verify with external payment API"""
    try:
        # Call external API (httpbin for demo)
        payment_response = requests.get(
            'https://httpbin.org/json',
            timeout=5,
            headers={'X-User-ID': user_id}
        )
        payment_status = 'payment-api-ok' if payment_response.status_code == 200 else 'payment-api-error'
    except Exception as e:
        payment_status = f'payment-api-error: {str(e)}'

    orders = ORDERS.get(user_id, [])
    return jsonify({
        'user_id': user_id,
        'orders': orders,
        'payment_api_status': payment_status,
        'service': 'order-service'
    }), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5002)

Part 2: Qtap Configuration for Service Mesh Observability

Create /tmp/qtap-mesh-lite.yaml:

version: 2

# Service Mesh Lite Configuration
# Captures all service-to-service traffic for observability

services:
  event_stores:
    - type: stdout
  object_stores:
    - type: stdout

rulekit:
  macros:
    - name: is_error
      expr: http.res.status >= 400 && http.res.status < 600
    - name: is_api_gateway
      expr: src.container.name == "api-gateway"
    - name: is_user_service
      expr: src.container.name == "user-service"
    - name: is_order_service
      expr: src.container.name == "order-service"

stacks:
  mesh_observability:
    plugins:
      - type: http_capture
        config:
          level: summary  # (none|summary|details|full) - Lightweight by default
          format: json    # (json|text) - JSON for easy parsing
          rules:
            # Capture full details for errors to debug failures
            - name: "Service errors"
              expr: is_error()
              level: full

            # Capture details for all API gateway traffic
            - name: "API Gateway traffic"
              expr: is_api_gateway()
              level: details

tap:
  direction: all          # (egress|egress-external|egress-internal|ingress|all)
                          # CRITICAL: Use 'all' to see BOTH sides of service calls
  ignore_loopback: false  # (true|false) - MUST be false to see localhost traffic!
  audit_include_dns: false # (true|false) - Skip DNS noise

  http:
    stack: mesh_observability

  # Filter out qtap's own traffic
  filters:
    groups:
      - qpoint

Key configuration points:

  1. direction: all: CRITICAL for service mesh observability. Captures both:

    • Egress: Requests leaving Service A

    • Ingress: Requests arriving at Service B

    • This gives you BOTH sides of every service-to-service call

  2. ignore_loopback: false: Required to capture localhost traffic in Docker

  3. level: summary by default: Lightweight capture (method, URL, status, timing)

  4. level: full for errors: Capture complete request/response for debugging

  5. Rulekit macros: Identify traffic by container name for filtering


Part 3: Deploy with Docker Compose

Create /tmp/docker-compose-mesh-lite.yaml:

version: '3.9'

services:
  # API Gateway - Entry point
  api-gateway:
    image: python:3.11-slim
    container_name: api-gateway
    working_dir: /app
    network_mode: host  # Required for qtap to see localhost traffic
    command: >
      bash -c "pip install -q flask requests &&
               python api-gateway.py"
    volumes:
      - /tmp/api-gateway.py:/app/api-gateway.py
    environment:
      - USER_SERVICE_URL=http://localhost:5001
      - ORDER_SERVICE_URL=http://localhost:5002

  # User Service
  user-service:
    image: python:3.11-slim
    container_name: user-service
    working_dir: /app
    network_mode: host
    command: >
      bash -c "pip install -q flask &&
               python user-service.py"
    volumes:
      - /tmp/user-service.py:/app/user-service.py

  # Order Service
  order-service:
    image: python:3.11-slim
    container_name: order-service
    working_dir: /app
    network_mode: host
    command: >
      bash -c "pip install -q flask requests &&
               python order-service.py"
    volumes:
      - /tmp/order-service.py:/app/order-service.py

  # Qtap - Traffic capture
  qtap:
    image: us-docker.pkg.dev/qpoint-edge/public/qtap:v0
    container_name: qtap-mesh-lite
    privileged: true
    user: "0:0"
    cap_add:
      - CAP_BPF
      - CAP_SYS_ADMIN
    pid: host
    network_mode: host
    volumes:
      - /sys:/sys
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/qtap-mesh-lite.yaml:/app/config/qtap.yaml
    environment:
      - TINI_SUBREAPER=1
    ulimits:
      memlock: -1
    command:
      - --log-level=info
      - --log-encoding=console
      - --config=/app/config/qtap.yaml

Start the stack:

docker compose -f /tmp/docker-compose-mesh-lite.yaml up -d

Wait for services to initialize:

sleep 10

Part 4: Generate Traffic and Observe

Test 1: Simple Service-to-Service Call

curl -s http://localhost:5000/api/users/1 | jq .

Output:

{
  "email": "[email protected]",
  "id": "1",
  "name": "Alice"
}

What Qtap captures:

  • Client → API Gateway (egress from curl)

  • API Gateway → User Service (egress from gateway, ingress to user-service)

  • Complete request chain with timing


Test 2: Service Calling External API

curl -s http://localhost:5000/api/orders/1 | jq .

Output:

{
  "orders": [
    {
      "id": "order-101",
      "status": "completed",
      "total": 99.99,
      "user_id": "1"
    },
    {
      "id": "order-102",
      "status": "pending",
      "total": 49.99,
      "user_id": "1"
    }
  ],
  "payment_api_status": "payment-api-ok",
  "service": "order-service",
  "user_id": "1"
}

What Qtap captures:

  • Client → API Gateway

  • API Gateway → Order Service

  • Order Service → httpbin.org (external API call)

  • TLS inspection: Qtap sees inside HTTPS to httpbin.org before encryption


Test 3: Fan-Out Request

curl -s http://localhost:5000/api/user-with-orders/2 | jq .

Output:

{
  "orders": {
    "orders": [
      {
        "id": "order-201",
        "status": "completed",
        "total": 199.99,
        "user_id": "2"
      }
    ],
    "payment_api_status": "payment-api-ok",
    "service": "order-service",
    "user_id": "2"
  },
  "user": {
    "email": "[email protected]",
    "id": "2",
    "name": "Bob"
  }
}

What Qtap captures:

  • API Gateway making parallel calls to User Service AND Order Service

  • Both service responses with timing

  • Service dependency mapping (gateway depends on both services)


Test 4: Error Scenario

curl -s http://localhost:5000/api/users/999 | jq .

Output:

{
  "error": "User not found",
  "service": "user-service"
}

What Qtap captures:

  • 404 response captured with level: full (includes headers and body)

  • Error tracking per service

  • Useful for identifying which service is failing


Part 5: Analyze Service Mesh Observability

View Qtap Logs

docker logs qtap-mesh-lite 2>&1 | grep '"method"' | jq .

Example 1: Service-to-Service Call (Egress Side)

{
  "metadata": {
    "process_exe": "/usr/local/bin/python3.11",
    "connection_id": "abc123"
  },
  "request": {
    "method": "GET",
    "url": "http://localhost:5001/users/1",
    "authority": "localhost:5001",
    "protocol": "http1",
    "user_agent": "python-requests/2.32.5"
  },
  "response": {
    "status": 200,
    "content_type": "application/json"
  },
  "direction": "egress-external",
  "duration_ms": 2
}

Interpretation:

  • Python process (API Gateway) made request

  • Destination: User Service (port 5001)

  • Direction: egress (leaving API Gateway)

  • Response time: 2ms


Example 2: Same Call (Ingress Side)

{
  "metadata": {
    "process_exe": "/usr/local/bin/python3.11",
    "connection_id": "abc124"
  },
  "request": {
    "method": "GET",
    "url": "http://localhost:5001/users/1",
    "authority": "localhost:5001",
    "protocol": "http1"
  },
  "response": {
    "status": 200
  },
  "direction": "ingress",
  "duration_ms": 1
}

Interpretation:

  • Same request, different perspective

  • Direction: ingress (arriving at User Service)

  • Qtap captured BOTH sides of the call

  • This is the core of service mesh observability


Example 3: External API Call

{
  "metadata": {
    "process_exe": "/usr/local/bin/python3.11",
    "endpoint_id": "httpbin.org",
    "is_tls": true
  },
  "request": {
    "method": "GET",
    "url": "https://httpbin.org/json",
    "authority": "httpbin.org",
    "protocol": "http1"
  },
  "response": {
    "status": 200,
    "content_type": "application/json"
  },
  "direction": "egress-external",
  "duration_ms": 1949
}

Interpretation:

  • Order Service called external API

  • Qtap saw inside HTTPS (before TLS encryption)

  • Much slower (1949ms) than internal calls (1-2ms)

  • Useful for identifying slow external dependencies


Part 6: Service Mesh Features Demonstrated

✅ Service Discovery

Qtap automatically discovered:

  • API Gateway depends on User Service (port 5001)

  • API Gateway depends on Order Service (port 5002)

  • Order Service depends on httpbin.org (external)

No configuration required. Just observe traffic and map dependencies.


✅ Request Tracing

Follow a single request through the system:

1. Client → API Gateway [2ms]
   GET /api/user-with-orders/2

2. API Gateway → User Service [1ms]
   GET /users/2
   Response: {"id": "2", "name": "Bob"}

3. API Gateway → Order Service [5270ms]  ← Slow!
   GET /orders/2

4. Order Service → httpbin.org [1949ms]
   GET https://httpbin.org/json

Total: ~7.2 seconds

Insight: Order Service is slow because it waits for external API. Consider caching or async processing.


✅ Error Monitoring

Errors captured with full details:

docker logs qtap-mesh-lite 2>&1 | grep '"status":404' | jq .

Output shows:

  • Which service returned 404 (User Service)

  • Request that caused it (GET /users/999)

  • Full response body with error message

  • Captured at level: full due to rulekit rule


✅ Latency Tracking

Calculate latency percentiles from captured data:

docker logs qtap-mesh-lite 2>&1 | \
  grep '^{' | \
  jq -r 'select(has("duration_ms")) | .duration_ms' | \
  sort -n

The grep '^{' filter keeps only the JSON capture lines so jq doesn't choke on INFO prefixes.

Analyze:

  • p50 (median): ~2ms for internal calls

  • p95: ~10ms

  • p99: ~2000ms (external API calls)

Insight: Most internal calls are fast (<10ms), but external APIs add significant latency.


Part 7: Comparison - Service Mesh vs Qtap

Feature
Service Mesh (Istio)
Qtap (Service Mesh Lite)

Observability

Service discovery

✅ Yes

✅ Yes

Request tracing

✅ Yes

✅ Yes (both ingress/egress)

Error monitoring

✅ Yes

✅ Yes (with full capture)

Latency metrics

✅ Yes

✅ Yes (per request)

Traffic volume

✅ Yes

✅ Yes

TLS inspection

✅ Yes (via mTLS)

✅ Yes (pre-encryption)

Traffic Control

Load balancing

✅ Yes

❌ No

Circuit breaking

✅ Yes

❌ No

Retries/timeouts

✅ Yes

❌ No

Traffic splitting (canary)

✅ Yes

❌ No

Security

mTLS between services

✅ Yes

❌ No

Authorization policies

✅ Yes

❌ No

Operational

Deployment complexity

⚠️ High

✅ Low (DaemonSet)

Resource overhead

⚠️ High (sidecars on every pod)

✅ Minimal (one per node)

Latency impact

⚠️ Yes (adds ~2-5ms per hop)

✅ Zero (out-of-band)

Code changes required

✅ None

✅ None

Control plane

⚠️ Required (istiod, pilot)

✅ None

Configuration complexity

⚠️ High (CRDs, policies)

✅ Low (one YAML)


When to Use Which

Use Qtap (Service Mesh Lite) When:

You need observability only

  • Understand service dependencies

  • Track errors and latency

  • Debug microservice issues

You want zero latency overhead

  • Performance-critical applications

  • High-throughput services

  • Latency-sensitive workloads

You want operational simplicity

  • Small team without mesh expertise

  • Don't want to manage control plane

  • Avoid sidecar complexity

You have resource constraints

  • Limited cluster resources

  • Cost-sensitive deployments

  • Want to minimize overhead


Use a Service Mesh When:

⚠️ You need traffic control

  • Canary deployments and traffic splitting

  • Circuit breaking for resilience

  • Retry and timeout policies

  • Advanced load balancing

⚠️ You require mTLS

  • Zero-trust network requirements

  • Compliance mandates encryption

  • Need service-to-service authentication

⚠️ You need authorization

  • Fine-grained access control between services

  • Policy-based routing

⚠️ You have dedicated platform team

  • Team can manage mesh complexity

  • Already invested in service mesh skills

  • Have operational capacity


Part 8: Production Considerations

Scaling to Many Services

This demo uses 3 services. In production with 50+ services:

Qtap scales linearly:

  • One qtap per node (DaemonSet in Kubernetes)

  • Resource usage independent of service count

  • No per-service configuration needed

Service Mesh complexity grows:

  • One sidecar per pod (100 pods = 100 sidecars)

  • Control plane resource requirements increase

  • Configuration complexity multiplies


Integration with Observability Tools

Qtap outputs JSON. Pipe to your observability stack:

Send to S3 for long-term storage:

services:
  object_stores:
    - type: s3
      bucket: my-mesh-observability
      region: us-east-1

Send events to ClickHouse/Elasticsearch:

services:
  event_stores:
    - type: axiom
      api_token: $AXIOM_TOKEN
      dataset: mesh-traffic

Build dashboards:

  • Parse JSON logs to extract metrics

  • Visualize service dependencies

  • Alert on error rate increases


Kubernetes Deployment

For production Kubernetes, deploy qtap as a DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: qtap
  namespace: qpoint
spec:
  selector:
    matchLabels:
      app: qtap
  template:
    metadata:
      labels:
        app: qtap
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: qtap
        image: us-docker.pkg.dev/qpoint-edge/public/qtap:v0
        securityContext:
          privileged: true
        volumeMounts:
        - name: sys
          mountPath: /sys
        - name: docker-sock
          mountPath: /var/run/docker.sock
        - name: config
          mountPath: /app/config
        env:
        - name: TINI_SUBREAPER
          value: "1"
        args:
        - --log-level=info
        - --log-encoding=console
        - --config=/app/config/qtap.yaml
      volumes:
      - name: sys
        hostPath:
          path: /sys
      - name: docker-sock
        hostPath:
          path: /var/run/docker.sock
      - name: config
        configMap:
          name: qtap-config

Use Pod Labels for Service Attribution

In Kubernetes, use pod labels to identify services:

rulekit:
  macros:
    - name: is_payment_service
      expr: src.pod.labels.app == "payment-service"
    - name: is_auth_service
      expr: src.pod.labels.app == "auth-service"

stacks:
  mesh_observability:
    plugins:
      - type: http_capture
        config:
          level: summary
          rules:
            - name: "Payment service traffic"
              expr: is_payment_service()
              level: details

            - name: "Auth service traffic"
              expr: is_auth_service()
              level: details

Part 9: Cleanup

docker compose -f /tmp/docker-compose-mesh-lite.yaml down

Key Takeaways

  1. Service mesh observability without service mesh complexity: Qtap provides service discovery, request tracing, error monitoring, and latency tracking without sidecars or control planes.

  2. Zero latency impact: Out-of-band observation means no proxies in the request path. Your services run at full speed.

  3. BOTH sides of every call: direction: all captures egress (leaving Service A) and ingress (arriving at Service B), giving complete visibility.

  4. TLS inspection without certificates: Qtap sees inside HTTPS at the kernel level before encryption happens.

  5. Operational simplicity: One DaemonSet per node vs sidecars on every pod. Simple YAML config vs complex CRDs and policies.

  6. Know when you need more: If you need traffic control (canary, circuit breaking) or mTLS, use a full service mesh. If you just need visibility, Qtap is simpler.


Last updated