Service Mesh Lite: Observability Without Sidecars
Service meshes like Istio and Linkerd provide powerful observability into microservice communication, but they come with significant operational overhead: complex deployments, sidecar proxies on every pod, added latency, and resource consumption.
What if you could get service mesh observability without the complexity?
This guide demonstrates how Qtap provides core service mesh observability features—service discovery, request tracing, error monitoring, and latency tracking—without sidecars, without control planes, and with zero latency impact.
The Service Mesh Complexity Problem
Service meshes solve real problems but introduce significant complexity:
Deployment Complexity:
Control plane components (istiod, pilot, mixer)
Sidecar injection on every pod
Webhook configuration and RBAC setup
Certificate management for mTLS
Resource Overhead:
Each sidecar consumes CPU/memory (multiply by number of pods)
Control plane resource requirements
Significant cluster resource increase
Latency Impact:
Every request goes through proxy (adds milliseconds)
Data plane overhead on critical path
Performance degradation under load
Operational Burden:
Complex troubleshooting (is it my app or the mesh?)
Version upgrades affect entire cluster
Learning curve for operators
Most teams want observability, not traffic control. If you don't need advanced features like circuit breaking, traffic splitting, or mTLS enforcement, you're paying a high price for what you actually use.
What Qtap Provides Instead
Qtap offers a "Service Mesh Lite" approach focused purely on observability:
✅ What You Get:
Service Discovery: Automatic mapping of service dependencies
Request Tracing: Follow requests across services
Error Monitoring: Track failures and error rates per service
Latency Metrics: Measure p50, p95, p99 response times
Traffic Volume: Connections per second, bytes transferred
Zero Instrumentation: No code changes or configuration needed
✅ How It's Different:
Out-of-band observation: Qtap watches traffic passively, doesn't proxy it
Zero latency: No proxies in the request path
No sidecars: One DaemonSet per node (not per pod)
No control plane: Just install qtap and start observing
❌ What Qtap DOESN'T Do:
Traffic routing or load balancing
Circuit breaking or retries
Traffic splitting (canary deployments)
mTLS between services
Rate limiting or quotas
Qtap is observation-only. If you need traffic control, use a service mesh. If you just need visibility, Qtap is simpler.
Demo Architecture
We'll demonstrate service mesh observability with a multi-service application:
┌─────────────┐
┌─────────→│ Frontend │
│ │ (Client) │
│ └──────┬──────┘
│ │
│ ↓
[External] ┌─────────────┐
[Request] │ API │ (Port 5000)
│ Gateway │
└──────┬──────┘
│
┌──────┴──────┐
↓ ↓
┌─────────────┐ ┌─────────────┐
│ User │ │ Order │
│ Service │ │ Service │
│ (Port 5001) │ │ (Port 5002) │
└─────────────┘ └──────┬──────┘
│
↓
┌─────────────┐
│ httpbin.org│ (External API)
└─────────────┘Traffic Flows:
Client → API Gateway → User Service: Simple service-to-service call
Client → API Gateway → Order Service → httpbin.org: Service calling external API
Client → API Gateway → User Service + Order Service: Fan-out to multiple services
Qtap captures ALL of this:
Service A calling Service B (both egress from A and ingress to B)
External API calls from services
Error responses and status codes
Request/response timing
Prerequisites
Linux host with Docker
jqfor JSON parsingBasic understanding of microservices
Part 1: Deploy the Multi-Service Application
We'll create three microservices that communicate with each other.
API Gateway Service
#!/usr/bin/env python3
"""
API Gateway - Routes requests to backend services
"""
from flask import Flask, jsonify
import requests
import os
app = Flask(__name__)
USER_SERVICE_URL = os.environ.get('USER_SERVICE_URL', 'http://localhost:5001')
ORDER_SERVICE_URL = os.environ.get('ORDER_SERVICE_URL', 'http://localhost:5002')
@app.route('/health')
def health():
return jsonify({'status': 'healthy', 'service': 'api-gateway'})
@app.route('/api/users/<user_id>')
def get_user(user_id):
"""Proxy request to user service"""
response = requests.get(f'{USER_SERVICE_URL}/users/{user_id}', timeout=5)
return jsonify(response.json()), response.status_code
@app.route('/api/orders/<user_id>')
def get_orders(user_id):
"""Proxy request to order service"""
response = requests.get(f'{ORDER_SERVICE_URL}/orders/{user_id}', timeout=5)
return jsonify(response.json()), response.status_code
@app.route('/api/user-with-orders/<user_id>')
def get_user_with_orders(user_id):
"""Fan-out request to both services"""
user_response = requests.get(f'{USER_SERVICE_URL}/users/{user_id}', timeout=5)
orders_response = requests.get(f'{ORDER_SERVICE_URL}/orders/{user_id}', timeout=5)
return jsonify({
'user': user_response.json(),
'orders': orders_response.json()
}), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)User Service
#!/usr/bin/env python3
"""
User Service - Returns user data
"""
from flask import Flask, jsonify
app = Flask(__name__)
USERS = {
'1': {'id': '1', 'name': 'Alice', 'email': '[email protected]'},
'2': {'id': '2', 'name': 'Bob', 'email': '[email protected]'},
'3': {'id': '3', 'name': 'Charlie', 'email': '[email protected]'},
}
@app.route('/health')
def health():
return jsonify({'status': 'healthy', 'service': 'user-service'})
@app.route('/users/<user_id>')
def get_user(user_id):
if user_id == '999': # Simulate error
return jsonify({'error': 'User not found', 'service': 'user-service'}), 404
user = USERS.get(user_id)
if user:
return jsonify(user), 200
else:
return jsonify({'error': 'User not found', 'service': 'user-service'}), 404
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001)Order Service
#!/usr/bin/env python3
"""
Order Service - Returns orders and calls external payment API
"""
from flask import Flask, jsonify
import requests
app = Flask(__name__)
ORDERS = {
'1': [
{'id': 'order-101', 'user_id': '1', 'total': 99.99, 'status': 'completed'},
{'id': 'order-102', 'user_id': '1', 'total': 49.99, 'status': 'pending'},
],
'2': [
{'id': 'order-201', 'user_id': '2', 'total': 199.99, 'status': 'completed'},
],
'3': [],
}
@app.route('/health')
def health():
return jsonify({'status': 'healthy', 'service': 'order-service'})
@app.route('/orders/<user_id>')
def get_orders(user_id):
"""Get orders and verify with external payment API"""
try:
# Call external API (httpbin for demo)
payment_response = requests.get(
'https://httpbin.org/json',
timeout=5,
headers={'X-User-ID': user_id}
)
payment_status = 'payment-api-ok' if payment_response.status_code == 200 else 'payment-api-error'
except Exception as e:
payment_status = f'payment-api-error: {str(e)}'
orders = ORDERS.get(user_id, [])
return jsonify({
'user_id': user_id,
'orders': orders,
'payment_api_status': payment_status,
'service': 'order-service'
}), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5002)Part 2: Qtap Configuration for Service Mesh Observability
Create /tmp/qtap-mesh-lite.yaml:
version: 2
# Service Mesh Lite Configuration
# Captures all service-to-service traffic for observability
services:
event_stores:
- type: stdout
object_stores:
- type: stdout
rulekit:
macros:
- name: is_error
expr: http.res.status >= 400 && http.res.status < 600
- name: is_api_gateway
expr: src.container.name == "api-gateway"
- name: is_user_service
expr: src.container.name == "user-service"
- name: is_order_service
expr: src.container.name == "order-service"
stacks:
mesh_observability:
plugins:
- type: http_capture
config:
level: summary # (none|summary|details|full) - Lightweight by default
format: json # (json|text) - JSON for easy parsing
rules:
# Capture full details for errors to debug failures
- name: "Service errors"
expr: is_error()
level: full
# Capture details for all API gateway traffic
- name: "API Gateway traffic"
expr: is_api_gateway()
level: details
tap:
direction: all # (egress|egress-external|egress-internal|ingress|all)
# CRITICAL: Use 'all' to see BOTH sides of service calls
ignore_loopback: false # (true|false) - MUST be false to see localhost traffic!
audit_include_dns: false # (true|false) - Skip DNS noise
http:
stack: mesh_observability
# Filter out qtap's own traffic
filters:
groups:
- qpointKey configuration points:
direction: all: CRITICAL for service mesh observability. Captures both:Egress: Requests leaving Service A
Ingress: Requests arriving at Service B
This gives you BOTH sides of every service-to-service call
ignore_loopback: false: Required to capture localhost traffic in Dockerlevel: summaryby default: Lightweight capture (method, URL, status, timing)level: fullfor errors: Capture complete request/response for debuggingRulekit macros: Identify traffic by container name for filtering
Part 3: Deploy with Docker Compose
Create /tmp/docker-compose-mesh-lite.yaml:
version: '3.9'
services:
# API Gateway - Entry point
api-gateway:
image: python:3.11-slim
container_name: api-gateway
working_dir: /app
network_mode: host # Required for qtap to see localhost traffic
command: >
bash -c "pip install -q flask requests &&
python api-gateway.py"
volumes:
- /tmp/api-gateway.py:/app/api-gateway.py
environment:
- USER_SERVICE_URL=http://localhost:5001
- ORDER_SERVICE_URL=http://localhost:5002
# User Service
user-service:
image: python:3.11-slim
container_name: user-service
working_dir: /app
network_mode: host
command: >
bash -c "pip install -q flask &&
python user-service.py"
volumes:
- /tmp/user-service.py:/app/user-service.py
# Order Service
order-service:
image: python:3.11-slim
container_name: order-service
working_dir: /app
network_mode: host
command: >
bash -c "pip install -q flask requests &&
python order-service.py"
volumes:
- /tmp/order-service.py:/app/order-service.py
# Qtap - Traffic capture
qtap:
image: us-docker.pkg.dev/qpoint-edge/public/qtap:v0
container_name: qtap-mesh-lite
privileged: true
user: "0:0"
cap_add:
- CAP_BPF
- CAP_SYS_ADMIN
pid: host
network_mode: host
volumes:
- /sys:/sys
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/qtap-mesh-lite.yaml:/app/config/qtap.yaml
environment:
- TINI_SUBREAPER=1
ulimits:
memlock: -1
command:
- --log-level=info
- --log-encoding=console
- --config=/app/config/qtap.yamlStart the stack:
docker compose -f /tmp/docker-compose-mesh-lite.yaml up -dWait for services to initialize:
sleep 10Part 4: Generate Traffic and Observe
Test 1: Simple Service-to-Service Call
curl -s http://localhost:5000/api/users/1 | jq .Output:
{
"email": "[email protected]",
"id": "1",
"name": "Alice"
}What Qtap captures:
Client → API Gateway (egress from curl)
API Gateway → User Service (egress from gateway, ingress to user-service)
Complete request chain with timing
Test 2: Service Calling External API
curl -s http://localhost:5000/api/orders/1 | jq .Output:
{
"orders": [
{
"id": "order-101",
"status": "completed",
"total": 99.99,
"user_id": "1"
},
{
"id": "order-102",
"status": "pending",
"total": 49.99,
"user_id": "1"
}
],
"payment_api_status": "payment-api-ok",
"service": "order-service",
"user_id": "1"
}What Qtap captures:
Client → API Gateway
API Gateway → Order Service
Order Service → httpbin.org (external API call)
TLS inspection: Qtap sees inside HTTPS to httpbin.org before encryption
Test 3: Fan-Out Request
curl -s http://localhost:5000/api/user-with-orders/2 | jq .Output:
{
"orders": {
"orders": [
{
"id": "order-201",
"status": "completed",
"total": 199.99,
"user_id": "2"
}
],
"payment_api_status": "payment-api-ok",
"service": "order-service",
"user_id": "2"
},
"user": {
"email": "[email protected]",
"id": "2",
"name": "Bob"
}
}What Qtap captures:
API Gateway making parallel calls to User Service AND Order Service
Both service responses with timing
Service dependency mapping (gateway depends on both services)
Test 4: Error Scenario
curl -s http://localhost:5000/api/users/999 | jq .Output:
{
"error": "User not found",
"service": "user-service"
}What Qtap captures:
404 response captured with
level: full(includes headers and body)Error tracking per service
Useful for identifying which service is failing
Part 5: Analyze Service Mesh Observability
View Qtap Logs
docker logs qtap-mesh-lite 2>&1 | grep '"method"' | jq .Example 1: Service-to-Service Call (Egress Side)
{
"metadata": {
"process_exe": "/usr/local/bin/python3.11",
"connection_id": "abc123"
},
"request": {
"method": "GET",
"url": "http://localhost:5001/users/1",
"authority": "localhost:5001",
"protocol": "http1",
"user_agent": "python-requests/2.32.5"
},
"response": {
"status": 200,
"content_type": "application/json"
},
"direction": "egress-external",
"duration_ms": 2
}Interpretation:
Python process (API Gateway) made request
Destination: User Service (port 5001)
Direction: egress (leaving API Gateway)
Response time: 2ms
Example 2: Same Call (Ingress Side)
{
"metadata": {
"process_exe": "/usr/local/bin/python3.11",
"connection_id": "abc124"
},
"request": {
"method": "GET",
"url": "http://localhost:5001/users/1",
"authority": "localhost:5001",
"protocol": "http1"
},
"response": {
"status": 200
},
"direction": "ingress",
"duration_ms": 1
}Interpretation:
Same request, different perspective
Direction: ingress (arriving at User Service)
Qtap captured BOTH sides of the call
This is the core of service mesh observability
Example 3: External API Call
{
"metadata": {
"process_exe": "/usr/local/bin/python3.11",
"endpoint_id": "httpbin.org",
"is_tls": true
},
"request": {
"method": "GET",
"url": "https://httpbin.org/json",
"authority": "httpbin.org",
"protocol": "http1"
},
"response": {
"status": 200,
"content_type": "application/json"
},
"direction": "egress-external",
"duration_ms": 1949
}Interpretation:
Order Service called external API
Qtap saw inside HTTPS (before TLS encryption)
Much slower (1949ms) than internal calls (1-2ms)
Useful for identifying slow external dependencies
Part 6: Service Mesh Features Demonstrated
✅ Service Discovery
Qtap automatically discovered:
API Gateway depends on User Service (port 5001)
API Gateway depends on Order Service (port 5002)
Order Service depends on httpbin.org (external)
No configuration required. Just observe traffic and map dependencies.
✅ Request Tracing
Follow a single request through the system:
1. Client → API Gateway [2ms]
GET /api/user-with-orders/2
2. API Gateway → User Service [1ms]
GET /users/2
Response: {"id": "2", "name": "Bob"}
3. API Gateway → Order Service [5270ms] ← Slow!
GET /orders/2
4. Order Service → httpbin.org [1949ms]
GET https://httpbin.org/json
Total: ~7.2 secondsInsight: Order Service is slow because it waits for external API. Consider caching or async processing.
✅ Error Monitoring
Errors captured with full details:
docker logs qtap-mesh-lite 2>&1 | grep '"status":404' | jq .Output shows:
Which service returned 404 (User Service)
Request that caused it (GET /users/999)
Full response body with error message
Captured at
level: fulldue to rulekit rule
✅ Latency Tracking
Calculate latency percentiles from captured data:
docker logs qtap-mesh-lite 2>&1 | \
grep '^{' | \
jq -r 'select(has("duration_ms")) | .duration_ms' | \
sort -nThe grep '^{' filter keeps only the JSON capture lines so jq doesn't choke on INFO prefixes.
Analyze:
p50 (median): ~2ms for internal calls
p95: ~10ms
p99: ~2000ms (external API calls)
Insight: Most internal calls are fast (<10ms), but external APIs add significant latency.
Part 7: Comparison - Service Mesh vs Qtap
Observability
Service discovery
✅ Yes
✅ Yes
Request tracing
✅ Yes
✅ Yes (both ingress/egress)
Error monitoring
✅ Yes
✅ Yes (with full capture)
Latency metrics
✅ Yes
✅ Yes (per request)
Traffic volume
✅ Yes
✅ Yes
TLS inspection
✅ Yes (via mTLS)
✅ Yes (pre-encryption)
Traffic Control
Load balancing
✅ Yes
❌ No
Circuit breaking
✅ Yes
❌ No
Retries/timeouts
✅ Yes
❌ No
Traffic splitting (canary)
✅ Yes
❌ No
Security
mTLS between services
✅ Yes
❌ No
Authorization policies
✅ Yes
❌ No
Operational
Deployment complexity
⚠️ High
✅ Low (DaemonSet)
Resource overhead
⚠️ High (sidecars on every pod)
✅ Minimal (one per node)
Latency impact
⚠️ Yes (adds ~2-5ms per hop)
✅ Zero (out-of-band)
Code changes required
✅ None
✅ None
Control plane
⚠️ Required (istiod, pilot)
✅ None
Configuration complexity
⚠️ High (CRDs, policies)
✅ Low (one YAML)
When to Use Which
Use Qtap (Service Mesh Lite) When:
✅ You need observability only
Understand service dependencies
Track errors and latency
Debug microservice issues
✅ You want zero latency overhead
Performance-critical applications
High-throughput services
Latency-sensitive workloads
✅ You want operational simplicity
Small team without mesh expertise
Don't want to manage control plane
Avoid sidecar complexity
✅ You have resource constraints
Limited cluster resources
Cost-sensitive deployments
Want to minimize overhead
Use a Service Mesh When:
⚠️ You need traffic control
Canary deployments and traffic splitting
Circuit breaking for resilience
Retry and timeout policies
Advanced load balancing
⚠️ You require mTLS
Zero-trust network requirements
Compliance mandates encryption
Need service-to-service authentication
⚠️ You need authorization
Fine-grained access control between services
Policy-based routing
⚠️ You have dedicated platform team
Team can manage mesh complexity
Already invested in service mesh skills
Have operational capacity
Part 8: Production Considerations
Scaling to Many Services
This demo uses 3 services. In production with 50+ services:
Qtap scales linearly:
One qtap per node (DaemonSet in Kubernetes)
Resource usage independent of service count
No per-service configuration needed
Service Mesh complexity grows:
One sidecar per pod (100 pods = 100 sidecars)
Control plane resource requirements increase
Configuration complexity multiplies
Integration with Observability Tools
Qtap outputs JSON. Pipe to your observability stack:
Send to S3 for long-term storage:
services:
object_stores:
- type: s3
bucket: my-mesh-observability
region: us-east-1Send events to ClickHouse/Elasticsearch:
services:
event_stores:
- type: axiom
api_token: $AXIOM_TOKEN
dataset: mesh-trafficBuild dashboards:
Parse JSON logs to extract metrics
Visualize service dependencies
Alert on error rate increases
Kubernetes Deployment
For production Kubernetes, deploy qtap as a DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: qtap
namespace: qpoint
spec:
selector:
matchLabels:
app: qtap
template:
metadata:
labels:
app: qtap
spec:
hostNetwork: true
hostPID: true
containers:
- name: qtap
image: us-docker.pkg.dev/qpoint-edge/public/qtap:v0
securityContext:
privileged: true
volumeMounts:
- name: sys
mountPath: /sys
- name: docker-sock
mountPath: /var/run/docker.sock
- name: config
mountPath: /app/config
env:
- name: TINI_SUBREAPER
value: "1"
args:
- --log-level=info
- --log-encoding=console
- --config=/app/config/qtap.yaml
volumes:
- name: sys
hostPath:
path: /sys
- name: docker-sock
hostPath:
path: /var/run/docker.sock
- name: config
configMap:
name: qtap-configUse Pod Labels for Service Attribution
In Kubernetes, use pod labels to identify services:
rulekit:
macros:
- name: is_payment_service
expr: src.pod.labels.app == "payment-service"
- name: is_auth_service
expr: src.pod.labels.app == "auth-service"
stacks:
mesh_observability:
plugins:
- type: http_capture
config:
level: summary
rules:
- name: "Payment service traffic"
expr: is_payment_service()
level: details
- name: "Auth service traffic"
expr: is_auth_service()
level: detailsPart 9: Cleanup
docker compose -f /tmp/docker-compose-mesh-lite.yaml downKey Takeaways
Service mesh observability without service mesh complexity: Qtap provides service discovery, request tracing, error monitoring, and latency tracking without sidecars or control planes.
Zero latency impact: Out-of-band observation means no proxies in the request path. Your services run at full speed.
BOTH sides of every call:
direction: allcaptures egress (leaving Service A) and ingress (arriving at Service B), giving complete visibility.TLS inspection without certificates: Qtap sees inside HTTPS at the kernel level before encryption happens.
Operational simplicity: One DaemonSet per node vs sidecars on every pod. Simple YAML config vs complex CRDs and policies.
Know when you need more: If you need traffic control (canary, circuit breaking) or mTLS, use a full service mesh. If you just need visibility, Qtap is simpler.
Related Documentation
HTTPS Header Capture Without Proxies - Deep dive on TLS inspection
Traffic Capture Settings - Understanding direction options
Traffic Processing with Plugins - Rulekit and capture levels
CI Security Validation - Another advanced use case
Last updated