AI Prompt: Forma3D.Connect — Traefik Mesh Observability Integration¶
Purpose: Integrate Traefik Mesh's Prometheus metrics into the existing OTel Collector → ClickHouse → Grafana pipeline, providing a unified view of application logs and mesh-level traffic patterns across all environments
Estimated Effort: 4–6 hours
Prerequisites: Scaling Preparations prompt completed (Traefik Mesh installed via Helm in local dev, staging, and production); ClickHouse + Grafana logging stack operational; OTel Collector running
Output: Prometheus scraping of Traefik Mesh metrics, ClickHouse metrics table, Grafana dashboards for service-to-service traffic, latency, error rates, and mTLS status
Status: 🚧 TODO
🎯 Mission¶
Extend the existing observability stack to include network-level metrics from Traefik Mesh, complementing the application-level logs already flowing through the OTel Collector → ClickHouse → Grafana pipeline.
What this delivers:
- Prometheus receiver in OTel Collector — scrapes Traefik Mesh proxy metrics from each node
- ClickHouse metrics table — stores time-series mesh metrics with configurable retention
- Grafana "Service Mesh" dashboard — visualizes service-to-service traffic patterns, request rates, latency percentiles, error rates, and mTLS status
- Alerting rules — high inter-service error rates, latency spikes, mesh proxy unhealthy
- Consistent across environments — same pipeline works in local dev (Rancher Desktop + Tilt), staging, and production (DOKS)
Why this matters:
The existing observability stack answers "what happened inside a service?" (application logs). Traefik Mesh metrics answer "what's happening between services?" — a blind spot today. Combined, they provide full-stack observability:
┌─────────────────────────────────────────────────────────────────┐
│ Grafana Dashboards │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌─────────────┐ │
│ │ Application Logs │ │ Service Mesh │ │ System │ │
│ │ (existing) │ │ Traffic (NEW) │ │ Health │ │
│ └────────┬─────────┘ └────────┬─────────┘ └──────┬──────┘ │
│ │ │ │ │
│ └──────────┬───────────┘─────────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ ClickHouse │ │
│ │ otel DB │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ OTel Collector │ │
│ │ │ │
│ │ receivers: │ │
│ │ - otlp (logs) │ ← existing │
│ │ - prometheus (NEW)│ ← Traefik Mesh metrics │
│ │ │ │
│ │ exporters: │ │
│ │ - clickhouse │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
📋 Step-by-Step Implementation¶
Phase 1: Understand Traefik Mesh Metrics (30 min)¶
Priority: P0 | Impact: Foundation | Dependencies: Traefik Mesh running
1. Identify the metrics endpoint¶
Traefik Mesh exposes Prometheus metrics from its per-node proxy pods. Identify the metrics port and path:
# Find Traefik Mesh proxy pods
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy
# Check what ports the proxy exposes
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy -o jsonpath='{.items[0].spec.containers[0].ports[*]}'
# Verify metrics are available
kubectl port-forward -n forma3d-dev svc/traefik-mesh-proxy-api 8080:8080
curl http://localhost:8080/metrics
2. Document available metrics¶
Key Traefik Mesh metrics to capture:
| Metric | Type | Description |
|---|---|---|
traefik_mesh_service_requests_total |
Counter | Total requests per source→destination service pair |
traefik_mesh_service_request_duration_seconds |
Histogram | Request latency distribution per service pair |
traefik_mesh_service_open_connections |
Gauge | Active connections between services |
traefik_mesh_tls_certs_not_after |
Gauge | mTLS certificate expiration timestamp |
traefik_mesh_config_reloads_total |
Counter | Mesh configuration reload count |
traefik_mesh_entrypoint_requests_total |
Counter | Requests per entrypoint with status code labels |
Verify exact metric names by inspecting the /metrics endpoint — names may vary by Traefik Mesh version.
Phase 2: Add Prometheus Receiver to OTel Collector (1 hour)¶
Priority: P1 | Impact: High | Dependencies: Phase 1
1. Update OTel Collector configuration¶
Extend deployment/staging/otel-collector-config.yaml to add a Prometheus receiver alongside the existing OTLP receiver:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'traefik-mesh'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- forma3d-dev # local dev
- forma3d # staging/production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: traefik-mesh-proxy
action: keep
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
processors:
batch:
timeout: 5s
send_batch_size: 10000
send_batch_max_size: 20000
resource:
attributes:
- key: deployment.environment
value: "${ENVIRONMENT}"
action: upsert
- key: host.name
value: "${HOSTNAME}"
action: upsert
filter/drop-debug:
logs:
log_record:
- 'severity_number < 9'
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000
database: otel
logs_table_name: otel_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
create_schema: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [resource, filter/drop-debug, batch]
exporters: [clickhouse]
metrics:
receivers: [prometheus]
processors: [resource, batch]
exporters: [clickhouse]
telemetry:
logs:
level: warn
metrics:
address: 0.0.0.0:8888
extensions: [health_check]
The key addition is the metrics pipeline: prometheus receiver → resource processor → batch processor → clickhouse exporter.
2. For Docker Compose (staging/production without K8s SD)¶
On the current Docker Compose staging environment (where Traefik Mesh is not yet installed), use static targets instead of Kubernetes service discovery. This will be the initial configuration until the DOKS migration:
prometheus:
config:
scrape_configs:
- job_name: 'traefik-mesh'
scrape_interval: 15s
static_configs:
- targets: ['traefik-mesh-proxy:8080']
Switch to kubernetes_sd_configs when migrating to DOKS.
3. Verify OTel Collector RBAC (Kubernetes environments)¶
The OTel Collector needs RBAC permissions to discover Traefik Mesh pods via the Kubernetes API. Add to its ServiceAccount:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-prometheus
rules:
- apiGroups: [""]
resources: [pods, nodes, endpoints]
verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-prometheus-binding
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: forma3d-dev
roleRef:
kind: ClusterRole
name: otel-collector-prometheus
apiGroup: rbac.authorization.k8s.io
Phase 3: ClickHouse Metrics Table (1 hour)¶
Priority: P1 | Impact: High | Dependencies: Phase 2
1. Verify auto-created metrics table¶
The clickhouse exporter in the OTel Collector (create_schema: true) will auto-create a metrics table in the otel database. Verify:
SHOW TABLES FROM otel;
-- Should include: otel_logs (existing), otel_metrics (new)
DESCRIBE TABLE otel.otel_metrics;
2. Add TTL retention for metrics¶
Metrics are time-series data and need retention policies. Apply TTLs similar to the log retention strategy:
ALTER TABLE otel.otel_metrics
MODIFY TTL toDateTime(TimeUnix) + INTERVAL 90 DAY;
90-day retention for metrics balances storage cost with the ability to observe trends over a quarter.
3. Verify metrics are flowing¶
After restarting the OTel Collector with the new config:
SELECT
MetricName,
count() as samples,
min(TimeUnix) as first_seen,
max(TimeUnix) as last_seen
FROM otel.otel_metrics
WHERE MetricName LIKE 'traefik_mesh%'
GROUP BY MetricName
ORDER BY samples DESC;
Phase 4: Grafana "Service Mesh Traffic" Dashboard (1.5 hours)¶
Priority: P1 | Impact: High | Dependencies: Phase 3
1. Create the dashboard¶
Create a new Grafana dashboard "Service Mesh Traffic" with the following panels:
Row 1 — Overview
| Panel | Visualization | Query Description |
|---|---|---|
| Total Mesh Requests (24h) | Stat | Sum of traefik_mesh_service_requests_total over 24h |
| Mesh Error Rate | Gauge | Percentage of 5xx responses across all service pairs |
| Active Connections | Stat | Current sum of traefik_mesh_service_open_connections |
| mTLS Certificate Expiry | Stat (warning threshold) | Minimum traefik_mesh_tls_certs_not_after across all proxies |
Row 2 — Service-to-Service Traffic Map
| Panel | Visualization | Query Description |
|---|---|---|
| Request Rate by Service Pair | Time Series | Rate of traefik_mesh_service_requests_total grouped by source_service and destination_service |
| Error Rate by Service Pair | Time Series | Rate of 4xx/5xx traefik_mesh_entrypoint_requests_total grouped by service pair and status code |
| Latency p50/p95/p99 by Service Pair | Time Series | Histogram quantiles of traefik_mesh_service_request_duration_seconds |
Row 3 — Per-Service Deep Dive (with variable selector)
| Panel | Visualization | Query Description |
|---|---|---|
| Inbound Request Rate | Time Series | Requests TO the selected service |
| Outbound Request Rate | Time Series | Requests FROM the selected service |
| Latency Distribution | Histogram | Full latency histogram for the selected service pair |
| Status Code Breakdown | Pie Chart | 2xx/3xx/4xx/5xx distribution |
2. Add template variables¶
$namespace— dropdown:forma3d-dev,forma3d(filters all panels)$service— dropdown populated from distinctdestination_servicevalues in metrics$interval— auto interval for rate calculations
3. ClickHouse queries for Grafana¶
Example query for request rate by service pair:
SELECT
toStartOfInterval(TimeUnix, INTERVAL 1 MINUTE) AS time,
Attributes['source_service'] AS source,
Attributes['destination_service'] AS destination,
sum(Value) AS requests
FROM otel.otel_metrics
WHERE MetricName = 'traefik_mesh_service_requests_total'
AND TimeUnix >= $__fromTime
AND TimeUnix <= $__toTime
GROUP BY time, source, destination
ORDER BY time
Adapt column names to match the actual ClickHouse OTel metrics schema after verifying in Phase 3.
Phase 5: Alerting Rules (30 min)¶
Priority: P2 | Impact: Medium | Dependencies: Phase 4
1. Add mesh-specific alert rules in Grafana¶
| Alert | Condition | Severity | Action |
|---|---|---|---|
| High inter-service error rate | >5% 5xx responses for any service pair over 5 min | Warning | Notification |
| Latency spike | p99 latency >2s for any service pair over 5 min | Warning | Notification |
| Mesh proxy unhealthy | Any traefik-mesh-proxy pod not ready for >2 min |
Critical | Notification |
| mTLS cert expiring soon | Certificate expiry <7 days | Warning | Notification |
2. Configure alert notifications¶
Use the same notification channel as existing log alerts (configured in the ClickHouse + Grafana prompt).
Phase 6: Local Dev Integration (30 min)¶
Priority: P2 | Impact: Medium | Dependencies: Phase 2
1. Update Tiltfile for metrics pipeline¶
Add the OTel Collector Prometheus scraping to the local dev setup. Since Tilt runs in Rancher Desktop's K3s cluster, Kubernetes service discovery works natively.
If the OTel Collector is not part of the local dev Tiltfile (it's currently in the "excluded services" list), add a lightweight local-dev variant:
# --- OTel Collector (metrics only, for mesh observability) ---
k8s_yaml('k8s/dev/otel-collector.yaml')
k8s_resource('otel-collector',
labels=['observability'])
This is optional for local dev — developers who don't need mesh metrics can skip it.
2. Port-forward Grafana for local dev (optional)¶
If a developer wants to view mesh dashboards locally, they can add Grafana to their Tilt setup or run it standalone:
kubectl port-forward -n forma3d-dev svc/grafana 3001:3000
📊 Data Flow Summary¶
Traefik Mesh Proxy Pods
(per-node, expose /metrics)
│
│ Prometheus scrape (15s interval)
▼
┌─────────────────┐
│ OTel Collector │
│ │
│ receivers: │
│ - otlp (logs) │ ← NestJS services (existing)
│ - prometheus │ ← Traefik Mesh metrics (NEW)
│ │
│ exporters: │
│ - clickhouse │
└────────┬─────────┘
│
▼
┌─────────────────┐
│ ClickHouse │
│ │
│ otel_logs │ ← existing (TTL: 7–180 days)
│ otel_metrics │ ← NEW (TTL: 90 days)
└────────┬─────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ │
│ 📊 App Logs │ ← existing dashboards
│ 📊 Mesh Traffic │ ← NEW dashboard
│ 🔔 Alerts │ ← existing + NEW mesh alerts
└─────────────────┘
✅ Validation Checklist¶
Metrics Pipeline¶
- OTel Collector config includes
prometheusreceiver targeting Traefik Mesh proxy pods -
metricspipeline defined in OTel Collector service config (receivers → processors → exporters) - OTel Collector has RBAC to discover pods via Kubernetes API (in K8s environments)
- Metrics are flowing:
SELECT count() FROM otel.otel_metrics WHERE MetricName LIKE 'traefik_mesh%'returns data - TTL retention set on
otel_metricstable (90 days)
Grafana Dashboard¶
- "Service Mesh Traffic" dashboard created with overview, service-pair, and deep-dive rows
- Template variables work ($namespace, $service, $interval)
- Request rate, error rate, and latency panels show real data
- mTLS certificate expiry panel shows certificate status
Alerting¶
- High inter-service error rate alert configured and tested
- Latency spike alert configured and tested
- Mesh proxy unhealthy alert configured and tested
- mTLS cert expiry alert configured and tested
- Alerts use the same notification channel as existing log alerts
Cross-Environment¶
- Metrics pipeline works in local dev (Rancher Desktop + Tilt) with Kubernetes SD
- Metrics pipeline config is compatible with staging Docker Compose (static targets)
- Same Grafana dashboard works across environments (only namespace filter changes)
🚫 Constraints and Rules¶
MUST DO¶
- Use the existing OTel Collector — add a receiver, don't deploy a separate metrics collector
- Store metrics in ClickHouse (same database as logs) — don't introduce Prometheus server or Thanos
- Use the existing Grafana instance — add dashboards, don't deploy a separate Grafana
- Set TTL retention on the metrics table
- Verify exact metric names from the running Traefik Mesh
/metricsendpoint before building dashboards
MUST NOT¶
- Deploy Prometheus server, VictoriaMetrics, or any separate TSDB — ClickHouse handles both logs and metrics
- Modify the existing log pipeline — the
logspipeline in OTel Collector must remain unchanged - Break existing Grafana dashboards or alert rules
- Add metrics collection for anything other than Traefik Mesh in this prompt (application-level Prometheus metrics are a separate concern)
- Use
any,ts-ignore, oreslint-disable
SHOULD DO (Nice to Have)¶
- Add a Grafana "Service Map" visualization showing the request flow between services as a directed graph
- Include a "Top Slow Routes" panel showing the slowest service-to-service paths
- Create a combined "Correlate" panel that links a mesh latency spike to application logs from the same time window
- Document how to add Prometheus metrics from other sources (e.g., Redis, PostgreSQL exporters) to the same pipeline in the future
🔄 Rollback Plan¶
All changes are additive and non-destructive:
- OTel Collector config: Removing the
prometheusreceiver andmetricspipeline reverts to the original logs-only configuration - ClickHouse metrics table: Can be dropped (
DROP TABLE otel.otel_metrics) without affecting theotel_logstable - Grafana dashboards: Deleting the "Service Mesh Traffic" dashboard has no effect on existing dashboards
- Alerting rules: Mesh-specific alerts can be disabled or removed independently
- RBAC: The additional ClusterRole/ClusterRoleBinding can be deleted without affecting other Kubernetes resources
📚 Key References¶
- Traefik Mesh metrics: https://doc.traefik.io/traefik-mesh/ — proxy metrics documentation
- OTel Collector Prometheus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver
- OTel Collector ClickHouse exporter (metrics): https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter
- Grafana ClickHouse plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/
- Existing logging prompt:
done/prompt-clickhouse-grafana-logging.md - Scaling preparations prompt:
todo/prompt-scaling-preparations.md(Traefik Mesh installation)
END OF PROMPT
This prompt extends the existing OTel Collector → ClickHouse → Grafana observability stack to include Traefik Mesh network-level metrics. It adds a Prometheus receiver to the OTel Collector, stores metrics in ClickHouse alongside logs, and creates Grafana dashboards for service-to-service traffic visualization. The result is full-stack observability: application logs show what happened inside a service, mesh metrics show what happened between services.