Skip to content

AI Prompt: Forma3D.Connect — Traefik Mesh Observability Integration

Purpose: Integrate Traefik Mesh's Prometheus metrics into the existing OTel Collector → ClickHouse → Grafana pipeline, providing a unified view of application logs and mesh-level traffic patterns across all environments
Estimated Effort: 4–6 hours
Prerequisites: Scaling Preparations prompt completed (Traefik Mesh installed via Helm in local dev, staging, and production); ClickHouse + Grafana logging stack operational; OTel Collector running
Output: Prometheus scraping of Traefik Mesh metrics, ClickHouse metrics table, Grafana dashboards for service-to-service traffic, latency, error rates, and mTLS status
Status: 🚧 TODO


🎯 Mission

Extend the existing observability stack to include network-level metrics from Traefik Mesh, complementing the application-level logs already flowing through the OTel Collector → ClickHouse → Grafana pipeline.

What this delivers:

  1. Prometheus receiver in OTel Collector — scrapes Traefik Mesh proxy metrics from each node
  2. ClickHouse metrics table — stores time-series mesh metrics with configurable retention
  3. Grafana "Service Mesh" dashboard — visualizes service-to-service traffic patterns, request rates, latency percentiles, error rates, and mTLS status
  4. Alerting rules — high inter-service error rates, latency spikes, mesh proxy unhealthy
  5. Consistent across environments — same pipeline works in local dev (Rancher Desktop + Tilt), staging, and production (DOKS)

Why this matters:

The existing observability stack answers "what happened inside a service?" (application logs). Traefik Mesh metrics answer "what's happening between services?" — a blind spot today. Combined, they provide full-stack observability:

┌─────────────────────────────────────────────────────────────────┐
│                     Grafana Dashboards                          │
│                                                                 │
│  ┌──────────────────┐   ┌──────────────────┐   ┌─────────────┐ │
│  │ Application Logs │   │  Service Mesh    │   │   System    │ │
│  │ (existing)       │   │  Traffic (NEW)   │   │   Health    │ │
│  └────────┬─────────┘   └────────┬─────────┘   └──────┬──────┘ │
│           │                      │                     │        │
│           └──────────┬───────────┘─────────────────────┘        │
│                      │                                          │
│               ┌──────┴──────┐                                   │
│               │  ClickHouse │                                   │
│               │  otel DB    │                                   │
│               └──────┬──────┘                                   │
│                      │                                          │
│           ┌──────────┴──────────┐                               │
│           │   OTel Collector    │                                │
│           │                     │                                │
│           │  receivers:         │                                │
│           │   - otlp (logs)     │  ← existing                   │
│           │   - prometheus (NEW)│  ← Traefik Mesh metrics       │
│           │                     │                                │
│           │  exporters:         │                                │
│           │   - clickhouse      │                                │
│           └─────────────────────┘                               │
└─────────────────────────────────────────────────────────────────┘

📋 Step-by-Step Implementation

Phase 1: Understand Traefik Mesh Metrics (30 min)

Priority: P0 | Impact: Foundation | Dependencies: Traefik Mesh running

1. Identify the metrics endpoint

Traefik Mesh exposes Prometheus metrics from its per-node proxy pods. Identify the metrics port and path:

# Find Traefik Mesh proxy pods
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy

# Check what ports the proxy exposes
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy -o jsonpath='{.items[0].spec.containers[0].ports[*]}'

# Verify metrics are available
kubectl port-forward -n forma3d-dev svc/traefik-mesh-proxy-api 8080:8080
curl http://localhost:8080/metrics

2. Document available metrics

Key Traefik Mesh metrics to capture:

Metric Type Description
traefik_mesh_service_requests_total Counter Total requests per source→destination service pair
traefik_mesh_service_request_duration_seconds Histogram Request latency distribution per service pair
traefik_mesh_service_open_connections Gauge Active connections between services
traefik_mesh_tls_certs_not_after Gauge mTLS certificate expiration timestamp
traefik_mesh_config_reloads_total Counter Mesh configuration reload count
traefik_mesh_entrypoint_requests_total Counter Requests per entrypoint with status code labels

Verify exact metric names by inspecting the /metrics endpoint — names may vary by Traefik Mesh version.


Phase 2: Add Prometheus Receiver to OTel Collector (1 hour)

Priority: P1 | Impact: High | Dependencies: Phase 1

1. Update OTel Collector configuration

Extend deployment/staging/otel-collector-config.yaml to add a Prometheus receiver alongside the existing OTLP receiver:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'traefik-mesh'
          scrape_interval: 15s
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - forma3d-dev      # local dev
                  - forma3d          # staging/production
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              regex: traefik-mesh-proxy
              action: keep
            - source_labels: [__meta_kubernetes_pod_name]
              target_label: pod
            - source_labels: [__meta_kubernetes_namespace]
              target_label: namespace

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

    metrics:
      receivers: [prometheus]
      processors: [resource, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

The key addition is the metrics pipeline: prometheus receiver → resource processor → batch processor → clickhouse exporter.

2. For Docker Compose (staging/production without K8s SD)

On the current Docker Compose staging environment (where Traefik Mesh is not yet installed), use static targets instead of Kubernetes service discovery. This will be the initial configuration until the DOKS migration:

  prometheus:
    config:
      scrape_configs:
        - job_name: 'traefik-mesh'
          scrape_interval: 15s
          static_configs:
            - targets: ['traefik-mesh-proxy:8080']

Switch to kubernetes_sd_configs when migrating to DOKS.

3. Verify OTel Collector RBAC (Kubernetes environments)

The OTel Collector needs RBAC permissions to discover Traefik Mesh pods via the Kubernetes API. Add to its ServiceAccount:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-prometheus
rules:
  - apiGroups: [""]
    resources: [pods, nodes, endpoints]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-prometheus-binding
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: forma3d-dev
roleRef:
  kind: ClusterRole
  name: otel-collector-prometheus
  apiGroup: rbac.authorization.k8s.io

Phase 3: ClickHouse Metrics Table (1 hour)

Priority: P1 | Impact: High | Dependencies: Phase 2

1. Verify auto-created metrics table

The clickhouse exporter in the OTel Collector (create_schema: true) will auto-create a metrics table in the otel database. Verify:

SHOW TABLES FROM otel;
-- Should include: otel_logs (existing), otel_metrics (new)

DESCRIBE TABLE otel.otel_metrics;

2. Add TTL retention for metrics

Metrics are time-series data and need retention policies. Apply TTLs similar to the log retention strategy:

ALTER TABLE otel.otel_metrics
    MODIFY TTL toDateTime(TimeUnix) + INTERVAL 90 DAY;

90-day retention for metrics balances storage cost with the ability to observe trends over a quarter.

3. Verify metrics are flowing

After restarting the OTel Collector with the new config:

SELECT
    MetricName,
    count() as samples,
    min(TimeUnix) as first_seen,
    max(TimeUnix) as last_seen
FROM otel.otel_metrics
WHERE MetricName LIKE 'traefik_mesh%'
GROUP BY MetricName
ORDER BY samples DESC;

Phase 4: Grafana "Service Mesh Traffic" Dashboard (1.5 hours)

Priority: P1 | Impact: High | Dependencies: Phase 3

1. Create the dashboard

Create a new Grafana dashboard "Service Mesh Traffic" with the following panels:

Row 1 — Overview

Panel Visualization Query Description
Total Mesh Requests (24h) Stat Sum of traefik_mesh_service_requests_total over 24h
Mesh Error Rate Gauge Percentage of 5xx responses across all service pairs
Active Connections Stat Current sum of traefik_mesh_service_open_connections
mTLS Certificate Expiry Stat (warning threshold) Minimum traefik_mesh_tls_certs_not_after across all proxies

Row 2 — Service-to-Service Traffic Map

Panel Visualization Query Description
Request Rate by Service Pair Time Series Rate of traefik_mesh_service_requests_total grouped by source_service and destination_service
Error Rate by Service Pair Time Series Rate of 4xx/5xx traefik_mesh_entrypoint_requests_total grouped by service pair and status code
Latency p50/p95/p99 by Service Pair Time Series Histogram quantiles of traefik_mesh_service_request_duration_seconds

Row 3 — Per-Service Deep Dive (with variable selector)

Panel Visualization Query Description
Inbound Request Rate Time Series Requests TO the selected service
Outbound Request Rate Time Series Requests FROM the selected service
Latency Distribution Histogram Full latency histogram for the selected service pair
Status Code Breakdown Pie Chart 2xx/3xx/4xx/5xx distribution

2. Add template variables

  • $namespace — dropdown: forma3d-dev, forma3d (filters all panels)
  • $service — dropdown populated from distinct destination_service values in metrics
  • $interval — auto interval for rate calculations

3. ClickHouse queries for Grafana

Example query for request rate by service pair:

SELECT
    toStartOfInterval(TimeUnix, INTERVAL 1 MINUTE) AS time,
    Attributes['source_service'] AS source,
    Attributes['destination_service'] AS destination,
    sum(Value) AS requests
FROM otel.otel_metrics
WHERE MetricName = 'traefik_mesh_service_requests_total'
  AND TimeUnix >= $__fromTime
  AND TimeUnix <= $__toTime
GROUP BY time, source, destination
ORDER BY time

Adapt column names to match the actual ClickHouse OTel metrics schema after verifying in Phase 3.


Phase 5: Alerting Rules (30 min)

Priority: P2 | Impact: Medium | Dependencies: Phase 4

1. Add mesh-specific alert rules in Grafana

Alert Condition Severity Action
High inter-service error rate >5% 5xx responses for any service pair over 5 min Warning Notification
Latency spike p99 latency >2s for any service pair over 5 min Warning Notification
Mesh proxy unhealthy Any traefik-mesh-proxy pod not ready for >2 min Critical Notification
mTLS cert expiring soon Certificate expiry <7 days Warning Notification

2. Configure alert notifications

Use the same notification channel as existing log alerts (configured in the ClickHouse + Grafana prompt).


Phase 6: Local Dev Integration (30 min)

Priority: P2 | Impact: Medium | Dependencies: Phase 2

1. Update Tiltfile for metrics pipeline

Add the OTel Collector Prometheus scraping to the local dev setup. Since Tilt runs in Rancher Desktop's K3s cluster, Kubernetes service discovery works natively.

If the OTel Collector is not part of the local dev Tiltfile (it's currently in the "excluded services" list), add a lightweight local-dev variant:

# --- OTel Collector (metrics only, for mesh observability) ---
k8s_yaml('k8s/dev/otel-collector.yaml')
k8s_resource('otel-collector',
    labels=['observability'])

This is optional for local dev — developers who don't need mesh metrics can skip it.

2. Port-forward Grafana for local dev (optional)

If a developer wants to view mesh dashboards locally, they can add Grafana to their Tilt setup or run it standalone:

kubectl port-forward -n forma3d-dev svc/grafana 3001:3000

📊 Data Flow Summary

                     Traefik Mesh Proxy Pods
                     (per-node, expose /metrics)
                              │
                              │ Prometheus scrape (15s interval)
                              ▼
                     ┌─────────────────┐
                     │  OTel Collector  │
                     │                  │
                     │  receivers:      │
                     │   - otlp (logs)  │ ← NestJS services (existing)
                     │   - prometheus   │ ← Traefik Mesh metrics (NEW)
                     │                  │
                     │  exporters:      │
                     │   - clickhouse   │
                     └────────┬─────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │   ClickHouse     │
                     │                  │
                     │  otel_logs       │ ← existing (TTL: 7–180 days)
                     │  otel_metrics    │ ← NEW (TTL: 90 days)
                     └────────┬─────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │    Grafana       │
                     │                  │
                     │  📊 App Logs     │ ← existing dashboards
                     │  📊 Mesh Traffic │ ← NEW dashboard
                     │  🔔 Alerts       │ ← existing + NEW mesh alerts
                     └─────────────────┘

✅ Validation Checklist

Metrics Pipeline

  • OTel Collector config includes prometheus receiver targeting Traefik Mesh proxy pods
  • metrics pipeline defined in OTel Collector service config (receivers → processors → exporters)
  • OTel Collector has RBAC to discover pods via Kubernetes API (in K8s environments)
  • Metrics are flowing: SELECT count() FROM otel.otel_metrics WHERE MetricName LIKE 'traefik_mesh%' returns data
  • TTL retention set on otel_metrics table (90 days)

Grafana Dashboard

  • "Service Mesh Traffic" dashboard created with overview, service-pair, and deep-dive rows
  • Template variables work ($namespace, $service, $interval)
  • Request rate, error rate, and latency panels show real data
  • mTLS certificate expiry panel shows certificate status

Alerting

  • High inter-service error rate alert configured and tested
  • Latency spike alert configured and tested
  • Mesh proxy unhealthy alert configured and tested
  • mTLS cert expiry alert configured and tested
  • Alerts use the same notification channel as existing log alerts

Cross-Environment

  • Metrics pipeline works in local dev (Rancher Desktop + Tilt) with Kubernetes SD
  • Metrics pipeline config is compatible with staging Docker Compose (static targets)
  • Same Grafana dashboard works across environments (only namespace filter changes)

🚫 Constraints and Rules

MUST DO

  • Use the existing OTel Collector — add a receiver, don't deploy a separate metrics collector
  • Store metrics in ClickHouse (same database as logs) — don't introduce Prometheus server or Thanos
  • Use the existing Grafana instance — add dashboards, don't deploy a separate Grafana
  • Set TTL retention on the metrics table
  • Verify exact metric names from the running Traefik Mesh /metrics endpoint before building dashboards

MUST NOT

  • Deploy Prometheus server, VictoriaMetrics, or any separate TSDB — ClickHouse handles both logs and metrics
  • Modify the existing log pipeline — the logs pipeline in OTel Collector must remain unchanged
  • Break existing Grafana dashboards or alert rules
  • Add metrics collection for anything other than Traefik Mesh in this prompt (application-level Prometheus metrics are a separate concern)
  • Use any, ts-ignore, or eslint-disable

SHOULD DO (Nice to Have)

  • Add a Grafana "Service Map" visualization showing the request flow between services as a directed graph
  • Include a "Top Slow Routes" panel showing the slowest service-to-service paths
  • Create a combined "Correlate" panel that links a mesh latency spike to application logs from the same time window
  • Document how to add Prometheus metrics from other sources (e.g., Redis, PostgreSQL exporters) to the same pipeline in the future

🔄 Rollback Plan

All changes are additive and non-destructive:

  1. OTel Collector config: Removing the prometheus receiver and metrics pipeline reverts to the original logs-only configuration
  2. ClickHouse metrics table: Can be dropped (DROP TABLE otel.otel_metrics) without affecting the otel_logs table
  3. Grafana dashboards: Deleting the "Service Mesh Traffic" dashboard has no effect on existing dashboards
  4. Alerting rules: Mesh-specific alerts can be disabled or removed independently
  5. RBAC: The additional ClusterRole/ClusterRoleBinding can be deleted without affecting other Kubernetes resources

📚 Key References

  • Traefik Mesh metrics: https://doc.traefik.io/traefik-mesh/ — proxy metrics documentation
  • OTel Collector Prometheus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver
  • OTel Collector ClickHouse exporter (metrics): https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter
  • Grafana ClickHouse plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/
  • Existing logging prompt: done/prompt-clickhouse-grafana-logging.md
  • Scaling preparations prompt: todo/prompt-scaling-preparations.md (Traefik Mesh installation)

END OF PROMPT


This prompt extends the existing OTel Collector → ClickHouse → Grafana observability stack to include Traefik Mesh network-level metrics. It adds a Prometheus receiver to the OTel Collector, stores metrics in ClickHouse alongside logs, and creates Grafana dashboards for service-to-service traffic visualization. The result is full-stack observability: application logs show what happened inside a service, mesh metrics show what happened between services.