AI Prompt: Forma3D.Connect — Traefik Mesh Observability Integration¶

Purpose: Integrate Traefik Mesh's Prometheus metrics into the existing OTel Collector → ClickHouse → Grafana pipeline, providing a unified view of application logs and mesh-level traffic patterns across all environments
Estimated Effort: 4–6 hours
Prerequisites: Scaling Preparations prompt completed (Traefik Mesh installed via Helm in local dev, staging, and production); ClickHouse + Grafana logging stack operational; OTel Collector running
Output: Prometheus scraping of Traefik Mesh metrics, ClickHouse metrics table, Grafana dashboards for service-to-service traffic, latency, error rates, and mTLS status
Status: 🚧 TODO

🎯 Mission¶

Extend the existing observability stack to include network-level metrics from Traefik Mesh, complementing the application-level logs already flowing through the OTel Collector → ClickHouse → Grafana pipeline.

What this delivers:

Prometheus receiver in OTel Collector — scrapes Traefik Mesh proxy metrics from each node
ClickHouse metrics table — stores time-series mesh metrics with configurable retention
Grafana "Service Mesh" dashboard — visualizes service-to-service traffic patterns, request rates, latency percentiles, error rates, and mTLS status
Alerting rules — high inter-service error rates, latency spikes, mesh proxy unhealthy
Consistent across environments — same pipeline works in local dev (Rancher Desktop + Tilt), staging, and production (DOKS)

Why this matters:

The existing observability stack answers "what happened inside a service?" (application logs). Traefik Mesh metrics answer "what's happening between services?" — a blind spot today. Combined, they provide full-stack observability:

┌─────────────────────────────────────────────────────────────────┐
│                     Grafana Dashboards                          │
│                                                                 │
│  ┌──────────────────┐   ┌──────────────────┐   ┌─────────────┐ │
│  │ Application Logs │   │  Service Mesh    │   │   System    │ │
│  │ (existing)       │   │  Traffic (NEW)   │   │   Health    │ │
│  └────────┬─────────┘   └────────┬─────────┘   └──────┬──────┘ │
│           │                      │                     │        │
│           └──────────┬───────────┘─────────────────────┘        │
│                      │                                          │
│               ┌──────┴──────┐                                   │
│               │  ClickHouse │                                   │
│               │  otel DB    │                                   │
│               └──────┬──────┘                                   │
│                      │                                          │
│           ┌──────────┴──────────┐                               │
│           │   OTel Collector    │                                │
│           │                     │                                │
│           │  receivers:         │                                │
│           │   - otlp (logs)     │  ← existing                   │
│           │   - prometheus (NEW)│  ← Traefik Mesh metrics       │
│           │                     │                                │
│           │  exporters:         │                                │
│           │   - clickhouse      │                                │
│           └─────────────────────┘                               │
└─────────────────────────────────────────────────────────────────┘

📋 Step-by-Step Implementation¶

Phase 1: Understand Traefik Mesh Metrics (30 min)¶

Priority: P0 | Impact: Foundation | Dependencies: Traefik Mesh running

1. Identify the metrics endpoint¶

Traefik Mesh exposes Prometheus metrics from its per-node proxy pods. Identify the metrics port and path:

# Find Traefik Mesh proxy pods
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy

# Check what ports the proxy exposes
kubectl get pods -n forma3d-dev -l app=traefik-mesh-proxy -o jsonpath='{.items[0].spec.containers[0].ports[*]}'

# Verify metrics are available
kubectl port-forward -n forma3d-dev svc/traefik-mesh-proxy-api 8080:8080
curl http://localhost:8080/metrics

2. Document available metrics¶

Key Traefik Mesh metrics to capture:

Metric	Type	Description
`traefik_mesh_service_requests_total`	Counter	Total requests per source→destination service pair
`traefik_mesh_service_request_duration_seconds`	Histogram	Request latency distribution per service pair
`traefik_mesh_service_open_connections`	Gauge	Active connections between services
`traefik_mesh_tls_certs_not_after`	Gauge	mTLS certificate expiration timestamp
`traefik_mesh_config_reloads_total`	Counter	Mesh configuration reload count
`traefik_mesh_entrypoint_requests_total`	Counter	Requests per entrypoint with status code labels

Verify exact metric names by inspecting the /metrics endpoint — names may vary by Traefik Mesh version.

Phase 2: Add Prometheus Receiver to OTel Collector (1 hour)¶

Priority: P1 | Impact: High | Dependencies: Phase 1

1. Update OTel Collector configuration¶

Extend deployment/staging/otel-collector-config.yaml to add a Prometheus receiver alongside the existing OTLP receiver:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'traefik-mesh'
          scrape_interval: 15s
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - forma3d-dev      # local dev
                  - forma3d          # staging/production
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              regex: traefik-mesh-proxy
              action: keep
            - source_labels: [__meta_kubernetes_pod_name]
              target_label: pod
            - source_labels: [__meta_kubernetes_namespace]
              target_label: namespace

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

    metrics:
      receivers: [prometheus]
      processors: [resource, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

The key addition is the metrics pipeline: prometheus receiver → resource processor → batch processor → clickhouse exporter.

2. For Docker Compose (staging/production without K8s SD)¶

On the current Docker Compose staging environment (where Traefik Mesh is not yet installed), use static targets instead of Kubernetes service discovery. This will be the initial configuration until the DOKS migration:

  prometheus:
    config:
      scrape_configs:
        - job_name: 'traefik-mesh'
          scrape_interval: 15s
          static_configs:
            - targets: ['traefik-mesh-proxy:8080']

Switch to kubernetes_sd_configs when migrating to DOKS.

3. Verify OTel Collector RBAC (Kubernetes environments)¶

The OTel Collector needs RBAC permissions to discover Traefik Mesh pods via the Kubernetes API. Add to its ServiceAccount:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-prometheus
rules:
  - apiGroups: [""]
    resources: [pods, nodes, endpoints]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-prometheus-binding
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: forma3d-dev
roleRef:
  kind: ClusterRole
  name: otel-collector-prometheus
  apiGroup: rbac.authorization.k8s.io

Phase 3: ClickHouse Metrics Table (1 hour)¶

Priority: P1 | Impact: High | Dependencies: Phase 2

1. Verify auto-created metrics table¶

The clickhouse exporter in the OTel Collector (create_schema: true) will auto-create a metrics table in the otel database. Verify:

SHOW TABLES FROM otel;
-- Should include: otel_logs (existing), otel_metrics (new)

DESCRIBE TABLE otel.otel_metrics;

2. Add TTL retention for metrics¶

Metrics are time-series data and need retention policies. Apply TTLs similar to the log retention strategy:

ALTER TABLE otel.otel_metrics
    MODIFY TTL toDateTime(TimeUnix) + INTERVAL 90 DAY;

90-day retention for metrics balances storage cost with the ability to observe trends over a quarter.

3. Verify metrics are flowing¶

After restarting the OTel Collector with the new config:

SELECT
    MetricName,
    count() as samples,
    min(TimeUnix) as first_seen,
    max(TimeUnix) as last_seen
FROM otel.otel_metrics
WHERE MetricName LIKE 'traefik_mesh%'
GROUP BY MetricName
ORDER BY samples DESC;

Phase 4: Grafana "Service Mesh Traffic" Dashboard (1.5 hours)¶

Priority: P1 | Impact: High | Dependencies: Phase 3

1. Create the dashboard¶

Create a new Grafana dashboard "Service Mesh Traffic" with the following panels:

Row 1 — Overview

Panel	Visualization	Query Description
Total Mesh Requests (24h)	Stat	Sum of `traefik_mesh_service_requests_total` over 24h
Mesh Error Rate	Gauge	Percentage of 5xx responses across all service pairs
Active Connections	Stat	Current sum of `traefik_mesh_service_open_connections`
mTLS Certificate Expiry	Stat (warning threshold)	Minimum `traefik_mesh_tls_certs_not_after` across all proxies

Row 2 — Service-to-Service Traffic Map

Panel	Visualization	Query Description
Request Rate by Service Pair	Time Series	Rate of `traefik_mesh_service_requests_total` grouped by `source_service` and `destination_service`
Error Rate by Service Pair	Time Series	Rate of 4xx/5xx `traefik_mesh_entrypoint_requests_total` grouped by service pair and status code
Latency p50/p95/p99 by Service Pair	Time Series	Histogram quantiles of `traefik_mesh_service_request_duration_seconds`

Row 3 — Per-Service Deep Dive (with variable selector)

Panel	Visualization	Query Description
Inbound Request Rate	Time Series	Requests TO the selected service
Outbound Request Rate	Time Series	Requests FROM the selected service
Latency Distribution	Histogram	Full latency histogram for the selected service pair
Status Code Breakdown	Pie Chart	2xx/3xx/4xx/5xx distribution

2. Add template variables¶

$namespace — dropdown: forma3d-dev, forma3d (filters all panels)
$service — dropdown populated from distinct destination_service values in metrics
$interval — auto interval for rate calculations

3. ClickHouse queries for Grafana¶

Example query for request rate by service pair:

SELECT
    toStartOfInterval(TimeUnix, INTERVAL 1 MINUTE) AS time,
    Attributes['source_service'] AS source,
    Attributes['destination_service'] AS destination,
    sum(Value) AS requests
FROM otel.otel_metrics
WHERE MetricName = 'traefik_mesh_service_requests_total'
  AND TimeUnix >= $__fromTime
  AND TimeUnix <= $__toTime
GROUP BY time, source, destination
ORDER BY time

Adapt column names to match the actual ClickHouse OTel metrics schema after verifying in Phase 3.

Phase 5: Alerting Rules (30 min)¶

Priority: P2 | Impact: Medium | Dependencies: Phase 4

1. Add mesh-specific alert rules in Grafana¶

Alert	Condition	Severity	Action
High inter-service error rate	>5% 5xx responses for any service pair over 5 min	Warning	Notification
Latency spike	p99 latency >2s for any service pair over 5 min	Warning	Notification
Mesh proxy unhealthy	Any `traefik-mesh-proxy` pod not ready for >2 min	Critical	Notification
mTLS cert expiring soon	Certificate expiry <7 days	Warning	Notification

2. Configure alert notifications¶

Use the same notification channel as existing log alerts (configured in the ClickHouse + Grafana prompt).

Phase 6: Local Dev Integration (30 min)¶

Priority: P2 | Impact: Medium | Dependencies: Phase 2

1. Update Tiltfile for metrics pipeline¶

Add the OTel Collector Prometheus scraping to the local dev setup. Since Tilt runs in Rancher Desktop's K3s cluster, Kubernetes service discovery works natively.

If the OTel Collector is not part of the local dev Tiltfile (it's currently in the "excluded services" list), add a lightweight local-dev variant:

# --- OTel Collector (metrics only, for mesh observability) ---
k8s_yaml('k8s/dev/otel-collector.yaml')
k8s_resource('otel-collector',
    labels=['observability'])

This is optional for local dev — developers who don't need mesh metrics can skip it.

2. Port-forward Grafana for local dev (optional)¶

If a developer wants to view mesh dashboards locally, they can add Grafana to their Tilt setup or run it standalone:

kubectl port-forward -n forma3d-dev svc/grafana 3001:3000

📊 Data Flow Summary¶

                     Traefik Mesh Proxy Pods
                     (per-node, expose /metrics)
                              │
                              │ Prometheus scrape (15s interval)
                              ▼
                     ┌─────────────────┐
                     │  OTel Collector  │
                     │                  │
                     │  receivers:      │
                     │   - otlp (logs)  │ ← NestJS services (existing)
                     │   - prometheus   │ ← Traefik Mesh metrics (NEW)
                     │                  │
                     │  exporters:      │
                     │   - clickhouse   │
                     └────────┬─────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │   ClickHouse     │
                     │                  │
                     │  otel_logs       │ ← existing (TTL: 7–180 days)
                     │  otel_metrics    │ ← NEW (TTL: 90 days)
                     └────────┬─────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │    Grafana       │
                     │                  │
                     │  📊 App Logs     │ ← existing dashboards
                     │  📊 Mesh Traffic │ ← NEW dashboard
                     │  🔔 Alerts       │ ← existing + NEW mesh alerts
                     └─────────────────┘

✅ Validation Checklist¶

Metrics Pipeline¶

OTel Collector config includes prometheus receiver targeting Traefik Mesh proxy pods
metrics pipeline defined in OTel Collector service config (receivers → processors → exporters)
OTel Collector has RBAC to discover pods via Kubernetes API (in K8s environments)
Metrics are flowing: SELECT count() FROM otel.otel_metrics WHERE MetricName LIKE 'traefik_mesh%' returns data
TTL retention set on otel_metrics table (90 days)

Grafana Dashboard¶

"Service Mesh Traffic" dashboard created with overview, service-pair, and deep-dive rows
Template variables work ($namespace, $service, $interval)
Request rate, error rate, and latency panels show real data
mTLS certificate expiry panel shows certificate status

Alerting¶

High inter-service error rate alert configured and tested
Latency spike alert configured and tested
Mesh proxy unhealthy alert configured and tested
mTLS cert expiry alert configured and tested
Alerts use the same notification channel as existing log alerts

Cross-Environment¶

Metrics pipeline works in local dev (Rancher Desktop + Tilt) with Kubernetes SD
Metrics pipeline config is compatible with staging Docker Compose (static targets)
Same Grafana dashboard works across environments (only namespace filter changes)

🚫 Constraints and Rules¶

MUST DO¶

Use the existing OTel Collector — add a receiver, don't deploy a separate metrics collector
Store metrics in ClickHouse (same database as logs) — don't introduce Prometheus server or Thanos
Use the existing Grafana instance — add dashboards, don't deploy a separate Grafana
Set TTL retention on the metrics table
Verify exact metric names from the running Traefik Mesh /metrics endpoint before building dashboards

MUST NOT¶

Deploy Prometheus server, VictoriaMetrics, or any separate TSDB — ClickHouse handles both logs and metrics
Modify the existing log pipeline — the logs pipeline in OTel Collector must remain unchanged
Break existing Grafana dashboards or alert rules
Add metrics collection for anything other than Traefik Mesh in this prompt (application-level Prometheus metrics are a separate concern)
Use any, ts-ignore, or eslint-disable

SHOULD DO (Nice to Have)¶

Add a Grafana "Service Map" visualization showing the request flow between services as a directed graph
Include a "Top Slow Routes" panel showing the slowest service-to-service paths
Create a combined "Correlate" panel that links a mesh latency spike to application logs from the same time window
Document how to add Prometheus metrics from other sources (e.g., Redis, PostgreSQL exporters) to the same pipeline in the future

🔄 Rollback Plan¶

All changes are additive and non-destructive:

OTel Collector config: Removing the prometheus receiver and metrics pipeline reverts to the original logs-only configuration
ClickHouse metrics table: Can be dropped (DROP TABLE otel.otel_metrics) without affecting the otel_logs table
Grafana dashboards: Deleting the "Service Mesh Traffic" dashboard has no effect on existing dashboards
Alerting rules: Mesh-specific alerts can be disabled or removed independently
RBAC: The additional ClusterRole/ClusterRoleBinding can be deleted without affecting other Kubernetes resources

📚 Key References¶

Traefik Mesh metrics: https://doc.traefik.io/traefik-mesh/ — proxy metrics documentation
OTel Collector Prometheus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver
OTel Collector ClickHouse exporter (metrics): https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter
Grafana ClickHouse plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/
Existing logging prompt: done/prompt-clickhouse-grafana-logging.md
Scaling preparations prompt: todo/prompt-scaling-preparations.md (Traefik Mesh installation)

END OF PROMPT

This prompt extends the existing OTel Collector → ClickHouse → Grafana observability stack to include Traefik Mesh network-level metrics. It adds a Prometheus receiver to the OTel Collector, stores metrics in ClickHouse alongside logs, and creates Grafana dashboards for service-to-service traffic visualization. The result is full-stack observability: application logs show what happened inside a service, mesh metrics show what happened between services.