Skip to content

AI Prompt: Forma3D.Connect — ClickHouse + Grafana Centralized Logging

Purpose: Migrate structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry, while keeping Sentry for error tracking, performance monitoring, and profiling
Estimated Effort: 16–24 hours (infrastructure + application integration + dashboards + backups)
Prerequisites: All microservices running with current SentryLoggerService and instrument.ts files; staging Docker Compose deployment operational; DigitalOcean Droplet with at least 8 GB RAM / 4 vCPU (or plan to upsize from 4 GB)
Output: ClickHouse storing structured logs via OpenTelemetry, Grafana dashboards for log visualization and alerting, automated backups to DigitalOcean Spaces, all services logging via Pino bridged to OTel — with Sentry retained for errors/traces/profiling
Status: 🚧 TODO
Research: docs/03-architecture/research/clickhouse-grafana-logging-research.md


🎯 Mission

Implement a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform. This replaces only the logging concern currently handled by Sentry Logs (Sentry.logger.*). Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — unchanged.

What this delivers:

  1. OpenTelemetry Collector — receives logs from all services via OTLP gRPC, batches and exports to ClickHouse
  2. ClickHouse — high-performance columnar storage for structured logs with TTL-based retention and S3 backup
  3. Grafana — log visualization dashboards with the ClickHouse plugin, alerting rules, and log exploration
  4. Pino + OTel bridge — replaces SentryLoggerService with Pino logger auto-bridged to OpenTelemetry via @opentelemetry/instrumentation-pino
  5. Tiered retention — ERROR/FATAL logs kept 180 days, WARN 90 days, INFO 30 days, DEBUG 7 days
  6. Automated backups — nightly ClickHouse backups to DigitalOcean Spaces (S3-compatible)
  7. Dozzle coexistence — Pino still writes to stdout, so Docker log viewers (Dozzle) continue working

Why move logging away from Sentry:

  • Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
  • Retention: Sentry retains logs for limited periods (30 days on Team plan); ClickHouse allows arbitrary retention
  • Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
  • Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
  • Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
  • Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes

What stays with Sentry (unchanged):

  • Exception capture (Sentry.captureException, SentryExceptionFilter)
  • Performance traces (Sentry's OTel integration)
  • Profiling (nodeProfilingIntegration)
  • Frontend logging (React/browser logs stay in Sentry for now)

📐 Architecture

Data Flow

┌───────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│                                                                           │
│  ┌─────────┐  ┌───────────┐  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ Gateway │  │ Order Svc │  │ Print Svc │  │Ship. Svc │  │GridFlock  │ │
│  │         │  │           │  │           │  │          │  │           │ │
│  │  Pino   │  │   Pino    │  │   Pino    │  │  Pino    │  │   Pino    │ │
│  │  + OTel │  │   + OTel  │  │   + OTel  │  │  + OTel  │  │   + OTel  │ │
│  └────┬────┘  └─────┬─────┘  └─────┬─────┘  └────┬─────┘  └─────┬─────┘ │
│       │             │              │             │              │        │
│       └─────────────┴──────────────┴─────────────┴──────────────┘        │
│                                    │                                      │
│                            OTLP (gRPC :4317)                             │
└────────────────────────────────────┼──────────────────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     COLLECTION LAYER                                       │
│                                                                            │
│  ┌──────────────────────────────────────┐                                  │
│  │    OpenTelemetry Collector (Contrib) │                                  │
│  │                                      │                                  │
│  │  Receivers:  otlp (gRPC + HTTP)      │                                  │
│  │  Processors: batch, resource,        │                                  │
│  │              attributes, filter      │                                  │
│  │  Exporters:  clickhouse              │                                  │
│  └──────────────────┬───────────────────┘                                  │
│                     │                                                      │
└─────────────────────┼──────────────────────────────────────────────────────┘
                      │
                      ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     STORAGE + VISUALIZATION                                │
│                                                                            │
│  ┌──────────────────┐          ┌──────────────────────────┐                │
│  │    ClickHouse    │◀─────────│       Grafana            │                │
│  │                  │  query   │                          │                │
│  │  otel_logs DB    │          │  ClickHouse Data Source  │                │
│  │  TTL: tiered     │          │  Log dashboards          │                │
│  │                  │          │  Alert rules             │                │
│  └────────┬─────────┘          └──────────────────────────┘                │
│           │                                                                │
│           │  Nightly backup                                                │
│           ▼                                                                │
│  ┌──────────────────────────┐                                              │
│  │  DigitalOcean Spaces     │                                              │
│  │  (S3-compatible)         │                                              │
│  │  forma3d-log-backups/    │                                              │
│  └──────────────────────────┘                                              │
└────────────────────────────────────────────────────────────────────────────┘

Sentry Coexistence

Sentry continues to handle errors, traces, and profiling. The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).

Component Summary

Component Image Purpose Port(s)
OTel Collector otel/opentelemetry-collector-contrib:0.120.0 Receives OTLP logs, batches, exports to ClickHouse 4317 (gRPC), 4318 (HTTP), 13133 (health), 8888 (metrics)
ClickHouse clickhouse/clickhouse-server:24.12-alpine Columnar log storage with TTL and S3 backup 9000 (native TCP), 8123 (HTTP)
Grafana grafana/grafana-oss:11.5.0 Dashboard visualization, alerting 3000

📋 Services Affected

Service Logging changes Sentry changes
Gateway Yes — add OTel SDK + Pino logger Remove enableLogs experiment flag
Order Service Yes — add OTel SDK + Pino logger Remove enableLogs experiment flag
Print Service Yes — add OTel SDK + Pino logger Remove enableLogs experiment flag
Shipping Service Yes — add OTel SDK + Pino logger Remove enableLogs experiment flag
GridFlock Service Yes — add OTel SDK + Pino logger Remove enableLogs experiment flag
Web (React) No (frontend logs stay in Sentry) No change

Files Modified per Service

Action Files
Modify observability/instrument.ts — add OTel SDK init for logs
Replace observability/services/sentry-logger.service.tsotel-logger.service.ts
Modify observability/observability.module.ts — swap provider
Keep observability/filters/sentry-exception.filter.ts — unchanged
Modify observability/interceptors/logging.interceptor.ts — swap logger injection
Modify observability/services/business-observability.service.ts — swap logger injection

📁 Files to Create/Modify

New Files — Infrastructure

deployment/staging/otel-collector-config.yaml              # OTel Collector pipeline configuration
deployment/staging/clickhouse-config.xml                   # ClickHouse server configuration
deployment/staging/clickhouse-users.xml                    # ClickHouse user access (reference)
deployment/staging/grafana/provisioning/datasources/clickhouse.yaml  # Grafana ClickHouse datasource
deployment/staging/scripts/backup-clickhouse-logs.sh       # Automated ClickHouse backup script

New Files — Application

libs/observability/src/lib/otel-logger.ts                  # Pino logger factory (createLogger)
libs/observability/src/lib/services/otel-logger.service.ts # NestJS injectable logger service (replaces SentryLoggerService)

Modified Files — Infrastructure

deployment/staging/docker-compose.yml                      # Add otel-collector, clickhouse, grafana services + volumes
deployment/staging/.env.example                            # Add CLICKHOUSE_PASSWORD, GRAFANA_ADMIN_PASSWORD, DO_SPACES_* vars

Modified Files — Application (per service)

apps/*/src/observability/instrument.ts                     # Add OTel SDK init, remove enableLogs experiment
apps/*/src/observability/observability.module.ts            # Swap SentryLoggerService → OtelLoggerService
apps/*/src/observability/interceptors/logging.interceptor.ts  # Swap logger injection
apps/*/src/observability/services/business-observability.service.ts  # Swap logger injection

Modified Files — Shared Library

libs/observability/src/index.ts                            # Export createLogger and OtelLoggerService

Deleted Files (Phase 5 only)

apps/*/src/observability/services/sentry-logger.service.ts  # Replaced by OtelLoggerService

🔧 Implementation Phases

Phase 1: Deploy Infrastructure (1 day)

Priority: P0 | Impact: Foundation | Dependencies: None

Deploy ClickHouse, OTel Collector, and Grafana containers on staging. Verify they start, connect, and are healthy.

1. Create OTel Collector configuration

Create deployment/staging/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

Notes: - create_schema: true auto-creates the otel_logs table on first run - filter/drop-debug prevents debug-level logs from reaching ClickHouse in production (disable on staging by removing from the pipeline processors list) - batch processor is critical for performance — ClickHouse ingests best in large batches

2. Create ClickHouse server configuration

Create deployment/staging/clickhouse-config.xml:

<?xml version="1.0"?>
<clickhouse>
    <listen_host>0.0.0.0</listen_host>

    <logger>
        <level>warning</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>100M</size>
        <count>3</count>
    </logger>

    <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

    <backups>
        <allowed_path>/backups/</allowed_path>
        <allowed_disk>s3_backups</allowed_disk>
    </backups>

    <storage_configuration>
        <disks>
            <s3_backups>
                <type>s3</type>
                <endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
                <access_key_id>${DO_SPACES_KEY}</access_key_id>
                <secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
            </s3_backups>
        </disks>
    </storage_configuration>
</clickhouse>

3. Create ClickHouse users configuration

Create deployment/staging/clickhouse-users.xml (reference — may use CLICKHOUSE_PASSWORD env var approach instead):

<?xml version="1.0"?>
<clickhouse>
    <users>
        <otel>
            <password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
            <networks>
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>0</access_management>
        </otel>
    </users>
</clickhouse>

4. Create Grafana ClickHouse datasource provisioning

Create deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:

apiVersion: 1

datasources:
  - name: ClickHouse
    type: grafana-clickhouse-datasource
    access: proxy
    isDefault: true
    jsonData:
      host: clickhouse
      port: 9000
      protocol: native
      username: otel
      defaultDatabase: otel
      logs:
        defaultDatabase: otel
        defaultTable: otel_logs
        otelEnabled: true
        otelVersion: latest
        timeColumn: Timestamp
        levelColumn: SeverityText
        messageColumn: Body
    secureJsonData:
      password: ${CLICKHOUSE_PASSWORD}

5. Add services to Docker Compose

Add the following to deployment/staging/docker-compose.yml:

Services to add:

  • otel-collector — OTel Collector Contrib with health check, depends on ClickHouse
  • clickhouse — ClickHouse server with named volumes, health check, ulimits
  • grafana — Grafana OSS with ClickHouse plugin, provisioning volumes, Traefik labels

Volumes to add:

  • clickhouse-data
  • clickhouse-logs
  • grafana-data

Environment variables for existing backend services:

Add OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 to each backend service (Gateway, Order Service, Print Service, Shipping Service, GridFlock Service). Do NOT add depends_on on the OTel Collector — the OTel SDK buffers and retries if the collector is down.

See the research document (Section 6.1) for the complete Docker Compose service definitions including Traefik labels for Grafana at staging-connect-grafana.forma3d.be.

6. Add environment variables

Update deployment/staging/.env.example with:

# ClickHouse
CLICKHOUSE_PASSWORD=<generate-strong-password>

# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>

# DigitalOcean Spaces (for ClickHouse backups)
DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups

7. Verify infrastructure

  • docker compose up -d clickhouse otel-collector grafana
  • Verify ClickHouse health: docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1"
  • Verify OTel Collector health: curl http://localhost:13133/
  • Verify Grafana health: curl http://localhost:3000/api/health
  • Verify Grafana can query ClickHouse: open Grafana UI → Data Sources → ClickHouse → Test Connection

Phase 2: Application Integration — Dual-Write (1 week)

Priority: P1 | Impact: Core integration | Dependencies: Phase 1

Keep SentryLoggerService active. Add OTel SDK to each service so logs are sent to both Sentry and ClickHouse in parallel. This validates the pipeline without risk.

8. Install required packages

pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/sdk-logs \
  @opentelemetry/instrumentation-pino \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

pnpm add -D -w @types/pino pino-pretty

9. Create Pino logger factory in shared library

Create libs/observability/src/lib/otel-logger.ts:

import pino from 'pino';
import { getServiceName } from './service-context';

export function createLogger(context?: string): pino.Logger {
  const logger = pino({
    level: process.env['LOG_LEVEL'] || 'info',
    transport:
      process.env['NODE_ENV'] === 'development'
        ? { target: 'pino-pretty', options: { colorize: true } }
        : undefined,
  });

  return context ? logger.child({ context }) : logger;
}

10. Create OtelLoggerService in shared library

Create libs/observability/src/lib/services/otel-logger.service.ts:

import { Injectable } from '@nestjs/common';
import { createLogger } from '../otel-logger';
import type pino from 'pino';

@Injectable()
export class OtelLoggerService {
  private readonly logger: pino.Logger;

  constructor() {
    this.logger = createLogger('business');
  }

  info(message: string, attributes?: Record<string, unknown>): void {
    this.logger.info(attributes, message);
  }

  warn(message: string, attributes?: Record<string, unknown>): void {
    this.logger.warn(attributes, message);
  }

  error(message: string, attributes?: Record<string, unknown>): void {
    this.logger.error(attributes, message);
  }

  debug(message: string, attributes?: Record<string, unknown>): void {
    this.logger.debug(attributes, message);
  }

  logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
    this.logger.info({ ...attributes, eventType }, message);
  }

  logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
    const level = success ? 'info' : 'warn';
    this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
  }
}

11. Export new modules from shared library

Update libs/observability/src/index.ts to export createLogger and OtelLoggerService.

12. Modify instrument.ts in each backend service

In each service's observability/instrument.ts, add OTel SDK initialization before Sentry init. The OTel SDK must be initialized first so that Sentry can link to OTel traces.

The key changes per instrument.ts:

  1. Import NodeSDK, OTLPLogExporter, BatchLogRecordProcessor, getNodeAutoInstrumentations, Resource, and semantic conventions
  2. Read getOtelConfig(SERVICE_NAME) from the shared observability library
  3. If otelConfig.exporterEndpoint is set, create and start a NodeSDK instance with:
  4. logRecordProcessors using BatchLogRecordProcessor + OTLPLogExporter
  5. instrumentations with getNodeAutoInstrumentations enabling @opentelemetry/instrumentation-pino (disableLogSending: false)
  6. Disable unused instrumentations (e.g., @opentelemetry/instrumentation-fs)
  7. Keep Sentry init unchanged during dual-write phase (do NOT remove _experiments: { enableLogs: true } yet)

See the research document (Section 5.3) for the complete instrument.ts code.

13. Validate logs appear in Grafana

  • Deploy updated services to staging
  • Generate some log activity (e.g., process a test order)
  • Open Grafana → Explore → select ClickHouse datasource → query otel_logs table
  • Verify logs from all services appear with correct ServiceName, SeverityText, Body, and TraceId

Phase 3: Build Grafana Dashboards (3–5 days)

Priority: P1 | Impact: Visualization | Dependencies: Phase 2 (logs flowing)

14. Create "Service Logs Overview" dashboard

  • Log volume over time (by service) — time series panel
  • Error rate by service — stacked bar chart
  • Latest error logs — table panel
  • Log level distribution — pie chart

15. Create "Request Tracing" dashboard

  • Logs filtered by TraceId variable
  • Correlated with Sentry traces (link out to Sentry)
  • Request lifecycle visualization

16. Create "Business Events" dashboard

  • Order processing events (filtered by eventType log attribute)
  • Print job status changes
  • Shipment events
  • Webhook receipt/processing logs

17. Create "System Health" dashboard

  • OTel Collector throughput (records/sec) — from Prometheus metrics on :8888
  • ClickHouse disk usage — query system.disks
  • ClickHouse query performance
  • Log ingestion latency

18. Configure alerting rules

Alert Condition Channel
High error rate > 50 ERROR logs in 5 min for any service Slack / Email
Service silent Zero logs from a service for > 10 min Slack
ClickHouse disk > 80% Query system.disks Email
OTel Collector unhealthy Health check fail Slack

19. Useful Grafana queries for reference

Error log count by service (last 24h):

SELECT
    ServiceName,
    count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
  AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC

Log volume by level (time series):

SELECT
    toStartOfFiveMinutes(Timestamp) AS time,
    SeverityText,
    count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time

Search logs by keyword:

SELECT
    Timestamp,
    ServiceName,
    SeverityText,
    Body,
    LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
  AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100

Phase 4: Cut Over — Replace SentryLoggerService (1 day)

Priority: P2 | Impact: Migration | Dependencies: Phase 3 (dashboards validated)

20. Swap SentryLoggerService → OtelLoggerService in all services

In each service's observability/observability.module.ts: - Replace SentryLoggerService provider with OtelLoggerService from @forma3d/observability - Update all injection sites (logging.interceptor.ts, business-observability.service.ts, etc.)

21. Remove Sentry Logs experiment flag

In each service's instrument.ts: - Remove _experiments: { enableLogs: true } from the Sentry.init() call - Remove any Sentry.logger.* imports/calls

22. Verify cut-over

  • Deploy updated services to staging
  • Verify logs no longer appear in Sentry's log explorer
  • Verify logs continue to appear in Grafana
  • Verify Sentry still captures exceptions and traces correctly
  • Verify Dozzle still shows container logs (Pino writes to stdout)

Phase 5: Cleanup and Backups (1 day)

Priority: P2 | Impact: Completion | Dependencies: Phase 4 validated for 2+ weeks

23. Delete SentryLoggerService files

Remove apps/*/src/observability/services/sentry-logger.service.ts from all services.

24. Apply tiered TTL policy

Connect to ClickHouse and run:

ALTER TABLE otel.otel_logs MODIFY TTL
    TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
    TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
    TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
    TimestampDate + INTERVAL 180 DAY;

25. Set up DigitalOcean Spaces bucket

  • Create forma3d-log-backups bucket in ams3 region
  • Disable CDN, disable versioning
  • Generate Spaces access key and secret
  • Add lifecycle rules:
Path prefix Expiration
clickhouse/full/ 35 days
clickhouse/incremental/ 14 days
clickhouse/archive/ 365 days

26. Create backup script

Create deployment/staging/scripts/backup-clickhouse-logs.sh with: - Full backup on Sundays - Incremental backup on weekdays (referencing last full) - Logging to /var/log/clickhouse-backup.log

See the research document (Section 8.4) for the complete backup script.

27. Configure backup cron

Add to the Droplet's crontab:

0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1

28. Add DNS record

Create DNS A record for staging-connect-grafana.forma3d.be pointing to the staging Droplet.

29. Add Grafana to monitoring

Add Grafana to Uptime Kuma monitoring at https://staging-connect-grafana.forma3d.be/api/health.


Phase 6: Production Parity (4 hours)

Priority: P3 | Impact: Production readiness | Dependencies: Phase 5

30. Replicate setup for production docker-compose

Copy infrastructure configuration to deployment/production/ with these differences:

Config Staging Production
LOG_LEVEL debug info
filter/drop-debug processor Disabled (remove from pipeline) Enabled
ClickHouse TTL (INFO) 14 days 30 days
ClickHouse TTL (ERROR) 90 days 180 days
Backup frequency Daily (no incremental) Full weekly + daily incremental
Grafana alerts Email only Slack + Email

31. Remove Sentry Logs experiment flag cleanup

Remove any remaining Sentry Logs code or configuration across all services.

32. Document runbooks

Add ClickHouse operational runbooks to docs/05-deployment/: - How to query logs manually via clickhouse-client - How to restore from backup - How to check disk usage and TTL status - How to force TTL cleanup: OPTIMIZE TABLE otel.otel_logs FINAL


📊 Resource Requirements

Estimated Log Volume

Service Est. logs/min (staging) Est. logs/min (production)
Gateway 20 100
Order Service 30 150
Print Service 10 50
Shipping Service 10 50
GridFlock Service 5 20
Total ~75 ~370

Storage Estimates (production, with ClickHouse ~10:1 compression)

Timeframe Raw size Compressed
1 day ~265 MB ~26 MB
30 days ~8 GB ~800 MB
90 days ~24 GB ~2.4 GB

Container Resource Allocation

Container CPU (cores) RAM Disk
ClickHouse 0.5–1 1–2 GB 10 GB initial
OTel Collector 0.25 256 MB Minimal
Grafana 0.25 256 MB 1 GB
Total new 1–1.5 1.5–2.5 GB ~12 GB

Droplet Sizing

Current Recommended
4 GB RAM / 2 vCPU 8 GB RAM / 4 vCPU ($48/mo)

✅ Validation Checklist

Infrastructure

  • ClickHouse container starts and is healthy (SELECT 1 succeeds)
  • OTel Collector container starts and is healthy (:13133/ returns OK)
  • Grafana container starts and is healthy (/api/health returns OK)
  • Grafana ClickHouse plugin installed (grafana-clickhouse-datasource)
  • Grafana data source connection to ClickHouse tested successfully
  • otel_logs table auto-created in ClickHouse otel database
  • Grafana accessible at staging-connect-grafana.forma3d.be via Traefik

Application Integration

  • pino and OTel packages installed at workspace root
  • createLogger factory exported from @forma3d/observability
  • OtelLoggerService exported from @forma3d/observability
  • OTel SDK initialized in instrument.ts for all 5 backend services
  • @opentelemetry/instrumentation-pino enabled (disableLogSending: false)
  • OTEL_EXPORTER_OTLP_ENDPOINT set in all backend service containers
  • Pino logs appear on stdout (Dozzle still works)
  • Logs appear in ClickHouse otel_logs table with correct fields
  • TraceId and SpanId populated in log records (trace correlation working)
  • ServiceName correctly identifies each service

Dashboards

  • "Service Logs Overview" dashboard created and functional
  • "Request Tracing" dashboard created with TraceId filtering
  • "Business Events" dashboard created with event type filtering
  • "System Health" dashboard created with ClickHouse metrics
  • Alerting rules configured (high error rate, service silent, disk usage)

Migration Cut-Over

  • SentryLoggerService replaced by OtelLoggerService in all services
  • _experiments: { enableLogs: true } removed from all Sentry init calls
  • Sentry still captures exceptions correctly (SentryExceptionFilter unchanged)
  • Sentry still captures performance traces correctly
  • No logs appearing in Sentry's log explorer after cut-over
  • All logs flowing to ClickHouse/Grafana

Retention and Backups

  • Tiered TTL applied: DEBUG 7d, INFO 30d, WARN 90d, ERROR/FATAL 180d
  • DigitalOcean Spaces bucket created (forma3d-log-backups in ams3)
  • Backup script executable and runs successfully
  • Cron job configured for nightly backups at 3 AM
  • Spaces lifecycle rules configured for backup cleanup
  • Backup restore tested: RESTORE TABLE otel.otel_logs FROM S3(...) succeeds

Rollback Readiness

  • SentryLoggerService files retained until Phase 5 validated for 2+ weeks
  • Re-enabling _experiments: { enableLogs: true } documented as rollback step
  • Dozzle continues to work as last-resort log viewer

Verification Commands

# ClickHouse health
docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1"

# OTel Collector health
curl http://localhost:13133/

# Grafana health
curl http://localhost:3000/api/health

# Check log count in ClickHouse
docker exec forma3d-clickhouse clickhouse-client --query "SELECT count() FROM otel.otel_logs"

# Check logs by service
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT ServiceName, count() FROM otel.otel_logs GROUP BY ServiceName"

# Check TTL status
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT name, engine, total_rows, total_bytes FROM system.tables WHERE database = 'otel'"

# Check disk usage
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT name, path, free_space, total_space FROM system.disks"

# Build passes
pnpm nx run-many -t build --all

# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests

# Lint passes
pnpm nx run-many -t lint --all

🚫 Constraints and Rules

MUST DO

  • Keep Sentry for error tracking, performance monitoring, tracing, and profiling — only move logging
  • Use OpenTelemetry as the transport layer (vendor-neutral, backend-swappable)
  • Use Pino as the application logger (structured JSON, stdout output, Dozzle compatibility)
  • Bridge Pino to OTel via @opentelemetry/instrumentation-pino (no direct OTel log API calls in application code)
  • Initialize OTel SDK before Sentry init in instrument.ts (so Sentry can link to OTel traces)
  • Use the OTel Collector Contrib distribution (otel/opentelemetry-collector-contrib) — required for the clickhouseexporter
  • Set create_schema: true on the ClickHouse exporter so the otel_logs table is auto-created
  • Use named Docker volumes for ClickHouse data and Grafana data
  • Add health checks to all three new containers
  • Apply tiered TTL retention policy (not a flat retention period)
  • Partition ClickHouse logs by TimestampDate for efficient TTL drops
  • Run dual-write (Phase 2) for at least 1 week before cutting over (Phase 4)
  • Keep SentryLoggerService files until Phase 5 is validated for 2+ weeks
  • Store CLICKHOUSE_PASSWORD and GRAFANA_ADMIN_PASSWORD as secrets — never commit to repository
  • Add OTEL_EXPORTER_OTLP_ENDPOINT to each backend service's environment (no hard dependency on collector)
  • Export createLogger and OtelLoggerService from the shared @forma3d/observability library
  • Add DNS record and Uptime Kuma monitoring for Grafana

MUST NOT

  • Remove or modify Sentry exception filters, performance monitoring, or profiling
  • Remove Sentry SDK or @sentry/nestjs from any service
  • Add depends_on from backend services to the OTel Collector (the OTel SDK handles retries gracefully)
  • Call OTel log APIs directly in application code — always use Pino, let the instrumentation bridge handle it
  • Use ClickHouse for anything other than logs at this stage (no traces, no metrics)
  • Expose ClickHouse ports externally (keep behind Docker network)
  • Hard-code secrets in configuration files or Docker Compose
  • Skip the dual-write validation phase (Phase 2)
  • Delete SentryLoggerService before the cut-over is validated for 2+ weeks
  • Use any, ts-ignore, or eslint-disable
  • Add console.log statements — all logging goes through Pino/OtelLoggerService

SHOULD DO (Nice to Have)

  • Provision Grafana dashboards via JSON files in grafana/provisioning/dashboards/ for reproducibility
  • Add Grafana SMTP configuration for email alerting
  • Set up Grafana Slack integration for critical alerts
  • Configure Grafana IP allowlisting or VPN access for the admin UI
  • Add pino-pretty transport for local development (colored, human-readable logs)
  • Add LOG_LEVEL environment variable per service for fine-grained control
  • Explore routing OTel traces to ClickHouse in a future iteration (currently stays in Sentry)
  • Explore adding browser/frontend logs to the OTel pipeline in a future iteration

🔄 Rollback Plan

If issues arise at any phase:

  1. Phase 1 (infrastructure): Remove the three new containers from docker-compose.yml — no application impact
  2. Phase 2 (dual-write): Revert instrument.ts changes — remove OTel SDK init. Sentry Logs continues working
  3. Phase 3 (dashboards): No rollback needed — dashboards are additive
  4. Phase 4 (cut-over): Re-enable _experiments: { enableLogs: true } in Sentry init, swap OtelLoggerService back to SentryLoggerService
  5. Phase 5 (cleanup): If SentryLoggerService was already deleted, restore from Git history

Dozzle remains available as a last-resort log viewer at all times (reads Docker stdout directly).


📚 Key References

Research: - Detailed research document: docs/03-architecture/research/clickhouse-grafana-logging-research.md

Technologies: - OpenTelemetry Collector Contrib: https://github.com/open-telemetry/opentelemetry-collector-contrib - ClickHouse OTel Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter - ClickHouse Docker: https://hub.docker.com/r/clickhouse/clickhouse-server - Grafana ClickHouse Plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/ - Pino Logger: https://github.com/pinojs/pino - OTel Pino Instrumentation: https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-pino

Existing Codebase: - Current Sentry logger: apps/*/src/observability/services/sentry-logger.service.ts - Current instrument files: apps/*/src/observability/instrument.ts - Shared observability library: libs/observability/src/ - OTel config helper: libs/observability/src/lib/otel-config.ts (already exists, currently unused for logging) - Docker Compose (staging): deployment/staging/docker-compose.yml - Deployment guide: docs/05-deployment/staging-deployment-guide.md


END OF PROMPT


This prompt implements a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform, as designed in docs/03-architecture/research/clickhouse-grafana-logging-research.md. The AI should deploy ClickHouse, OTel Collector, and Grafana containers via Docker Compose; install Pino and OTel packages; create a shared OtelLoggerService that replaces SentryLoggerService; bridge Pino to OpenTelemetry via @opentelemetry/instrumentation-pino; modify instrument.ts in all 5 backend services to initialize the OTel SDK; build 4 Grafana dashboards with alerting; apply tiered TTL retention; and set up automated backups to DigitalOcean Spaces. Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — only the logging concern moves. A phased rollout with dual-write validation ensures safe migration with rollback at every stage.