AI Prompt: Forma3D.Connect — ClickHouse + Grafana Centralized Logging¶

Purpose: Migrate structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry, while keeping Sentry for error tracking, performance monitoring, and profiling
Estimated Effort: 16–24 hours (infrastructure + application integration + dashboards + backups)
Prerequisites: All microservices running with current SentryLoggerService and instrument.ts files; staging Docker Compose deployment operational; DigitalOcean Droplet with at least 8 GB RAM / 4 vCPU (or plan to upsize from 4 GB)
Output: ClickHouse storing structured logs via OpenTelemetry, Grafana dashboards for log visualization and alerting, automated backups to DigitalOcean Spaces, all services logging via Pino bridged to OTel — with Sentry retained for errors/traces/profiling
Status: 🚧 TODO
Research: docs/03-architecture/research/clickhouse-grafana-logging-research.md

🎯 Mission¶

Implement a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform. This replaces only the logging concern currently handled by Sentry Logs (Sentry.logger.*). Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — unchanged.

What this delivers:

OpenTelemetry Collector — receives logs from all services via OTLP gRPC, batches and exports to ClickHouse
ClickHouse — high-performance columnar storage for structured logs with TTL-based retention and S3 backup
Grafana — log visualization dashboards with the ClickHouse plugin, alerting rules, and log exploration
Pino + OTel bridge — replaces SentryLoggerService with Pino logger auto-bridged to OpenTelemetry via @opentelemetry/instrumentation-pino
Tiered retention — ERROR/FATAL logs kept 180 days, WARN 90 days, INFO 30 days, DEBUG 7 days
Automated backups — nightly ClickHouse backups to DigitalOcean Spaces (S3-compatible)
Dozzle coexistence — Pino still writes to stdout, so Docker log viewers (Dozzle) continue working

Why move logging away from Sentry:

Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
Retention: Sentry retains logs for limited periods (30 days on Team plan); ClickHouse allows arbitrary retention
Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes

What stays with Sentry (unchanged):

Exception capture (Sentry.captureException, SentryExceptionFilter)
Performance traces (Sentry's OTel integration)
Profiling (nodeProfilingIntegration)
Frontend logging (React/browser logs stay in Sentry for now)

📐 Architecture¶

Data Flow¶

┌───────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│                                                                           │
│  ┌─────────┐  ┌───────────┐  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ Gateway │  │ Order Svc │  │ Print Svc │  │Ship. Svc │  │GridFlock  │ │
│  │         │  │           │  │           │  │          │  │           │ │
│  │  Pino   │  │   Pino    │  │   Pino    │  │  Pino    │  │   Pino    │ │
│  │  + OTel │  │   + OTel  │  │   + OTel  │  │  + OTel  │  │   + OTel  │ │
│  └────┬────┘  └─────┬─────┘  └─────┬─────┘  └────┬─────┘  └─────┬─────┘ │
│       │             │              │             │              │        │
│       └─────────────┴──────────────┴─────────────┴──────────────┘        │
│                                    │                                      │
│                            OTLP (gRPC :4317)                             │
└────────────────────────────────────┼──────────────────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     COLLECTION LAYER                                       │
│                                                                            │
│  ┌──────────────────────────────────────┐                                  │
│  │    OpenTelemetry Collector (Contrib) │                                  │
│  │                                      │                                  │
│  │  Receivers:  otlp (gRPC + HTTP)      │                                  │
│  │  Processors: batch, resource,        │                                  │
│  │              attributes, filter      │                                  │
│  │  Exporters:  clickhouse              │                                  │
│  └──────────────────┬───────────────────┘                                  │
│                     │                                                      │
└─────────────────────┼──────────────────────────────────────────────────────┘
                      │
                      ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     STORAGE + VISUALIZATION                                │
│                                                                            │
│  ┌──────────────────┐          ┌──────────────────────────┐                │
│  │    ClickHouse    │◀─────────│       Grafana            │                │
│  │                  │  query   │                          │                │
│  │  otel_logs DB    │          │  ClickHouse Data Source  │                │
│  │  TTL: tiered     │          │  Log dashboards          │                │
│  │                  │          │  Alert rules             │                │
│  └────────┬─────────┘          └──────────────────────────┘                │
│           │                                                                │
│           │  Nightly backup                                                │
│           ▼                                                                │
│  ┌──────────────────────────┐                                              │
│  │  DigitalOcean Spaces     │                                              │
│  │  (S3-compatible)         │                                              │
│  │  forma3d-log-backups/    │                                              │
│  └──────────────────────────┘                                              │
└────────────────────────────────────────────────────────────────────────────┘

Sentry Coexistence¶

Sentry continues to handle errors, traces, and profiling. The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).

Component Summary¶

Component	Image	Purpose	Port(s)
OTel Collector	`otel/opentelemetry-collector-contrib:0.120.0`	Receives OTLP logs, batches, exports to ClickHouse	4317 (gRPC), 4318 (HTTP), 13133 (health), 8888 (metrics)
ClickHouse	`clickhouse/clickhouse-server:24.12-alpine`	Columnar log storage with TTL and S3 backup	9000 (native TCP), 8123 (HTTP)
Grafana	`grafana/grafana-oss:11.5.0`	Dashboard visualization, alerting	3000

📋 Services Affected¶

Service	Logging changes	Sentry changes
Gateway	Yes — add OTel SDK + Pino logger	Remove `enableLogs` experiment flag
Order Service	Yes — add OTel SDK + Pino logger	Remove `enableLogs` experiment flag
Print Service	Yes — add OTel SDK + Pino logger	Remove `enableLogs` experiment flag
Shipping Service	Yes — add OTel SDK + Pino logger	Remove `enableLogs` experiment flag
GridFlock Service	Yes — add OTel SDK + Pino logger	Remove `enableLogs` experiment flag
Web (React)	No (frontend logs stay in Sentry)	No change

Files Modified per Service¶

Action	Files
Modify	`observability/instrument.ts` — add OTel SDK init for logs
Replace	`observability/services/sentry-logger.service.ts` → `otel-logger.service.ts`
Modify	`observability/observability.module.ts` — swap provider
Keep	`observability/filters/sentry-exception.filter.ts` — unchanged
Modify	`observability/interceptors/logging.interceptor.ts` — swap logger injection
Modify	`observability/services/business-observability.service.ts` — swap logger injection

📁 Files to Create/Modify¶

New Files — Infrastructure¶

deployment/staging/otel-collector-config.yaml              # OTel Collector pipeline configuration
deployment/staging/clickhouse-config.xml                   # ClickHouse server configuration
deployment/staging/clickhouse-users.xml                    # ClickHouse user access (reference)
deployment/staging/grafana/provisioning/datasources/clickhouse.yaml  # Grafana ClickHouse datasource
deployment/staging/scripts/backup-clickhouse-logs.sh       # Automated ClickHouse backup script

New Files — Application¶

libs/observability/src/lib/otel-logger.ts                  # Pino logger factory (createLogger)
libs/observability/src/lib/services/otel-logger.service.ts # NestJS injectable logger service (replaces SentryLoggerService)

Modified Files — Infrastructure¶

deployment/staging/docker-compose.yml                      # Add otel-collector, clickhouse, grafana services + volumes
deployment/staging/.env.example                            # Add CLICKHOUSE_PASSWORD, GRAFANA_ADMIN_PASSWORD, DO_SPACES_* vars

Modified Files — Application (per service)¶

apps/*/src/observability/instrument.ts                     # Add OTel SDK init, remove enableLogs experiment
apps/*/src/observability/observability.module.ts            # Swap SentryLoggerService → OtelLoggerService
apps/*/src/observability/interceptors/logging.interceptor.ts  # Swap logger injection
apps/*/src/observability/services/business-observability.service.ts  # Swap logger injection

Modified Files — Shared Library¶

libs/observability/src/index.ts                            # Export createLogger and OtelLoggerService

Deleted Files (Phase 5 only)¶

apps/*/src/observability/services/sentry-logger.service.ts  # Replaced by OtelLoggerService

🔧 Implementation Phases¶

Phase 1: Deploy Infrastructure (1 day)¶

Priority: P0 | Impact: Foundation | Dependencies: None

Deploy ClickHouse, OTel Collector, and Grafana containers on staging. Verify they start, connect, and are healthy.

1. Create OTel Collector configuration¶

Create deployment/staging/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

Notes: - create_schema: true auto-creates the otel_logs table on first run - filter/drop-debug prevents debug-level logs from reaching ClickHouse in production (disable on staging by removing from the pipeline processors list) - batch processor is critical for performance — ClickHouse ingests best in large batches

2. Create ClickHouse server configuration¶

Create deployment/staging/clickhouse-config.xml:

<?xml version="1.0"?>
<clickhouse>
    <listen_host>0.0.0.0</listen_host>

    <logger>
        <level>warning</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>100M</size>
        <count>3</count>
    </logger>

    <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

    <backups>
        <allowed_path>/backups/</allowed_path>
        <allowed_disk>s3_backups</allowed_disk>
    </backups>

    <storage_configuration>
        <disks>
            <s3_backups>
                <type>s3</type>
                <endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
                <access_key_id>${DO_SPACES_KEY}</access_key_id>
                <secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
            </s3_backups>
        </disks>
    </storage_configuration>
</clickhouse>

3. Create ClickHouse users configuration¶

Create deployment/staging/clickhouse-users.xml (reference — may use CLICKHOUSE_PASSWORD env var approach instead):

<?xml version="1.0"?>
<clickhouse>
    <users>
        <otel>
            <password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
            <networks>
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>0</access_management>
        </otel>
    </users>
</clickhouse>

4. Create Grafana ClickHouse datasource provisioning¶

Create deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:

apiVersion: 1

datasources:
  - name: ClickHouse
    type: grafana-clickhouse-datasource
    access: proxy
    isDefault: true
    jsonData:
      host: clickhouse
      port: 9000
      protocol: native
      username: otel
      defaultDatabase: otel
      logs:
        defaultDatabase: otel
        defaultTable: otel_logs
        otelEnabled: true
        otelVersion: latest
        timeColumn: Timestamp
        levelColumn: SeverityText
        messageColumn: Body
    secureJsonData:
      password: ${CLICKHOUSE_PASSWORD}

5. Add services to Docker Compose¶

Add the following to deployment/staging/docker-compose.yml:

Services to add:

otel-collector — OTel Collector Contrib with health check, depends on ClickHouse
clickhouse — ClickHouse server with named volumes, health check, ulimits
grafana — Grafana OSS with ClickHouse plugin, provisioning volumes, Traefik labels

Volumes to add:

clickhouse-data
clickhouse-logs
grafana-data

Environment variables for existing backend services:

Add OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 to each backend service (Gateway, Order Service, Print Service, Shipping Service, GridFlock Service). Do NOT add depends_on on the OTel Collector — the OTel SDK buffers and retries if the collector is down.

See the research document (Section 6.1) for the complete Docker Compose service definitions including Traefik labels for Grafana at staging-connect-grafana.forma3d.be.

6. Add environment variables¶

Update deployment/staging/.env.example with:

# ClickHouse
CLICKHOUSE_PASSWORD=<generate-strong-password>

# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>

# DigitalOcean Spaces (for ClickHouse backups)
DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups

7. Verify infrastructure¶

docker compose up -d clickhouse otel-collector grafana
Verify ClickHouse health: docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1"
Verify OTel Collector health: curl http://localhost:13133/
Verify Grafana health: curl http://localhost:3000/api/health
Verify Grafana can query ClickHouse: open Grafana UI → Data Sources → ClickHouse → Test Connection

Phase 2: Application Integration — Dual-Write (1 week)¶

Priority: P1 | Impact: Core integration | Dependencies: Phase 1

Keep SentryLoggerService active. Add OTel SDK to each service so logs are sent to both Sentry and ClickHouse in parallel. This validates the pipeline without risk.

8. Install required packages¶

pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/sdk-logs \
  @opentelemetry/instrumentation-pino \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

pnpm add -D -w @types/pino pino-pretty

9. Create Pino logger factory in shared library¶

Create libs/observability/src/lib/otel-logger.ts:

import pino from 'pino';
import { getServiceName } from './service-context';

export function createLogger(context?: string): pino.Logger {
  const logger = pino({
    level: process.env['LOG_LEVEL'] || 'info',
    transport:
      process.env['NODE_ENV'] === 'development'
        ? { target: 'pino-pretty', options: { colorize: true } }
        : undefined,
  });

  return context ? logger.child({ context }) : logger;
}

10. Create OtelLoggerService in shared library¶

Create libs/observability/src/lib/services/otel-logger.service.ts:

import { Injectable } from '@nestjs/common';
import { createLogger } from '../otel-logger';
import type pino from 'pino';

@Injectable()
export class OtelLoggerService {
  private readonly logger: pino.Logger;

  constructor() {
    this.logger = createLogger('business');
  }

  info(message: string, attributes?: Record<string, unknown>): void {
    this.logger.info(attributes, message);
  }

  warn(message: string, attributes?: Record<string, unknown>): void {
    this.logger.warn(attributes, message);
  }

  error(message: string, attributes?: Record<string, unknown>): void {
    this.logger.error(attributes, message);
  }

  debug(message: string, attributes?: Record<string, unknown>): void {
    this.logger.debug(attributes, message);
  }

  logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
    this.logger.info({ ...attributes, eventType }, message);
  }

  logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
    const level = success ? 'info' : 'warn';
    this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
  }
}

11. Export new modules from shared library¶

Update libs/observability/src/index.ts to export createLogger and OtelLoggerService.

12. Modify `instrument.ts` in each backend service¶

In each service's observability/instrument.ts, add OTel SDK initialization before Sentry init. The OTel SDK must be initialized first so that Sentry can link to OTel traces.

The key changes per instrument.ts:

Import NodeSDK, OTLPLogExporter, BatchLogRecordProcessor, getNodeAutoInstrumentations, Resource, and semantic conventions
Read getOtelConfig(SERVICE_NAME) from the shared observability library
If otelConfig.exporterEndpoint is set, create and start a NodeSDK instance with:
logRecordProcessors using BatchLogRecordProcessor + OTLPLogExporter
instrumentations with getNodeAutoInstrumentations enabling @opentelemetry/instrumentation-pino (disableLogSending: false)
Disable unused instrumentations (e.g., @opentelemetry/instrumentation-fs)
Keep Sentry init unchanged during dual-write phase (do NOT remove _experiments: { enableLogs: true } yet)

See the research document (Section 5.3) for the complete instrument.ts code.

13. Validate logs appear in Grafana¶

Deploy updated services to staging
Generate some log activity (e.g., process a test order)
Open Grafana → Explore → select ClickHouse datasource → query otel_logs table
Verify logs from all services appear with correct ServiceName, SeverityText, Body, and TraceId

Phase 3: Build Grafana Dashboards (3–5 days)¶

Priority: P1 | Impact: Visualization | Dependencies: Phase 2 (logs flowing)

14. Create "Service Logs Overview" dashboard¶

Log volume over time (by service) — time series panel
Error rate by service — stacked bar chart
Latest error logs — table panel
Log level distribution — pie chart

15. Create "Request Tracing" dashboard¶

Logs filtered by TraceId variable
Correlated with Sentry traces (link out to Sentry)
Request lifecycle visualization

16. Create "Business Events" dashboard¶

Order processing events (filtered by eventType log attribute)
Print job status changes
Shipment events
Webhook receipt/processing logs

17. Create "System Health" dashboard¶

OTel Collector throughput (records/sec) — from Prometheus metrics on :8888
ClickHouse disk usage — query system.disks
ClickHouse query performance
Log ingestion latency

18. Configure alerting rules¶

Alert	Condition	Channel
High error rate	> 50 ERROR logs in 5 min for any service	Slack / Email
Service silent	Zero logs from a service for > 10 min	Slack
ClickHouse disk > 80%	Query `system.disks`	Email
OTel Collector unhealthy	Health check fail	Slack

19. Useful Grafana queries for reference¶

Error log count by service (last 24h):

SELECT
    ServiceName,
    count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
  AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC

Log volume by level (time series):

SELECT
    toStartOfFiveMinutes(Timestamp) AS time,
    SeverityText,
    count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time

Search logs by keyword:

SELECT
    Timestamp,
    ServiceName,
    SeverityText,
    Body,
    LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
  AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100

Phase 4: Cut Over — Replace SentryLoggerService (1 day)¶

Priority: P2 | Impact: Migration | Dependencies: Phase 3 (dashboards validated)

20. Swap SentryLoggerService → OtelLoggerService in all services¶

In each service's observability/observability.module.ts: - Replace SentryLoggerService provider with OtelLoggerService from @forma3d/observability - Update all injection sites (logging.interceptor.ts, business-observability.service.ts, etc.)

21. Remove Sentry Logs experiment flag¶

In each service's instrument.ts: - Remove _experiments: { enableLogs: true } from the Sentry.init() call - Remove any Sentry.logger.* imports/calls

22. Verify cut-over¶

Deploy updated services to staging
Verify logs no longer appear in Sentry's log explorer
Verify logs continue to appear in Grafana
Verify Sentry still captures exceptions and traces correctly
Verify Dozzle still shows container logs (Pino writes to stdout)

Phase 5: Cleanup and Backups (1 day)¶

Priority: P2 | Impact: Completion | Dependencies: Phase 4 validated for 2+ weeks

23. Delete SentryLoggerService files¶

Remove apps/*/src/observability/services/sentry-logger.service.ts from all services.

24. Apply tiered TTL policy¶

Connect to ClickHouse and run:

ALTER TABLE otel.otel_logs MODIFY TTL
    TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
    TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
    TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
    TimestampDate + INTERVAL 180 DAY;

25. Set up DigitalOcean Spaces bucket¶

Create forma3d-log-backups bucket in ams3 region
Disable CDN, disable versioning
Generate Spaces access key and secret
Add lifecycle rules:

Path prefix	Expiration
`clickhouse/full/`	35 days
`clickhouse/incremental/`	14 days
`clickhouse/archive/`	365 days

26. Create backup script¶

Create deployment/staging/scripts/backup-clickhouse-logs.sh with: - Full backup on Sundays - Incremental backup on weekdays (referencing last full) - Logging to /var/log/clickhouse-backup.log

See the research document (Section 8.4) for the complete backup script.

27. Configure backup cron¶

Add to the Droplet's crontab:

0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1

28. Add DNS record¶

Create DNS A record for staging-connect-grafana.forma3d.be pointing to the staging Droplet.

29. Add Grafana to monitoring¶

Add Grafana to Uptime Kuma monitoring at https://staging-connect-grafana.forma3d.be/api/health.

Phase 6: Production Parity (4 hours)¶

Priority: P3 | Impact: Production readiness | Dependencies: Phase 5

30. Replicate setup for production docker-compose¶

Copy infrastructure configuration to deployment/production/ with these differences:

Config	Staging	Production
`LOG_LEVEL`	`debug`	`info`
`filter/drop-debug` processor	Disabled (remove from pipeline)	Enabled
ClickHouse TTL (INFO)	14 days	30 days
ClickHouse TTL (ERROR)	90 days	180 days
Backup frequency	Daily (no incremental)	Full weekly + daily incremental
Grafana alerts	Email only	Slack + Email

31. Remove Sentry Logs experiment flag cleanup¶

Remove any remaining Sentry Logs code or configuration across all services.

32. Document runbooks¶

Add ClickHouse operational runbooks to docs/05-deployment/: - How to query logs manually via clickhouse-client - How to restore from backup - How to check disk usage and TTL status - How to force TTL cleanup: OPTIMIZE TABLE otel.otel_logs FINAL

📊 Resource Requirements¶

Estimated Log Volume¶

Service	Est. logs/min (staging)	Est. logs/min (production)
Gateway	20	100
Order Service	30	150
Print Service	10	50
Shipping Service	10	50
GridFlock Service	5	20
Total	~75	~370

Storage Estimates (production, with ClickHouse ~10:1 compression)¶

Timeframe	Raw size	Compressed
1 day	~265 MB	~26 MB
30 days	~8 GB	~800 MB
90 days	~24 GB	~2.4 GB

Container Resource Allocation¶

Container	CPU (cores)	RAM	Disk
ClickHouse	0.5–1	1–2 GB	10 GB initial
OTel Collector	0.25	256 MB	Minimal
Grafana	0.25	256 MB	1 GB
Total new	1–1.5	1.5–2.5 GB	~12 GB

Droplet Sizing¶

Current	Recommended
4 GB RAM / 2 vCPU	8 GB RAM / 4 vCPU ($48/mo)

✅ Validation Checklist¶

Infrastructure¶

ClickHouse container starts and is healthy (SELECT 1 succeeds)
OTel Collector container starts and is healthy (:13133/ returns OK)
Grafana container starts and is healthy (/api/health returns OK)
Grafana ClickHouse plugin installed (grafana-clickhouse-datasource)
Grafana data source connection to ClickHouse tested successfully
otel_logs table auto-created in ClickHouse otel database
Grafana accessible at staging-connect-grafana.forma3d.be via Traefik

Application Integration¶

Dashboards¶

"Service Logs Overview" dashboard created and functional
"Request Tracing" dashboard created with TraceId filtering
"Business Events" dashboard created with event type filtering
"System Health" dashboard created with ClickHouse metrics
Alerting rules configured (high error rate, service silent, disk usage)

Migration Cut-Over¶

SentryLoggerService replaced by OtelLoggerService in all services
_experiments: { enableLogs: true } removed from all Sentry init calls
Sentry still captures exceptions correctly (SentryExceptionFilter unchanged)
Sentry still captures performance traces correctly
No logs appearing in Sentry's log explorer after cut-over
All logs flowing to ClickHouse/Grafana

Retention and Backups¶

Tiered TTL applied: DEBUG 7d, INFO 30d, WARN 90d, ERROR/FATAL 180d
DigitalOcean Spaces bucket created (forma3d-log-backups in ams3)
Backup script executable and runs successfully
Cron job configured for nightly backups at 3 AM
Spaces lifecycle rules configured for backup cleanup
Backup restore tested: RESTORE TABLE otel.otel_logs FROM S3(...) succeeds

Rollback Readiness¶

SentryLoggerService files retained until Phase 5 validated for 2+ weeks
Re-enabling _experiments: { enableLogs: true } documented as rollback step
Dozzle continues to work as last-resort log viewer

Verification Commands¶

# ClickHouse health
docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1"

# OTel Collector health
curl http://localhost:13133/

# Grafana health
curl http://localhost:3000/api/health

# Check log count in ClickHouse
docker exec forma3d-clickhouse clickhouse-client --query "SELECT count() FROM otel.otel_logs"

# Check logs by service
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT ServiceName, count() FROM otel.otel_logs GROUP BY ServiceName"

# Check TTL status
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT name, engine, total_rows, total_bytes FROM system.tables WHERE database = 'otel'"

# Check disk usage
docker exec forma3d-clickhouse clickhouse-client --query \
  "SELECT name, path, free_space, total_space FROM system.disks"

# Build passes
pnpm nx run-many -t build --all

# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests

# Lint passes
pnpm nx run-many -t lint --all

🚫 Constraints and Rules¶

MUST DO¶

Keep Sentry for error tracking, performance monitoring, tracing, and profiling — only move logging
Use OpenTelemetry as the transport layer (vendor-neutral, backend-swappable)
Use Pino as the application logger (structured JSON, stdout output, Dozzle compatibility)
Bridge Pino to OTel via @opentelemetry/instrumentation-pino (no direct OTel log API calls in application code)
Initialize OTel SDK before Sentry init in instrument.ts (so Sentry can link to OTel traces)
Use the OTel Collector Contrib distribution (otel/opentelemetry-collector-contrib) — required for the clickhouseexporter
Set create_schema: true on the ClickHouse exporter so the otel_logs table is auto-created
Use named Docker volumes for ClickHouse data and Grafana data
Add health checks to all three new containers
Apply tiered TTL retention policy (not a flat retention period)
Partition ClickHouse logs by TimestampDate for efficient TTL drops
Run dual-write (Phase 2) for at least 1 week before cutting over (Phase 4)
Keep SentryLoggerService files until Phase 5 is validated for 2+ weeks
Store CLICKHOUSE_PASSWORD and GRAFANA_ADMIN_PASSWORD as secrets — never commit to repository
Add OTEL_EXPORTER_OTLP_ENDPOINT to each backend service's environment (no hard dependency on collector)
Export createLogger and OtelLoggerService from the shared @forma3d/observability library
Add DNS record and Uptime Kuma monitoring for Grafana

MUST NOT¶

Remove or modify Sentry exception filters, performance monitoring, or profiling
Remove Sentry SDK or @sentry/nestjs from any service
Add depends_on from backend services to the OTel Collector (the OTel SDK handles retries gracefully)
Call OTel log APIs directly in application code — always use Pino, let the instrumentation bridge handle it
Use ClickHouse for anything other than logs at this stage (no traces, no metrics)
Expose ClickHouse ports externally (keep behind Docker network)
Hard-code secrets in configuration files or Docker Compose
Skip the dual-write validation phase (Phase 2)
Delete SentryLoggerService before the cut-over is validated for 2+ weeks
Use any, ts-ignore, or eslint-disable
Add console.log statements — all logging goes through Pino/OtelLoggerService

SHOULD DO (Nice to Have)¶

Provision Grafana dashboards via JSON files in grafana/provisioning/dashboards/ for reproducibility
Add Grafana SMTP configuration for email alerting
Set up Grafana Slack integration for critical alerts
Configure Grafana IP allowlisting or VPN access for the admin UI
Add pino-pretty transport for local development (colored, human-readable logs)
Add LOG_LEVEL environment variable per service for fine-grained control
Explore routing OTel traces to ClickHouse in a future iteration (currently stays in Sentry)
Explore adding browser/frontend logs to the OTel pipeline in a future iteration

🔄 Rollback Plan¶

If issues arise at any phase:

Phase 1 (infrastructure): Remove the three new containers from docker-compose.yml — no application impact
Phase 2 (dual-write): Revert instrument.ts changes — remove OTel SDK init. Sentry Logs continues working
Phase 3 (dashboards): No rollback needed — dashboards are additive
Phase 4 (cut-over): Re-enable _experiments: { enableLogs: true } in Sentry init, swap OtelLoggerService back to SentryLoggerService
Phase 5 (cleanup): If SentryLoggerService was already deleted, restore from Git history

Dozzle remains available as a last-resort log viewer at all times (reads Docker stdout directly).

📚 Key References¶

Research: - Detailed research document: docs/03-architecture/research/clickhouse-grafana-logging-research.md

Technologies: - OpenTelemetry Collector Contrib: https://github.com/open-telemetry/opentelemetry-collector-contrib - ClickHouse OTel Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter - ClickHouse Docker: https://hub.docker.com/r/clickhouse/clickhouse-server - Grafana ClickHouse Plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/ - Pino Logger: https://github.com/pinojs/pino - OTel Pino Instrumentation: https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-pino

Existing Codebase: - Current Sentry logger: apps/*/src/observability/services/sentry-logger.service.ts - Current instrument files: apps/*/src/observability/instrument.ts - Shared observability library: libs/observability/src/ - OTel config helper: libs/observability/src/lib/otel-config.ts (already exists, currently unused for logging) - Docker Compose (staging): deployment/staging/docker-compose.yml - Deployment guide: docs/05-deployment/staging-deployment-guide.md

END OF PROMPT

This prompt implements a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform, as designed in docs/03-architecture/research/clickhouse-grafana-logging-research.md. The AI should deploy ClickHouse, OTel Collector, and Grafana containers via Docker Compose; install Pino and OTel packages; create a shared OtelLoggerService that replaces SentryLoggerService; bridge Pino to OpenTelemetry via @opentelemetry/instrumentation-pino; modify instrument.ts in all 5 backend services to initialize the OTel SDK; build 4 Grafana dashboards with alerting; apply tiered TTL retention; and set up automated backups to DigitalOcean Spaces. Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — only the logging concern moves. A phased rollout with dual-write validation ensures safe migration with rollback at every stage.