ClickHouse + Grafana Logging Research¶

Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Centralized Logging Stack

Table of Contents¶

Executive Summary
Current State
Proposed Architecture
Technology Deep Dive
Application-Side Integration
Docker Compose Deployment
Log Rotation and TTL
Backup to DigitalOcean Spaces
Grafana Dashboards
Resource Requirements
Migration Strategy
Cost Analysis
Risks and Mitigations
Recommendations and Next Steps

1. Executive Summary¶

This document evaluates migrating structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry. Sentry remains the primary tool for error tracking, performance monitoring, and profiling — only the logging concern is being moved.

Why Move Logging Away from Sentry?¶

Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
Retention: Sentry retains logs for limited periods; self-hosted ClickHouse allows arbitrary retention
Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes

Key Decisions¶

Concern	Current	Proposed
Error tracking	Sentry	Sentry (unchanged)
Performance / Tracing	Sentry + OTel	Sentry + OTel (unchanged)
Profiling	Sentry	Sentry (unchanged)
Structured logging	Sentry Logs (`Sentry.logger.*`)	ClickHouse via OpenTelemetry
Log visualization	Sentry Explore > Logs + Dozzle	Grafana + ClickHouse plugin
Real-time log tailing	Dozzle (Docker log viewer)	Dozzle (keep) + Grafana Live

2. Current State¶

2.1 Sentry Logging Architecture¶

Each microservice currently sends structured logs to Sentry via SentryLoggerService:

┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│   Gateway    │   │ Order Svc    │   │ Print Svc    │   │ Shipping Svc │
│              │   │              │   │              │   │              │
│ SentryLogger │   │ SentryLogger │   │ SentryLogger │   │ SentryLogger │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │                  │
       └──────────────────┴──────────────────┴──────────────────┘
                                    │
                                    ▼
                          ┌──────────────────┐
                          │   Sentry Cloud   │
                          │  (Logs + Errors  │
                          │   + Traces)      │
                          └──────────────────┘

Files involved per service:

File	Purpose
`observability/instrument.ts`	Sentry SDK init with `enableLogs: true`
`observability/services/sentry-logger.service.ts`	Wrapper sending `Sentry.logger.info()` etc.
`observability/services/business-observability.service.ts`	Business event logging
`observability/interceptors/logging.interceptor.ts`	HTTP request/response logging
`observability/filters/sentry-exception.filter.ts`	Exception capture (stays with Sentry)

Shared library: libs/observability exports getSentryConfig(), getOtelConfig(), and setServiceName().

2.2 What Works Well¶

Sentry exception filters — keep as-is
Sentry performance monitoring / tracing — keep as-is
Sentry profiling — keep as-is
getOtelConfig() in libs/observability already exists (currently unused for logging)

2.3 Pain Points with Sentry Logging¶

Sentry Logs is still marked as experimental (_experiments: { enableLogs: true })
Limited log retention (30 days on Team plan)
No way to build custom dashboards over logs
Cannot cross-reference logs with custom ClickHouse analytics
Costs scale linearly with log volume

3. Proposed Architecture¶

3.1 High-Level Overview¶

┌───────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│                                                                           │
│  ┌─────────┐  ┌───────────┐  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ Gateway │  │ Order Svc │  │ Print Svc │  │Ship. Svc │  │GridFlock  │ │
│  │         │  │           │  │           │  │          │  │           │ │
│  │  Pino   │  │   Pino    │  │   Pino    │  │  Pino    │  │   Pino    │ │
│  │  + OTel │  │   + OTel  │  │   + OTel  │  │  + OTel  │  │   + OTel  │ │
│  └────┬────┘  └─────┬─────┘  └─────┬─────┘  └────┬─────┘  └─────┬─────┘ │
│       │             │              │             │              │        │
│       └─────────────┴──────────────┴─────────────┴──────────────┘        │
│                                    │                                      │
│                            OTLP (gRPC :4317)                             │
└────────────────────────────────────┼──────────────────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     COLLECTION LAYER                                       │
│                                                                            │
│  ┌──────────────────────────────────────┐                                  │
│  │    OpenTelemetry Collector (Contrib) │                                  │
│  │                                      │                                  │
│  │  Receivers:  otlp (gRPC + HTTP)      │                                  │
│  │  Processors: batch, resource,        │                                  │
│  │              attributes, filter      │                                  │
│  │  Exporters:  clickhouse              │                                  │
│  └──────────────────┬───────────────────┘                                  │
│                     │                                                      │
└─────────────────────┼──────────────────────────────────────────────────────┘
                      │
                      ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     STORAGE + VISUALIZATION                                │
│                                                                            │
│  ┌──────────────────┐          ┌──────────────────────────┐                │
│  │    ClickHouse    │◀─────────│       Grafana            │                │
│  │                  │  query   │                          │                │
│  │  otel_logs DB    │          │  ClickHouse Data Source  │                │
│  │  TTL: 90 days    │          │  Log dashboards          │                │
│  │                  │          │  Alert rules             │                │
│  └────────┬─────────┘          └──────────────────────────┘                │
│           │                                                                │
│           │  Nightly backup                                                │
│           ▼                                                                │
│  ┌──────────────────────────┐                                              │
│  │  DigitalOcean Spaces     │                                              │
│  │  (S3-compatible)         │                                              │
│  │  forma3d-log-backups/    │                                              │
│  └──────────────────────────┘                                              │
└────────────────────────────────────────────────────────────────────────────┘

3.2 Data Flow¶

Application logs via Pino (structured JSON), auto-instrumented by @opentelemetry/instrumentation-pino
OTel SDK in each service sends log records over OTLP gRPC to the OTel Collector
OTel Collector batches, enriches (service name, environment, host), and exports to ClickHouse
ClickHouse stores logs in an OTel-schema table with TTL-based retention
Grafana queries ClickHouse via the official plugin for dashboards and alerting
Nightly cron runs BACKUP TABLE ... TO S3(...) to DigitalOcean Spaces

3.3 Sentry Coexistence¶

Sentry continues to handle: - Exception capture (Sentry.captureException) - Performance traces (Sentry's OTel integration) - Profiling (nodeProfilingIntegration)

The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).

4. Technology Deep Dive¶

4.1 OpenTelemetry Collector (Contrib)¶

The Contrib distribution is required because it includes the clickhouseexporter.

Detail	Value
Image	`otel/opentelemetry-collector-contrib:0.120.0`
Receivers	`otlp` (gRPC :4317, HTTP :4318)
Processors	`batch`, `resource`, `attributes`
Exporters	`clickhouse`
Health check	`:13133/health`
Metrics	`:8888/metrics` (Prometheus)

Key configuration — otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'  # Drop TRACE and DEBUG in production

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

Notes: - create_schema: true auto-creates the otel_logs table on first run - The filter/drop-debug processor prevents debug-level logs from reaching ClickHouse in production (toggle per environment) - batch processor is critical for performance — ClickHouse ingests best in large batches

4.2 ClickHouse¶

Detail	Value
Image	`clickhouse/clickhouse-server:24.12-alpine`
Protocol ports	9000 (native TCP), 8123 (HTTP)
Storage	Named Docker volume `clickhouse-data`
Min RAM recommended	2 GB (for our log volume)
Compression	LZ4 by default (~10:1 on log data)

Why ClickHouse over alternatives?

Alternative	Pros	Cons
Elasticsearch / OpenSearch	Mature, rich full-text search	RAM-hungry (4+ GB minimum), complex to operate
Loki + Grafana	Native Grafana integration, low resource	Limited querying (labels only, no full-text)
ClickHouse	Blazing fast analytics, SQL interface, low RAM, excellent compression	Less mature OTel ecosystem (improving rapidly)
PostgreSQL	Already in the stack	Not designed for high-volume log ingestion

ClickHouse wins for this use case because: - Compression: Log data compresses 10-20x, keeping disk usage low - SQL: Query language is familiar (no learning curve like LogQL) - Speed: Sub-second queries on millions of rows - Lightweight: Runs well in 1-2 GB RAM for moderate log volumes - TTL: Built-in time-based data lifecycle management - S3 backup: Native BACKUP TABLE ... TO S3(...) command

4.3 Grafana with ClickHouse Plugin¶

Detail	Value
Image	`grafana/grafana-oss:11.5.0`
ClickHouse plugin	`grafana-clickhouse-datasource` v4.x
Protocol	HTTP to ClickHouse on port 8123
Authentication	Grafana built-in auth (admin password from env)

The ClickHouse plugin v4 has first-class support for OpenTelemetry log schema: - Auto-detects OTel column names (Timestamp, SeverityText, Body, etc.) - Log panel renders with severity coloring - Logs link to traces via TraceId field - Query builder minimizes need for raw SQL

5. Application-Side Integration¶

5.1 Approach: Pino + OpenTelemetry Instrumentation¶

Rather than calling OTel log APIs directly, the recommended approach is:

Use Pino as the application logger (fast, structured JSON)
Enable @opentelemetry/instrumentation-pino to bridge Pino logs to OTel
The OTel SDK's LogRecordExporter sends log records via OTLP to the Collector

This is preferable because: - Pino logs still go to stdout (Docker captures them, Dozzle still works) - Trace context (traceId, spanId) is automatically injected into every log record - No tight coupling to any backend — swapping ClickHouse for Loki is a config change, not code change

5.2 Required Packages¶

pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/sdk-logs \
  @opentelemetry/instrumentation-pino \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

pnpm add -D -w @types/pino

5.3 Tracing File Changes (per service)¶

New file: libs/observability/src/lib/otel-logger.ts

import pino from 'pino';
import { getServiceName } from './service-context';

export function createLogger(context?: string): pino.Logger {
  const logger = pino({
    level: process.env['LOG_LEVEL'] || 'info',
    transport:
      process.env['NODE_ENV'] === 'development'
        ? { target: 'pino-pretty', options: { colorize: true } }
        : undefined,
  });

  return context ? logger.child({ context }) : logger;
}

Modified: apps/*/src/observability/instrument.ts

// BEFORE: only Sentry init
import * as Sentry from '@sentry/nestjs';
// ...

// AFTER: Sentry init + OTel Logs SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc';
import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import * as Sentry from '@sentry/nestjs';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
import { getSentryConfig, SENTRY_IGNORED_ERRORS, setServiceName, getOtelConfig } from '@forma3d/observability';

const SERVICE_NAME = 'order-service'; // varies per service
setServiceName(SERVICE_NAME);

// 1. Initialize OTel SDK (MUST happen before Sentry and app imports)
const otelConfig = getOtelConfig(SERVICE_NAME);

if (otelConfig.exporterEndpoint) {
  const sdk = new NodeSDK({
    resource: new Resource({
      [ATTR_SERVICE_NAME]: otelConfig.serviceName,
      [ATTR_SERVICE_VERSION]: otelConfig.serviceVersion,
      'deployment.environment': otelConfig.environment,
    }),
    logRecordProcessors: [
      new BatchLogRecordProcessor(
        new OTLPLogExporter({
          url: otelConfig.exporterEndpoint,
        })
      ),
    ],
    instrumentations: [
      getNodeAutoInstrumentations({
        '@opentelemetry/instrumentation-pino': {
          disableLogSending: false, // Enable bridging Pino → OTel
        },
        // Disable instrumentations we don't need
        '@opentelemetry/instrumentation-fs': { enabled: false },
      }),
    ],
  });

  sdk.start();
}

// 2. Initialize Sentry (after OTel so Sentry can link to OTel traces)
const config = getSentryConfig();
if (config.dsn) {
  Sentry.init({
    dsn: config.dsn,
    environment: config.environment,
    release: config.release,
    debug: config.debug,
    tracesSampleRate: config.tracesSampleRate,
    profilesSampleRate: config.profilesSampleRate,
    integrations: [nodeProfilingIntegration()],
    // REMOVED: _experiments: { enableLogs: true }
    ignoreErrors: SENTRY_IGNORED_ERRORS,
    beforeSend(event) {
      if (event.request?.headers) {
        delete event.request.headers['authorization'];
        delete event.request.headers['cookie'];
        delete event.request.headers['x-shopify-access-token'];
      }
      return event;
    },
    initialScope: {
      tags: { service: SERVICE_NAME, component: 'backend' },
    },
  });
}

export { Sentry };

Replaced: SentryLoggerService → OtelLoggerService

The new service uses Pino instead of Sentry.logger.*:

import { Injectable } from '@nestjs/common';
import { createLogger } from '@forma3d/observability';
import type pino from 'pino';

@Injectable()
export class OtelLoggerService {
  private readonly logger: pino.Logger;

  constructor() {
    this.logger = createLogger('business');
  }

  info(message: string, attributes?: Record<string, unknown>): void {
    this.logger.info(attributes, message);
  }

  warn(message: string, attributes?: Record<string, unknown>): void {
    this.logger.warn(attributes, message);
  }

  error(message: string, attributes?: Record<string, unknown>): void {
    this.logger.error(attributes, message);
  }

  debug(message: string, attributes?: Record<string, unknown>): void {
    this.logger.debug(attributes, message);
  }

  logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
    this.logger.info({ ...attributes, eventType }, message);
  }

  logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
    const level = success ? 'info' : 'warn';
    this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
  }
}

5.4 Environment Variables to Add¶

Each service container needs one new environment variable:

# In docker-compose.yml, for each backend service:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

Optional tuning variables:

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OTel Collector gRPC endpoint
`LOG_LEVEL`	`info`	Minimum log level (trace/debug/info/warn/error)
`OTEL_LOG_LEVEL`	`warn`	OTel SDK's own log level
`OTEL_SERVICE_NAME`	Set in code	Override service name

6. Docker Compose Deployment¶

6.1 New Services to Add¶

Add these three services to deployment/staging/docker-compose.yml:

  # --------------------------------------------------------------------------
  # OpenTelemetry Collector - Log collection and forwarding
  # --------------------------------------------------------------------------
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.120.0
    container_name: forma3d-otel-collector
    restart: unless-stopped
    command: ['--config=/etc/otelcol-contrib/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
    ports:
      - '4317:4317'   # OTLP gRPC (internal only — remove port mapping if not needed externally)
      - '4318:4318'   # OTLP HTTP
    environment:
      - ENVIRONMENT=staging
      - HOSTNAME=${HOSTNAME:-forma3d-staging}
    networks:
      - forma3d-network
    healthcheck:
      test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:13133/']
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    depends_on:
      clickhouse:
        condition: service_healthy

  # --------------------------------------------------------------------------
  # ClickHouse - Log Storage
  # --------------------------------------------------------------------------
  clickhouse:
    image: clickhouse/clickhouse-server:24.12-alpine
    container_name: forma3d-clickhouse
    restart: unless-stopped
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml:ro
      - ./clickhouse-users.xml:/etc/clickhouse-server/users.d/custom.xml:ro
    environment:
      - CLICKHOUSE_DB=otel
      - CLICKHOUSE_USER=otel
      - CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD}
      - CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
    networks:
      - forma3d-network
    healthcheck:
      test: ['CMD', 'clickhouse-client', '--query', 'SELECT 1']
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  # --------------------------------------------------------------------------
  # Grafana - Log Visualization and Dashboards
  # --------------------------------------------------------------------------
  grafana:
    image: grafana/grafana-oss:11.5.0
    container_name: forma3d-grafana
    restart: unless-stopped
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_INSTALL_PLUGINS=grafana-clickhouse-datasource
      - GF_SERVER_ROOT_URL=https://staging-connect-grafana.forma3d.be
      - GF_SERVER_SERVE_FROM_SUB_PATH=false
    networks:
      - forma3d-network
    labels:
      - 'traefik.enable=true'
      - 'traefik.http.routers.grafana.rule=Host(`staging-connect-grafana.forma3d.be`)'
      - 'traefik.http.routers.grafana.entrypoints=websecure'
      - 'traefik.http.routers.grafana.tls=true'
      - 'traefik.http.routers.grafana.tls.certresolver=letsencrypt'
      - 'traefik.http.services.grafana.loadbalancer.server.port=3000'
    healthcheck:
      test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/api/health']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      clickhouse:
        condition: service_healthy

New volumes to add:

volumes:
  # ... existing volumes ...
  clickhouse-data:
  clickhouse-logs:
  grafana-data:

6.2 Updated Service Dependencies¶

Each backend service that sends logs should not hard-depend on the OTel Collector. If the collector is down, the OTel SDK buffers and retries — the application continues running. However, the OTel Collector itself depends on ClickHouse.

6.3 Configuration Files¶

deployment/staging/clickhouse-config.xml:

<?xml version="1.0"?>
<clickhouse>
    <!-- Listen on all interfaces within Docker network -->
    <listen_host>0.0.0.0</listen_host>

    <!-- Logging -->
    <logger>
        <level>warning</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>100M</size>
        <count>3</count>
    </logger>

    <!-- Memory limits for single-server deployment -->
    <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

    <!-- S3 backup configuration (DigitalOcean Spaces) -->
    <backups>
        <allowed_path>/backups/</allowed_path>
        <allowed_disk>s3_backups</allowed_disk>
    </backups>

    <storage_configuration>
        <disks>
            <s3_backups>
                <type>s3</type>
                <endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
                <access_key_id>${DO_SPACES_KEY}</access_key_id>
                <secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
            </s3_backups>
        </disks>
    </storage_configuration>
</clickhouse>

deployment/staging/clickhouse-users.xml:

<?xml version="1.0"?>
<clickhouse>
    <users>
        <otel>
            <password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
            <networks>
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>0</access_management>
        </otel>
    </users>
</clickhouse>

Note: For simplicity, use CLICKHOUSE_PASSWORD env var with the default user approach instead of the XML user file. The XML approach is shown for reference on how to lock down access further.

deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:

apiVersion: 1

datasources:
  - name: ClickHouse
    type: grafana-clickhouse-datasource
    access: proxy
    isDefault: true
    jsonData:
      host: clickhouse
      port: 9000
      protocol: native
      username: otel
      defaultDatabase: otel
      logs:
        defaultDatabase: otel
        defaultTable: otel_logs
        otelEnabled: true
        otelVersion: latest
        timeColumn: Timestamp
        levelColumn: SeverityText
        messageColumn: Body
    secureJsonData:
      password: ${CLICKHOUSE_PASSWORD}

7. Log Rotation and TTL¶

7.1 ClickHouse TTL Strategy¶

ClickHouse's TTL feature automatically deletes data that exceeds a time threshold. This replaces traditional log rotation.

Schema with TTL (auto-created by OTel Collector with create_schema: true, but can be customized):

CREATE TABLE IF NOT EXISTS otel.otel_logs
(
    Timestamp          DateTime64(9),
    TimestampDate      Date DEFAULT toDate(Timestamp),
    TraceId            String,
    SpanId             String,
    TraceFlags         UInt32,
    SeverityText       LowCardinality(String),
    SeverityNumber     Int32,
    ServiceName        LowCardinality(String),
    Body               String,
    ResourceSchemaUrl  String,
    ResourceAttributes Map(LowCardinality(String), String),
    ScopeSchemaUrl     String,
    ScopeName          String,
    ScopeVersion       String,
    ScopeAttributes    Map(LowCardinality(String), String),
    LogAttributes      Map(LowCardinality(String), String),

    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_body Body TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
)
ENGINE = MergeTree()
PARTITION BY TimestampDate
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL TimestampDate + INTERVAL 90 DAY
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;

7.2 Tiered Retention Policy¶

Log Level	Retention	Rationale
ERROR, FATAL	180 days	Need long history for debugging recurring issues
WARN	90 days	Important but less critical
INFO	30 days	Operational visibility, high volume
DEBUG	7 days	Only on staging; filtered out in production

This can be achieved with conditional TTL:

ALTER TABLE otel.otel_logs MODIFY TTL
    TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
    TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
    TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
    TimestampDate + INTERVAL 180 DAY;

7.3 Partition Management¶

Partitioning by TimestampDate means each day's data is a separate partition. Benefits: - TTL drops entire partitions (fast, no row-level deletes) - Backups can target specific date ranges - Old data is cleanly isolated from hot data

ClickHouse TTL cleanup runs every 4 hours by default (merge_with_ttl_timeout = 14400). This is adequate — there's no urgency to delete data the moment it expires.

7.4 ClickHouse Internal Log Rotation¶

ClickHouse's own server logs (not to be confused with application logs stored in ClickHouse) are managed via clickhouse-config.xml:

<logger>
    <size>100M</size>  <!-- Max file size before rotation -->
    <count>3</count>   <!-- Keep 3 rotated files -->
</logger>

8. Backup to DigitalOcean Spaces¶

8.1 DigitalOcean Spaces Setup¶

DigitalOcean Spaces is S3-compatible and works with ClickHouse's native BACKUP command.

Spaces configuration:

Setting	Value
Bucket name	`forma3d-log-backups`
Region	`ams3` (Amsterdam) — closest to EU infrastructure
CDN	Disabled (not needed for backups)
Versioning	Disabled (ClickHouse manages versions)

Access credentials:

DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups

8.2 Backup Strategy¶

Backup Type	Frequency	Retention	Contents
Full backup	Weekly (Sunday 3 AM)	4 weeks	Entire `otel_logs` table
Incremental	Daily (3 AM)	2 weeks	Changes since last full
Pre-TTL archive	Before TTL expiry	1 year (cold)	ERROR/FATAL logs about to expire

8.3 Backup Commands¶

Full backup:

BACKUP TABLE otel.otel_logs
TO S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

Incremental backup (referencing last full):

BACKUP TABLE otel.otel_logs
TO S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/incremental/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
)
SETTINGS base_backup = S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{last_full_date}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

Restore from backup:

RESTORE TABLE otel.otel_logs
FROM S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

8.4 Automated Backup Script¶

Create deployment/staging/scripts/backup-clickhouse-logs.sh:

#!/bin/bash
set -euo pipefail

# Configuration
CLICKHOUSE_HOST="clickhouse"
CLICKHOUSE_PORT="9000"
CLICKHOUSE_USER="otel"
CLICKHOUSE_PASSWORD="${CLICKHOUSE_PASSWORD}"

DO_SPACES_ENDPOINT="https://${DO_SPACES_REGION}.digitaloceanspaces.com"
DO_SPACES_BUCKET="${DO_SPACES_BUCKET}"
DO_SPACES_KEY="${DO_SPACES_KEY}"
DO_SPACES_SECRET="${DO_SPACES_SECRET}"

DATE=$(date +%Y-%m-%d)
DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday

BACKUP_PATH="${DO_SPACES_ENDPOINT}/${DO_SPACES_BUCKET}/clickhouse"

if [ "$DAY_OF_WEEK" -eq 7 ]; then
  # Sunday: Full backup
  echo "[$(date)] Starting full backup..."
  docker exec forma3d-clickhouse clickhouse-client \
    --user "$CLICKHOUSE_USER" \
    --password "$CLICKHOUSE_PASSWORD" \
    --query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/full/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
  echo "[$(date)] Full backup completed: ${BACKUP_PATH}/full/${DATE}/"
else
  # Weekday: Incremental backup
  LAST_SUNDAY=$(date -d "last sunday" +%Y-%m-%d 2>/dev/null || date -v-sunday +%Y-%m-%d)
  echo "[$(date)] Starting incremental backup (base: ${LAST_SUNDAY})..."
  docker exec forma3d-clickhouse clickhouse-client \
    --user "$CLICKHOUSE_USER" \
    --password "$CLICKHOUSE_PASSWORD" \
    --query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/incremental/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}') SETTINGS base_backup = S3('${BACKUP_PATH}/full/${LAST_SUNDAY}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
  echo "[$(date)] Incremental backup completed: ${BACKUP_PATH}/incremental/${DATE}/"
fi

Cron entry (on the Droplet):

0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1

8.5 DigitalOcean Spaces Lifecycle Policy¶

Use Spaces lifecycle rules to automatically clean up old backups:

Path prefix	Expiration
`clickhouse/full/`	35 days (keep ~5 full backups)
`clickhouse/incremental/`	14 days
`clickhouse/archive/`	365 days

9. Grafana Dashboards¶

9.1 Provisioned Dashboards¶

Create pre-built dashboards via Grafana provisioning:

Dashboard 1: Service Logs Overview - Log volume over time (by service) - Error rate by service (stacked bar chart) - Latest error logs (table) - Log level distribution (pie chart)

Dashboard 2: Request Tracing - Logs filtered by TraceId - Correlated with Sentry traces (link out) - Request lifecycle visualization

Dashboard 3: Business Events - Order processing events - Print job status changes - Shipment events - Webhook receipt/processing logs

Dashboard 4: System Health - OTel Collector throughput (records/sec) - ClickHouse disk usage - ClickHouse query performance - Log ingestion latency

9.2 Useful Grafana Queries¶

Error log count by service (last 24h):

SELECT
    ServiceName,
    count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
  AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC

Log volume by level (time series):

SELECT
    toStartOfFiveMinutes(Timestamp) AS time,
    SeverityText,
    count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time

Search logs by keyword:

SELECT
    Timestamp,
    ServiceName,
    SeverityText,
    Body,
    LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
  AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100

9.3 Alerting Rules¶

Grafana can alert on ClickHouse queries:

Alert	Condition	Channel
High error rate	> 50 ERROR logs in 5 min for any service	Slack / Email
Service silent	Zero logs from a service for > 10 min	Slack
ClickHouse disk > 80%	Query `system.disks`	Email
OTel Collector unhealthy	Health check fail	Slack

10. Resource Requirements¶

10.1 Estimated Log Volume¶

Service	Est. logs/min (staging)	Est. logs/min (production)
Gateway	20	100
Order Service	30	150
Print Service	10	50
Shipping Service	10	50
GridFlock Service	5	20
Total	~75	~370

At production load: ~370 logs/min = ~530K logs/day = ~16M logs/month

10.2 Storage Estimates¶

Assuming average log size of 500 bytes (after ClickHouse compression ~50 bytes):

Timeframe	Raw size	Compressed (est. 10:1)
1 day	~265 MB	~26 MB
30 days	~8 GB	~800 MB
90 days	~24 GB	~2.4 GB

ClickHouse compression is exceptionally efficient on repetitive log data.

10.3 Container Resource Allocation¶

Container	CPU (cores)	RAM	Disk
ClickHouse	0.5-1	1-2 GB	10 GB initial (grows)
OTel Collector	0.25	256 MB	Minimal
Grafana	0.25	256 MB	1 GB (dashboards/cache)
Total new	1-1.5	1.5-2.5 GB	~12 GB

10.4 Droplet Sizing Impact¶

Current staging Droplet likely needs to be upsized:

Current	Recommended
4 GB RAM / 2 vCPU	8 GB RAM / 4 vCPU

The 8 GB / 4 vCPU Droplet ($48/mo on DO) comfortably runs the entire stack including the new logging containers.

11. Migration Strategy¶

11.1 Phased Rollout¶

Phase	Duration	Actions
Phase 1: Deploy infrastructure	1 day	Add ClickHouse, OTel Collector, Grafana to docker-compose. Verify containers start.
Phase 2: Dual-write	1 week	Keep `SentryLoggerService` active. Add OTel SDK to each service sending logs in parallel. Validate logs appear in Grafana.
Phase 3: Build dashboards	3-5 days	Create Grafana dashboards. Validate query performance. Set up alerting.
Phase 4: Cut over	1 day	Replace `SentryLoggerService` with `OtelLoggerService`. Remove `_experiments: { enableLogs: true }` from Sentry init.
Phase 5: Cleanup	1 day	Remove `SentryLoggerService` files. Update documentation. Configure backups.

11.2 Rollback Plan¶

If ClickHouse/OTel issues arise, re-enable _experiments: { enableLogs: true } in Sentry init
Keep SentryLoggerService files until Phase 5 is validated for 2+ weeks
Dozzle remains available as a last-resort log viewer (reads Docker stdout directly)

11.3 Files to Modify (per service)¶

Action	Files
Modify	`observability/instrument.ts` — add OTel SDK init
Replace	`observability/services/sentry-logger.service.ts` → `otel-logger.service.ts`
Modify	`observability/observability.module.ts` — swap provider
Keep	`observability/filters/sentry-exception.filter.ts` — unchanged
Keep	`observability/interceptors/logging.interceptor.ts` — swap logger injection
Keep	`observability/services/business-observability.service.ts` — swap logger injection

11.4 Services Affected¶

Service	Logging changes	Sentry changes
Gateway	Yes - add OTel logger	Remove `enableLogs`
Order Service	Yes - add OTel logger	Remove `enableLogs`
Print Service	Yes - add OTel logger	Remove `enableLogs`
Shipping Service	Yes - add OTel logger	Remove `enableLogs`
GridFlock Service	Yes - add OTel logger	Remove `enableLogs`
Web (React)	No (frontend logs stay in Sentry for now)	No change

12. Cost Analysis¶

12.1 Monthly Costs¶

Component	Staging	Production	Notes
ClickHouse	$0 (self-hosted)	$0 (self-hosted)	Open source, runs on existing infra
OTel Collector	$0 (self-hosted)	$0 (self-hosted)	Open source
Grafana OSS	$0 (self-hosted)	$0 (self-hosted)	Open source
Droplet upsize (if needed)	+$24/mo	+$24/mo	4 GB → 8 GB RAM
DO Spaces storage	~$5/mo	~$5/mo	$5/mo for 250 GB + $0.02/GB
Total additional	~$29/mo	~$29/mo

12.2 Cost Savings¶

Item	Current cost	After migration
Sentry Logs volume	Metered (plan-dependent)	$0 (self-hosted)
Sentry errors + tracing	Unchanged	Unchanged
Total Sentry bill reduction	Depends on log volume	Could be significant at scale

12.3 TCO Comparison (12 months)¶

Approach	Year 1 Cost	Pros	Cons
Keep Sentry Logs	$600-2400+ (volume dependent)	Zero operational overhead	Locked into Sentry, limited querying
ClickHouse + Grafana (this proposal)	~$696 ($58/mo)	Full control, unlimited retention, rich dashboards	Operational overhead, self-hosted
Grafana Cloud + Loki	~$1200+ (volume dependent)	Managed, easy setup	Vendor lock-in, limited log querying

13. Risks and Mitigations¶

Risk	Probability	Impact	Mitigation
ClickHouse disk fills up	Medium	High — stops ingesting	TTL auto-deletes; disk monitoring alert in Grafana
OTel Collector crash	Low	Medium — logs buffered in SDK	Docker `restart: unless-stopped`; SDK buffers ~30s
ClickHouse crash	Low	High — log loss during downtime	Docker auto-restart; OTel Collector retries with backoff
Performance impact on app services	Low	Medium	OTel SDK is async; batch processor minimizes overhead
Grafana security (exposed dashboard)	Medium	Medium	Auth required; Traefik IP allowlisting for admin tools
Backup failure to Spaces	Low	Low — TTL still manages lifecycle	Backup script alerts on failure; manual backup option
Complexity for single operator	Medium	Medium	Good documentation; simple Docker Compose setup
Log loss during OTel Collector restart	Low	Low	Batch processor flushes on graceful shutdown

14. Recommendations and Next Steps¶

14.1 Recommendation¶

Proceed with implementation. The ClickHouse + Grafana + OpenTelemetry stack is:

Cost-effective: Essentially free beyond a small Droplet upsize and Spaces storage
Operationally sound: ClickHouse is battle-tested at scale (Sentry itself runs on it)
Future-proof: OpenTelemetry is the industry standard; the backend can be swapped without app changes
Well-scoped: Only logging moves; Sentry keeps error tracking and tracing

14.2 Implementation Priority¶

Priority	Task	Effort	Impact
P0	Add ClickHouse + OTel Collector + Grafana to staging docker-compose	1 day	Foundation
P0	Create OTel Collector config file	2 hours	Collection pipeline
P1	Modify `instrument.ts` in all services to init OTel SDK	1 day	App integration
P1	Create `OtelLoggerService` in `libs/observability`	4 hours	Shared logger
P1	Build initial Grafana dashboards	1 day	Visualization
P2	Swap `SentryLoggerService` → `OtelLoggerService` in all services	1 day	Migration
P2	Set up DO Spaces bucket and backup cron	4 hours	Data safety
P2	Apply tiered TTL policy	1 hour	Retention
P3	Remove Sentry Logs experiment flag	30 min	Cleanup
P3	Replicate setup for production docker-compose	4 hours	Parity
P3	Document runbooks for ClickHouse operations	4 hours	Operations

14.3 Open Questions¶

Production Droplet sizing: Does the production server have enough headroom for 1.5-2.5 GB additional RAM?
Grafana access: Should Grafana be publicly accessible (with auth) or restricted via IP allowlisting / VPN?
Frontend logs: Should React/browser logs also be routed through OpenTelemetry, or stay in Sentry?
Traces: Should we eventually route OTel traces to ClickHouse too, or keep them in Sentry?
Dozzle: Keep Dozzle for quick container-level debugging, or replace entirely with Grafana?

14.4 Staging vs Production Differences¶

Config	Staging	Production
`LOG_LEVEL`	`debug`	`info`
`filter/drop-debug` processor	Disabled	Enabled
ClickHouse TTL (INFO)	14 days	30 days
ClickHouse TTL (ERROR)	90 days	180 days
Backup frequency	Daily (no incremental)	Full weekly + daily incremental
Grafana alerts	Email only	Slack + Email

Document Version: 1.0 Last Updated: February 2026 Next Review: After Phase 2 implementation