Skip to content

ClickHouse + Grafana Logging Research

Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Centralized Logging Stack

Table of Contents

  1. Executive Summary
  2. Current State
  3. Proposed Architecture
  4. Technology Deep Dive
  5. Application-Side Integration
  6. Docker Compose Deployment
  7. Log Rotation and TTL
  8. Backup to DigitalOcean Spaces
  9. Grafana Dashboards
  10. Resource Requirements
  11. Migration Strategy
  12. Cost Analysis
  13. Risks and Mitigations
  14. Recommendations and Next Steps

1. Executive Summary

This document evaluates migrating structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry. Sentry remains the primary tool for error tracking, performance monitoring, and profiling — only the logging concern is being moved.

Why Move Logging Away from Sentry?

  • Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
  • Retention: Sentry retains logs for limited periods; self-hosted ClickHouse allows arbitrary retention
  • Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
  • Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
  • Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
  • Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes

Key Decisions

Concern Current Proposed
Error tracking Sentry Sentry (unchanged)
Performance / Tracing Sentry + OTel Sentry + OTel (unchanged)
Profiling Sentry Sentry (unchanged)
Structured logging Sentry Logs (Sentry.logger.*) ClickHouse via OpenTelemetry
Log visualization Sentry Explore > Logs + Dozzle Grafana + ClickHouse plugin
Real-time log tailing Dozzle (Docker log viewer) Dozzle (keep) + Grafana Live

2. Current State

2.1 Sentry Logging Architecture

Each microservice currently sends structured logs to Sentry via SentryLoggerService:

┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│   Gateway    │   │ Order Svc    │   │ Print Svc    │   │ Shipping Svc │
│              │   │              │   │              │   │              │
│ SentryLogger │   │ SentryLogger │   │ SentryLogger │   │ SentryLogger │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │                  │
       └──────────────────┴──────────────────┴──────────────────┘
                                    │
                                    ▼
                          ┌──────────────────┐
                          │   Sentry Cloud   │
                          │  (Logs + Errors  │
                          │   + Traces)      │
                          └──────────────────┘

Files involved per service:

File Purpose
observability/instrument.ts Sentry SDK init with enableLogs: true
observability/services/sentry-logger.service.ts Wrapper sending Sentry.logger.info() etc.
observability/services/business-observability.service.ts Business event logging
observability/interceptors/logging.interceptor.ts HTTP request/response logging
observability/filters/sentry-exception.filter.ts Exception capture (stays with Sentry)

Shared library: libs/observability exports getSentryConfig(), getOtelConfig(), and setServiceName().

2.2 What Works Well

  • Sentry exception filters — keep as-is
  • Sentry performance monitoring / tracing — keep as-is
  • Sentry profiling — keep as-is
  • getOtelConfig() in libs/observability already exists (currently unused for logging)

2.3 Pain Points with Sentry Logging

  • Sentry Logs is still marked as experimental (_experiments: { enableLogs: true })
  • Limited log retention (30 days on Team plan)
  • No way to build custom dashboards over logs
  • Cannot cross-reference logs with custom ClickHouse analytics
  • Costs scale linearly with log volume

3. Proposed Architecture

3.1 High-Level Overview

┌───────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│                                                                           │
│  ┌─────────┐  ┌───────────┐  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ Gateway │  │ Order Svc │  │ Print Svc │  │Ship. Svc │  │GridFlock  │ │
│  │         │  │           │  │           │  │          │  │           │ │
│  │  Pino   │  │   Pino    │  │   Pino    │  │  Pino    │  │   Pino    │ │
│  │  + OTel │  │   + OTel  │  │   + OTel  │  │  + OTel  │  │   + OTel  │ │
│  └────┬────┘  └─────┬─────┘  └─────┬─────┘  └────┬─────┘  └─────┬─────┘ │
│       │             │              │             │              │        │
│       └─────────────┴──────────────┴─────────────┴──────────────┘        │
│                                    │                                      │
│                            OTLP (gRPC :4317)                             │
└────────────────────────────────────┼──────────────────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     COLLECTION LAYER                                       │
│                                                                            │
│  ┌──────────────────────────────────────┐                                  │
│  │    OpenTelemetry Collector (Contrib) │                                  │
│  │                                      │                                  │
│  │  Receivers:  otlp (gRPC + HTTP)      │                                  │
│  │  Processors: batch, resource,        │                                  │
│  │              attributes, filter      │                                  │
│  │  Exporters:  clickhouse              │                                  │
│  └──────────────────┬───────────────────┘                                  │
│                     │                                                      │
└─────────────────────┼──────────────────────────────────────────────────────┘
                      │
                      ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                     STORAGE + VISUALIZATION                                │
│                                                                            │
│  ┌──────────────────┐          ┌──────────────────────────┐                │
│  │    ClickHouse    │◀─────────│       Grafana            │                │
│  │                  │  query   │                          │                │
│  │  otel_logs DB    │          │  ClickHouse Data Source  │                │
│  │  TTL: 90 days    │          │  Log dashboards          │                │
│  │                  │          │  Alert rules             │                │
│  └────────┬─────────┘          └──────────────────────────┘                │
│           │                                                                │
│           │  Nightly backup                                                │
│           ▼                                                                │
│  ┌──────────────────────────┐                                              │
│  │  DigitalOcean Spaces     │                                              │
│  │  (S3-compatible)         │                                              │
│  │  forma3d-log-backups/    │                                              │
│  └──────────────────────────┘                                              │
└────────────────────────────────────────────────────────────────────────────┘

3.2 Data Flow

  1. Application logs via Pino (structured JSON), auto-instrumented by @opentelemetry/instrumentation-pino
  2. OTel SDK in each service sends log records over OTLP gRPC to the OTel Collector
  3. OTel Collector batches, enriches (service name, environment, host), and exports to ClickHouse
  4. ClickHouse stores logs in an OTel-schema table with TTL-based retention
  5. Grafana queries ClickHouse via the official plugin for dashboards and alerting
  6. Nightly cron runs BACKUP TABLE ... TO S3(...) to DigitalOcean Spaces

3.3 Sentry Coexistence

Sentry continues to handle: - Exception capture (Sentry.captureException) - Performance traces (Sentry's OTel integration) - Profiling (nodeProfilingIntegration)

The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).


4. Technology Deep Dive

4.1 OpenTelemetry Collector (Contrib)

The Contrib distribution is required because it includes the clickhouseexporter.

Detail Value
Image otel/opentelemetry-collector-contrib:0.120.0
Receivers otlp (gRPC :4317, HTTP :4318)
Processors batch, resource, attributes
Exporters clickhouse
Health check :13133/health
Metrics :8888/metrics (Prometheus)

Key configuration — otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 20000

  resource:
    attributes:
      - key: deployment.environment
        value: "${ENVIRONMENT}"
        action: upsert
      - key: host.name
        value: "${HOSTNAME}"
        action: upsert

  filter/drop-debug:
    logs:
      log_record:
        - 'severity_number < 9'  # Drop TRACE and DEBUG in production

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: otel
    logs_table_name: otel_logs
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    create_schema: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [resource, filter/drop-debug, batch]
      exporters: [clickhouse]

  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

  extensions: [health_check]

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

Notes: - create_schema: true auto-creates the otel_logs table on first run - The filter/drop-debug processor prevents debug-level logs from reaching ClickHouse in production (toggle per environment) - batch processor is critical for performance — ClickHouse ingests best in large batches

4.2 ClickHouse

Detail Value
Image clickhouse/clickhouse-server:24.12-alpine
Protocol ports 9000 (native TCP), 8123 (HTTP)
Storage Named Docker volume clickhouse-data
Min RAM recommended 2 GB (for our log volume)
Compression LZ4 by default (~10:1 on log data)

Why ClickHouse over alternatives?

Alternative Pros Cons
Elasticsearch / OpenSearch Mature, rich full-text search RAM-hungry (4+ GB minimum), complex to operate
Loki + Grafana Native Grafana integration, low resource Limited querying (labels only, no full-text)
ClickHouse Blazing fast analytics, SQL interface, low RAM, excellent compression Less mature OTel ecosystem (improving rapidly)
PostgreSQL Already in the stack Not designed for high-volume log ingestion

ClickHouse wins for this use case because: - Compression: Log data compresses 10-20x, keeping disk usage low - SQL: Query language is familiar (no learning curve like LogQL) - Speed: Sub-second queries on millions of rows - Lightweight: Runs well in 1-2 GB RAM for moderate log volumes - TTL: Built-in time-based data lifecycle management - S3 backup: Native BACKUP TABLE ... TO S3(...) command

4.3 Grafana with ClickHouse Plugin

Detail Value
Image grafana/grafana-oss:11.5.0
ClickHouse plugin grafana-clickhouse-datasource v4.x
Protocol HTTP to ClickHouse on port 8123
Authentication Grafana built-in auth (admin password from env)

The ClickHouse plugin v4 has first-class support for OpenTelemetry log schema: - Auto-detects OTel column names (Timestamp, SeverityText, Body, etc.) - Log panel renders with severity coloring - Logs link to traces via TraceId field - Query builder minimizes need for raw SQL


5. Application-Side Integration

5.1 Approach: Pino + OpenTelemetry Instrumentation

Rather than calling OTel log APIs directly, the recommended approach is:

  1. Use Pino as the application logger (fast, structured JSON)
  2. Enable @opentelemetry/instrumentation-pino to bridge Pino logs to OTel
  3. The OTel SDK's LogRecordExporter sends log records via OTLP to the Collector

This is preferable because: - Pino logs still go to stdout (Docker captures them, Dozzle still works) - Trace context (traceId, spanId) is automatically injected into every log record - No tight coupling to any backend — swapping ClickHouse for Loki is a config change, not code change

5.2 Required Packages

pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/sdk-logs \
  @opentelemetry/instrumentation-pino \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

pnpm add -D -w @types/pino

5.3 Tracing File Changes (per service)

New file: libs/observability/src/lib/otel-logger.ts

import pino from 'pino';
import { getServiceName } from './service-context';

export function createLogger(context?: string): pino.Logger {
  const logger = pino({
    level: process.env['LOG_LEVEL'] || 'info',
    transport:
      process.env['NODE_ENV'] === 'development'
        ? { target: 'pino-pretty', options: { colorize: true } }
        : undefined,
  });

  return context ? logger.child({ context }) : logger;
}

Modified: apps/*/src/observability/instrument.ts

// BEFORE: only Sentry init
import * as Sentry from '@sentry/nestjs';
// ...

// AFTER: Sentry init + OTel Logs SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc';
import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import * as Sentry from '@sentry/nestjs';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
import { getSentryConfig, SENTRY_IGNORED_ERRORS, setServiceName, getOtelConfig } from '@forma3d/observability';

const SERVICE_NAME = 'order-service'; // varies per service
setServiceName(SERVICE_NAME);

// 1. Initialize OTel SDK (MUST happen before Sentry and app imports)
const otelConfig = getOtelConfig(SERVICE_NAME);

if (otelConfig.exporterEndpoint) {
  const sdk = new NodeSDK({
    resource: new Resource({
      [ATTR_SERVICE_NAME]: otelConfig.serviceName,
      [ATTR_SERVICE_VERSION]: otelConfig.serviceVersion,
      'deployment.environment': otelConfig.environment,
    }),
    logRecordProcessors: [
      new BatchLogRecordProcessor(
        new OTLPLogExporter({
          url: otelConfig.exporterEndpoint,
        })
      ),
    ],
    instrumentations: [
      getNodeAutoInstrumentations({
        '@opentelemetry/instrumentation-pino': {
          disableLogSending: false, // Enable bridging Pino → OTel
        },
        // Disable instrumentations we don't need
        '@opentelemetry/instrumentation-fs': { enabled: false },
      }),
    ],
  });

  sdk.start();
}

// 2. Initialize Sentry (after OTel so Sentry can link to OTel traces)
const config = getSentryConfig();
if (config.dsn) {
  Sentry.init({
    dsn: config.dsn,
    environment: config.environment,
    release: config.release,
    debug: config.debug,
    tracesSampleRate: config.tracesSampleRate,
    profilesSampleRate: config.profilesSampleRate,
    integrations: [nodeProfilingIntegration()],
    // REMOVED: _experiments: { enableLogs: true }
    ignoreErrors: SENTRY_IGNORED_ERRORS,
    beforeSend(event) {
      if (event.request?.headers) {
        delete event.request.headers['authorization'];
        delete event.request.headers['cookie'];
        delete event.request.headers['x-shopify-access-token'];
      }
      return event;
    },
    initialScope: {
      tags: { service: SERVICE_NAME, component: 'backend' },
    },
  });
}

export { Sentry };

Replaced: SentryLoggerServiceOtelLoggerService

The new service uses Pino instead of Sentry.logger.*:

import { Injectable } from '@nestjs/common';
import { createLogger } from '@forma3d/observability';
import type pino from 'pino';

@Injectable()
export class OtelLoggerService {
  private readonly logger: pino.Logger;

  constructor() {
    this.logger = createLogger('business');
  }

  info(message: string, attributes?: Record<string, unknown>): void {
    this.logger.info(attributes, message);
  }

  warn(message: string, attributes?: Record<string, unknown>): void {
    this.logger.warn(attributes, message);
  }

  error(message: string, attributes?: Record<string, unknown>): void {
    this.logger.error(attributes, message);
  }

  debug(message: string, attributes?: Record<string, unknown>): void {
    this.logger.debug(attributes, message);
  }

  logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
    this.logger.info({ ...attributes, eventType }, message);
  }

  logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
    const level = success ? 'info' : 'warn';
    this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
  }
}

5.4 Environment Variables to Add

Each service container needs one new environment variable:

# In docker-compose.yml, for each backend service:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

Optional tuning variables:

Variable Default Description
OTEL_EXPORTER_OTLP_ENDPOINT (none) OTel Collector gRPC endpoint
LOG_LEVEL info Minimum log level (trace/debug/info/warn/error)
OTEL_LOG_LEVEL warn OTel SDK's own log level
OTEL_SERVICE_NAME Set in code Override service name

6. Docker Compose Deployment

6.1 New Services to Add

Add these three services to deployment/staging/docker-compose.yml:

  # --------------------------------------------------------------------------
  # OpenTelemetry Collector - Log collection and forwarding
  # --------------------------------------------------------------------------
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.120.0
    container_name: forma3d-otel-collector
    restart: unless-stopped
    command: ['--config=/etc/otelcol-contrib/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
    ports:
      - '4317:4317'   # OTLP gRPC (internal only — remove port mapping if not needed externally)
      - '4318:4318'   # OTLP HTTP
    environment:
      - ENVIRONMENT=staging
      - HOSTNAME=${HOSTNAME:-forma3d-staging}
    networks:
      - forma3d-network
    healthcheck:
      test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:13133/']
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    depends_on:
      clickhouse:
        condition: service_healthy

  # --------------------------------------------------------------------------
  # ClickHouse - Log Storage
  # --------------------------------------------------------------------------
  clickhouse:
    image: clickhouse/clickhouse-server:24.12-alpine
    container_name: forma3d-clickhouse
    restart: unless-stopped
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml:ro
      - ./clickhouse-users.xml:/etc/clickhouse-server/users.d/custom.xml:ro
    environment:
      - CLICKHOUSE_DB=otel
      - CLICKHOUSE_USER=otel
      - CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD}
      - CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
    networks:
      - forma3d-network
    healthcheck:
      test: ['CMD', 'clickhouse-client', '--query', 'SELECT 1']
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  # --------------------------------------------------------------------------
  # Grafana - Log Visualization and Dashboards
  # --------------------------------------------------------------------------
  grafana:
    image: grafana/grafana-oss:11.5.0
    container_name: forma3d-grafana
    restart: unless-stopped
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_INSTALL_PLUGINS=grafana-clickhouse-datasource
      - GF_SERVER_ROOT_URL=https://staging-connect-grafana.forma3d.be
      - GF_SERVER_SERVE_FROM_SUB_PATH=false
    networks:
      - forma3d-network
    labels:
      - 'traefik.enable=true'
      - 'traefik.http.routers.grafana.rule=Host(`staging-connect-grafana.forma3d.be`)'
      - 'traefik.http.routers.grafana.entrypoints=websecure'
      - 'traefik.http.routers.grafana.tls=true'
      - 'traefik.http.routers.grafana.tls.certresolver=letsencrypt'
      - 'traefik.http.services.grafana.loadbalancer.server.port=3000'
    healthcheck:
      test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/api/health']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      clickhouse:
        condition: service_healthy

New volumes to add:

volumes:
  # ... existing volumes ...
  clickhouse-data:
  clickhouse-logs:
  grafana-data:

6.2 Updated Service Dependencies

Each backend service that sends logs should not hard-depend on the OTel Collector. If the collector is down, the OTel SDK buffers and retries — the application continues running. However, the OTel Collector itself depends on ClickHouse.

6.3 Configuration Files

deployment/staging/clickhouse-config.xml:

<?xml version="1.0"?>
<clickhouse>
    <!-- Listen on all interfaces within Docker network -->
    <listen_host>0.0.0.0</listen_host>

    <!-- Logging -->
    <logger>
        <level>warning</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>100M</size>
        <count>3</count>
    </logger>

    <!-- Memory limits for single-server deployment -->
    <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

    <!-- S3 backup configuration (DigitalOcean Spaces) -->
    <backups>
        <allowed_path>/backups/</allowed_path>
        <allowed_disk>s3_backups</allowed_disk>
    </backups>

    <storage_configuration>
        <disks>
            <s3_backups>
                <type>s3</type>
                <endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
                <access_key_id>${DO_SPACES_KEY}</access_key_id>
                <secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
            </s3_backups>
        </disks>
    </storage_configuration>
</clickhouse>

deployment/staging/clickhouse-users.xml:

<?xml version="1.0"?>
<clickhouse>
    <users>
        <otel>
            <password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
            <networks>
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>0</access_management>
        </otel>
    </users>
</clickhouse>

Note: For simplicity, use CLICKHOUSE_PASSWORD env var with the default user approach instead of the XML user file. The XML approach is shown for reference on how to lock down access further.

deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:

apiVersion: 1

datasources:
  - name: ClickHouse
    type: grafana-clickhouse-datasource
    access: proxy
    isDefault: true
    jsonData:
      host: clickhouse
      port: 9000
      protocol: native
      username: otel
      defaultDatabase: otel
      logs:
        defaultDatabase: otel
        defaultTable: otel_logs
        otelEnabled: true
        otelVersion: latest
        timeColumn: Timestamp
        levelColumn: SeverityText
        messageColumn: Body
    secureJsonData:
      password: ${CLICKHOUSE_PASSWORD}

7. Log Rotation and TTL

7.1 ClickHouse TTL Strategy

ClickHouse's TTL feature automatically deletes data that exceeds a time threshold. This replaces traditional log rotation.

Schema with TTL (auto-created by OTel Collector with create_schema: true, but can be customized):

CREATE TABLE IF NOT EXISTS otel.otel_logs
(
    Timestamp          DateTime64(9),
    TimestampDate      Date DEFAULT toDate(Timestamp),
    TraceId            String,
    SpanId             String,
    TraceFlags         UInt32,
    SeverityText       LowCardinality(String),
    SeverityNumber     Int32,
    ServiceName        LowCardinality(String),
    Body               String,
    ResourceSchemaUrl  String,
    ResourceAttributes Map(LowCardinality(String), String),
    ScopeSchemaUrl     String,
    ScopeName          String,
    ScopeVersion       String,
    ScopeAttributes    Map(LowCardinality(String), String),
    LogAttributes      Map(LowCardinality(String), String),

    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_body Body TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
)
ENGINE = MergeTree()
PARTITION BY TimestampDate
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL TimestampDate + INTERVAL 90 DAY
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;

7.2 Tiered Retention Policy

Log Level Retention Rationale
ERROR, FATAL 180 days Need long history for debugging recurring issues
WARN 90 days Important but less critical
INFO 30 days Operational visibility, high volume
DEBUG 7 days Only on staging; filtered out in production

This can be achieved with conditional TTL:

ALTER TABLE otel.otel_logs MODIFY TTL
    TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
    TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
    TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
    TimestampDate + INTERVAL 180 DAY;

7.3 Partition Management

Partitioning by TimestampDate means each day's data is a separate partition. Benefits: - TTL drops entire partitions (fast, no row-level deletes) - Backups can target specific date ranges - Old data is cleanly isolated from hot data

ClickHouse TTL cleanup runs every 4 hours by default (merge_with_ttl_timeout = 14400). This is adequate — there's no urgency to delete data the moment it expires.

7.4 ClickHouse Internal Log Rotation

ClickHouse's own server logs (not to be confused with application logs stored in ClickHouse) are managed via clickhouse-config.xml:

<logger>
    <size>100M</size>  <!-- Max file size before rotation -->
    <count>3</count>   <!-- Keep 3 rotated files -->
</logger>

8. Backup to DigitalOcean Spaces

8.1 DigitalOcean Spaces Setup

DigitalOcean Spaces is S3-compatible and works with ClickHouse's native BACKUP command.

Spaces configuration:

Setting Value
Bucket name forma3d-log-backups
Region ams3 (Amsterdam) — closest to EU infrastructure
CDN Disabled (not needed for backups)
Versioning Disabled (ClickHouse manages versions)

Access credentials:

DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups

8.2 Backup Strategy

Backup Type Frequency Retention Contents
Full backup Weekly (Sunday 3 AM) 4 weeks Entire otel_logs table
Incremental Daily (3 AM) 2 weeks Changes since last full
Pre-TTL archive Before TTL expiry 1 year (cold) ERROR/FATAL logs about to expire

8.3 Backup Commands

Full backup:

BACKUP TABLE otel.otel_logs
TO S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

Incremental backup (referencing last full):

BACKUP TABLE otel.otel_logs
TO S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/incremental/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
)
SETTINGS base_backup = S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{last_full_date}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

Restore from backup:

RESTORE TABLE otel.otel_logs
FROM S3(
  'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
  '<DO_SPACES_KEY>',
  '<DO_SPACES_SECRET>'
);

8.4 Automated Backup Script

Create deployment/staging/scripts/backup-clickhouse-logs.sh:

#!/bin/bash
set -euo pipefail

# Configuration
CLICKHOUSE_HOST="clickhouse"
CLICKHOUSE_PORT="9000"
CLICKHOUSE_USER="otel"
CLICKHOUSE_PASSWORD="${CLICKHOUSE_PASSWORD}"

DO_SPACES_ENDPOINT="https://${DO_SPACES_REGION}.digitaloceanspaces.com"
DO_SPACES_BUCKET="${DO_SPACES_BUCKET}"
DO_SPACES_KEY="${DO_SPACES_KEY}"
DO_SPACES_SECRET="${DO_SPACES_SECRET}"

DATE=$(date +%Y-%m-%d)
DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday

BACKUP_PATH="${DO_SPACES_ENDPOINT}/${DO_SPACES_BUCKET}/clickhouse"

if [ "$DAY_OF_WEEK" -eq 7 ]; then
  # Sunday: Full backup
  echo "[$(date)] Starting full backup..."
  docker exec forma3d-clickhouse clickhouse-client \
    --user "$CLICKHOUSE_USER" \
    --password "$CLICKHOUSE_PASSWORD" \
    --query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/full/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
  echo "[$(date)] Full backup completed: ${BACKUP_PATH}/full/${DATE}/"
else
  # Weekday: Incremental backup
  LAST_SUNDAY=$(date -d "last sunday" +%Y-%m-%d 2>/dev/null || date -v-sunday +%Y-%m-%d)
  echo "[$(date)] Starting incremental backup (base: ${LAST_SUNDAY})..."
  docker exec forma3d-clickhouse clickhouse-client \
    --user "$CLICKHOUSE_USER" \
    --password "$CLICKHOUSE_PASSWORD" \
    --query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/incremental/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}') SETTINGS base_backup = S3('${BACKUP_PATH}/full/${LAST_SUNDAY}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
  echo "[$(date)] Incremental backup completed: ${BACKUP_PATH}/incremental/${DATE}/"
fi

Cron entry (on the Droplet):

0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1

8.5 DigitalOcean Spaces Lifecycle Policy

Use Spaces lifecycle rules to automatically clean up old backups:

Path prefix Expiration
clickhouse/full/ 35 days (keep ~5 full backups)
clickhouse/incremental/ 14 days
clickhouse/archive/ 365 days

9. Grafana Dashboards

9.1 Provisioned Dashboards

Create pre-built dashboards via Grafana provisioning:

Dashboard 1: Service Logs Overview - Log volume over time (by service) - Error rate by service (stacked bar chart) - Latest error logs (table) - Log level distribution (pie chart)

Dashboard 2: Request Tracing - Logs filtered by TraceId - Correlated with Sentry traces (link out) - Request lifecycle visualization

Dashboard 3: Business Events - Order processing events - Print job status changes - Shipment events - Webhook receipt/processing logs

Dashboard 4: System Health - OTel Collector throughput (records/sec) - ClickHouse disk usage - ClickHouse query performance - Log ingestion latency

9.2 Useful Grafana Queries

Error log count by service (last 24h):

SELECT
    ServiceName,
    count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
  AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC

Log volume by level (time series):

SELECT
    toStartOfFiveMinutes(Timestamp) AS time,
    SeverityText,
    count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time

Search logs by keyword:

SELECT
    Timestamp,
    ServiceName,
    SeverityText,
    Body,
    LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
  AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100

9.3 Alerting Rules

Grafana can alert on ClickHouse queries:

Alert Condition Channel
High error rate > 50 ERROR logs in 5 min for any service Slack / Email
Service silent Zero logs from a service for > 10 min Slack
ClickHouse disk > 80% Query system.disks Email
OTel Collector unhealthy Health check fail Slack

10. Resource Requirements

10.1 Estimated Log Volume

Service Est. logs/min (staging) Est. logs/min (production)
Gateway 20 100
Order Service 30 150
Print Service 10 50
Shipping Service 10 50
GridFlock Service 5 20
Total ~75 ~370

At production load: ~370 logs/min = ~530K logs/day = ~16M logs/month

10.2 Storage Estimates

Assuming average log size of 500 bytes (after ClickHouse compression ~50 bytes):

Timeframe Raw size Compressed (est. 10:1)
1 day ~265 MB ~26 MB
30 days ~8 GB ~800 MB
90 days ~24 GB ~2.4 GB

ClickHouse compression is exceptionally efficient on repetitive log data.

10.3 Container Resource Allocation

Container CPU (cores) RAM Disk
ClickHouse 0.5-1 1-2 GB 10 GB initial (grows)
OTel Collector 0.25 256 MB Minimal
Grafana 0.25 256 MB 1 GB (dashboards/cache)
Total new 1-1.5 1.5-2.5 GB ~12 GB

10.4 Droplet Sizing Impact

Current staging Droplet likely needs to be upsized:

Current Recommended
4 GB RAM / 2 vCPU 8 GB RAM / 4 vCPU

The 8 GB / 4 vCPU Droplet ($48/mo on DO) comfortably runs the entire stack including the new logging containers.


11. Migration Strategy

11.1 Phased Rollout

Phase Duration Actions
Phase 1: Deploy infrastructure 1 day Add ClickHouse, OTel Collector, Grafana to docker-compose. Verify containers start.
Phase 2: Dual-write 1 week Keep SentryLoggerService active. Add OTel SDK to each service sending logs in parallel. Validate logs appear in Grafana.
Phase 3: Build dashboards 3-5 days Create Grafana dashboards. Validate query performance. Set up alerting.
Phase 4: Cut over 1 day Replace SentryLoggerService with OtelLoggerService. Remove _experiments: { enableLogs: true } from Sentry init.
Phase 5: Cleanup 1 day Remove SentryLoggerService files. Update documentation. Configure backups.

11.2 Rollback Plan

  • If ClickHouse/OTel issues arise, re-enable _experiments: { enableLogs: true } in Sentry init
  • Keep SentryLoggerService files until Phase 5 is validated for 2+ weeks
  • Dozzle remains available as a last-resort log viewer (reads Docker stdout directly)

11.3 Files to Modify (per service)

Action Files
Modify observability/instrument.ts — add OTel SDK init
Replace observability/services/sentry-logger.service.tsotel-logger.service.ts
Modify observability/observability.module.ts — swap provider
Keep observability/filters/sentry-exception.filter.ts — unchanged
Keep observability/interceptors/logging.interceptor.ts — swap logger injection
Keep observability/services/business-observability.service.ts — swap logger injection

11.4 Services Affected

Service Logging changes Sentry changes
Gateway Yes - add OTel logger Remove enableLogs
Order Service Yes - add OTel logger Remove enableLogs
Print Service Yes - add OTel logger Remove enableLogs
Shipping Service Yes - add OTel logger Remove enableLogs
GridFlock Service Yes - add OTel logger Remove enableLogs
Web (React) No (frontend logs stay in Sentry for now) No change

12. Cost Analysis

12.1 Monthly Costs

Component Staging Production Notes
ClickHouse $0 (self-hosted) $0 (self-hosted) Open source, runs on existing infra
OTel Collector $0 (self-hosted) $0 (self-hosted) Open source
Grafana OSS $0 (self-hosted) $0 (self-hosted) Open source
Droplet upsize (if needed) +$24/mo +$24/mo 4 GB → 8 GB RAM
DO Spaces storage ~$5/mo ~$5/mo $5/mo for 250 GB + $0.02/GB
Total additional ~$29/mo ~$29/mo

12.2 Cost Savings

Item Current cost After migration
Sentry Logs volume Metered (plan-dependent) $0 (self-hosted)
Sentry errors + tracing Unchanged Unchanged
Total Sentry bill reduction Depends on log volume Could be significant at scale

12.3 TCO Comparison (12 months)

Approach Year 1 Cost Pros Cons
Keep Sentry Logs $600-2400+ (volume dependent) Zero operational overhead Locked into Sentry, limited querying
ClickHouse + Grafana (this proposal) ~\(696 (\)58/mo) Full control, unlimited retention, rich dashboards Operational overhead, self-hosted
Grafana Cloud + Loki ~$1200+ (volume dependent) Managed, easy setup Vendor lock-in, limited log querying

13. Risks and Mitigations

Risk Probability Impact Mitigation
ClickHouse disk fills up Medium High — stops ingesting TTL auto-deletes; disk monitoring alert in Grafana
OTel Collector crash Low Medium — logs buffered in SDK Docker restart: unless-stopped; SDK buffers ~30s
ClickHouse crash Low High — log loss during downtime Docker auto-restart; OTel Collector retries with backoff
Performance impact on app services Low Medium OTel SDK is async; batch processor minimizes overhead
Grafana security (exposed dashboard) Medium Medium Auth required; Traefik IP allowlisting for admin tools
Backup failure to Spaces Low Low — TTL still manages lifecycle Backup script alerts on failure; manual backup option
Complexity for single operator Medium Medium Good documentation; simple Docker Compose setup
Log loss during OTel Collector restart Low Low Batch processor flushes on graceful shutdown

14. Recommendations and Next Steps

14.1 Recommendation

Proceed with implementation. The ClickHouse + Grafana + OpenTelemetry stack is:

  • Cost-effective: Essentially free beyond a small Droplet upsize and Spaces storage
  • Operationally sound: ClickHouse is battle-tested at scale (Sentry itself runs on it)
  • Future-proof: OpenTelemetry is the industry standard; the backend can be swapped without app changes
  • Well-scoped: Only logging moves; Sentry keeps error tracking and tracing

14.2 Implementation Priority

Priority Task Effort Impact
P0 Add ClickHouse + OTel Collector + Grafana to staging docker-compose 1 day Foundation
P0 Create OTel Collector config file 2 hours Collection pipeline
P1 Modify instrument.ts in all services to init OTel SDK 1 day App integration
P1 Create OtelLoggerService in libs/observability 4 hours Shared logger
P1 Build initial Grafana dashboards 1 day Visualization
P2 Swap SentryLoggerServiceOtelLoggerService in all services 1 day Migration
P2 Set up DO Spaces bucket and backup cron 4 hours Data safety
P2 Apply tiered TTL policy 1 hour Retention
P3 Remove Sentry Logs experiment flag 30 min Cleanup
P3 Replicate setup for production docker-compose 4 hours Parity
P3 Document runbooks for ClickHouse operations 4 hours Operations

14.3 Open Questions

  1. Production Droplet sizing: Does the production server have enough headroom for 1.5-2.5 GB additional RAM?
  2. Grafana access: Should Grafana be publicly accessible (with auth) or restricted via IP allowlisting / VPN?
  3. Frontend logs: Should React/browser logs also be routed through OpenTelemetry, or stay in Sentry?
  4. Traces: Should we eventually route OTel traces to ClickHouse too, or keep them in Sentry?
  5. Dozzle: Keep Dozzle for quick container-level debugging, or replace entirely with Grafana?

14.4 Staging vs Production Differences

Config Staging Production
LOG_LEVEL debug info
filter/drop-debug processor Disabled Enabled
ClickHouse TTL (INFO) 14 days 30 days
ClickHouse TTL (ERROR) 90 days 180 days
Backup frequency Daily (no incremental) Full weekly + daily incremental
Grafana alerts Email only Slack + Email

Document Version: 1.0 Last Updated: February 2026 Next Review: After Phase 2 implementation