ClickHouse + Grafana Logging Research¶
Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Centralized Logging Stack
Table of Contents¶
- Executive Summary
- Current State
- Proposed Architecture
- Technology Deep Dive
- Application-Side Integration
- Docker Compose Deployment
- Log Rotation and TTL
- Backup to DigitalOcean Spaces
- Grafana Dashboards
- Resource Requirements
- Migration Strategy
- Cost Analysis
- Risks and Mitigations
- Recommendations and Next Steps
1. Executive Summary¶
This document evaluates migrating structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry. Sentry remains the primary tool for error tracking, performance monitoring, and profiling — only the logging concern is being moved.
Why Move Logging Away from Sentry?¶
- Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
- Retention: Sentry retains logs for limited periods; self-hosted ClickHouse allows arbitrary retention
- Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
- Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
- Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
- Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes
Key Decisions¶
| Concern | Current | Proposed |
|---|---|---|
| Error tracking | Sentry | Sentry (unchanged) |
| Performance / Tracing | Sentry + OTel | Sentry + OTel (unchanged) |
| Profiling | Sentry | Sentry (unchanged) |
| Structured logging | Sentry Logs (Sentry.logger.*) |
ClickHouse via OpenTelemetry |
| Log visualization | Sentry Explore > Logs + Dozzle | Grafana + ClickHouse plugin |
| Real-time log tailing | Dozzle (Docker log viewer) | Dozzle (keep) + Grafana Live |
2. Current State¶
2.1 Sentry Logging Architecture¶
Each microservice currently sends structured logs to Sentry via SentryLoggerService:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Gateway │ │ Order Svc │ │ Print Svc │ │ Shipping Svc │
│ │ │ │ │ │ │ │
│ SentryLogger │ │ SentryLogger │ │ SentryLogger │ │ SentryLogger │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │ │
└──────────────────┴──────────────────┴──────────────────┘
│
▼
┌──────────────────┐
│ Sentry Cloud │
│ (Logs + Errors │
│ + Traces) │
└──────────────────┘
Files involved per service:
| File | Purpose |
|---|---|
observability/instrument.ts |
Sentry SDK init with enableLogs: true |
observability/services/sentry-logger.service.ts |
Wrapper sending Sentry.logger.info() etc. |
observability/services/business-observability.service.ts |
Business event logging |
observability/interceptors/logging.interceptor.ts |
HTTP request/response logging |
observability/filters/sentry-exception.filter.ts |
Exception capture (stays with Sentry) |
Shared library: libs/observability exports getSentryConfig(), getOtelConfig(), and setServiceName().
2.2 What Works Well¶
- Sentry exception filters — keep as-is
- Sentry performance monitoring / tracing — keep as-is
- Sentry profiling — keep as-is
getOtelConfig()inlibs/observabilityalready exists (currently unused for logging)
2.3 Pain Points with Sentry Logging¶
- Sentry Logs is still marked as experimental (
_experiments: { enableLogs: true }) - Limited log retention (30 days on Team plan)
- No way to build custom dashboards over logs
- Cannot cross-reference logs with custom ClickHouse analytics
- Costs scale linearly with log volume
3. Proposed Architecture¶
3.1 High-Level Overview¶
┌───────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ │
│ ┌─────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Gateway │ │ Order Svc │ │ Print Svc │ │Ship. Svc │ │GridFlock │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ Pino │ │ Pino │ │ Pino │ │ Pino │ │ Pino │ │
│ │ + OTel │ │ + OTel │ │ + OTel │ │ + OTel │ │ + OTel │ │
│ └────┬────┘ └─────┬─────┘ └─────┬─────┘ └────┬─────┘ └─────┬─────┘ │
│ │ │ │ │ │ │
│ └─────────────┴──────────────┴─────────────┴──────────────┘ │
│ │ │
│ OTLP (gRPC :4317) │
└────────────────────────────────────┼──────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ COLLECTION LAYER │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ OpenTelemetry Collector (Contrib) │ │
│ │ │ │
│ │ Receivers: otlp (gRPC + HTTP) │ │
│ │ Processors: batch, resource, │ │
│ │ attributes, filter │ │
│ │ Exporters: clickhouse │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
└─────────────────────┼──────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STORAGE + VISUALIZATION │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ ClickHouse │◀─────────│ Grafana │ │
│ │ │ query │ │ │
│ │ otel_logs DB │ │ ClickHouse Data Source │ │
│ │ TTL: 90 days │ │ Log dashboards │ │
│ │ │ │ Alert rules │ │
│ └────────┬─────────┘ └──────────────────────────┘ │
│ │ │
│ │ Nightly backup │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ DigitalOcean Spaces │ │
│ │ (S3-compatible) │ │
│ │ forma3d-log-backups/ │ │
│ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
3.2 Data Flow¶
- Application logs via Pino (structured JSON), auto-instrumented by
@opentelemetry/instrumentation-pino - OTel SDK in each service sends log records over OTLP gRPC to the OTel Collector
- OTel Collector batches, enriches (service name, environment, host), and exports to ClickHouse
- ClickHouse stores logs in an OTel-schema table with TTL-based retention
- Grafana queries ClickHouse via the official plugin for dashboards and alerting
- Nightly cron runs
BACKUP TABLE ... TO S3(...)to DigitalOcean Spaces
3.3 Sentry Coexistence¶
Sentry continues to handle:
- Exception capture (Sentry.captureException)
- Performance traces (Sentry's OTel integration)
- Profiling (nodeProfilingIntegration)
The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).
4. Technology Deep Dive¶
4.1 OpenTelemetry Collector (Contrib)¶
The Contrib distribution is required because it includes the clickhouseexporter.
| Detail | Value |
|---|---|
| Image | otel/opentelemetry-collector-contrib:0.120.0 |
| Receivers | otlp (gRPC :4317, HTTP :4318) |
| Processors | batch, resource, attributes |
| Exporters | clickhouse |
| Health check | :13133/health |
| Metrics | :8888/metrics (Prometheus) |
Key configuration — otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 10000
send_batch_max_size: 20000
resource:
attributes:
- key: deployment.environment
value: "${ENVIRONMENT}"
action: upsert
- key: host.name
value: "${HOSTNAME}"
action: upsert
filter/drop-debug:
logs:
log_record:
- 'severity_number < 9' # Drop TRACE and DEBUG in production
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000
database: otel
logs_table_name: otel_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
create_schema: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [resource, filter/drop-debug, batch]
exporters: [clickhouse]
telemetry:
logs:
level: warn
metrics:
address: 0.0.0.0:8888
extensions: [health_check]
extensions:
health_check:
endpoint: 0.0.0.0:13133
Notes:
- create_schema: true auto-creates the otel_logs table on first run
- The filter/drop-debug processor prevents debug-level logs from reaching ClickHouse in production (toggle per environment)
- batch processor is critical for performance — ClickHouse ingests best in large batches
4.2 ClickHouse¶
| Detail | Value |
|---|---|
| Image | clickhouse/clickhouse-server:24.12-alpine |
| Protocol ports | 9000 (native TCP), 8123 (HTTP) |
| Storage | Named Docker volume clickhouse-data |
| Min RAM recommended | 2 GB (for our log volume) |
| Compression | LZ4 by default (~10:1 on log data) |
Why ClickHouse over alternatives?
| Alternative | Pros | Cons |
|---|---|---|
| Elasticsearch / OpenSearch | Mature, rich full-text search | RAM-hungry (4+ GB minimum), complex to operate |
| Loki + Grafana | Native Grafana integration, low resource | Limited querying (labels only, no full-text) |
| ClickHouse | Blazing fast analytics, SQL interface, low RAM, excellent compression | Less mature OTel ecosystem (improving rapidly) |
| PostgreSQL | Already in the stack | Not designed for high-volume log ingestion |
ClickHouse wins for this use case because:
- Compression: Log data compresses 10-20x, keeping disk usage low
- SQL: Query language is familiar (no learning curve like LogQL)
- Speed: Sub-second queries on millions of rows
- Lightweight: Runs well in 1-2 GB RAM for moderate log volumes
- TTL: Built-in time-based data lifecycle management
- S3 backup: Native BACKUP TABLE ... TO S3(...) command
4.3 Grafana with ClickHouse Plugin¶
| Detail | Value |
|---|---|
| Image | grafana/grafana-oss:11.5.0 |
| ClickHouse plugin | grafana-clickhouse-datasource v4.x |
| Protocol | HTTP to ClickHouse on port 8123 |
| Authentication | Grafana built-in auth (admin password from env) |
The ClickHouse plugin v4 has first-class support for OpenTelemetry log schema:
- Auto-detects OTel column names (Timestamp, SeverityText, Body, etc.)
- Log panel renders with severity coloring
- Logs link to traces via TraceId field
- Query builder minimizes need for raw SQL
5. Application-Side Integration¶
5.1 Approach: Pino + OpenTelemetry Instrumentation¶
Rather than calling OTel log APIs directly, the recommended approach is:
- Use Pino as the application logger (fast, structured JSON)
- Enable
@opentelemetry/instrumentation-pinoto bridge Pino logs to OTel - The OTel SDK's
LogRecordExportersends log records via OTLP to the Collector
This is preferable because:
- Pino logs still go to stdout (Docker captures them, Dozzle still works)
- Trace context (traceId, spanId) is automatically injected into every log record
- No tight coupling to any backend — swapping ClickHouse for Loki is a config change, not code change
5.2 Required Packages¶
pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-logs-otlp-grpc \
@opentelemetry/sdk-logs \
@opentelemetry/instrumentation-pino \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
pnpm add -D -w @types/pino
5.3 Tracing File Changes (per service)¶
New file: libs/observability/src/lib/otel-logger.ts
import pino from 'pino';
import { getServiceName } from './service-context';
export function createLogger(context?: string): pino.Logger {
const logger = pino({
level: process.env['LOG_LEVEL'] || 'info',
transport:
process.env['NODE_ENV'] === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
});
return context ? logger.child({ context }) : logger;
}
Modified: apps/*/src/observability/instrument.ts
// BEFORE: only Sentry init
import * as Sentry from '@sentry/nestjs';
// ...
// AFTER: Sentry init + OTel Logs SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc';
import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import * as Sentry from '@sentry/nestjs';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
import { getSentryConfig, SENTRY_IGNORED_ERRORS, setServiceName, getOtelConfig } from '@forma3d/observability';
const SERVICE_NAME = 'order-service'; // varies per service
setServiceName(SERVICE_NAME);
// 1. Initialize OTel SDK (MUST happen before Sentry and app imports)
const otelConfig = getOtelConfig(SERVICE_NAME);
if (otelConfig.exporterEndpoint) {
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: otelConfig.serviceName,
[ATTR_SERVICE_VERSION]: otelConfig.serviceVersion,
'deployment.environment': otelConfig.environment,
}),
logRecordProcessors: [
new BatchLogRecordProcessor(
new OTLPLogExporter({
url: otelConfig.exporterEndpoint,
})
),
],
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-pino': {
disableLogSending: false, // Enable bridging Pino → OTel
},
// Disable instrumentations we don't need
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
}
// 2. Initialize Sentry (after OTel so Sentry can link to OTel traces)
const config = getSentryConfig();
if (config.dsn) {
Sentry.init({
dsn: config.dsn,
environment: config.environment,
release: config.release,
debug: config.debug,
tracesSampleRate: config.tracesSampleRate,
profilesSampleRate: config.profilesSampleRate,
integrations: [nodeProfilingIntegration()],
// REMOVED: _experiments: { enableLogs: true }
ignoreErrors: SENTRY_IGNORED_ERRORS,
beforeSend(event) {
if (event.request?.headers) {
delete event.request.headers['authorization'];
delete event.request.headers['cookie'];
delete event.request.headers['x-shopify-access-token'];
}
return event;
},
initialScope: {
tags: { service: SERVICE_NAME, component: 'backend' },
},
});
}
export { Sentry };
Replaced: SentryLoggerService → OtelLoggerService
The new service uses Pino instead of Sentry.logger.*:
import { Injectable } from '@nestjs/common';
import { createLogger } from '@forma3d/observability';
import type pino from 'pino';
@Injectable()
export class OtelLoggerService {
private readonly logger: pino.Logger;
constructor() {
this.logger = createLogger('business');
}
info(message: string, attributes?: Record<string, unknown>): void {
this.logger.info(attributes, message);
}
warn(message: string, attributes?: Record<string, unknown>): void {
this.logger.warn(attributes, message);
}
error(message: string, attributes?: Record<string, unknown>): void {
this.logger.error(attributes, message);
}
debug(message: string, attributes?: Record<string, unknown>): void {
this.logger.debug(attributes, message);
}
logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
this.logger.info({ ...attributes, eventType }, message);
}
logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
const level = success ? 'info' : 'warn';
this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
}
}
5.4 Environment Variables to Add¶
Each service container needs one new environment variable:
# In docker-compose.yml, for each backend service:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
Optional tuning variables:
| Variable | Default | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
(none) | OTel Collector gRPC endpoint |
LOG_LEVEL |
info |
Minimum log level (trace/debug/info/warn/error) |
OTEL_LOG_LEVEL |
warn |
OTel SDK's own log level |
OTEL_SERVICE_NAME |
Set in code | Override service name |
6. Docker Compose Deployment¶
6.1 New Services to Add¶
Add these three services to deployment/staging/docker-compose.yml:
# --------------------------------------------------------------------------
# OpenTelemetry Collector - Log collection and forwarding
# --------------------------------------------------------------------------
otel-collector:
image: otel/opentelemetry-collector-contrib:0.120.0
container_name: forma3d-otel-collector
restart: unless-stopped
command: ['--config=/etc/otelcol-contrib/config.yaml']
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
ports:
- '4317:4317' # OTLP gRPC (internal only — remove port mapping if not needed externally)
- '4318:4318' # OTLP HTTP
environment:
- ENVIRONMENT=staging
- HOSTNAME=${HOSTNAME:-forma3d-staging}
networks:
- forma3d-network
healthcheck:
test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:13133/']
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
depends_on:
clickhouse:
condition: service_healthy
# --------------------------------------------------------------------------
# ClickHouse - Log Storage
# --------------------------------------------------------------------------
clickhouse:
image: clickhouse/clickhouse-server:24.12-alpine
container_name: forma3d-clickhouse
restart: unless-stopped
volumes:
- clickhouse-data:/var/lib/clickhouse
- clickhouse-logs:/var/log/clickhouse-server
- ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml:ro
- ./clickhouse-users.xml:/etc/clickhouse-server/users.d/custom.xml:ro
environment:
- CLICKHOUSE_DB=otel
- CLICKHOUSE_USER=otel
- CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD}
- CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
networks:
- forma3d-network
healthcheck:
test: ['CMD', 'clickhouse-client', '--query', 'SELECT 1']
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
ulimits:
nofile:
soft: 262144
hard: 262144
# --------------------------------------------------------------------------
# Grafana - Log Visualization and Dashboards
# --------------------------------------------------------------------------
grafana:
image: grafana/grafana-oss:11.5.0
container_name: forma3d-grafana
restart: unless-stopped
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_INSTALL_PLUGINS=grafana-clickhouse-datasource
- GF_SERVER_ROOT_URL=https://staging-connect-grafana.forma3d.be
- GF_SERVER_SERVE_FROM_SUB_PATH=false
networks:
- forma3d-network
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.grafana.rule=Host(`staging-connect-grafana.forma3d.be`)'
- 'traefik.http.routers.grafana.entrypoints=websecure'
- 'traefik.http.routers.grafana.tls=true'
- 'traefik.http.routers.grafana.tls.certresolver=letsencrypt'
- 'traefik.http.services.grafana.loadbalancer.server.port=3000'
healthcheck:
test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/api/health']
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
depends_on:
clickhouse:
condition: service_healthy
New volumes to add:
volumes:
# ... existing volumes ...
clickhouse-data:
clickhouse-logs:
grafana-data:
6.2 Updated Service Dependencies¶
Each backend service that sends logs should not hard-depend on the OTel Collector. If the collector is down, the OTel SDK buffers and retries — the application continues running. However, the OTel Collector itself depends on ClickHouse.
6.3 Configuration Files¶
deployment/staging/clickhouse-config.xml:
<?xml version="1.0"?>
<clickhouse>
<!-- Listen on all interfaces within Docker network -->
<listen_host>0.0.0.0</listen_host>
<!-- Logging -->
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>100M</size>
<count>3</count>
</logger>
<!-- Memory limits for single-server deployment -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>
<!-- S3 backup configuration (DigitalOcean Spaces) -->
<backups>
<allowed_path>/backups/</allowed_path>
<allowed_disk>s3_backups</allowed_disk>
</backups>
<storage_configuration>
<disks>
<s3_backups>
<type>s3</type>
<endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
<access_key_id>${DO_SPACES_KEY}</access_key_id>
<secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
</s3_backups>
</disks>
</storage_configuration>
</clickhouse>
deployment/staging/clickhouse-users.xml:
<?xml version="1.0"?>
<clickhouse>
<users>
<otel>
<password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
<networks>
<ip>::/0</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
<access_management>0</access_management>
</otel>
</users>
</clickhouse>
Note: For simplicity, use
CLICKHOUSE_PASSWORDenv var with the default user approach instead of the XML user file. The XML approach is shown for reference on how to lock down access further.
deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:
apiVersion: 1
datasources:
- name: ClickHouse
type: grafana-clickhouse-datasource
access: proxy
isDefault: true
jsonData:
host: clickhouse
port: 9000
protocol: native
username: otel
defaultDatabase: otel
logs:
defaultDatabase: otel
defaultTable: otel_logs
otelEnabled: true
otelVersion: latest
timeColumn: Timestamp
levelColumn: SeverityText
messageColumn: Body
secureJsonData:
password: ${CLICKHOUSE_PASSWORD}
7. Log Rotation and TTL¶
7.1 ClickHouse TTL Strategy¶
ClickHouse's TTL feature automatically deletes data that exceeds a time threshold. This replaces traditional log rotation.
Schema with TTL (auto-created by OTel Collector with create_schema: true, but can be customized):
CREATE TABLE IF NOT EXISTS otel.otel_logs
(
Timestamp DateTime64(9),
TimestampDate Date DEFAULT toDate(Timestamp),
TraceId String,
SpanId String,
TraceFlags UInt32,
SeverityText LowCardinality(String),
SeverityNumber Int32,
ServiceName LowCardinality(String),
Body String,
ResourceSchemaUrl String,
ResourceAttributes Map(LowCardinality(String), String),
ScopeSchemaUrl String,
ScopeName String,
ScopeVersion String,
ScopeAttributes Map(LowCardinality(String), String),
LogAttributes Map(LowCardinality(String), String),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_body Body TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
)
ENGINE = MergeTree()
PARTITION BY TimestampDate
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL TimestampDate + INTERVAL 90 DAY
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;
7.2 Tiered Retention Policy¶
| Log Level | Retention | Rationale |
|---|---|---|
| ERROR, FATAL | 180 days | Need long history for debugging recurring issues |
| WARN | 90 days | Important but less critical |
| INFO | 30 days | Operational visibility, high volume |
| DEBUG | 7 days | Only on staging; filtered out in production |
This can be achieved with conditional TTL:
ALTER TABLE otel.otel_logs MODIFY TTL
TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
TimestampDate + INTERVAL 180 DAY;
7.3 Partition Management¶
Partitioning by TimestampDate means each day's data is a separate partition. Benefits:
- TTL drops entire partitions (fast, no row-level deletes)
- Backups can target specific date ranges
- Old data is cleanly isolated from hot data
ClickHouse TTL cleanup runs every 4 hours by default (merge_with_ttl_timeout = 14400). This is adequate — there's no urgency to delete data the moment it expires.
7.4 ClickHouse Internal Log Rotation¶
ClickHouse's own server logs (not to be confused with application logs stored in ClickHouse) are managed via clickhouse-config.xml:
<logger>
<size>100M</size> <!-- Max file size before rotation -->
<count>3</count> <!-- Keep 3 rotated files -->
</logger>
8. Backup to DigitalOcean Spaces¶
8.1 DigitalOcean Spaces Setup¶
DigitalOcean Spaces is S3-compatible and works with ClickHouse's native BACKUP command.
Spaces configuration:
| Setting | Value |
|---|---|
| Bucket name | forma3d-log-backups |
| Region | ams3 (Amsterdam) — closest to EU infrastructure |
| CDN | Disabled (not needed for backups) |
| Versioning | Disabled (ClickHouse manages versions) |
Access credentials:
DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups
8.2 Backup Strategy¶
| Backup Type | Frequency | Retention | Contents |
|---|---|---|---|
| Full backup | Weekly (Sunday 3 AM) | 4 weeks | Entire otel_logs table |
| Incremental | Daily (3 AM) | 2 weeks | Changes since last full |
| Pre-TTL archive | Before TTL expiry | 1 year (cold) | ERROR/FATAL logs about to expire |
8.3 Backup Commands¶
Full backup:
BACKUP TABLE otel.otel_logs
TO S3(
'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
'<DO_SPACES_KEY>',
'<DO_SPACES_SECRET>'
);
Incremental backup (referencing last full):
BACKUP TABLE otel.otel_logs
TO S3(
'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/incremental/{yyyy}-{mm}-{dd}/',
'<DO_SPACES_KEY>',
'<DO_SPACES_SECRET>'
)
SETTINGS base_backup = S3(
'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{last_full_date}/',
'<DO_SPACES_KEY>',
'<DO_SPACES_SECRET>'
);
Restore from backup:
RESTORE TABLE otel.otel_logs
FROM S3(
'https://ams3.digitaloceanspaces.com/forma3d-log-backups/clickhouse/full/{yyyy}-{mm}-{dd}/',
'<DO_SPACES_KEY>',
'<DO_SPACES_SECRET>'
);
8.4 Automated Backup Script¶
Create deployment/staging/scripts/backup-clickhouse-logs.sh:
#!/bin/bash
set -euo pipefail
# Configuration
CLICKHOUSE_HOST="clickhouse"
CLICKHOUSE_PORT="9000"
CLICKHOUSE_USER="otel"
CLICKHOUSE_PASSWORD="${CLICKHOUSE_PASSWORD}"
DO_SPACES_ENDPOINT="https://${DO_SPACES_REGION}.digitaloceanspaces.com"
DO_SPACES_BUCKET="${DO_SPACES_BUCKET}"
DO_SPACES_KEY="${DO_SPACES_KEY}"
DO_SPACES_SECRET="${DO_SPACES_SECRET}"
DATE=$(date +%Y-%m-%d)
DAY_OF_WEEK=$(date +%u) # 1=Monday, 7=Sunday
BACKUP_PATH="${DO_SPACES_ENDPOINT}/${DO_SPACES_BUCKET}/clickhouse"
if [ "$DAY_OF_WEEK" -eq 7 ]; then
# Sunday: Full backup
echo "[$(date)] Starting full backup..."
docker exec forma3d-clickhouse clickhouse-client \
--user "$CLICKHOUSE_USER" \
--password "$CLICKHOUSE_PASSWORD" \
--query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/full/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
echo "[$(date)] Full backup completed: ${BACKUP_PATH}/full/${DATE}/"
else
# Weekday: Incremental backup
LAST_SUNDAY=$(date -d "last sunday" +%Y-%m-%d 2>/dev/null || date -v-sunday +%Y-%m-%d)
echo "[$(date)] Starting incremental backup (base: ${LAST_SUNDAY})..."
docker exec forma3d-clickhouse clickhouse-client \
--user "$CLICKHOUSE_USER" \
--password "$CLICKHOUSE_PASSWORD" \
--query "BACKUP TABLE otel.otel_logs TO S3('${BACKUP_PATH}/incremental/${DATE}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}') SETTINGS base_backup = S3('${BACKUP_PATH}/full/${LAST_SUNDAY}/', '${DO_SPACES_KEY}', '${DO_SPACES_SECRET}')"
echo "[$(date)] Incremental backup completed: ${BACKUP_PATH}/incremental/${DATE}/"
fi
Cron entry (on the Droplet):
0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1
8.5 DigitalOcean Spaces Lifecycle Policy¶
Use Spaces lifecycle rules to automatically clean up old backups:
| Path prefix | Expiration |
|---|---|
clickhouse/full/ |
35 days (keep ~5 full backups) |
clickhouse/incremental/ |
14 days |
clickhouse/archive/ |
365 days |
9. Grafana Dashboards¶
9.1 Provisioned Dashboards¶
Create pre-built dashboards via Grafana provisioning:
Dashboard 1: Service Logs Overview - Log volume over time (by service) - Error rate by service (stacked bar chart) - Latest error logs (table) - Log level distribution (pie chart)
Dashboard 2: Request Tracing
- Logs filtered by TraceId
- Correlated with Sentry traces (link out)
- Request lifecycle visualization
Dashboard 3: Business Events - Order processing events - Print job status changes - Shipment events - Webhook receipt/processing logs
Dashboard 4: System Health - OTel Collector throughput (records/sec) - ClickHouse disk usage - ClickHouse query performance - Log ingestion latency
9.2 Useful Grafana Queries¶
Error log count by service (last 24h):
SELECT
ServiceName,
count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC
Log volume by level (time series):
SELECT
toStartOfFiveMinutes(Timestamp) AS time,
SeverityText,
count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time
Search logs by keyword:
SELECT
Timestamp,
ServiceName,
SeverityText,
Body,
LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100
9.3 Alerting Rules¶
Grafana can alert on ClickHouse queries:
| Alert | Condition | Channel |
|---|---|---|
| High error rate | > 50 ERROR logs in 5 min for any service | Slack / Email |
| Service silent | Zero logs from a service for > 10 min | Slack |
| ClickHouse disk > 80% | Query system.disks |
|
| OTel Collector unhealthy | Health check fail | Slack |
10. Resource Requirements¶
10.1 Estimated Log Volume¶
| Service | Est. logs/min (staging) | Est. logs/min (production) |
|---|---|---|
| Gateway | 20 | 100 |
| Order Service | 30 | 150 |
| Print Service | 10 | 50 |
| Shipping Service | 10 | 50 |
| GridFlock Service | 5 | 20 |
| Total | ~75 | ~370 |
At production load: ~370 logs/min = ~530K logs/day = ~16M logs/month
10.2 Storage Estimates¶
Assuming average log size of 500 bytes (after ClickHouse compression ~50 bytes):
| Timeframe | Raw size | Compressed (est. 10:1) |
|---|---|---|
| 1 day | ~265 MB | ~26 MB |
| 30 days | ~8 GB | ~800 MB |
| 90 days | ~24 GB | ~2.4 GB |
ClickHouse compression is exceptionally efficient on repetitive log data.
10.3 Container Resource Allocation¶
| Container | CPU (cores) | RAM | Disk |
|---|---|---|---|
| ClickHouse | 0.5-1 | 1-2 GB | 10 GB initial (grows) |
| OTel Collector | 0.25 | 256 MB | Minimal |
| Grafana | 0.25 | 256 MB | 1 GB (dashboards/cache) |
| Total new | 1-1.5 | 1.5-2.5 GB | ~12 GB |
10.4 Droplet Sizing Impact¶
Current staging Droplet likely needs to be upsized:
| Current | Recommended |
|---|---|
| 4 GB RAM / 2 vCPU | 8 GB RAM / 4 vCPU |
The 8 GB / 4 vCPU Droplet ($48/mo on DO) comfortably runs the entire stack including the new logging containers.
11. Migration Strategy¶
11.1 Phased Rollout¶
| Phase | Duration | Actions |
|---|---|---|
| Phase 1: Deploy infrastructure | 1 day | Add ClickHouse, OTel Collector, Grafana to docker-compose. Verify containers start. |
| Phase 2: Dual-write | 1 week | Keep SentryLoggerService active. Add OTel SDK to each service sending logs in parallel. Validate logs appear in Grafana. |
| Phase 3: Build dashboards | 3-5 days | Create Grafana dashboards. Validate query performance. Set up alerting. |
| Phase 4: Cut over | 1 day | Replace SentryLoggerService with OtelLoggerService. Remove _experiments: { enableLogs: true } from Sentry init. |
| Phase 5: Cleanup | 1 day | Remove SentryLoggerService files. Update documentation. Configure backups. |
11.2 Rollback Plan¶
- If ClickHouse/OTel issues arise, re-enable
_experiments: { enableLogs: true }in Sentry init - Keep
SentryLoggerServicefiles until Phase 5 is validated for 2+ weeks - Dozzle remains available as a last-resort log viewer (reads Docker stdout directly)
11.3 Files to Modify (per service)¶
| Action | Files |
|---|---|
| Modify | observability/instrument.ts — add OTel SDK init |
| Replace | observability/services/sentry-logger.service.ts → otel-logger.service.ts |
| Modify | observability/observability.module.ts — swap provider |
| Keep | observability/filters/sentry-exception.filter.ts — unchanged |
| Keep | observability/interceptors/logging.interceptor.ts — swap logger injection |
| Keep | observability/services/business-observability.service.ts — swap logger injection |
11.4 Services Affected¶
| Service | Logging changes | Sentry changes |
|---|---|---|
| Gateway | Yes - add OTel logger | Remove enableLogs |
| Order Service | Yes - add OTel logger | Remove enableLogs |
| Print Service | Yes - add OTel logger | Remove enableLogs |
| Shipping Service | Yes - add OTel logger | Remove enableLogs |
| GridFlock Service | Yes - add OTel logger | Remove enableLogs |
| Web (React) | No (frontend logs stay in Sentry for now) | No change |
12. Cost Analysis¶
12.1 Monthly Costs¶
| Component | Staging | Production | Notes |
|---|---|---|---|
| ClickHouse | $0 (self-hosted) | $0 (self-hosted) | Open source, runs on existing infra |
| OTel Collector | $0 (self-hosted) | $0 (self-hosted) | Open source |
| Grafana OSS | $0 (self-hosted) | $0 (self-hosted) | Open source |
| Droplet upsize (if needed) | +$24/mo | +$24/mo | 4 GB → 8 GB RAM |
| DO Spaces storage | ~$5/mo | ~$5/mo | $5/mo for 250 GB + $0.02/GB |
| Total additional | ~$29/mo | ~$29/mo |
12.2 Cost Savings¶
| Item | Current cost | After migration |
|---|---|---|
| Sentry Logs volume | Metered (plan-dependent) | $0 (self-hosted) |
| Sentry errors + tracing | Unchanged | Unchanged |
| Total Sentry bill reduction | Depends on log volume | Could be significant at scale |
12.3 TCO Comparison (12 months)¶
| Approach | Year 1 Cost | Pros | Cons |
|---|---|---|---|
| Keep Sentry Logs | $600-2400+ (volume dependent) | Zero operational overhead | Locked into Sentry, limited querying |
| ClickHouse + Grafana (this proposal) | ~\(696 (\)58/mo) | Full control, unlimited retention, rich dashboards | Operational overhead, self-hosted |
| Grafana Cloud + Loki | ~$1200+ (volume dependent) | Managed, easy setup | Vendor lock-in, limited log querying |
13. Risks and Mitigations¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| ClickHouse disk fills up | Medium | High — stops ingesting | TTL auto-deletes; disk monitoring alert in Grafana |
| OTel Collector crash | Low | Medium — logs buffered in SDK | Docker restart: unless-stopped; SDK buffers ~30s |
| ClickHouse crash | Low | High — log loss during downtime | Docker auto-restart; OTel Collector retries with backoff |
| Performance impact on app services | Low | Medium | OTel SDK is async; batch processor minimizes overhead |
| Grafana security (exposed dashboard) | Medium | Medium | Auth required; Traefik IP allowlisting for admin tools |
| Backup failure to Spaces | Low | Low — TTL still manages lifecycle | Backup script alerts on failure; manual backup option |
| Complexity for single operator | Medium | Medium | Good documentation; simple Docker Compose setup |
| Log loss during OTel Collector restart | Low | Low | Batch processor flushes on graceful shutdown |
14. Recommendations and Next Steps¶
14.1 Recommendation¶
Proceed with implementation. The ClickHouse + Grafana + OpenTelemetry stack is:
- Cost-effective: Essentially free beyond a small Droplet upsize and Spaces storage
- Operationally sound: ClickHouse is battle-tested at scale (Sentry itself runs on it)
- Future-proof: OpenTelemetry is the industry standard; the backend can be swapped without app changes
- Well-scoped: Only logging moves; Sentry keeps error tracking and tracing
14.2 Implementation Priority¶
| Priority | Task | Effort | Impact |
|---|---|---|---|
| P0 | Add ClickHouse + OTel Collector + Grafana to staging docker-compose | 1 day | Foundation |
| P0 | Create OTel Collector config file | 2 hours | Collection pipeline |
| P1 | Modify instrument.ts in all services to init OTel SDK |
1 day | App integration |
| P1 | Create OtelLoggerService in libs/observability |
4 hours | Shared logger |
| P1 | Build initial Grafana dashboards | 1 day | Visualization |
| P2 | Swap SentryLoggerService → OtelLoggerService in all services |
1 day | Migration |
| P2 | Set up DO Spaces bucket and backup cron | 4 hours | Data safety |
| P2 | Apply tiered TTL policy | 1 hour | Retention |
| P3 | Remove Sentry Logs experiment flag | 30 min | Cleanup |
| P3 | Replicate setup for production docker-compose | 4 hours | Parity |
| P3 | Document runbooks for ClickHouse operations | 4 hours | Operations |
14.3 Open Questions¶
- Production Droplet sizing: Does the production server have enough headroom for 1.5-2.5 GB additional RAM?
- Grafana access: Should Grafana be publicly accessible (with auth) or restricted via IP allowlisting / VPN?
- Frontend logs: Should React/browser logs also be routed through OpenTelemetry, or stay in Sentry?
- Traces: Should we eventually route OTel traces to ClickHouse too, or keep them in Sentry?
- Dozzle: Keep Dozzle for quick container-level debugging, or replace entirely with Grafana?
14.4 Staging vs Production Differences¶
| Config | Staging | Production |
|---|---|---|
LOG_LEVEL |
debug |
info |
filter/drop-debug processor |
Disabled | Enabled |
| ClickHouse TTL (INFO) | 14 days | 30 days |
| ClickHouse TTL (ERROR) | 90 days | 180 days |
| Backup frequency | Daily (no incremental) | Full weekly + daily incremental |
| Grafana alerts | Email only | Slack + Email |
Document Version: 1.0 Last Updated: February 2026 Next Review: After Phase 2 implementation