AI Prompt: Forma3D.Connect — ClickHouse + Grafana Centralized Logging¶
Purpose: Migrate structured logging from Sentry Logs to a self-hosted ClickHouse + Grafana stack, collected via OpenTelemetry, while keeping Sentry for error tracking, performance monitoring, and profiling
Estimated Effort: 16–24 hours (infrastructure + application integration + dashboards + backups)
Prerequisites: All microservices running with currentSentryLoggerServiceandinstrument.tsfiles; staging Docker Compose deployment operational; DigitalOcean Droplet with at least 8 GB RAM / 4 vCPU (or plan to upsize from 4 GB)
Output: ClickHouse storing structured logs via OpenTelemetry, Grafana dashboards for log visualization and alerting, automated backups to DigitalOcean Spaces, all services logging via Pino bridged to OTel — with Sentry retained for errors/traces/profiling
Status: 🚧 TODO
Research:docs/03-architecture/research/clickhouse-grafana-logging-research.md
🎯 Mission¶
Implement a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform. This replaces only the logging concern currently handled by Sentry Logs (Sentry.logger.*). Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — unchanged.
What this delivers:
- OpenTelemetry Collector — receives logs from all services via OTLP gRPC, batches and exports to ClickHouse
- ClickHouse — high-performance columnar storage for structured logs with TTL-based retention and S3 backup
- Grafana — log visualization dashboards with the ClickHouse plugin, alerting rules, and log exploration
- Pino + OTel bridge — replaces
SentryLoggerServicewith Pino logger auto-bridged to OpenTelemetry via@opentelemetry/instrumentation-pino - Tiered retention — ERROR/FATAL logs kept 180 days, WARN 90 days, INFO 30 days, DEBUG 7 days
- Automated backups — nightly ClickHouse backups to DigitalOcean Spaces (S3-compatible)
- Dozzle coexistence — Pino still writes to stdout, so Docker log viewers (Dozzle) continue working
Why move logging away from Sentry:
- Cost: Sentry Logs is a metered feature; high-volume logging becomes expensive at scale
- Retention: Sentry retains logs for limited periods (30 days on Team plan); ClickHouse allows arbitrary retention
- Querying: ClickHouse offers sub-second analytical queries on billions of log rows; Sentry's log search is limited
- Ownership: Logs contain sensitive business data — self-hosting provides full data sovereignty
- Flexibility: Grafana dashboards are far more customizable than Sentry's log explorer
- Vendor independence: OpenTelemetry is vendor-neutral — the backend can be swapped without app changes
What stays with Sentry (unchanged):
- Exception capture (
Sentry.captureException,SentryExceptionFilter) - Performance traces (Sentry's OTel integration)
- Profiling (
nodeProfilingIntegration) - Frontend logging (React/browser logs stay in Sentry for now)
📐 Architecture¶
Data Flow¶
┌───────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ │
│ ┌─────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Gateway │ │ Order Svc │ │ Print Svc │ │Ship. Svc │ │GridFlock │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ Pino │ │ Pino │ │ Pino │ │ Pino │ │ Pino │ │
│ │ + OTel │ │ + OTel │ │ + OTel │ │ + OTel │ │ + OTel │ │
│ └────┬────┘ └─────┬─────┘ └─────┬─────┘ └────┬─────┘ └─────┬─────┘ │
│ │ │ │ │ │ │
│ └─────────────┴──────────────┴─────────────┴──────────────┘ │
│ │ │
│ OTLP (gRPC :4317) │
└────────────────────────────────────┼──────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ COLLECTION LAYER │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ OpenTelemetry Collector (Contrib) │ │
│ │ │ │
│ │ Receivers: otlp (gRPC + HTTP) │ │
│ │ Processors: batch, resource, │ │
│ │ attributes, filter │ │
│ │ Exporters: clickhouse │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
└─────────────────────┼──────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STORAGE + VISUALIZATION │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ ClickHouse │◀─────────│ Grafana │ │
│ │ │ query │ │ │
│ │ otel_logs DB │ │ ClickHouse Data Source │ │
│ │ TTL: tiered │ │ Log dashboards │ │
│ │ │ │ Alert rules │ │
│ └────────┬─────────┘ └──────────────────────────┘ │
│ │ │
│ │ Nightly backup │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ DigitalOcean Spaces │ │
│ │ (S3-compatible) │ │
│ │ forma3d-log-backups/ │ │
│ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Sentry Coexistence¶
Sentry continues to handle errors, traces, and profiling. The instrument.ts files in each service keep their current Sentry init but drop _experiments: { enableLogs: true }. The SentryLoggerService is replaced by an OtelLoggerService that writes to Pino (which is bridged to OTel).
Component Summary¶
| Component | Image | Purpose | Port(s) |
|---|---|---|---|
| OTel Collector | otel/opentelemetry-collector-contrib:0.120.0 |
Receives OTLP logs, batches, exports to ClickHouse | 4317 (gRPC), 4318 (HTTP), 13133 (health), 8888 (metrics) |
| ClickHouse | clickhouse/clickhouse-server:24.12-alpine |
Columnar log storage with TTL and S3 backup | 9000 (native TCP), 8123 (HTTP) |
| Grafana | grafana/grafana-oss:11.5.0 |
Dashboard visualization, alerting | 3000 |
📋 Services Affected¶
| Service | Logging changes | Sentry changes |
|---|---|---|
| Gateway | Yes — add OTel SDK + Pino logger | Remove enableLogs experiment flag |
| Order Service | Yes — add OTel SDK + Pino logger | Remove enableLogs experiment flag |
| Print Service | Yes — add OTel SDK + Pino logger | Remove enableLogs experiment flag |
| Shipping Service | Yes — add OTel SDK + Pino logger | Remove enableLogs experiment flag |
| GridFlock Service | Yes — add OTel SDK + Pino logger | Remove enableLogs experiment flag |
| Web (React) | No (frontend logs stay in Sentry) | No change |
Files Modified per Service¶
| Action | Files |
|---|---|
| Modify | observability/instrument.ts — add OTel SDK init for logs |
| Replace | observability/services/sentry-logger.service.ts → otel-logger.service.ts |
| Modify | observability/observability.module.ts — swap provider |
| Keep | observability/filters/sentry-exception.filter.ts — unchanged |
| Modify | observability/interceptors/logging.interceptor.ts — swap logger injection |
| Modify | observability/services/business-observability.service.ts — swap logger injection |
📁 Files to Create/Modify¶
New Files — Infrastructure¶
deployment/staging/otel-collector-config.yaml # OTel Collector pipeline configuration
deployment/staging/clickhouse-config.xml # ClickHouse server configuration
deployment/staging/clickhouse-users.xml # ClickHouse user access (reference)
deployment/staging/grafana/provisioning/datasources/clickhouse.yaml # Grafana ClickHouse datasource
deployment/staging/scripts/backup-clickhouse-logs.sh # Automated ClickHouse backup script
New Files — Application¶
libs/observability/src/lib/otel-logger.ts # Pino logger factory (createLogger)
libs/observability/src/lib/services/otel-logger.service.ts # NestJS injectable logger service (replaces SentryLoggerService)
Modified Files — Infrastructure¶
deployment/staging/docker-compose.yml # Add otel-collector, clickhouse, grafana services + volumes
deployment/staging/.env.example # Add CLICKHOUSE_PASSWORD, GRAFANA_ADMIN_PASSWORD, DO_SPACES_* vars
Modified Files — Application (per service)¶
apps/*/src/observability/instrument.ts # Add OTel SDK init, remove enableLogs experiment
apps/*/src/observability/observability.module.ts # Swap SentryLoggerService → OtelLoggerService
apps/*/src/observability/interceptors/logging.interceptor.ts # Swap logger injection
apps/*/src/observability/services/business-observability.service.ts # Swap logger injection
Modified Files — Shared Library¶
libs/observability/src/index.ts # Export createLogger and OtelLoggerService
Deleted Files (Phase 5 only)¶
apps/*/src/observability/services/sentry-logger.service.ts # Replaced by OtelLoggerService
🔧 Implementation Phases¶
Phase 1: Deploy Infrastructure (1 day)¶
Priority: P0 | Impact: Foundation | Dependencies: None
Deploy ClickHouse, OTel Collector, and Grafana containers on staging. Verify they start, connect, and are healthy.
1. Create OTel Collector configuration¶
Create deployment/staging/otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 10000
send_batch_max_size: 20000
resource:
attributes:
- key: deployment.environment
value: "${ENVIRONMENT}"
action: upsert
- key: host.name
value: "${HOSTNAME}"
action: upsert
filter/drop-debug:
logs:
log_record:
- 'severity_number < 9'
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000
database: otel
logs_table_name: otel_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
create_schema: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [resource, filter/drop-debug, batch]
exporters: [clickhouse]
telemetry:
logs:
level: warn
metrics:
address: 0.0.0.0:8888
extensions: [health_check]
extensions:
health_check:
endpoint: 0.0.0.0:13133
Notes:
- create_schema: true auto-creates the otel_logs table on first run
- filter/drop-debug prevents debug-level logs from reaching ClickHouse in production (disable on staging by removing from the pipeline processors list)
- batch processor is critical for performance — ClickHouse ingests best in large batches
2. Create ClickHouse server configuration¶
Create deployment/staging/clickhouse-config.xml:
<?xml version="1.0"?>
<clickhouse>
<listen_host>0.0.0.0</listen_host>
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>100M</size>
<count>3</count>
</logger>
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>
<backups>
<allowed_path>/backups/</allowed_path>
<allowed_disk>s3_backups</allowed_disk>
</backups>
<storage_configuration>
<disks>
<s3_backups>
<type>s3</type>
<endpoint>https://${DO_SPACES_REGION}.digitaloceanspaces.com/${DO_SPACES_BUCKET}/clickhouse-backups/</endpoint>
<access_key_id>${DO_SPACES_KEY}</access_key_id>
<secret_access_key>${DO_SPACES_SECRET}</secret_access_key>
</s3_backups>
</disks>
</storage_configuration>
</clickhouse>
3. Create ClickHouse users configuration¶
Create deployment/staging/clickhouse-users.xml (reference — may use CLICKHOUSE_PASSWORD env var approach instead):
<?xml version="1.0"?>
<clickhouse>
<users>
<otel>
<password_sha256_hex replace="true"><!-- generated hash --></password_sha256_hex>
<networks>
<ip>::/0</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
<access_management>0</access_management>
</otel>
</users>
</clickhouse>
4. Create Grafana ClickHouse datasource provisioning¶
Create deployment/staging/grafana/provisioning/datasources/clickhouse.yaml:
apiVersion: 1
datasources:
- name: ClickHouse
type: grafana-clickhouse-datasource
access: proxy
isDefault: true
jsonData:
host: clickhouse
port: 9000
protocol: native
username: otel
defaultDatabase: otel
logs:
defaultDatabase: otel
defaultTable: otel_logs
otelEnabled: true
otelVersion: latest
timeColumn: Timestamp
levelColumn: SeverityText
messageColumn: Body
secureJsonData:
password: ${CLICKHOUSE_PASSWORD}
5. Add services to Docker Compose¶
Add the following to deployment/staging/docker-compose.yml:
Services to add:
otel-collector— OTel Collector Contrib with health check, depends on ClickHouseclickhouse— ClickHouse server with named volumes, health check, ulimitsgrafana— Grafana OSS with ClickHouse plugin, provisioning volumes, Traefik labels
Volumes to add:
clickhouse-dataclickhouse-logsgrafana-data
Environment variables for existing backend services:
Add OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 to each backend service (Gateway, Order Service, Print Service, Shipping Service, GridFlock Service). Do NOT add depends_on on the OTel Collector — the OTel SDK buffers and retries if the collector is down.
See the research document (Section 6.1) for the complete Docker Compose service definitions including Traefik labels for Grafana at staging-connect-grafana.forma3d.be.
6. Add environment variables¶
Update deployment/staging/.env.example with:
# ClickHouse
CLICKHOUSE_PASSWORD=<generate-strong-password>
# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>
# DigitalOcean Spaces (for ClickHouse backups)
DO_SPACES_KEY=<generated-key>
DO_SPACES_SECRET=<generated-secret>
DO_SPACES_REGION=ams3
DO_SPACES_BUCKET=forma3d-log-backups
7. Verify infrastructure¶
docker compose up -d clickhouse otel-collector grafana- Verify ClickHouse health:
docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1" - Verify OTel Collector health:
curl http://localhost:13133/ - Verify Grafana health:
curl http://localhost:3000/api/health - Verify Grafana can query ClickHouse: open Grafana UI → Data Sources → ClickHouse → Test Connection
Phase 2: Application Integration — Dual-Write (1 week)¶
Priority: P1 | Impact: Core integration | Dependencies: Phase 1
Keep SentryLoggerService active. Add OTel SDK to each service so logs are sent to both Sentry and ClickHouse in parallel. This validates the pipeline without risk.
8. Install required packages¶
pnpm add -w pino @opentelemetry/sdk-node @opentelemetry/api \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-logs-otlp-grpc \
@opentelemetry/sdk-logs \
@opentelemetry/instrumentation-pino \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
pnpm add -D -w @types/pino pino-pretty
9. Create Pino logger factory in shared library¶
Create libs/observability/src/lib/otel-logger.ts:
import pino from 'pino';
import { getServiceName } from './service-context';
export function createLogger(context?: string): pino.Logger {
const logger = pino({
level: process.env['LOG_LEVEL'] || 'info',
transport:
process.env['NODE_ENV'] === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
});
return context ? logger.child({ context }) : logger;
}
10. Create OtelLoggerService in shared library¶
Create libs/observability/src/lib/services/otel-logger.service.ts:
import { Injectable } from '@nestjs/common';
import { createLogger } from '../otel-logger';
import type pino from 'pino';
@Injectable()
export class OtelLoggerService {
private readonly logger: pino.Logger;
constructor() {
this.logger = createLogger('business');
}
info(message: string, attributes?: Record<string, unknown>): void {
this.logger.info(attributes, message);
}
warn(message: string, attributes?: Record<string, unknown>): void {
this.logger.warn(attributes, message);
}
error(message: string, attributes?: Record<string, unknown>): void {
this.logger.error(attributes, message);
}
debug(message: string, attributes?: Record<string, unknown>): void {
this.logger.debug(attributes, message);
}
logEvent(eventType: string, message: string, attributes?: Record<string, unknown>): void {
this.logger.info({ ...attributes, eventType }, message);
}
logAudit(action: string, success: boolean, attributes?: Record<string, unknown>): void {
const level = success ? 'info' : 'warn';
this.logger[level]({ ...attributes, action, success, category: 'audit' }, `Audit: ${action}`);
}
}
11. Export new modules from shared library¶
Update libs/observability/src/index.ts to export createLogger and OtelLoggerService.
12. Modify instrument.ts in each backend service¶
In each service's observability/instrument.ts, add OTel SDK initialization before Sentry init. The OTel SDK must be initialized first so that Sentry can link to OTel traces.
The key changes per instrument.ts:
- Import
NodeSDK,OTLPLogExporter,BatchLogRecordProcessor,getNodeAutoInstrumentations,Resource, and semantic conventions - Read
getOtelConfig(SERVICE_NAME)from the shared observability library - If
otelConfig.exporterEndpointis set, create and start aNodeSDKinstance with: logRecordProcessorsusingBatchLogRecordProcessor+OTLPLogExporterinstrumentationswithgetNodeAutoInstrumentationsenabling@opentelemetry/instrumentation-pino(disableLogSending: false)- Disable unused instrumentations (e.g.,
@opentelemetry/instrumentation-fs) - Keep Sentry init unchanged during dual-write phase (do NOT remove
_experiments: { enableLogs: true }yet)
See the research document (Section 5.3) for the complete instrument.ts code.
13. Validate logs appear in Grafana¶
- Deploy updated services to staging
- Generate some log activity (e.g., process a test order)
- Open Grafana → Explore → select ClickHouse datasource → query
otel_logstable - Verify logs from all services appear with correct
ServiceName,SeverityText,Body, andTraceId
Phase 3: Build Grafana Dashboards (3–5 days)¶
Priority: P1 | Impact: Visualization | Dependencies: Phase 2 (logs flowing)
14. Create "Service Logs Overview" dashboard¶
- Log volume over time (by service) — time series panel
- Error rate by service — stacked bar chart
- Latest error logs — table panel
- Log level distribution — pie chart
15. Create "Request Tracing" dashboard¶
- Logs filtered by
TraceIdvariable - Correlated with Sentry traces (link out to Sentry)
- Request lifecycle visualization
16. Create "Business Events" dashboard¶
- Order processing events (filtered by
eventTypelog attribute) - Print job status changes
- Shipment events
- Webhook receipt/processing logs
17. Create "System Health" dashboard¶
- OTel Collector throughput (records/sec) — from Prometheus metrics on
:8888 - ClickHouse disk usage — query
system.disks - ClickHouse query performance
- Log ingestion latency
18. Configure alerting rules¶
| Alert | Condition | Channel |
|---|---|---|
| High error rate | > 50 ERROR logs in 5 min for any service | Slack / Email |
| Service silent | Zero logs from a service for > 10 min | Slack |
| ClickHouse disk > 80% | Query system.disks |
|
| OTel Collector unhealthy | Health check fail | Slack |
19. Useful Grafana queries for reference¶
Error log count by service (last 24h):
SELECT
ServiceName,
count() AS error_count
FROM otel.otel_logs
WHERE SeverityText IN ('ERROR', 'FATAL')
AND Timestamp >= now() - INTERVAL 24 HOUR
GROUP BY ServiceName
ORDER BY error_count DESC
Log volume by level (time series):
SELECT
toStartOfFiveMinutes(Timestamp) AS time,
SeverityText,
count() AS count
FROM otel.otel_logs
WHERE Timestamp >= $__fromTime AND Timestamp <= $__toTime
GROUP BY time, SeverityText
ORDER BY time
Search logs by keyword:
SELECT
Timestamp,
ServiceName,
SeverityText,
Body,
LogAttributes
FROM otel.otel_logs
WHERE Body LIKE '%order%'
AND Timestamp >= $__fromTime AND Timestamp <= $__toTime
ORDER BY Timestamp DESC
LIMIT 100
Phase 4: Cut Over — Replace SentryLoggerService (1 day)¶
Priority: P2 | Impact: Migration | Dependencies: Phase 3 (dashboards validated)
20. Swap SentryLoggerService → OtelLoggerService in all services¶
In each service's observability/observability.module.ts:
- Replace SentryLoggerService provider with OtelLoggerService from @forma3d/observability
- Update all injection sites (logging.interceptor.ts, business-observability.service.ts, etc.)
21. Remove Sentry Logs experiment flag¶
In each service's instrument.ts:
- Remove _experiments: { enableLogs: true } from the Sentry.init() call
- Remove any Sentry.logger.* imports/calls
22. Verify cut-over¶
- Deploy updated services to staging
- Verify logs no longer appear in Sentry's log explorer
- Verify logs continue to appear in Grafana
- Verify Sentry still captures exceptions and traces correctly
- Verify Dozzle still shows container logs (Pino writes to stdout)
Phase 5: Cleanup and Backups (1 day)¶
Priority: P2 | Impact: Completion | Dependencies: Phase 4 validated for 2+ weeks
23. Delete SentryLoggerService files¶
Remove apps/*/src/observability/services/sentry-logger.service.ts from all services.
24. Apply tiered TTL policy¶
Connect to ClickHouse and run:
ALTER TABLE otel.otel_logs MODIFY TTL
TimestampDate + INTERVAL 7 DAY WHERE SeverityText IN ('TRACE', 'DEBUG'),
TimestampDate + INTERVAL 30 DAY WHERE SeverityText = 'INFO',
TimestampDate + INTERVAL 90 DAY WHERE SeverityText = 'WARN',
TimestampDate + INTERVAL 180 DAY;
25. Set up DigitalOcean Spaces bucket¶
- Create
forma3d-log-backupsbucket inams3region - Disable CDN, disable versioning
- Generate Spaces access key and secret
- Add lifecycle rules:
| Path prefix | Expiration |
|---|---|
clickhouse/full/ |
35 days |
clickhouse/incremental/ |
14 days |
clickhouse/archive/ |
365 days |
26. Create backup script¶
Create deployment/staging/scripts/backup-clickhouse-logs.sh with:
- Full backup on Sundays
- Incremental backup on weekdays (referencing last full)
- Logging to /var/log/clickhouse-backup.log
See the research document (Section 8.4) for the complete backup script.
27. Configure backup cron¶
Add to the Droplet's crontab:
0 3 * * * /opt/forma3d/scripts/backup-clickhouse-logs.sh >> /var/log/clickhouse-backup.log 2>&1
28. Add DNS record¶
Create DNS A record for staging-connect-grafana.forma3d.be pointing to the staging Droplet.
29. Add Grafana to monitoring¶
Add Grafana to Uptime Kuma monitoring at https://staging-connect-grafana.forma3d.be/api/health.
Phase 6: Production Parity (4 hours)¶
Priority: P3 | Impact: Production readiness | Dependencies: Phase 5
30. Replicate setup for production docker-compose¶
Copy infrastructure configuration to deployment/production/ with these differences:
| Config | Staging | Production |
|---|---|---|
LOG_LEVEL |
debug |
info |
filter/drop-debug processor |
Disabled (remove from pipeline) | Enabled |
| ClickHouse TTL (INFO) | 14 days | 30 days |
| ClickHouse TTL (ERROR) | 90 days | 180 days |
| Backup frequency | Daily (no incremental) | Full weekly + daily incremental |
| Grafana alerts | Email only | Slack + Email |
31. Remove Sentry Logs experiment flag cleanup¶
Remove any remaining Sentry Logs code or configuration across all services.
32. Document runbooks¶
Add ClickHouse operational runbooks to docs/05-deployment/:
- How to query logs manually via clickhouse-client
- How to restore from backup
- How to check disk usage and TTL status
- How to force TTL cleanup: OPTIMIZE TABLE otel.otel_logs FINAL
📊 Resource Requirements¶
Estimated Log Volume¶
| Service | Est. logs/min (staging) | Est. logs/min (production) |
|---|---|---|
| Gateway | 20 | 100 |
| Order Service | 30 | 150 |
| Print Service | 10 | 50 |
| Shipping Service | 10 | 50 |
| GridFlock Service | 5 | 20 |
| Total | ~75 | ~370 |
Storage Estimates (production, with ClickHouse ~10:1 compression)¶
| Timeframe | Raw size | Compressed |
|---|---|---|
| 1 day | ~265 MB | ~26 MB |
| 30 days | ~8 GB | ~800 MB |
| 90 days | ~24 GB | ~2.4 GB |
Container Resource Allocation¶
| Container | CPU (cores) | RAM | Disk |
|---|---|---|---|
| ClickHouse | 0.5–1 | 1–2 GB | 10 GB initial |
| OTel Collector | 0.25 | 256 MB | Minimal |
| Grafana | 0.25 | 256 MB | 1 GB |
| Total new | 1–1.5 | 1.5–2.5 GB | ~12 GB |
Droplet Sizing¶
| Current | Recommended |
|---|---|
| 4 GB RAM / 2 vCPU | 8 GB RAM / 4 vCPU ($48/mo) |
✅ Validation Checklist¶
Infrastructure¶
- ClickHouse container starts and is healthy (
SELECT 1succeeds) - OTel Collector container starts and is healthy (
:13133/returns OK) - Grafana container starts and is healthy (
/api/healthreturns OK) - Grafana ClickHouse plugin installed (
grafana-clickhouse-datasource) - Grafana data source connection to ClickHouse tested successfully
-
otel_logstable auto-created in ClickHouseoteldatabase - Grafana accessible at
staging-connect-grafana.forma3d.bevia Traefik
Application Integration¶
-
pinoand OTel packages installed at workspace root -
createLoggerfactory exported from@forma3d/observability -
OtelLoggerServiceexported from@forma3d/observability - OTel SDK initialized in
instrument.tsfor all 5 backend services -
@opentelemetry/instrumentation-pinoenabled (disableLogSending: false) -
OTEL_EXPORTER_OTLP_ENDPOINTset in all backend service containers - Pino logs appear on stdout (Dozzle still works)
- Logs appear in ClickHouse
otel_logstable with correct fields -
TraceIdandSpanIdpopulated in log records (trace correlation working) -
ServiceNamecorrectly identifies each service
Dashboards¶
- "Service Logs Overview" dashboard created and functional
- "Request Tracing" dashboard created with TraceId filtering
- "Business Events" dashboard created with event type filtering
- "System Health" dashboard created with ClickHouse metrics
- Alerting rules configured (high error rate, service silent, disk usage)
Migration Cut-Over¶
-
SentryLoggerServicereplaced byOtelLoggerServicein all services -
_experiments: { enableLogs: true }removed from all Sentry init calls - Sentry still captures exceptions correctly (SentryExceptionFilter unchanged)
- Sentry still captures performance traces correctly
- No logs appearing in Sentry's log explorer after cut-over
- All logs flowing to ClickHouse/Grafana
Retention and Backups¶
- Tiered TTL applied: DEBUG 7d, INFO 30d, WARN 90d, ERROR/FATAL 180d
- DigitalOcean Spaces bucket created (
forma3d-log-backupsinams3) - Backup script executable and runs successfully
- Cron job configured for nightly backups at 3 AM
- Spaces lifecycle rules configured for backup cleanup
- Backup restore tested:
RESTORE TABLE otel.otel_logs FROM S3(...)succeeds
Rollback Readiness¶
-
SentryLoggerServicefiles retained until Phase 5 validated for 2+ weeks - Re-enabling
_experiments: { enableLogs: true }documented as rollback step - Dozzle continues to work as last-resort log viewer
Verification Commands¶
# ClickHouse health
docker exec forma3d-clickhouse clickhouse-client --query "SELECT 1"
# OTel Collector health
curl http://localhost:13133/
# Grafana health
curl http://localhost:3000/api/health
# Check log count in ClickHouse
docker exec forma3d-clickhouse clickhouse-client --query "SELECT count() FROM otel.otel_logs"
# Check logs by service
docker exec forma3d-clickhouse clickhouse-client --query \
"SELECT ServiceName, count() FROM otel.otel_logs GROUP BY ServiceName"
# Check TTL status
docker exec forma3d-clickhouse clickhouse-client --query \
"SELECT name, engine, total_rows, total_bytes FROM system.tables WHERE database = 'otel'"
# Check disk usage
docker exec forma3d-clickhouse clickhouse-client --query \
"SELECT name, path, free_space, total_space FROM system.disks"
# Build passes
pnpm nx run-many -t build --all
# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests
# Lint passes
pnpm nx run-many -t lint --all
🚫 Constraints and Rules¶
MUST DO¶
- Keep Sentry for error tracking, performance monitoring, tracing, and profiling — only move logging
- Use OpenTelemetry as the transport layer (vendor-neutral, backend-swappable)
- Use Pino as the application logger (structured JSON, stdout output, Dozzle compatibility)
- Bridge Pino to OTel via
@opentelemetry/instrumentation-pino(no direct OTel log API calls in application code) - Initialize OTel SDK before Sentry init in
instrument.ts(so Sentry can link to OTel traces) - Use the OTel Collector Contrib distribution (
otel/opentelemetry-collector-contrib) — required for theclickhouseexporter - Set
create_schema: trueon the ClickHouse exporter so theotel_logstable is auto-created - Use named Docker volumes for ClickHouse data and Grafana data
- Add health checks to all three new containers
- Apply tiered TTL retention policy (not a flat retention period)
- Partition ClickHouse logs by
TimestampDatefor efficient TTL drops - Run dual-write (Phase 2) for at least 1 week before cutting over (Phase 4)
- Keep
SentryLoggerServicefiles until Phase 5 is validated for 2+ weeks - Store
CLICKHOUSE_PASSWORDandGRAFANA_ADMIN_PASSWORDas secrets — never commit to repository - Add
OTEL_EXPORTER_OTLP_ENDPOINTto each backend service's environment (no hard dependency on collector) - Export
createLoggerandOtelLoggerServicefrom the shared@forma3d/observabilitylibrary - Add DNS record and Uptime Kuma monitoring for Grafana
MUST NOT¶
- Remove or modify Sentry exception filters, performance monitoring, or profiling
- Remove Sentry SDK or
@sentry/nestjsfrom any service - Add
depends_onfrom backend services to the OTel Collector (the OTel SDK handles retries gracefully) - Call OTel log APIs directly in application code — always use Pino, let the instrumentation bridge handle it
- Use ClickHouse for anything other than logs at this stage (no traces, no metrics)
- Expose ClickHouse ports externally (keep behind Docker network)
- Hard-code secrets in configuration files or Docker Compose
- Skip the dual-write validation phase (Phase 2)
- Delete
SentryLoggerServicebefore the cut-over is validated for 2+ weeks - Use
any,ts-ignore, oreslint-disable - Add
console.logstatements — all logging goes through Pino/OtelLoggerService
SHOULD DO (Nice to Have)¶
- Provision Grafana dashboards via JSON files in
grafana/provisioning/dashboards/for reproducibility - Add Grafana SMTP configuration for email alerting
- Set up Grafana Slack integration for critical alerts
- Configure Grafana IP allowlisting or VPN access for the admin UI
- Add
pino-prettytransport for local development (colored, human-readable logs) - Add
LOG_LEVELenvironment variable per service for fine-grained control - Explore routing OTel traces to ClickHouse in a future iteration (currently stays in Sentry)
- Explore adding browser/frontend logs to the OTel pipeline in a future iteration
🔄 Rollback Plan¶
If issues arise at any phase:
- Phase 1 (infrastructure): Remove the three new containers from
docker-compose.yml— no application impact - Phase 2 (dual-write): Revert
instrument.tschanges — remove OTel SDK init. Sentry Logs continues working - Phase 3 (dashboards): No rollback needed — dashboards are additive
- Phase 4 (cut-over): Re-enable
_experiments: { enableLogs: true }in Sentry init, swapOtelLoggerServiceback toSentryLoggerService - Phase 5 (cleanup): If
SentryLoggerServicewas already deleted, restore from Git history
Dozzle remains available as a last-resort log viewer at all times (reads Docker stdout directly).
📚 Key References¶
Research:
- Detailed research document: docs/03-architecture/research/clickhouse-grafana-logging-research.md
Technologies: - OpenTelemetry Collector Contrib: https://github.com/open-telemetry/opentelemetry-collector-contrib - ClickHouse OTel Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter - ClickHouse Docker: https://hub.docker.com/r/clickhouse/clickhouse-server - Grafana ClickHouse Plugin: https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/ - Pino Logger: https://github.com/pinojs/pino - OTel Pino Instrumentation: https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-pino
Existing Codebase:
- Current Sentry logger: apps/*/src/observability/services/sentry-logger.service.ts
- Current instrument files: apps/*/src/observability/instrument.ts
- Shared observability library: libs/observability/src/
- OTel config helper: libs/observability/src/lib/otel-config.ts (already exists, currently unused for logging)
- Docker Compose (staging): deployment/staging/docker-compose.yml
- Deployment guide: docs/05-deployment/staging-deployment-guide.md
END OF PROMPT
This prompt implements a self-hosted ClickHouse + Grafana + OpenTelemetry centralized logging stack for the Forma3D.Connect microservice platform, as designed in docs/03-architecture/research/clickhouse-grafana-logging-research.md. The AI should deploy ClickHouse, OTel Collector, and Grafana containers via Docker Compose; install Pino and OTel packages; create a shared OtelLoggerService that replaces SentryLoggerService; bridge Pino to OpenTelemetry via @opentelemetry/instrumentation-pino; modify instrument.ts in all 5 backend services to initialize the OTel SDK; build 4 Grafana dashboards with alerting; apply tiered TTL retention; and set up automated backups to DigitalOcean Spaces. Sentry continues to handle error tracking, performance monitoring, tracing, and profiling — only the logging concern moves. A phased rollout with dual-write validation ensures safe migration with rollback at every stage.