Skip to content

Scheduled Maintenance Runbook

Version: 1.0
Date: February 21, 2026
Status: Active

This runbook defines the recurring maintenance cycles for Forma3D.Connect. It covers dependency updates, platform upgrades, infrastructure patching, security reviews, and operational housekeeping.

Related documents:


Maintenance Cadences

Not every component needs the same update frequency. The table below defines four cadences based on risk, velocity, and operational impact.

Cadence Frequency Focus Typical Duration
Continuous Ongoing / weekly Automated scanning, alert triage 30 min / week
Monthly 1st week of each month Security patches, minor bumps, monitoring review 2–4 hours
Quarterly Jan, Apr, Jul, Oct Dependency upgrades, infra updates, key rotation, full security review 1–2 days
Semi-annual Jan, Jul Major platform upgrades, load testing, DR drills, architecture review 2–3 days

Choosing the Right Cadence

Signal Action
Critical CVE in a direct dependency Patch immediately (out-of-band)
Grype flags a high-severity CVE Triage within 48 hours
New major version of Node.js / NestJS / React Evaluate at semi-annual, adopt at next quarterly
Minor / patch version of any dependency Bundle into monthly cycle
Let's Encrypt certificate nearing expiry Traefik auto-renews; verify monthly
Database CA certificate nearing expiry Plan rotation at quarterly; execute in maintenance window

1. Continuous — Automated Scanning & Alert Triage

These tasks run automatically or require only brief weekly attention.

1.1 Grype CVE Scan Results

  • Review Grype scan results in recent pipeline runs for new critical / high CVEs
  • Triage new container image vulnerabilities (update .grype.yaml exclusions for false positives, create tickets for real issues)
  • Verify SBOM freshness (Syft) — regenerated on last CI run
  • Check .grype.yaml exclusions are still warranted

1.2 Sentry Error Monitoring

  • Review new unresolved issues
  • Check for regressions in resolved issues
  • Verify no sensitive data leaking into breadcrumbs or contexts

1.3 Uptime & Health

  • Confirm Uptime Kuma shows no prolonged outages
  • Spot-check /health, /health/ready, /health/dependencies endpoints

1.4 CI/CD Pipeline

  • Review pipeline run history for flaky tests or timeouts
  • Confirm Cosign signing is still succeeding on image pushes

2. Monthly — Security Patches & Monitoring Review

When: First working week of each month.
Prep: Create a maintenance/YYYY-MM branch.

2.1 Dependency Security Patches

# Check for known vulnerabilities
pnpm audit

# List outdated packages (patch versions only this cycle)
pnpm outdated

# Apply patch-level updates
pnpm update --latest --recursive  # review diff before committing
  • Run pnpm audit — resolve critical and high findings
  • Apply patch-level dependency updates across the monorepo
  • Run full test suite: pnpm nx run-many -t lint,test,build
  • Deploy to staging and run acceptance tests
  • Merge maintenance branch after green pipeline

2.2 Container Image Patches

Alpine base images receive security patches regularly. Rebuilding images pulls in the latest apk packages.

  • Rebuild all Docker images (forces apk upgrade in build stage)
  • Verify images still pass health checks on staging
  • Confirm Cosign signatures are applied to rebuilt images

2.3 Monitoring & Observability Review

  • Review Grafana dashboards for anomalies in the last 30 days
  • Check ClickHouse log retention — verify old logs are archiving to DigitalOcean Spaces
  • Review OpenTelemetry Collector health (otel-collector container logs)
  • Verify Sentry quota usage and rate limits

2.4 TLS Certificates

Let's Encrypt certificates are auto-renewed by Traefik, but verification is prudent.

# Verify certificate validity
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null \
  | openssl x509 -noout -dates

echo | openssl s_client -connect connect.forma3d.be:443 2>/dev/null \
  | openssl x509 -noout -dates
  • Confirm certificates have > 30 days remaining
  • If renewal is failing, check Traefik ACME logs and DNS configuration

2.5 BullMQ Queue Health

  • Check for stuck or stalled jobs across all queues
  • Review failed job counts — investigate recurring failures
  • Verify Redis memory usage is within acceptable bounds

3. Quarterly — Full Dependency & Infrastructure Update

When: First two weeks of January, April, July, October.
Prep: Schedule a maintenance window. Communicate to stakeholders if downtime is expected.

3.1 Pre-Flight

  • Review this runbook for any process changes since last quarter
  • Create a maintenance/YYYY-QN branch
  • Back up the production database (verify backup is restorable)
  • Export current dependency tree: pnpm list --depth=0 > deps-before.txt
  • Note current versions of all infrastructure components

3.2 Full Security Review (Grype + SonarCloud)

Run the full security audit checklist from security-audit-checklist.md.

  • Complete all sections of the security audit checklist
  • Review Grype CVE scan results across all container images
  • Review SonarCloud security hotspots
  • Address any critical or high findings before proceeding
  • Document exceptions with justification (update .grype.yaml if needed)
  • Update the Audit Log table in the checklist

3.3 npm / pnpm Dependency Updates

3.3.1 Frontend Dependencies

Package Current Update Strategy Risk
react, react-dom 19.x Patch/minor within 19.x Low
react-router-dom 6.x Minor updates Low
@tanstack/react-query 5.x Minor updates Low
tailwindcss 4.x Minor updates, test visual regressions Medium
vite 7.x Minor updates Low
socket.io-client 4.x Match server version Medium

3.3.2 Backend Dependencies

Package Current Update Strategy Risk
@nestjs/* 11.x Minor updates within 11.x Medium
express 5.x Patch only Low
@prisma/client, prisma 5.x Minor updates, test migrations Medium
bullmq 5.x Minor updates Low
ioredis 5.x Minor updates Low
socket.io 4.x Match client version Medium
helmet 8.x Minor updates Low
pino 10.x Minor updates Low

3.3.3 Observability Dependencies

Package Current Update Strategy Risk
@sentry/* Latest Follow Sentry release notes Medium
@opentelemetry/* Latest Update as a group Medium

3.3.4 DevDependencies

Package Current Update Strategy Risk
nx 22.x Minor updates, check migration guide Medium
typescript 5.x Minor updates, run full type check Medium
eslint 9.x Minor updates with config review Low
vitest 4.x Minor updates Low
jest 30.x Minor updates Low
playwright 1.x Minor updates, re-record if needed Medium

3.3.5 Update Procedure

# Interactive update review
pnpm outdated

# Update all to latest compatible versions
pnpm update --recursive --latest

# Compare with pre-update snapshot
pnpm list --depth=0 > deps-after.txt
diff deps-before.txt deps-after.txt

# Full validation
pnpm install
pnpm nx run-many -t lint
pnpm nx run-many -t test
pnpm nx run-many -t build

# If Nx was updated, run migrations
pnpm nx migrate latest
pnpm nx migrate --run-migrations

3.4 Platform Runtime Updates

Node.js

Node.js 20.x is the current LTS. Patch updates within 20.x are low-risk.

  • Check current Node.js 20.x latest patch: https://nodejs.org/en/download/
  • Update engines.node in root package.json if pinning a minimum patch
  • Update FROM node:20-alpine in all Dockerfiles to pull latest patch
  • Update nodeVersion: '20.x' in Azure Pipelines if changing major/minor
  • Rebuild and test all services

Evaluating a Node.js major version upgrade (e.g., 20 → 22): - Only at semi-annual cadence - Verify all dependencies support the new version - Test on a feature branch first - Update Dockerfiles, CI/CD, and engines field together

pnpm

  • Check latest pnpm 9.x release: https://pnpm.io/installation
  • Update packageManager field in root package.json
  • Regenerate lockfile: pnpm install
  • Verify CI/CD pnpm version matches

3.5 Container Base Images

Image Used By Current Update Strategy
node:20-alpine All backend services 20-alpine Rebuild to pull latest Alpine patches
nginx:alpine Web app alpine Rebuild to pull latest
redis:7-alpine Redis 7-alpine Pin minor, update on quarterly
clickhouse/clickhouse-server ClickHouse 24.12-alpine Evaluate new minor releases
traefik Reverse proxy v3.0 Evaluate new minor releases
grafana/grafana Grafana 11.5.0 Update to latest 11.x
otel/opentelemetry-collector OTel Collector 0.120.0 Update to latest
# Pull latest base images on build server
docker pull node:20-alpine
docker pull nginx:alpine
docker pull redis:7-alpine

# Rebuild all application images
# (handled by CI/CD pipeline on merge)
  • Update pinned versions in docker-compose.yml for infrastructure services
  • Test updated infrastructure services on staging before production
  • Verify inter-service compatibility after updates

3.6 Infrastructure Services

PostgreSQL (DigitalOcean Managed)

  • Check current PostgreSQL version in DigitalOcean dashboard
  • Review DigitalOcean maintenance window schedule
  • Apply pending managed database patches if available
  • Verify connection pool settings are still appropriate
  • Review slow query logs for optimization opportunities
  • Confirm database backups are running and restorable

Redis

  • Check Redis 7.x latest patch release
  • Update Redis image tag in docker-compose.yml if needed
  • Verify maxmemory and eviction policy settings
  • Review memory usage trends

ClickHouse

  • Check for new ClickHouse releases
  • Review log retention and archival policies
  • Verify disk usage is within bounds
  • Test any schema migrations on staging first

Traefik

  • Check for Traefik v3.x updates
  • Review Traefik access logs for anomalies
  • Verify middleware configurations are still appropriate
  • Test updated Traefik on staging before production

3.7 Key & Certificate Rotation

Cross-reference with keys-certificates-inventory.md.

  • Database password — rotate quarterly (update DATABASE_URL in Azure DevOps, all services)
  • Database CA certificate — check expiry, rotate if < 90 days remaining
  • SSH keys — review access, rotate annually
  • API keys (Shopify, SimplyPrint, Sendcloud) — verify still valid, rotate if compromised
  • Cosign signing key — verify key is accessible to CI/CD
  • Container registry token — verify still valid
  • Session secret — rotate if desired, will invalidate active sessions
  • Update "Last Renewed" dates in keys-certificates-inventory.md

3.8 CI/CD Pipeline Maintenance

  • Review Azure DevOps agent pool health (self-hosted DO-Build-Agents)
  • Update self-hosted agent OS packages: apt update && apt upgrade
  • Clean up old Docker images and build cache on build agent
  • Review pipeline YAML for deprecated tasks or actions
  • Verify variable groups and secrets are current
  • Review pipeline duration trends — investigate slowdowns

3.9 Documentation & Housekeeping

  • Update this runbook if procedures have changed
  • Update TIMELINE.md with maintenance activities
  • Update changelog.md with dependency version changes
  • Review and close stale Azure DevOps work items
  • Archive old ClickHouse logs if not handled by automated retention
  • Clean up unused Docker images in DigitalOcean Container Registry

3.10 Post-Quarterly Validation

  • Deploy updated stack to staging
  • Run full acceptance test suite (pnpm nx run acceptance-tests:e2e)
  • Run smoke tests against all service health endpoints
  • Verify Shopify webhook processing end-to-end
  • Verify SimplyPrint integration end-to-end
  • Verify Sendcloud integration end-to-end
  • Monitor staging for 24–48 hours before promoting to production
  • Deploy to production
  • Monitor production for anomalies for 48 hours post-deploy

4. Semi-Annual — Major Upgrades & Resilience Testing

When: January and July (aligned with Q1 and Q3 quarterly cycles).
Prep: Allocate 2–3 days. Coordinate with stakeholders for potential downtime.

4.1 Major Platform Version Evaluation

Evaluate whether to adopt the next major version of core platforms. Do not upgrade — only evaluate and plan.

Platform Evaluate Decision Criteria
Node.js Next LTS (e.g., 22.x) Ecosystem readiness, dependency support, performance benchmarks
NestJS Next major (e.g., 12.x) Breaking changes, migration effort, feature value
React Next major (e.g., 20.x) Breaking changes, ecosystem readiness
Nx Next major (e.g., 23.x) Migration generators available, breaking changes
TypeScript Next major (e.g., 6.x) Breaking changes, new features needed
Prisma Next major (e.g., 6.x) Migration path, breaking schema changes
PostgreSQL Next major (e.g., 17.x) DigitalOcean availability, migration path
  • For each platform, review release notes and migration guides
  • Document evaluation in an ADR if upgrading
  • Schedule major upgrades in the subsequent quarterly cycle

4.2 Load Testing & Performance Baseline

# Run k6 load tests against staging
pnpm nx run acceptance-tests:load-test
  • Run load tests and compare against previous baseline
  • Investigate any performance regressions > 10%
  • Update baseline metrics in load test configuration
  • Review connection pool sizing under load
  • Review Redis memory usage under load

4.3 Disaster Recovery Drill

Reference: disaster-recovery-research.md

  • Test database backup restoration to a fresh instance
  • Test service recovery from complete container failure
  • Verify Traefik re-obtains TLS certificates after volume loss
  • Test Redis data loss scenario — verify BullMQ queue recovery
  • Document recovery times and update DR procedures if needed

4.4 Infrastructure Right-Sizing

  • Review DigitalOcean Droplet CPU/memory utilization (last 6 months)
  • Review database connection count and query latency trends
  • Review Redis memory usage trends
  • Review ClickHouse disk usage and query performance
  • Evaluate if instance sizes need scaling up or down
  • Review DigitalOcean spend and optimize if possible

4.5 Full Architecture Review

  • Review technical debt register (docs/04-development/techdebt/technical-debt-register.md)
  • Evaluate new technical debt items discovered since last review
  • Assess whether architectural patterns are still appropriate
  • Review inter-service communication patterns for bottlenecks
  • Evaluate new tools or services that could improve the stack

4.6 SSH Key Rotation (Annual, Aligned with January)

  • Rotate staging Droplet SSH key
  • Rotate build agent Droplet SSH key
  • Update authorized_keys on all servers
  • Update SSH key references in Azure DevOps
  • Verify CI/CD pipeline can still deploy

5. Emergency / Out-of-Band Maintenance

Not everything fits a schedule. These situations warrant immediate action.

Trigger Response SLA
Critical CVE in direct dependency Patch, test, deploy 24 hours
Grype critical CVE finding Triage, patch if exploitable 48 hours
Compromised credential Rotate immediately, audit access logs Immediate
Provider-forced upgrade (e.g., DO PostgreSQL EOL) Plan and execute migration Per provider timeline
TLS certificate failure Debug Traefik ACME, manual cert if needed 1 hour
Node.js security release Update Dockerfiles, rebuild, deploy 72 hours

6. Maintenance Calendar Template

Use this as a starting point. Adapt dates to your team's schedule.

Year at a Glance

Month Cadence Key Activities
January Quarterly + Semi-annual Full dependency update, security review, load test, DR drill, SSH key rotation
February Monthly Security patches, monitoring review
March Monthly Security patches, monitoring review
April Quarterly Full dependency update, security review, infra updates, key rotation
May Monthly Security patches, monitoring review
June Monthly Security patches, monitoring review
July Quarterly + Semi-annual Full dependency update, security review, load test, DR drill, platform evaluation
August Monthly Security patches, monitoring review
September Monthly Security patches, monitoring review
October Quarterly Full dependency update, security review, infra updates, key rotation
November Monthly Security patches, monitoring review
December Monthly (light) Security patches only — avoid major changes during holiday period

Suggested Maintenance Windows

Environment Window Duration
Staging Anytime (non-customer-facing) As needed
Production (non-breaking) Tuesday or Wednesday, 06:00–08:00 CET 2 hours
Production (breaking/migration) Saturday, 06:00–10:00 CET 4 hours

7. Rollback Procedures

If a maintenance update causes issues in production:

Application Rollback

# SSH into staging/production server
ssh deploy@staging-server

# Roll back to previous image version
cd /opt/forma3d
docker compose pull  # if reverting image tags in docker-compose.yml
docker compose up -d --force-recreate <service-name>

# Or roll back via CI/CD by re-deploying a previous successful build

Database Rollback

Prisma migrations are forward-only. If a migration causes issues:

  1. Restore database from the pre-maintenance backup
  2. Redeploy the previous application version
  3. Investigate the migration issue on staging

Infrastructure Rollback

For infrastructure service updates (Redis, ClickHouse, Traefik, Grafana):

  1. Revert docker-compose.yml to previous version tags
  2. Run docker compose up -d --force-recreate <service-name>
  3. Verify service health

8. Maintenance Log

Record every maintenance cycle for audit trail and trend analysis.

Date Cadence Operator Activities Issues Found Notes
2026-02-21 Runbook created Initial version

Quick Reference — Commands

# ─── Dependency Health ───
pnpm audit                              # Vulnerability scan
pnpm outdated                           # Outdated packages
pnpm list --depth=0                     # Current dependency tree
pnpm update --recursive --latest        # Update all packages

# ─── Nx Workspace ───
pnpm nx run-many -t lint                # Lint all projects
pnpm nx run-many -t test                # Test all projects
pnpm nx run-many -t build               # Build all projects
pnpm nx migrate latest                  # Check for Nx migrations

# ─── Docker / Infrastructure ───
docker compose pull                     # Pull latest images
docker compose up -d --force-recreate   # Recreate containers
docker system prune -af                 # Clean up unused images/containers
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.CreatedAt}}" # List image ages

# ─── TLS / Certificates ───
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates
echo | openssl s_client -connect connect.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates

# ─── Database ───
# Check PostgreSQL version (via psql or DigitalOcean dashboard)
# Verify backup status in DigitalOcean dashboard

# ─── Redis ───
redis-cli INFO server | grep redis_version
redis-cli INFO memory | grep used_memory_human

# ─── Health Checks ───
curl -s https://connect-api.forma3d.be/health | jq .
curl -s https://connect-api.forma3d.be/health/ready | jq .
curl -s https://connect-api.forma3d.be/health/dependencies | jq .

Revision History:

Version Date Author Changes
1.0 2026-02-21 Maintenance Planning Initial runbook