Scheduled Maintenance Runbook¶
Version: 1.0
Date: February 21, 2026
Status: Active
This runbook defines the recurring maintenance cycles for Forma3D.Connect. It covers dependency updates, platform upgrades, infrastructure patching, security reviews, and operational housekeeping.
Related documents:
- Security Audit Checklist — detailed security verification steps
- Keys & Certificates Inventory — credential lifespans and rotation procedures
- Operations Runbook — incident response and day-to-day operations
- Troubleshooting Guide — common issues and resolutions
Maintenance Cadences¶
Not every component needs the same update frequency. The table below defines four cadences based on risk, velocity, and operational impact.
| Cadence | Frequency | Focus | Typical Duration |
|---|---|---|---|
| Continuous | Ongoing / weekly | Automated scanning, alert triage | 30 min / week |
| Monthly | 1st week of each month | Security patches, minor bumps, monitoring review | 2–4 hours |
| Quarterly | Jan, Apr, Jul, Oct | Dependency upgrades, infra updates, key rotation, full security review | 1–2 days |
| Semi-annual | Jan, Jul | Major platform upgrades, load testing, DR drills, architecture review | 2–3 days |
Choosing the Right Cadence¶
| Signal | Action |
|---|---|
| Critical CVE in a direct dependency | Patch immediately (out-of-band) |
| Grype flags a high-severity CVE | Triage within 48 hours |
| New major version of Node.js / NestJS / React | Evaluate at semi-annual, adopt at next quarterly |
| Minor / patch version of any dependency | Bundle into monthly cycle |
| Let's Encrypt certificate nearing expiry | Traefik auto-renews; verify monthly |
| Database CA certificate nearing expiry | Plan rotation at quarterly; execute in maintenance window |
1. Continuous — Automated Scanning & Alert Triage¶
These tasks run automatically or require only brief weekly attention.
1.1 Grype CVE Scan Results¶
- Review Grype scan results in recent pipeline runs for new critical / high CVEs
- Triage new container image vulnerabilities (update
.grype.yamlexclusions for false positives, create tickets for real issues) - Verify SBOM freshness (Syft) — regenerated on last CI run
- Check
.grype.yamlexclusions are still warranted
1.2 Sentry Error Monitoring¶
- Review new unresolved issues
- Check for regressions in resolved issues
- Verify no sensitive data leaking into breadcrumbs or contexts
1.3 Uptime & Health¶
- Confirm Uptime Kuma shows no prolonged outages
- Spot-check
/health,/health/ready,/health/dependenciesendpoints
1.4 CI/CD Pipeline¶
- Review pipeline run history for flaky tests or timeouts
- Confirm Cosign signing is still succeeding on image pushes
2. Monthly — Security Patches & Monitoring Review¶
When: First working week of each month.
Prep: Create a maintenance/YYYY-MM branch.
2.1 Dependency Security Patches¶
# Check for known vulnerabilities
pnpm audit
# List outdated packages (patch versions only this cycle)
pnpm outdated
# Apply patch-level updates
pnpm update --latest --recursive # review diff before committing
- Run
pnpm audit— resolve critical and high findings - Apply patch-level dependency updates across the monorepo
- Run full test suite:
pnpm nx run-many -t lint,test,build - Deploy to staging and run acceptance tests
- Merge maintenance branch after green pipeline
2.2 Container Image Patches¶
Alpine base images receive security patches regularly. Rebuilding images pulls in the latest apk packages.
- Rebuild all Docker images (forces
apk upgradein build stage) - Verify images still pass health checks on staging
- Confirm Cosign signatures are applied to rebuilt images
2.3 Monitoring & Observability Review¶
- Review Grafana dashboards for anomalies in the last 30 days
- Check ClickHouse log retention — verify old logs are archiving to DigitalOcean Spaces
- Review OpenTelemetry Collector health (
otel-collectorcontainer logs) - Verify Sentry quota usage and rate limits
2.4 TLS Certificates¶
Let's Encrypt certificates are auto-renewed by Traefik, but verification is prudent.
# Verify certificate validity
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null \
| openssl x509 -noout -dates
echo | openssl s_client -connect connect.forma3d.be:443 2>/dev/null \
| openssl x509 -noout -dates
- Confirm certificates have > 30 days remaining
- If renewal is failing, check Traefik ACME logs and DNS configuration
2.5 BullMQ Queue Health¶
- Check for stuck or stalled jobs across all queues
- Review failed job counts — investigate recurring failures
- Verify Redis memory usage is within acceptable bounds
3. Quarterly — Full Dependency & Infrastructure Update¶
When: First two weeks of January, April, July, October.
Prep: Schedule a maintenance window. Communicate to stakeholders if downtime is expected.
3.1 Pre-Flight¶
- Review this runbook for any process changes since last quarter
- Create a
maintenance/YYYY-QNbranch - Back up the production database (verify backup is restorable)
- Export current dependency tree:
pnpm list --depth=0 > deps-before.txt - Note current versions of all infrastructure components
3.2 Full Security Review (Grype + SonarCloud)¶
Run the full security audit checklist from security-audit-checklist.md.
- Complete all sections of the security audit checklist
- Review Grype CVE scan results across all container images
- Review SonarCloud security hotspots
- Address any critical or high findings before proceeding
- Document exceptions with justification (update
.grype.yamlif needed) - Update the Audit Log table in the checklist
3.3 npm / pnpm Dependency Updates¶
3.3.1 Frontend Dependencies¶
| Package | Current | Update Strategy | Risk |
|---|---|---|---|
react, react-dom |
19.x | Patch/minor within 19.x | Low |
react-router-dom |
6.x | Minor updates | Low |
@tanstack/react-query |
5.x | Minor updates | Low |
tailwindcss |
4.x | Minor updates, test visual regressions | Medium |
vite |
7.x | Minor updates | Low |
socket.io-client |
4.x | Match server version | Medium |
3.3.2 Backend Dependencies¶
| Package | Current | Update Strategy | Risk |
|---|---|---|---|
@nestjs/* |
11.x | Minor updates within 11.x | Medium |
express |
5.x | Patch only | Low |
@prisma/client, prisma |
5.x | Minor updates, test migrations | Medium |
bullmq |
5.x | Minor updates | Low |
ioredis |
5.x | Minor updates | Low |
socket.io |
4.x | Match client version | Medium |
helmet |
8.x | Minor updates | Low |
pino |
10.x | Minor updates | Low |
3.3.3 Observability Dependencies¶
| Package | Current | Update Strategy | Risk |
|---|---|---|---|
@sentry/* |
Latest | Follow Sentry release notes | Medium |
@opentelemetry/* |
Latest | Update as a group | Medium |
3.3.4 DevDependencies¶
| Package | Current | Update Strategy | Risk |
|---|---|---|---|
nx |
22.x | Minor updates, check migration guide | Medium |
typescript |
5.x | Minor updates, run full type check | Medium |
eslint |
9.x | Minor updates with config review | Low |
vitest |
4.x | Minor updates | Low |
jest |
30.x | Minor updates | Low |
playwright |
1.x | Minor updates, re-record if needed | Medium |
3.3.5 Update Procedure¶
# Interactive update review
pnpm outdated
# Update all to latest compatible versions
pnpm update --recursive --latest
# Compare with pre-update snapshot
pnpm list --depth=0 > deps-after.txt
diff deps-before.txt deps-after.txt
# Full validation
pnpm install
pnpm nx run-many -t lint
pnpm nx run-many -t test
pnpm nx run-many -t build
# If Nx was updated, run migrations
pnpm nx migrate latest
pnpm nx migrate --run-migrations
3.4 Platform Runtime Updates¶
Node.js¶
Node.js 20.x is the current LTS. Patch updates within 20.x are low-risk.
- Check current Node.js 20.x latest patch: https://nodejs.org/en/download/
- Update
engines.nodein rootpackage.jsonif pinning a minimum patch - Update
FROM node:20-alpinein all Dockerfiles to pull latest patch - Update
nodeVersion: '20.x'in Azure Pipelines if changing major/minor - Rebuild and test all services
Evaluating a Node.js major version upgrade (e.g., 20 → 22): - Only at semi-annual cadence - Verify all dependencies support the new version - Test on a feature branch first - Update Dockerfiles, CI/CD, and engines field together
pnpm¶
- Check latest pnpm 9.x release: https://pnpm.io/installation
- Update
packageManagerfield in rootpackage.json - Regenerate lockfile:
pnpm install - Verify CI/CD pnpm version matches
3.5 Container Base Images¶
| Image | Used By | Current | Update Strategy |
|---|---|---|---|
node:20-alpine |
All backend services | 20-alpine | Rebuild to pull latest Alpine patches |
nginx:alpine |
Web app | alpine | Rebuild to pull latest |
redis:7-alpine |
Redis | 7-alpine | Pin minor, update on quarterly |
clickhouse/clickhouse-server |
ClickHouse | 24.12-alpine | Evaluate new minor releases |
traefik |
Reverse proxy | v3.0 | Evaluate new minor releases |
grafana/grafana |
Grafana | 11.5.0 | Update to latest 11.x |
otel/opentelemetry-collector |
OTel Collector | 0.120.0 | Update to latest |
# Pull latest base images on build server
docker pull node:20-alpine
docker pull nginx:alpine
docker pull redis:7-alpine
# Rebuild all application images
# (handled by CI/CD pipeline on merge)
- Update pinned versions in
docker-compose.ymlfor infrastructure services - Test updated infrastructure services on staging before production
- Verify inter-service compatibility after updates
3.6 Infrastructure Services¶
PostgreSQL (DigitalOcean Managed)¶
- Check current PostgreSQL version in DigitalOcean dashboard
- Review DigitalOcean maintenance window schedule
- Apply pending managed database patches if available
- Verify connection pool settings are still appropriate
- Review slow query logs for optimization opportunities
- Confirm database backups are running and restorable
Redis¶
- Check Redis 7.x latest patch release
- Update Redis image tag in
docker-compose.ymlif needed - Verify
maxmemoryand eviction policy settings - Review memory usage trends
ClickHouse¶
- Check for new ClickHouse releases
- Review log retention and archival policies
- Verify disk usage is within bounds
- Test any schema migrations on staging first
Traefik¶
- Check for Traefik v3.x updates
- Review Traefik access logs for anomalies
- Verify middleware configurations are still appropriate
- Test updated Traefik on staging before production
3.7 Key & Certificate Rotation¶
Cross-reference with keys-certificates-inventory.md.
- Database password — rotate quarterly (update
DATABASE_URLin Azure DevOps, all services) - Database CA certificate — check expiry, rotate if < 90 days remaining
- SSH keys — review access, rotate annually
- API keys (Shopify, SimplyPrint, Sendcloud) — verify still valid, rotate if compromised
- Cosign signing key — verify key is accessible to CI/CD
- Container registry token — verify still valid
- Session secret — rotate if desired, will invalidate active sessions
- Update "Last Renewed" dates in keys-certificates-inventory.md
3.8 CI/CD Pipeline Maintenance¶
- Review Azure DevOps agent pool health (self-hosted
DO-Build-Agents) - Update self-hosted agent OS packages:
apt update && apt upgrade - Clean up old Docker images and build cache on build agent
- Review pipeline YAML for deprecated tasks or actions
- Verify variable groups and secrets are current
- Review pipeline duration trends — investigate slowdowns
3.9 Documentation & Housekeeping¶
- Update this runbook if procedures have changed
- Update
TIMELINE.mdwith maintenance activities - Update
changelog.mdwith dependency version changes - Review and close stale Azure DevOps work items
- Archive old ClickHouse logs if not handled by automated retention
- Clean up unused Docker images in DigitalOcean Container Registry
3.10 Post-Quarterly Validation¶
- Deploy updated stack to staging
- Run full acceptance test suite (
pnpm nx run acceptance-tests:e2e) - Run smoke tests against all service health endpoints
- Verify Shopify webhook processing end-to-end
- Verify SimplyPrint integration end-to-end
- Verify Sendcloud integration end-to-end
- Monitor staging for 24–48 hours before promoting to production
- Deploy to production
- Monitor production for anomalies for 48 hours post-deploy
4. Semi-Annual — Major Upgrades & Resilience Testing¶
When: January and July (aligned with Q1 and Q3 quarterly cycles).
Prep: Allocate 2–3 days. Coordinate with stakeholders for potential downtime.
4.1 Major Platform Version Evaluation¶
Evaluate whether to adopt the next major version of core platforms. Do not upgrade — only evaluate and plan.
| Platform | Evaluate | Decision Criteria |
|---|---|---|
| Node.js | Next LTS (e.g., 22.x) | Ecosystem readiness, dependency support, performance benchmarks |
| NestJS | Next major (e.g., 12.x) | Breaking changes, migration effort, feature value |
| React | Next major (e.g., 20.x) | Breaking changes, ecosystem readiness |
| Nx | Next major (e.g., 23.x) | Migration generators available, breaking changes |
| TypeScript | Next major (e.g., 6.x) | Breaking changes, new features needed |
| Prisma | Next major (e.g., 6.x) | Migration path, breaking schema changes |
| PostgreSQL | Next major (e.g., 17.x) | DigitalOcean availability, migration path |
- For each platform, review release notes and migration guides
- Document evaluation in an ADR if upgrading
- Schedule major upgrades in the subsequent quarterly cycle
4.2 Load Testing & Performance Baseline¶
# Run k6 load tests against staging
pnpm nx run acceptance-tests:load-test
- Run load tests and compare against previous baseline
- Investigate any performance regressions > 10%
- Update baseline metrics in load test configuration
- Review connection pool sizing under load
- Review Redis memory usage under load
4.3 Disaster Recovery Drill¶
Reference: disaster-recovery-research.md
- Test database backup restoration to a fresh instance
- Test service recovery from complete container failure
- Verify Traefik re-obtains TLS certificates after volume loss
- Test Redis data loss scenario — verify BullMQ queue recovery
- Document recovery times and update DR procedures if needed
4.4 Infrastructure Right-Sizing¶
- Review DigitalOcean Droplet CPU/memory utilization (last 6 months)
- Review database connection count and query latency trends
- Review Redis memory usage trends
- Review ClickHouse disk usage and query performance
- Evaluate if instance sizes need scaling up or down
- Review DigitalOcean spend and optimize if possible
4.5 Full Architecture Review¶
- Review technical debt register (
docs/04-development/techdebt/technical-debt-register.md) - Evaluate new technical debt items discovered since last review
- Assess whether architectural patterns are still appropriate
- Review inter-service communication patterns for bottlenecks
- Evaluate new tools or services that could improve the stack
4.6 SSH Key Rotation (Annual, Aligned with January)¶
- Rotate staging Droplet SSH key
- Rotate build agent Droplet SSH key
- Update authorized_keys on all servers
- Update SSH key references in Azure DevOps
- Verify CI/CD pipeline can still deploy
5. Emergency / Out-of-Band Maintenance¶
Not everything fits a schedule. These situations warrant immediate action.
| Trigger | Response | SLA |
|---|---|---|
| Critical CVE in direct dependency | Patch, test, deploy | 24 hours |
| Grype critical CVE finding | Triage, patch if exploitable | 48 hours |
| Compromised credential | Rotate immediately, audit access logs | Immediate |
| Provider-forced upgrade (e.g., DO PostgreSQL EOL) | Plan and execute migration | Per provider timeline |
| TLS certificate failure | Debug Traefik ACME, manual cert if needed | 1 hour |
| Node.js security release | Update Dockerfiles, rebuild, deploy | 72 hours |
6. Maintenance Calendar Template¶
Use this as a starting point. Adapt dates to your team's schedule.
Year at a Glance¶
| Month | Cadence | Key Activities |
|---|---|---|
| January | Quarterly + Semi-annual | Full dependency update, security review, load test, DR drill, SSH key rotation |
| February | Monthly | Security patches, monitoring review |
| March | Monthly | Security patches, monitoring review |
| April | Quarterly | Full dependency update, security review, infra updates, key rotation |
| May | Monthly | Security patches, monitoring review |
| June | Monthly | Security patches, monitoring review |
| July | Quarterly + Semi-annual | Full dependency update, security review, load test, DR drill, platform evaluation |
| August | Monthly | Security patches, monitoring review |
| September | Monthly | Security patches, monitoring review |
| October | Quarterly | Full dependency update, security review, infra updates, key rotation |
| November | Monthly | Security patches, monitoring review |
| December | Monthly (light) | Security patches only — avoid major changes during holiday period |
Suggested Maintenance Windows¶
| Environment | Window | Duration |
|---|---|---|
| Staging | Anytime (non-customer-facing) | As needed |
| Production (non-breaking) | Tuesday or Wednesday, 06:00–08:00 CET | 2 hours |
| Production (breaking/migration) | Saturday, 06:00–10:00 CET | 4 hours |
7. Rollback Procedures¶
If a maintenance update causes issues in production:
Application Rollback¶
# SSH into staging/production server
ssh deploy@staging-server
# Roll back to previous image version
cd /opt/forma3d
docker compose pull # if reverting image tags in docker-compose.yml
docker compose up -d --force-recreate <service-name>
# Or roll back via CI/CD by re-deploying a previous successful build
Database Rollback¶
Prisma migrations are forward-only. If a migration causes issues:
- Restore database from the pre-maintenance backup
- Redeploy the previous application version
- Investigate the migration issue on staging
Infrastructure Rollback¶
For infrastructure service updates (Redis, ClickHouse, Traefik, Grafana):
- Revert
docker-compose.ymlto previous version tags - Run
docker compose up -d --force-recreate <service-name> - Verify service health
8. Maintenance Log¶
Record every maintenance cycle for audit trail and trend analysis.
| Date | Cadence | Operator | Activities | Issues Found | Notes |
|---|---|---|---|---|---|
| 2026-02-21 | — | — | Runbook created | — | Initial version |
Quick Reference — Commands¶
# ─── Dependency Health ───
pnpm audit # Vulnerability scan
pnpm outdated # Outdated packages
pnpm list --depth=0 # Current dependency tree
pnpm update --recursive --latest # Update all packages
# ─── Nx Workspace ───
pnpm nx run-many -t lint # Lint all projects
pnpm nx run-many -t test # Test all projects
pnpm nx run-many -t build # Build all projects
pnpm nx migrate latest # Check for Nx migrations
# ─── Docker / Infrastructure ───
docker compose pull # Pull latest images
docker compose up -d --force-recreate # Recreate containers
docker system prune -af # Clean up unused images/containers
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.CreatedAt}}" # List image ages
# ─── TLS / Certificates ───
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates
echo | openssl s_client -connect connect.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates
# ─── Database ───
# Check PostgreSQL version (via psql or DigitalOcean dashboard)
# Verify backup status in DigitalOcean dashboard
# ─── Redis ───
redis-cli INFO server | grep redis_version
redis-cli INFO memory | grep used_memory_human
# ─── Health Checks ───
curl -s https://connect-api.forma3d.be/health | jq .
curl -s https://connect-api.forma3d.be/health/ready | jq .
curl -s https://connect-api.forma3d.be/health/dependencies | jq .
Revision History:
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-21 | Maintenance Planning | Initial runbook |