AI Prompt: Forma3D.Connect — Scaling Preparations (Docker Compose → Kubernetes)¶

Purpose: Prepare the current single-Droplet Docker Compose deployment for a seamless future migration to Kubernetes on DigitalOcean, without adding unnecessary complexity now
Estimated Effort: 14–19 hours
Prerequisites: Staging deployment operational on a single DigitalOcean Droplet with Docker Compose + Traefik
Output: Externalized configuration, container registry strategy, architectural guardrails that make a future Kubernetes migration a straightforward lift-and-shift, and a local Rancher Desktop + Tilt development environment for production-parity "clone and tilt up" workflow
Status: 🚧 TODO

🎯 Mission¶

Prepare the Forma3D.Connect infrastructure for a seamless future migration from Docker Compose on a single Droplet to DigitalOcean Managed Kubernetes (DOKS). The goal is to make changes now — while the system is simple — that will pay off when multi-tenancy drives the need for horizontal scaling.

This is NOT a Kubernetes migration. This prompt makes the Docker Compose deployment "Kubernetes-ready" by:

DNS strategy with low TTL — Prepare DNS records for a fast cut-over to a future DO Load Balancer IP by lowering TTLs now
Container registry strategy — Ensure all images are pulled from DigitalOcean Container Registry (DOCR) with proper tagging
Configuration externalization — Move all configuration to environment variables and .env files so they map cleanly to Kubernetes ConfigMaps and Secrets
Health check standardization — Ensure all services expose HTTP health endpoints that work identically as Kubernetes liveness/readiness probes
Stateless service design — Verify all services are stateless (no local file storage, no in-memory sessions without Redis backing)
Graceful shutdown — Ensure all services handle SIGTERM for zero-downtime rolling updates
Resource awareness — Add resource constraints to Docker Compose that translate directly to Kubernetes resource requests/limits
DNS and TLS strategy — Plan the DNS/TLS migration path from Traefik to DigitalOcean Load Balancer + cert-manager
Multi-replica readiness — Validate that all containers can run as multiple replicas behind Traefik with proper load-balancing, sticky sessions, and worker deduplication
Rolling update strategy — Define and test a zero-downtime rolling update procedure with start-first ordering, health gates, and backward-compatible database migrations
Local development with Rancher Desktop + Tilt — Create a "clone and tilt up" developer experience using Rancher Desktop's built-in Kubernetes, Tilt live-update, port-forwarding, and VS Code debug attach

Important note on DigitalOcean Reserved IPs: DO Reserved IPs can only be assigned to Droplets, not to Load Balancers. This means we cannot use a Reserved IP as a stable entry point that gets reassigned from a Droplet to a Load Balancer. Instead, the migration strategy uses DNS-based cut-over: when moving to DOKS, update DNS A records from the Droplet IP to the Load Balancer's stable IP. Setting low TTLs (60s) on DNS records before migration minimizes propagation delay to under a minute.

Why now:

Changes are cheap when the system is small (6 services + supporting containers)
Retrofitting these patterns later is expensive and error-prone
Multi-tenancy (the next major feature) will be the trigger for needing Kubernetes
Lowering DNS TTLs now means the future DNS cut-over will propagate in under a minute

What stays unchanged:

Docker Compose remains the deployment mechanism for now
Traefik remains the reverse proxy for now
Single Droplet remains the hosting model for now
No Kubernetes manifests or Helm charts are created in this prompt

📐 Architecture¶

Current State¶

                    DNS A Records
                         │
        ┌────────────────┴────────────────────┐
        │  staging-connect.forma3d.be          │
        │  staging-connect-api.forma3d.be      │
        │  staging-connect-docs.forma3d.be     │
        │  staging-connect-events.forma3d.be   │
        │  staging-connect-db.forma3d.be       │
        │  staging-connect-logs.forma3d.be     │
        │  staging-connect-uptime.forma3d.be   │
        └────────────────┬────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  Droplet Public IP   │
              │  (e.g., 167.x.x.x)  │
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Docker Compose      │
              │  + Traefik           │
              │  + All Services      │
              └──────────────────────┘

Target State (after this prompt)¶

                    DNS A Records (TTL: 60s)
                         │
        ┌────────────────┴────────────────────┐
        │  staging-connect.forma3d.be          │
        │  staging-connect-api.forma3d.be      │
        │  (all subdomains)                    │
        └────────────────┬────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  Droplet Public IP   │   ← Same IP, but DNS TTL lowered to 60s
              │  (e.g., 167.x.x.x)  │     so future cut-over propagates fast
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Docker Compose      │   + Health checks standardized
              │  + Traefik           │   + Graceful shutdown enabled
              │  + All Services      │   + Resource constraints added
              │  (K8s-ready)         │   + Configuration externalized
              └──────────────────────┘

Future State (Kubernetes — NOT this prompt)¶

                    DNS A Records (TTL: 60s)
                         │
                         ▼
              ┌──────────────────────┐
              │  DO Load Balancer    │   ← DNS updated to LB's stable IP
              │  (stable IP)         │     Propagation: <1 min with 60s TTL
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  DOKS Cluster        │
              │  ├── Ingress NGINX   │
              │  ├── Gateway Pod(s)  │
              │  ├── Order Svc Pod(s)│
              │  ├── Print Svc Pod(s)│
              │  ├── Ship Svc Pod(s) │
              │  ├── GridFlock Pod(s)│
              │  ├── Slicer Pod(s)   │
              │  └── Web Pod(s)      │
              └──────────────────────┘

Note: DigitalOcean Load Balancers have stable, persistent IP addresses that do not change throughout their lifetime. The migration requires a one-time DNS update from the Droplet IP to the LB IP. With a 60s TTL, this propagates in under a minute.

📋 Implementation Phases¶

Phase 1: DNS Preparation for Future Migration (1 hour)¶

Priority: P0 | Impact: Critical | Dependencies: None

Prepare DNS records so that a future migration to DOKS + Load Balancer can be done with minimal disruption. The key insight: DigitalOcean Reserved IPs can only be assigned to Droplets, not to Load Balancers. Therefore, we cannot use a Reserved IP as a stable entry point that moves between Droplet and LB. Instead, the migration strategy relies on low-TTL DNS cut-over.

Why NOT a Reserved IP for this use case:

DO Reserved IPs are Droplet-only (cannot be assigned to Load Balancers)
DO Load Balancers get their own stable, persistent IP addresses
The migration requires updating DNS A records to the LB's new IP
With low TTLs, this DNS update propagates globally in under a minute

1. Lower DNS TTL on all staging subdomains¶

Set TTL to 60 seconds on all A records:

Record	TTL (current)	TTL (target)
`staging-connect.forma3d.be`	3600s (typical default)	60s
`staging-connect-api.forma3d.be`	3600s	60s
`staging-connect-docs.forma3d.be`	3600s	60s
`staging-connect-events.forma3d.be`	3600s	60s
`staging-connect-db.forma3d.be`	3600s	60s
`staging-connect-logs.forma3d.be`	3600s	60s
`staging-connect-uptime.forma3d.be`	3600s	60s

A 60s TTL means that when we later update the A records to point to a Load Balancer IP, all DNS caches worldwide will pick up the new IP within 60 seconds.

2. Verify DNS resolution¶

dig staging-connect.forma3d.be +short
dig staging-connect-api.forma3d.be +short
# Verify TTL is showing 60s or less
dig staging-connect.forma3d.be | grep -i ttl

3. Document the Droplet IP and datacenter¶

Add to deployment documentation and .env.example:

# DigitalOcean Infrastructure
DO_DROPLET_IP=<current-droplet-ip>
DO_DATACENTER=ams3
# Note: DNS TTLs set to 60s for future migration agility

4. Document the migration cut-over procedure¶

Create a brief runbook entry for the future DNS cut-over:

Deploy services to DOKS cluster
Create DO Load Balancer → get its stable IP
Verify services are healthy behind the LB
Update all DNS A records: Droplet IP → LB IP
Wait 60 seconds for propagation
Verify all services resolve to new IP
Decommission Droplet

Why low TTLs now: DNS TTL changes take effect only after the previous TTL expires. If TTLs are currently 1 hour (3600s), lowering them to 60s right before migration means you still need to wait up to 1 hour for the old TTL to expire from caches. By lowering TTLs now, the 60s TTL is already cached everywhere when migration day arrives.

Phase 2: Container Registry Hygiene (1 hour)¶

Priority: P0 | Impact: High | Dependencies: None

Ensure all container images are stored in DigitalOcean Container Registry (DOCR) with a consistent tagging strategy that works for both Docker Compose and Kubernetes.

1. Verify DOCR is the image source for all services¶

Current Docker Compose already uses ${REGISTRY_URL}/forma3d-connect-*:${*_IMAGE_TAG:-latest}. Verify all services follow this pattern.

2. Implement semantic image tagging¶

Instead of relying solely on latest, ensure the CI pipeline tags images with:

git-<short-sha> — immutable reference to exact commit
latest — rolling tag for the most recent build
staging / production — environment-specific rolling tags

docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:git-abc1234
docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:staging

3. Add image pull policy awareness¶

In Docker Compose, add explicit pull_policy to each service:

services:
  gateway:
    image: ${REGISTRY_URL}/forma3d-connect-gateway:${GATEWAY_IMAGE_TAG:-latest}
    pull_policy: always

This mirrors Kubernetes' imagePullPolicy: Always behavior and ensures deployments always use the latest image for a given tag.

Phase 3: Health Check Standardization (2 hours)¶

Priority: P1 | Impact: High | Dependencies: None

Kubernetes uses three probe types: liveness (is the process alive?), readiness (can it serve traffic?), and startup (has it finished initializing?). Ensure all services expose HTTP endpoints that serve these purposes.

1. Verify health endpoints exist in all backend services¶

Each NestJS service should expose:

Endpoint	Purpose	K8s Probe Type	Expected Response
`GET /health/live`	Process is alive	Liveness	200 OK
`GET /health/ready`	Can serve traffic (DB connected, dependencies up)	Readiness	200 OK or 503

2. Update Docker Compose health checks to use HTTP¶

Replace wget/curl health checks with consistent HTTP checks:

healthcheck:
  test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/health/live']
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 30s

3. Add readiness checks that verify dependencies¶

The /health/ready endpoint should verify:

Database connection is active
Redis connection is active (for services that use Redis)
Downstream services are reachable (for the Gateway)

This is critical for Kubernetes: a pod that passes liveness but fails readiness is kept alive but removed from the Service's endpoint list (no traffic sent to it).

4. Verify health checks for third-party / observability services¶

The following third-party services already have health checks in Docker Compose — verify they are consistent and functional:

Service	Health Check	Notes
ClickHouse	`clickhouse-client --query 'SELECT 1'`	Confirms query engine is ready
Grafana	`wget --spider http://localhost:3000/api/health`	Built-in API health endpoint
Uptime Kuma	HTTP check on port 3001	Verify `start_period` is sufficient for DB init
Dozzle	`/dozzle healthcheck`	Built-in healthcheck command
OTel Collector	Add `curl http://localhost:13133/`	Uses the `health_check` extension (port 13133) — verify this extension is enabled in `otel-collector-config.yaml`

Phase 4: Graceful Shutdown (2 hours)¶

Priority: P1 | Impact: High | Dependencies: None

Kubernetes sends SIGTERM to pods during rolling updates, then waits terminationGracePeriodSeconds (default 30s) before sending SIGKILL. Services must handle SIGTERM to finish in-flight requests.

1. Verify NestJS graceful shutdown is enabled¶

In each service's main.ts:

app.enableShutdownHooks();

This ensures NestJS listens for SIGTERM and: - Stops accepting new connections - Waits for in-flight HTTP requests to complete - Closes database connections cleanly - Closes Redis connections cleanly

2. Add `stop_grace_period` to Docker Compose¶

For each service in docker-compose.yml:

services:
  gateway:
    stop_grace_period: 30s

This mirrors Kubernetes' terminationGracePeriodSeconds and ensures Docker Compose also waits before sending SIGKILL.

3. Verify BullMQ workers handle shutdown¶

For services with BullMQ workers (order processing, print job processing), ensure workers call worker.close() on SIGTERM to finish processing the current job before shutting down.

4. Add `stop_grace_period` to third-party services¶

ClickHouse, Grafana, OTel Collector, Uptime Kuma, and Dozzle should also have stop_grace_period set. ClickHouse is especially critical — it may need time to flush in-memory buffers to disk on shutdown:

services:
  clickhouse:
    stop_grace_period: 60s   # needs time to flush write buffers
  grafana:
    stop_grace_period: 15s
  otel-collector:
    stop_grace_period: 30s   # flush pending telemetry batches
  uptime-kuma:
    stop_grace_period: 15s
  dozzle:
    stop_grace_period: 10s

Phase 5: Configuration Externalization Audit (2 hours)¶

Priority: P1 | Impact: High | Dependencies: None

Kubernetes uses ConfigMaps for non-sensitive configuration and Secrets for sensitive values. The Docker Compose .env file maps directly to these concepts — but only if ALL configuration is externalized.

1. Audit all services for hardcoded values¶

Search for hardcoded URLs, ports, timeouts, feature flags, or connection strings in application code. All must come from environment variables.

Common patterns to look for:

// ❌ WRONG — hardcoded
const DB_URL = 'postgresql://localhost:5432/forma3d';

// ✅ CORRECT — from environment
const DB_URL = process.env['DATABASE_URL'];

2. Categorize environment variables¶

Create a documented mapping of all environment variables into two categories:

Non-sensitive (→ ConfigMap):

Variable	Description	Example
`NODE_ENV`	Environment name	`staging`
`APP_PORT`	Service port	`3000`
`LOG_LEVEL`	Log verbosity	`info`
`RATE_LIMIT_DEFAULT`	Default rate limit	`10000`

Sensitive (→ Secret):

Variable	Description
`DATABASE_URL`	PostgreSQL connection string
`REDIS_URL`	Redis connection string
`SESSION_SECRET`	Cookie signing secret
`INTERNAL_API_KEY`	Inter-service auth key
`SENTRY_DSN`	Sentry data source name
`SHOPIFY_*`	Shopify OAuth credentials
`SENDCLOUD_*`	Sendcloud API credentials
`SIMPLYPRINT_*`	SimplyPrint API credentials

3. Create a configuration reference document¶

Create docs/05-deployment/configuration-reference.md listing every environment variable, its purpose, default value, and whether it's sensitive.

Phase 6: Resource Constraints (1 hour)¶

Priority: P2 | Impact: Medium | Dependencies: None

Add resource limits to Docker Compose services. These translate directly to Kubernetes resource requests and limits.

1. Add `deploy.resources` to each service¶

services:
  gateway:
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

2. Recommended resource allocations¶

Service	CPU Request	CPU Limit	Memory Request	Memory Limit
Gateway	0.25	0.50	256M	512M
Order Service	0.25	0.50	256M	512M
Print Service	0.15	0.30	192M	384M
Shipping Service	0.15	0.30	192M	384M
GridFlock Service	0.25	0.50	256M	512M
Slicer	0.50	1.00	512M	1024M
Web (static)	0.10	0.25	64M	128M
Redis	0.15	0.30	128M	256M
Traefik	0.15	0.30	128M	256M
ClickHouse	0.50	1.00	512M	1536M
Grafana	0.15	0.30	128M	256M
OTel Collector	0.15	0.30	128M	256M
Uptime Kuma	0.10	0.25	128M	256M
Dozzle	0.05	0.15	64M	128M

Adjust based on observed usage via docker stats.

Phase 7: Stateless Service Verification (1 hour)¶

Priority: P1 | Impact: High | Dependencies: None

For Kubernetes horizontal scaling, all application services must be stateless. State must live in external stores (PostgreSQL, Redis, S3).

1. Verify no local file storage¶

Check that no service writes to the local filesystem for state that needs to persist. Temporary files (e.g., STL processing in GridFlock/Slicer) should use /tmp and be cleaned up after processing.

2. Verify session storage uses Redis¶

Sessions must be stored in Redis (not in-memory). The Gateway already uses Redis for sessions — verify this is consistently applied.

3. Verify no in-memory caches that require consistency¶

If any service maintains in-memory caches, they must tolerate cache inconsistency across replicas or be moved to Redis.

4. Document stateful dependencies¶

Dependency	Type	Location	K8s Strategy
PostgreSQL	Database	DigitalOcean Managed DB	External (no migration needed)
Redis	Cache / Sessions / Queues	Docker container	DigitalOcean Managed Redis or StatefulSet
ClickHouse	Observability / Analytics DB	Docker container (volume)	StatefulSet with PVC or ClickHouse Cloud
Grafana	Dashboards / Datasource config	Docker container (volume)	StatefulSet with PVC or Grafana Cloud
Uptime Kuma	Monitor state / history	Docker container (volume)	StatefulSet with PVC
Let's Encrypt certs	TLS	Traefik volume	cert-manager in K8s
Uploaded files	STL files	Temporary local → S3 future	DigitalOcean Spaces

Phase 8: Docker Compose Networking Alignment (1 hour)¶

Priority: P2 | Impact: Medium | Dependencies: None

Kubernetes uses Service objects for service discovery (DNS-based: <service-name>.<namespace>.svc.cluster.local). Docker Compose already uses DNS-based service discovery within the network. Ensure the naming is consistent.

1. Verify service names match container references¶

In the Gateway's environment variables, downstream services are referenced as:

ORDER_SERVICE_URL=http://order-service:3001
PRINT_SERVICE_URL=http://print-service:3002
SHIPPING_SERVICE_URL=http://shipping-service:3003
GRIDFLOCK_SERVICE_URL=http://gridflock-service:3004

These names must match the Docker Compose service names exactly. In Kubernetes, these will become Kubernetes Service names — the URL pattern stays identical.

The observability pipeline also uses DNS-based service discovery:

# OTel Collector → ClickHouse
CLICKHOUSE_ENDPOINT=http://clickhouse:8123

# Application services → OTel Collector
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Grafana → ClickHouse (via provisioned datasource)
# Configured in grafana/provisioning/datasources/

2. Use environment variables for all service URLs¶

Never hardcode inter-service URLs. Always use environment variables so the values can be changed for Kubernetes Service discovery:

# Docker Compose
ORDER_SERVICE_URL=http://order-service:3001

# Kubernetes (same pattern, different port if needed)
ORDER_SERVICE_URL=http://order-service.forma3d.svc.cluster.local:3001

Phase 9: Multi-Replica Readiness (2 hours)¶

Priority: P1 | Impact: High | Dependencies: Phase 4, Phase 7

Verify and configure all application services so they can run as multiple replicas behind Traefik (Docker Compose) and later behind a Kubernetes Ingress. This goes beyond statelessness verification — it validates actual concurrent execution.

1. Add `deploy.replicas` to Docker Compose¶

Add explicit replica counts (default 1) to all application services so they can be scaled up trivially:

services:
  gateway:
    deploy:
      replicas: 1   # Scale with: docker compose up -d --scale gateway=3
      resources:
        # ... (existing resource constraints)

2. Configure Traefik load-balancing across replicas¶

Traefik auto-discovers Docker containers by label. Verify that scaling up a service (e.g., docker compose up -d --scale gateway=3) results in Traefik distributing traffic across all replicas. Key considerations:

Do NOT expose ports: on application services. Use expose: instead so replicas don't fight over host ports. Only Traefik should map to host ports 80/443.
Verify Traefik labels use the service name, not a container name, so all replicas are included in the backend pool.
Add a round-robin or least-connections load-balancing strategy in Traefik's dynamic config if needed.

services:
  gateway:
    # ❌ WRONG — blocks scaling
    # ports:
    #   - "3000:3000"
    # ✅ CORRECT — allows multiple replicas
    expose:
      - "3000"
    labels:
      - "traefik.http.services.gateway.loadbalancer.server.port=3000"

3. Validate BullMQ worker concurrency¶

When running multiple replicas of a service with BullMQ workers, jobs are naturally distributed across workers (BullMQ uses Redis-based locking). Verify:

No duplicate processing: Two replicas must not process the same job. BullMQ handles this natively — verify no custom job-fetch logic bypasses it.
Worker concurrency settings: Ensure concurrency is set per-worker (not globally) so each replica processes its fair share.
Job events / progress: If the Gateway subscribes to job events via QueueEvents, ensure this works correctly with multiple producer replicas.

4. Handle WebSocket sticky sessions¶

If any service uses WebSocket connections (Socket.IO for real-time events), multiple replicas require sticky sessions to ensure the WebSocket upgrade request reaches the same backend that holds the socket state.

Traefik supports sticky sessions via cookies:

labels:
  - "traefik.http.services.events.loadbalancer.sticky.cookie=true"
  - "traefik.http.services.events.loadbalancer.sticky.cookie.name=server_id"
  - "traefik.http.services.events.loadbalancer.sticky.cookie.httponly=true"

Socket.IO must also be configured to use the Redis adapter so pub/sub events propagate across replicas:

import { createAdapter } from '@socket.io/redis-adapter';
io.adapter(createAdapter(pubClient, subClient));

5. Test multi-replica operation¶

Run a manual scaling test for each application service:

# Scale up
docker compose up -d --scale gateway=2 --scale order-service=2

# Verify all replicas are healthy
docker compose ps

# Verify Traefik routes to all replicas
for i in $(seq 1 10); do
  curl -s https://staging-connect-api.forma3d.be/health/live \
    -o /dev/null -w "%{remote_ip}\n"
done

# Scale back down
docker compose up -d --scale gateway=1 --scale order-service=1

Document which services can and cannot be scaled (e.g., the Slicer may have constraints around GPU or temp file cleanup).

Phase 10: Rolling Update Strategy (1 hour)¶

Priority: P1 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 9

Define and test a rolling update procedure for Docker Compose that achieves zero-downtime deployments. This same procedure translates directly to Kubernetes Deployment rolling update strategy.

1. Add `deploy.update_config` to Docker Compose¶

Configure rolling update behavior for each service:

services:
  gateway:
    deploy:
      replicas: 1
      update_config:
        parallelism: 1       # Update one replica at a time
        delay: 10s            # Wait 10s between replica updates
        order: start-first    # Start new replica before stopping old one
        failure_action: rollback
      rollback_config:
        parallelism: 1
        order: start-first

order: start-first is critical — it ensures the new container is healthy before the old one is removed, maintaining service availability throughout the update.

2. Document the rolling update procedure¶

Create a step-by-step update runbook:

# 1. Pull latest images
docker compose -f deployment/staging/docker-compose.yml pull

# 2. Rolling update — application services one at a time (no full restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gateway
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build order-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build print-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build shipping-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gridflock-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build slicer

# 3. Verify health after each service update
curl -sf https://staging-connect-api.forma3d.be/health/ready || echo "UNHEALTHY"

# 4. Update the web frontend (stateless, fast restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build web

# 5. Update observability & monitoring services
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps clickhouse
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps otel-collector
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps grafana
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps uptime-kuma
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps dozzle

The --no-deps flag ensures only the target service is recreated — not its dependencies. This prevents cascading restarts.

3. Ensure database migration compatibility¶

Rolling updates mean two versions of a service run simultaneously (old and new). Database migrations must be backward-compatible:

Adding a column: Safe — old code ignores it.
Removing a column: Unsafe — old code still queries it. Use a two-phase approach:
Deploy code that stops using the column.
Deploy migration that removes the column.
Renaming a column: Unsafe — treat as add + deprecate + remove.

Document this in the deployment guide as the expand-and-contract migration pattern.

4. Add a pre-deployment health gate¶

Before starting a rolling update, verify the system is healthy:

#!/usr/bin/env bash
set -euo pipefail

SERVICES=("staging-connect-api.forma3d.be" "staging-connect.forma3d.be")

for svc in "${SERVICES[@]}"; do
  status=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}/health/ready" || echo "000")
  if [ "$status" != "200" ]; then
    echo "❌ Pre-deploy health check failed for ${svc} (HTTP ${status}). Aborting."
    exit 1
  fi
done

echo "✅ All services healthy. Proceeding with rolling update."

5. Test rollback procedure¶

Document and test the rollback procedure:

# Rollback a single service to a specific image tag
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps \
  -e GATEWAY_IMAGE_TAG=git-abc1234 gateway

# Or rollback by reverting the .env file and recreating
git checkout HEAD~1 -- deployment/staging/.env
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps gateway

Phase 11: Local Development Experience with Rancher Desktop + Tilt (3–4 hours)¶

Priority: P2 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 5

Give every developer a one-command local environment that mirrors production networking, service discovery, and container runtime. The goal: git clone, tilt up, start coding.

Why Rancher Desktop + Tilt¶

Tool	Role
Rancher Desktop	Desktop application that provides a local Kubernetes cluster (K3s under the hood), container runtime (containerd or dockerd), and `kubectl` — one install replaces Docker Desktop + k3d + ctlptl
Tilt	Watches source files, live-syncs into running containers, manages builds, port-forwards, and provides a dashboard at `localhost:10350`
Traefik Mesh	Lightweight service mesh — automatic mTLS, request metrics, no sidecars. Same Helm chart reused in staging and production on DOKS
KubeView	Real-time graphical visualization of cluster resources and their relationships at `localhost:8000`

The existing pnpm dev workflow (Nx parallel serve) remains available for quick, lightweight iteration. tilt up is the production-parity alternative.

Aspect	`pnpm dev`	`tilt up`
Infra setup	Manual (install PostgreSQL, Redis)	Automatic (K8s provisions everything)
Service discovery	`localhost` + hardcoded ports	K8s DNS (matches production)
Service mesh	None	Traefik Mesh (mTLS, metrics — same as staging/prod)
Hot reload	Nx watch mode	Tilt `live_update` (file sync into containers)
Debugging	Direct (same process)	Remote attach via `--inspect` port
Cluster visibility	None	KubeView (real-time resource graph)
Production parity	Low (no containers)	High (same images, same networking, same mesh)

1. Prerequisites¶

Developers need two tools installed (plus one optional tool for webhook testing):

Rancher Desktop — Provides the local Kubernetes cluster, container runtime, and kubectl. Download from the website or install via Homebrew:

# macOS
brew install --cask rancher

# Linux / Windows: download from https://rancherdesktop.io

Tilt — Orchestrates the development workflow:

# macOS
brew install tilt-dev/tap/tilt

# Linux
curl -fsSL https://raw.githubusercontent.com/tilt-dev/tilt/master/scripts/install.sh | bash

cloudflared (optional) — Exposes localhost:3000 to the internet so Shopify, SimplyPrint, and SendCloud can deliver real webhooks during local development. Not needed for most development — curl simulation is sufficient:

# macOS
brew install cloudflared

# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
sudo apt update && sudo apt install cloudflared

Rancher Desktop configuration:

After installing, open Rancher Desktop and verify these settings: - Kubernetes enabled (on by default) - Container runtime: dockerd (moby) — required for Tilt's docker_build() to work - Kubernetes version: 1.29+ recommended

Rancher Desktop bundles kubectl and manages the kubeconfig automatically. No other prerequisites — PostgreSQL and Redis run inside the cluster.

2. Cluster setup — Rancher Desktop¶

No cluster definition file is needed. Rancher Desktop provides the Kubernetes cluster out of the box. Verify the cluster is running:

kubectl cluster-info
# Should show: Kubernetes control plane running at https://127.0.0.1:6443

kubectl get nodes
# Should show: rancher-desktop   Ready

Rancher Desktop's built-in K3s has Traefik installed by default. This doesn't conflict with local development — Tilt handles port-forwarding directly to pods, bypassing any in-cluster ingress.

Image builds: Tilt's docker_build() uses the local Docker daemon provided by Rancher Desktop (when configured with dockerd (moby) runtime). Built images are available to the cluster immediately — no separate registry needed.

3. Kubernetes manifests — `k8s/dev/`¶

Create lightweight manifests for local development only. These are NOT the production Kubernetes manifests (those come later during the actual DOKS migration).

k8s/dev/namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: forma3d-dev

k8s/dev/postgres.yaml — Single-node PostgreSQL with a PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: forma3d-dev
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: forma3d
            - name: POSTGRES_USER
              value: forma3d
            - name: POSTGRES_PASSWORD
              value: forma3d_dev
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command: [pg_isready, -U, forma3d]
            periodSeconds: 5
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: postgres-data
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: forma3d-dev
spec:
  selector:
    app: postgres
  ports:
    - port: 5432

k8s/dev/redis.yaml — Single-node Redis:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          readinessProbe:
            exec:
              command: [redis-cli, ping]
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: forma3d-dev
spec:
  selector:
    app: redis
  ports:
    - port: 6379

k8s/dev/configmap.yaml — Shared configuration for all application services:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: forma3d-dev
data:
  NODE_ENV: development
  LOG_LEVEL: debug
  DATABASE_URL: postgresql://forma3d:forma3d_dev@postgres.forma3d-dev.svc.cluster.local:5432/forma3d?schema=public
  REDIS_URL: redis://redis.forma3d-dev.svc.cluster.local:6379
  ORDER_SERVICE_URL: http://order-service.forma3d-dev.svc.cluster.local:3001
  PRINT_SERVICE_URL: http://print-service.forma3d-dev.svc.cluster.local:3002
  SHIPPING_SERVICE_URL: http://shipping-service.forma3d-dev.svc.cluster.local:3003
  GRIDFLOCK_SERVICE_URL: http://gridflock-service.forma3d-dev.svc.cluster.local:3004
  GATEWAY_URL: http://gateway.forma3d-dev.svc.cluster.local:3000
  API_URL: http://localhost:3000
  WEB_URL: http://localhost:4200

k8s/dev/secret.yaml.example — Template for sensitive values (not committed to git):

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: forma3d-dev
type: Opaque
stringData:
  SESSION_SECRET: local-dev-session-secret
  INTERNAL_API_KEY: local-dev-internal-api-key
  # Add your external service credentials below:
  # SHOPIFY_CLIENT_ID: ""
  # SHOPIFY_CLIENT_SECRET: ""
  # SIMPLYPRINT_API_KEY: ""
  # SENDCLOUD_PUBLIC_KEY: ""
  # SENDCLOUD_SECRET_KEY: ""
  # SENTRY_DSN: ""

Per-service manifests — Create one file per application service following this pattern (example for Gateway, repeat for order-service, print-service, shipping-service, gridflock-service):

k8s/dev/gateway.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gateway
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gateway
  template:
    metadata:
      labels:
        app: gateway
    spec:
      containers:
        - name: gateway
          image: forma3d-connect-gateway
          ports:
            - containerPort: 3000
              name: http
            - containerPort: 9229
              name: debug
          envFrom:
            - configMapRef:
                name: app-config
            - secretRef:
                name: app-secrets
          env:
            - name: APP_PORT
              value: "3000"
          readinessProbe:
            httpGet:
              path: /health/live
              port: 3000
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: gateway
  namespace: forma3d-dev
spec:
  selector:
    app: gateway
  ports:
    - name: http
      port: 3000
    - name: debug
      port: 9229

Repeat the pattern for each backend service (order-service on 3001/9230, print-service on 3002/9231, shipping-service on 3003/9232, gridflock-service on 3004/9233). Each service gets its own debug port and APP_PORT env override (since the shared configmap cannot hold per-service port values). Adjust the readinessProbe path to /health for the downstream services (they use /health instead of /health/live).

k8s/dev/web.yaml — The React app runs vite dev (not nginx) for HMR:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: forma3d-connect-web-dev
          ports:
            - containerPort: 4200
          env:
            - name: VITE_API_URL
              value: http://localhost:3000
---
apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: forma3d-dev
spec:
  selector:
    app: web
  ports:
    - port: 4200

k8s/dev/kubeview.yaml — Cluster visualization tool (KubeView):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubeview
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kubeview
  template:
    metadata:
      labels:
        app: kubeview
    spec:
      serviceAccountName: kubeview
      containers:
        - name: kubeview
          image: ghcr.io/benc-uk/kubeview:latest
          ports:
            - containerPort: 8000
          env:
            - name: SINGLE_NAMESPACE
              value: forma3d-dev
            - name: NAMESPACE_FILTER
              value: "^kube-"
---
apiVersion: v1
kind: Service
metadata:
  name: kubeview
  namespace: forma3d-dev
spec:
  selector:
    app: kubeview
  ports:
    - port: 8000
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kubeview
  namespace: forma3d-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeview-reader
rules:
  - apiGroups: ["", "apps", "batch", "networking.k8s.io", "discovery.k8s.io", "autoscaling"]
    resources: ["*"]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubeview-reader-binding
subjects:
  - kind: ServiceAccount
    name: kubeview
    namespace: forma3d-dev
roleRef:
  kind: ClusterRole
  name: kubeview-reader
  apiGroup: rbac.authorization.k8s.io

KubeView provides a real-time graphical view of pods, deployments, services, and their relationships at http://localhost:8000. Useful for understanding how Kubernetes resources map to the application architecture.

Traefik Mesh — Installed via Helm in the Tiltfile (not a static manifest). Traefik Mesh is a lightweight service mesh that provides: - Automatic mTLS between services - Request-level metrics and tracing - SMI (Service Mesh Interface) support for traffic policies

To opt a service into the mesh, add the label mesh.traefik.io/enabled: "true" to its Service object. For example, in k8s/dev/gateway.yaml:

apiVersion: v1
kind: Service
metadata:
  name: gateway
  namespace: forma3d-dev
  labels:
    mesh.traefik.io/enabled: "true"
spec:
  selector:
    app: gateway
  ports:
    - name: http
      port: 3000
    - name: debug
      port: 9229

The same Traefik Mesh Helm chart and service labels will be reused in staging and production on DigitalOcean Managed Kubernetes (DOKS), giving consistent service-to-service networking across all environments.

4. Development Dockerfiles¶

Create lightweight development Dockerfiles that support Tilt's live_update (file sync instead of full rebuild):

.dockerignore — Create a .dockerignore in the project root to keep the build context small (Tilt's docker_build() uses context='.'):

node_modules
dist
build
.nx
.git
.vscode
.cursor
coverage
*.log
.DS_Store
docs
load-tests
.specstory

apps/gateway/Dockerfile.dev (same pattern for all NestJS services — adjust the service name and --inspect port):

FROM node:20-alpine
WORKDIR /app
RUN apk add --no-cache openssl
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
RUN pnpm prisma generate
COPY . .
CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]

Using tsx watch instead of the full Nx build pipeline — starts in seconds, restarts on file changes synced by Tilt. The --inspect flag enables remote debugging via VS Code (port per service: gateway=9229, order-service=9230, print-service=9231, shipping-service=9232, gridflock-service=9233). The openssl package is required by Prisma on Alpine. The prisma/ directory is copied separately before pnpm install so that prisma generate can run against the schema.

apps/web/Dockerfile.dev:

FROM node:20-alpine
WORKDIR /app
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
COPY . .
EXPOSE 4200
CMD ["node_modules/.bin/vite", "--host", "0.0.0.0", "--port", "4200", "apps/web"]

5. Tiltfile¶

Create Tiltfile in the project root:

# ---------------------------------------------------------------------------
# Forma3D.Connect — Local Development with Rancher Desktop + Tilt
# Usage: tilt up
# Dashboard: http://localhost:10350
# ---------------------------------------------------------------------------

load('ext://namespace', 'namespace_create')

# --- Cluster bootstrap ---
namespace_create('forma3d-dev')

# --- Infrastructure (PostgreSQL + Redis) ---
k8s_yaml('k8s/dev/postgres.yaml')
k8s_yaml('k8s/dev/redis.yaml')
k8s_yaml('k8s/dev/configmap.yaml')

# Apply secrets (developer must copy secret.yaml.example → secret.yaml)
if os.path.exists('k8s/dev/secret.yaml'):
    k8s_yaml('k8s/dev/secret.yaml')
else:
    fail('k8s/dev/secret.yaml not found. Copy k8s/dev/secret.yaml.example and fill in your values.')

k8s_resource('postgres', port_forwards=['5432:5432'],
             labels=['infra'])
k8s_resource('redis', port_forwards=['6379:6379'],
             labels=['infra'])

# --- Prisma migrations (runs after postgres is ready) ---
local_resource('prisma-migrate',
    cmd='pnpm prisma migrate deploy',
    resource_deps=['postgres'],
    labels=['setup'])

local_resource('prisma-seed',
    cmd='pnpm prisma db seed',
    resource_deps=['prisma-migrate'],
    auto_init=False,  # manual trigger via Tilt UI button
    labels=['setup'])

# --- Backend services ---
BACKEND_SERVICES = {
    'gateway':          {'port': 3000, 'debug': 9229},
    'order-service':    {'port': 3001, 'debug': 9230},
    'print-service':    {'port': 3002, 'debug': 9231},
    'shipping-service': {'port': 3003, 'debug': 9232},
    'gridflock-service':{'port': 3004, 'debug': 9233},
}

for svc, cfg in BACKEND_SERVICES.items():
    docker_build(
        'forma3d-connect-' + svc,
        context='.',
        dockerfile='apps/' + svc + '/Dockerfile.dev',
        live_update=[
            sync('apps/' + svc + '/src', '/app/apps/' + svc + '/src'),
            sync('libs/', '/app/libs/'),
            sync('prisma/', '/app/prisma/'),
        ],
    )
    k8s_yaml('k8s/dev/' + svc + '.yaml')
    k8s_resource(svc,
        port_forwards=[
            str(cfg['port']) + ':' + str(cfg['port']),
            str(cfg['debug']) + ':' + str(cfg['debug']),
        ],
        resource_deps=['prisma-migrate'],
        labels=['backend'])

# --- Web (React + Vite HMR) ---
docker_build(
    'forma3d-connect-web-dev',
    context='.',
    dockerfile='apps/web/Dockerfile.dev',
    live_update=[
        sync('apps/web/src', '/app/apps/web/src'),
        sync('libs/', '/app/libs/'),
    ],
)
k8s_yaml('k8s/dev/web.yaml')
k8s_resource('web',
    port_forwards=['4200:4200'],
    resource_deps=['gateway'],
    labels=['frontend'])

# --- Traefik Mesh (service mesh — same in dev, staging, production) ---
local_resource('traefik-mesh-install',
    cmd='helm repo add traefik https://traefik.github.io/charts --force-update && '
        'helm upgrade --install traefik-mesh traefik/traefik-mesh '
        '--namespace forma3d-dev --wait',
    resource_deps=['postgres'],  # ensure namespace exists
    labels=['mesh'])

# --- KubeView (cluster visualization) ---
k8s_yaml('k8s/dev/kubeview.yaml')
k8s_resource('kubeview',
    port_forwards=['8000:8000'],
    labels=['tools'])

# --- Cloudflare Tunnel (webhook testing — on-demand) ---
# Exposes localhost:3000 (Gateway) to the internet so external services
# (Shopify, SimplyPrint, SendCloud) can deliver webhooks during local dev.
# Start manually from the Tilt dashboard when testing webhook flows.
local_resource('tunnel',
    serve_cmd='cloudflared tunnel --url http://localhost:3000',
    auto_init=False,
    labels=['tools'])

6. VS Code debugging¶

Add launch configurations for attaching to running services inside the cluster. Create or extend .vscode/launch.json:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Attach: Gateway (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9229,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Order Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9230,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Print Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9231,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Shipping Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9232,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: GridFlock Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9233,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    }
  ]
}

To enable debugging, change the CMD in the service's Dockerfile.dev to include --inspect=0.0.0.0:9229 (adjusting the port per service), or add an environment variable toggle:

CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]

7. Developer workflow¶

First time setup:

# 1. Install Rancher Desktop (see Prerequisites) and ensure Kubernetes is running
kubectl cluster-info                                  # verify cluster is ready

# 2. Clone and install
git clone <repo-url> && cd forma-3d-connect
pnpm install                                          # install dependencies
cp k8s/dev/secret.yaml.example k8s/dev/secret.yaml    # add your API keys

# 3. Start
tilt up                                                # start everything

Daily development:

tilt up     # Rancher Desktop starts on login, cluster is always ready
# Edit files in apps/ or libs/ — changes sync into containers in <2 seconds
# Open http://localhost:4200 (web), http://localhost:3000 (API)
# Tilt dashboard at http://localhost:10350
# KubeView at http://localhost:8000 (cluster visualization)
tilt down   # stop all services (cluster persists)

Resetting the environment:

tilt down
kubectl delete namespace forma3d-dev    # remove all dev resources
tilt up                                 # recreate everything fresh

8. Webhook testing during local development¶

The application receives inbound webhooks from three external services:

Provider	Webhook Path	Routed To
Shopify	`/api/v1/webhooks/shopify`	order-service
SimplyPrint	`/webhooks/simplyprint`	print-service
SendCloud	`/webhooks/sendcloud`	shipping-service

These external services cannot reach localhost. When testing flows that depend on real webhook delivery, a tunnel is needed to expose the Gateway to the internet.

Option A: Cloudflare Tunnel (recommended for real webhook testing)

Install cloudflared:

# macOS
brew install cloudflared

# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/cloudflare-main.gpg] https://pkg.cloudflare.com/cloudflared any main' | sudo tee /etc/apt/sources.list.d/cloudflared.list
sudo apt update && sudo apt install cloudflared

The Tiltfile includes a tunnel resource (disabled by default). Start it from the Tilt dashboard when needed — it prints a temporary public URL (e.g., https://<random>.trycloudflare.com). Configure the external service's webhook URL to point to this tunnel URL:

# Shopify: Update webhook URL in Shopify Partner Dashboard or via API
# SimplyPrint: Update webhook URL in SimplyPrint dashboard
# SendCloud: Update webhook URL in SendCloud panel

# Example: verify the tunnel forwards to the Gateway
curl https://<tunnel-url>/health/live

Option B: Simulate webhooks with curl (no tunnel needed)

For most development, simulating webhook payloads locally is faster and doesn't require internet access. Send signed requests directly to the Gateway:

# Simulate a Shopify order creation webhook
curl -X POST http://localhost:3000/api/v1/webhooks/shopify \
  -H "Content-Type: application/json" \
  -H "X-Shopify-Topic: orders/create" \
  -H "X-Shopify-Hmac-Sha256: <computed-hmac>" \
  -H "X-Shopify-Shop-Domain: your-shop.myshopify.com" \
  -d @apps/order-service/src/shopify/__tests__/fixtures/order-created.json

# Simulate a SimplyPrint job status webhook
curl -X POST http://localhost:3000/webhooks/simplyprint \
  -H "Content-Type: application/json" \
  -d '{"event": "job.status_changed", "data": {"job_id": "123", "status": "completed"}}'

# Simulate a SendCloud parcel status webhook
curl -X POST http://localhost:3000/webhooks/sendcloud \
  -H "Content-Type: application/json" \
  -d '{"action": "parcel_status_changed", "parcel": {"id": 123, "status": {"id": 11}}}'

To bypass HMAC signature verification during local development, set WEBHOOK_SKIP_VERIFICATION=true in k8s/dev/configmap.yaml (this env variable must only be respected when NODE_ENV=development).

When to use which:

Scenario	Use
Testing webhook handler logic in isolation	Option B (`curl`)
Testing the full end-to-end flow with a real external service	Option A (Cloudflare Tunnel)
CI / automated tests	Option B (mock payloads in test fixtures)

9. Excluded services¶

The following staging-only services are NOT included in the local development setup:

Service	Reason
Slicer (BambuStudio)	Requires specific Linux binaries; optional for most development
ClickHouse	Observability infra, not needed for feature development
Grafana	Observability infra
Uptime Kuma	Monitoring infra
Dozzle	Log viewer — Tilt provides its own log aggregation
OTel Collector	Observability pipeline
Traefik Proxy (Ingress)	Tilt port-forwarding replaces the reverse proxy for local dev

Included dev tools: KubeView (cluster visualization at localhost:8000) and Traefik Mesh (service mesh — same Helm chart reused in staging/production).

If a developer needs the Slicer locally, they can run it standalone via Docker alongside the Tilt-managed cluster:

docker run -p 3010:3010 -v $(pwd)/deployment/slicer/profiles:/profiles \
  forma3d-connect-slicer:latest

📊 Migration Path Summary¶

Step	When	What Happens	DNS Impact
Now (this prompt)	Today	Lower DNS TTLs to 60s, standardize health checks, externalize config	TTL changes only
Multi-tenancy	Next quarter	System grows, consider DOKS	None
DOKS setup	When needed	Create DOKS cluster, deploy services, create DO Load Balancer	None
Cut-over	Migration day	Update DNS A records from Droplet IP to LB IP	DNS update (propagates in <60s)
Cleanup	Post-migration	Decommission Droplet	None

✅ Validation Checklist¶

DNS Preparation¶

DNS TTL lowered to 60s on all staging A records
DNS resolution verified (dig shows correct IP and low TTL)
Droplet IP and datacenter documented in deployment docs and .env.example
Migration cut-over runbook documented
All services still accessible after TTL changes (TLS valid, health checks pass)

Container Registry¶

All services pull from DOCR (${REGISTRY_URL}/forma3d-connect-*)
CI pipeline tags images with git-<sha> and environment tags
pull_policy: always set on all application services in Docker Compose

Health Checks¶

All backend services expose GET /health/live (200 OK)
All backend services expose GET /health/ready (200 OK when healthy, 503 when not)
Docker Compose health checks use HTTP endpoints consistently
Health check intervals and thresholds are consistent across services
Third-party services (ClickHouse, Grafana, Uptime Kuma, Dozzle, OTel Collector) have functioning health checks

Graceful Shutdown¶

app.enableShutdownHooks() called in all NestJS services
BullMQ workers handle SIGTERM (close cleanly)
stop_grace_period set on all services (30s for app services, 60s for ClickHouse, 30s for OTel Collector)
Verified: docker compose stop completes without SIGKILL for all services

Configuration¶

No hardcoded URLs, ports, or secrets in application code
All environment variables documented with sensitivity classification
Configuration reference document created
.env.example is complete and up-to-date

Resource Constraints¶

deploy.resources (limits + reservations) set for all services (including ClickHouse, Grafana, OTel Collector, Uptime Kuma, Dozzle)
Resource values validated against docker stats observations

Statelessness¶

No application service writes persistent state to local filesystem
Sessions stored in Redis (not in-memory)
No singleton in-memory caches that break with multiple replicas
Stateful dependencies documented with Kubernetes migration strategy (including ClickHouse, Grafana, Uptime Kuma volumes)

Multi-Replica Readiness¶

expose: used instead of ports: on all application services (only Traefik exposes host ports)
deploy.replicas: 1 set explicitly on all application services
Traefik load-balances across replicas when scaling up (docker compose up -d --scale gateway=2)
BullMQ job processing verified with multiple worker replicas (no duplicate processing)
WebSocket sticky sessions configured in Traefik for Socket.IO services
Socket.IO Redis adapter configured for cross-replica pub/sub
Each application service tested at 2+ replicas (scaled up and back down)
Services that cannot be scaled documented with reasoning

Rolling Updates¶

deploy.update_config with order: start-first set on all application services
deploy.rollback_config set on all application services
Rolling update runbook documented (per-service --no-deps updates, including observability services)
Pre-deployment health gate script created and tested
Rollback procedure documented and tested
Database migration backward-compatibility rules documented (expand-and-contract pattern)
CI pipeline updated to support targeted per-service deploys
Zero-downtime verified: rolling update completes with no failed health checks

Local Development (Rancher Desktop + Tilt)¶

Verification Commands¶

# All services healthy and DNS TTLs lowered
curl -I https://staging-connect.forma3d.be
curl -I https://staging-connect-api.forma3d.be/health/live
curl -I https://staging-connect-api.forma3d.be/health/ready
dig staging-connect.forma3d.be | grep TTL

# Docker Compose validation
docker compose -f deployment/staging/docker-compose.yml config --quiet

# Graceful shutdown test
docker compose stop gateway  # Should stop within 30s without SIGKILL

# Resource usage baseline
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

# Build passes
pnpm nx run-many -t build --all

# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests

🚫 Constraints and Rules¶

MUST DO¶

Lower DNS TTLs to 60s on all staging A records (enables fast future cut-over)
Document the migration cut-over procedure (DNS update from Droplet IP to LB IP)
Verify all services expose HTTP health endpoints (/health/live and /health/ready)
Enable graceful shutdown hooks in all NestJS services
Add stop_grace_period to all Docker Compose application services
Audit and document all environment variables with sensitivity classification
Verify all application services are stateless
Add resource constraints to Docker Compose services
Use environment variables for all inter-service URLs (no hardcoding)
Create k8s/dev/ manifests (including KubeView), Tiltfile, and dev Dockerfiles for local Rancher Desktop + Tilt workflow
Install Traefik Mesh via Helm in the Tiltfile (same chart reused in staging/production on DOKS)
Verify tilt up starts all services from a clean state (clone + tilt up = working environment)
Preserve the existing pnpm dev workflow as a lightweight alternative

MUST NOT¶

Create any production Kubernetes manifests, Helm charts, or Kustomize configs — not yet (local-dev K8s manifests in k8s/dev/ are fine)
Deploy K8s manifests to staging or production — the k8s/dev/ manifests are for local development only
Install kubectl, helm, or any Kubernetes tooling on the Droplet
Change the Docker Compose deployment workflow
Remove or replace Traefik — it stays as the reverse proxy for now
Over-engineer for Kubernetes patterns that aren't needed yet (sidecars, custom operators, etc.) — Traefik Mesh is allowed as it's lightweight and non-invasive
Break any existing functionality or deployment process
Use any, ts-ignore, or eslint-disable

SHOULD DO (Nice to Have)¶

Document the Kubernetes migration path in docs/05-deployment/kubernetes-migration-plan.md
Set up DigitalOcean monitoring alerts for DNS records
Explore DigitalOcean's App Platform as an intermediate step before full Kubernetes
Add a tilt_config.json for per-developer overrides (e.g., enable/disable Slicer, toggle debug ports)
Create a CONTRIBUTING.md section documenting the tilt up workflow for new developers

🔄 Rollback Plan¶

All changes in this prompt are non-destructive:

DNS TTL changes: Lowering TTLs is completely non-destructive. If needed, TTLs can be raised back to their original values.
Health checks: Added endpoints don't affect existing functionality.
Graceful shutdown: enableShutdownHooks() is additive — it doesn't change normal operation.
Resource constraints: Docker Compose ignores deploy.resources unless using docker compose up with --compatibility flag or Docker Swarm mode. In standalone Docker Compose, these serve as documentation.
Configuration audit: Documentation-only changes.
Rancher Desktop + Tilt: All local dev files (k8s/dev/, Tiltfile, Dockerfile.dev) are additive. They don't affect staging/production deployment, CI pipeline, or the existing pnpm dev workflow. The Rancher Desktop cluster runs entirely on the developer's machine. KubeView is read-only. Traefik Mesh is opt-in (only affects services with the mesh.traefik.io/enabled label) and will be reused with the same Helm chart in staging/production.

📚 Key References¶

DigitalOcean: - DOCR: https://docs.digitalocean.com/products/container-registry/ - DOKS: https://docs.digitalocean.com/products/kubernetes/ - Load Balancers: https://docs.digitalocean.com/products/networking/load-balancers/ - Reserved IPs (Droplet-only): https://docs.digitalocean.com/products/networking/reserved-ips/ — Note: cannot be assigned to Load Balancers, only Droplets

Kubernetes Migration: - Docker Compose to Kubernetes: https://kubernetes.io/docs/tasks/configure-pod-container/translate-compose-kubernetes/ - Kompose (migration tool): https://kompose.io/

Local Development (Rancher Desktop + Tilt): - Rancher Desktop: https://rancherdesktop.io/ — local Kubernetes with built-in K3s, container runtime, and kubectl - Tilt: https://docs.tilt.dev/ — live development orchestration - Tilt live_update: https://docs.tilt.dev/live_update_reference — file sync into running containers - Tilt + Rancher Desktop: https://docs.tilt.dev/choosing_clusters#rancher-desktop — official Tilt integration guide - KubeView: https://github.com/benc-uk/kubeview — lightweight Kubernetes cluster visualization - Traefik Mesh: https://doc.traefik.io/traefik-mesh/ — lightweight service mesh (no sidecars, SMI-compatible) - Traefik Mesh install: https://doc.traefik.io/traefik-mesh/install/ — Helm-based installation - Cloudflare Tunnel (cloudflared): https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/ — free tunnel for exposing local services to the internet (webhook testing)

NestJS: - Graceful shutdown: https://docs.nestjs.com/fundamentals/lifecycle-events#application-shutdown - Health checks (Terminus): https://docs.nestjs.com/recipes/terminus

Existing Codebase: - Docker Compose: deployment/staging/docker-compose.yml - Traefik config: deployment/staging/traefik.yml - Deployment guide: docs/05-deployment/staging-deployment-guide.md - CI Pipeline: azure-pipelines.yml

END OF PROMPT

This prompt prepares the Forma3D.Connect infrastructure for a future Docker Compose to Kubernetes migration. The key networking deliverable is lowering DNS TTLs to 60s — enabling a fast DNS-based cut-over to a Load Balancer + DOKS cluster in the future (propagation under 1 minute). Note: DO Reserved IPs cannot be assigned to Load Balancers, so the migration uses a DNS update strategy instead. Supporting changes include health check standardization, graceful shutdown, configuration externalization, resource constraints, statelessness verification, multi-replica readiness (load-balancing, sticky sessions, worker deduplication), and a rolling update strategy with health gates and backward-compatible migrations. The system stays on Docker Compose for staging/production but becomes "Kubernetes-ready." A local Rancher Desktop + Tilt development environment provides production-parity K8s-based development with a one-command tilt up workflow.

AI Prompt: Forma3D.Connect — Scaling Preparations (Docker Compose → Kubernetes)¶

🎯 Mission¶

📐 Architecture¶

Current State¶

Target State (after this prompt)¶

Future State (Kubernetes — NOT this prompt)¶

📋 Implementation Phases¶

Phase 1: DNS Preparation for Future Migration (1 hour)¶

1. Lower DNS TTL on all staging subdomains¶

2. Verify DNS resolution¶

3. Document the Droplet IP and datacenter¶

4. Document the migration cut-over procedure¶

Phase 2: Container Registry Hygiene (1 hour)¶

1. Verify DOCR is the image source for all services¶

2. Implement semantic image tagging¶

3. Add image pull policy awareness¶

Phase 3: Health Check Standardization (2 hours)¶

1. Verify health endpoints exist in all backend services¶

2. Update Docker Compose health checks to use HTTP¶

3. Add readiness checks that verify dependencies¶

4. Verify health checks for third-party / observability services¶

Phase 4: Graceful Shutdown (2 hours)¶

1. Verify NestJS graceful shutdown is enabled¶

2. Add stop_grace_period to Docker Compose¶

3. Verify BullMQ workers handle shutdown¶

4. Add stop_grace_period to third-party services¶

Phase 5: Configuration Externalization Audit (2 hours)¶

1. Audit all services for hardcoded values¶

2. Categorize environment variables¶

3. Create a configuration reference document¶

Phase 6: Resource Constraints (1 hour)¶

1. Add deploy.resources to each service¶

2. Recommended resource allocations¶

Phase 7: Stateless Service Verification (1 hour)¶

1. Verify no local file storage¶

2. Verify session storage uses Redis¶

3. Verify no in-memory caches that require consistency¶

4. Document stateful dependencies¶

Phase 8: Docker Compose Networking Alignment (1 hour)¶

1. Verify service names match container references¶

2. Use environment variables for all service URLs¶

Phase 9: Multi-Replica Readiness (2 hours)¶

1. Add deploy.replicas to Docker Compose¶

2. Configure Traefik load-balancing across replicas¶

3. Validate BullMQ worker concurrency¶

4. Handle WebSocket sticky sessions¶

5. Test multi-replica operation¶

Phase 10: Rolling Update Strategy (1 hour)¶

1. Add deploy.update_config to Docker Compose¶

2. Document the rolling update procedure¶

3. Ensure database migration compatibility¶

4. Add a pre-deployment health gate¶

5. Test rollback procedure¶

Phase 11: Local Development Experience with Rancher Desktop + Tilt (3–4 hours)¶

Why Rancher Desktop + Tilt¶

1. Prerequisites¶

2. Cluster setup — Rancher Desktop¶

3. Kubernetes manifests — k8s/dev/¶

4. Development Dockerfiles¶

5. Tiltfile¶

6. VS Code debugging¶

7. Developer workflow¶

8. Webhook testing during local development¶

9. Excluded services¶

📊 Migration Path Summary¶

✅ Validation Checklist¶

DNS Preparation¶

Container Registry¶

Health Checks¶

Graceful Shutdown¶

Configuration¶

Resource Constraints¶

Statelessness¶

Multi-Replica Readiness¶

Rolling Updates¶

Local Development (Rancher Desktop + Tilt)¶

Verification Commands¶

🚫 Constraints and Rules¶

MUST DO¶

MUST NOT¶

2. Add `stop_grace_period` to Docker Compose¶

4. Add `stop_grace_period` to third-party services¶

1. Add `deploy.resources` to each service¶

1. Add `deploy.replicas` to Docker Compose¶

1. Add `deploy.update_config` to Docker Compose¶

3. Kubernetes manifests — `k8s/dev/`¶