Skip to content

AI Prompt: Forma3D.Connect — Scaling Preparations (Docker Compose → Kubernetes)

Purpose: Prepare the current single-Droplet Docker Compose deployment for a seamless future migration to Kubernetes on DigitalOcean, without adding unnecessary complexity now
Estimated Effort: 14–19 hours
Prerequisites: Staging deployment operational on a single DigitalOcean Droplet with Docker Compose + Traefik
Output: Externalized configuration, container registry strategy, architectural guardrails that make a future Kubernetes migration a straightforward lift-and-shift, and a local Rancher Desktop + Tilt development environment for production-parity "clone and tilt up" workflow
Status: 🚧 TODO


🎯 Mission

Prepare the Forma3D.Connect infrastructure for a seamless future migration from Docker Compose on a single Droplet to DigitalOcean Managed Kubernetes (DOKS). The goal is to make changes now — while the system is simple — that will pay off when multi-tenancy drives the need for horizontal scaling.

This is NOT a Kubernetes migration. This prompt makes the Docker Compose deployment "Kubernetes-ready" by:

  1. DNS strategy with low TTL — Prepare DNS records for a fast cut-over to a future DO Load Balancer IP by lowering TTLs now
  2. Container registry strategy — Ensure all images are pulled from DigitalOcean Container Registry (DOCR) with proper tagging
  3. Configuration externalization — Move all configuration to environment variables and .env files so they map cleanly to Kubernetes ConfigMaps and Secrets
  4. Health check standardization — Ensure all services expose HTTP health endpoints that work identically as Kubernetes liveness/readiness probes
  5. Stateless service design — Verify all services are stateless (no local file storage, no in-memory sessions without Redis backing)
  6. Graceful shutdown — Ensure all services handle SIGTERM for zero-downtime rolling updates
  7. Resource awareness — Add resource constraints to Docker Compose that translate directly to Kubernetes resource requests/limits
  8. DNS and TLS strategy — Plan the DNS/TLS migration path from Traefik to DigitalOcean Load Balancer + cert-manager
  9. Multi-replica readiness — Validate that all containers can run as multiple replicas behind Traefik with proper load-balancing, sticky sessions, and worker deduplication
  10. Rolling update strategy — Define and test a zero-downtime rolling update procedure with start-first ordering, health gates, and backward-compatible database migrations
  11. Local development with Rancher Desktop + Tilt — Create a "clone and tilt up" developer experience using Rancher Desktop's built-in Kubernetes, Tilt live-update, port-forwarding, and VS Code debug attach

Important note on DigitalOcean Reserved IPs: DO Reserved IPs can only be assigned to Droplets, not to Load Balancers. This means we cannot use a Reserved IP as a stable entry point that gets reassigned from a Droplet to a Load Balancer. Instead, the migration strategy uses DNS-based cut-over: when moving to DOKS, update DNS A records from the Droplet IP to the Load Balancer's stable IP. Setting low TTLs (60s) on DNS records before migration minimizes propagation delay to under a minute.

Why now:

  • Changes are cheap when the system is small (6 services + supporting containers)
  • Retrofitting these patterns later is expensive and error-prone
  • Multi-tenancy (the next major feature) will be the trigger for needing Kubernetes
  • Lowering DNS TTLs now means the future DNS cut-over will propagate in under a minute

What stays unchanged:

  • Docker Compose remains the deployment mechanism for now
  • Traefik remains the reverse proxy for now
  • Single Droplet remains the hosting model for now
  • No Kubernetes manifests or Helm charts are created in this prompt

📐 Architecture

Current State

                    DNS A Records
                         │
        ┌────────────────┴────────────────────┐
        │  staging-connect.forma3d.be          │
        │  staging-connect-api.forma3d.be      │
        │  staging-connect-docs.forma3d.be     │
        │  staging-connect-events.forma3d.be   │
        │  staging-connect-db.forma3d.be       │
        │  staging-connect-logs.forma3d.be     │
        │  staging-connect-uptime.forma3d.be   │
        └────────────────┬────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  Droplet Public IP   │
              │  (e.g., 167.x.x.x)  │
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Docker Compose      │
              │  + Traefik           │
              │  + All Services      │
              └──────────────────────┘

Target State (after this prompt)

                    DNS A Records (TTL: 60s)
                         │
        ┌────────────────┴────────────────────┐
        │  staging-connect.forma3d.be          │
        │  staging-connect-api.forma3d.be      │
        │  (all subdomains)                    │
        └────────────────┬────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  Droplet Public IP   │   ← Same IP, but DNS TTL lowered to 60s
              │  (e.g., 167.x.x.x)  │     so future cut-over propagates fast
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Docker Compose      │   + Health checks standardized
              │  + Traefik           │   + Graceful shutdown enabled
              │  + All Services      │   + Resource constraints added
              │  (K8s-ready)         │   + Configuration externalized
              └──────────────────────┘

Future State (Kubernetes — NOT this prompt)

                    DNS A Records (TTL: 60s)
                         │
                         ▼
              ┌──────────────────────┐
              │  DO Load Balancer    │   ← DNS updated to LB's stable IP
              │  (stable IP)         │     Propagation: <1 min with 60s TTL
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  DOKS Cluster        │
              │  ├── Ingress NGINX   │
              │  ├── Gateway Pod(s)  │
              │  ├── Order Svc Pod(s)│
              │  ├── Print Svc Pod(s)│
              │  ├── Ship Svc Pod(s) │
              │  ├── GridFlock Pod(s)│
              │  ├── Slicer Pod(s)   │
              │  └── Web Pod(s)      │
              └──────────────────────┘

Note: DigitalOcean Load Balancers have stable, persistent IP addresses that do not change throughout their lifetime. The migration requires a one-time DNS update from the Droplet IP to the LB IP. With a 60s TTL, this propagates in under a minute.


📋 Implementation Phases

Phase 1: DNS Preparation for Future Migration (1 hour)

Priority: P0 | Impact: Critical | Dependencies: None

Prepare DNS records so that a future migration to DOKS + Load Balancer can be done with minimal disruption. The key insight: DigitalOcean Reserved IPs can only be assigned to Droplets, not to Load Balancers. Therefore, we cannot use a Reserved IP as a stable entry point that moves between Droplet and LB. Instead, the migration strategy relies on low-TTL DNS cut-over.

Why NOT a Reserved IP for this use case:

  • DO Reserved IPs are Droplet-only (cannot be assigned to Load Balancers)
  • DO Load Balancers get their own stable, persistent IP addresses
  • The migration requires updating DNS A records to the LB's new IP
  • With low TTLs, this DNS update propagates globally in under a minute

1. Lower DNS TTL on all staging subdomains

Set TTL to 60 seconds on all A records:

Record TTL (current) TTL (target)
staging-connect.forma3d.be 3600s (typical default) 60s
staging-connect-api.forma3d.be 3600s 60s
staging-connect-docs.forma3d.be 3600s 60s
staging-connect-events.forma3d.be 3600s 60s
staging-connect-db.forma3d.be 3600s 60s
staging-connect-logs.forma3d.be 3600s 60s
staging-connect-uptime.forma3d.be 3600s 60s

A 60s TTL means that when we later update the A records to point to a Load Balancer IP, all DNS caches worldwide will pick up the new IP within 60 seconds.

2. Verify DNS resolution

dig staging-connect.forma3d.be +short
dig staging-connect-api.forma3d.be +short
# Verify TTL is showing 60s or less
dig staging-connect.forma3d.be | grep -i ttl

3. Document the Droplet IP and datacenter

Add to deployment documentation and .env.example:

# DigitalOcean Infrastructure
DO_DROPLET_IP=<current-droplet-ip>
DO_DATACENTER=ams3
# Note: DNS TTLs set to 60s for future migration agility

4. Document the migration cut-over procedure

Create a brief runbook entry for the future DNS cut-over:

  1. Deploy services to DOKS cluster
  2. Create DO Load Balancer → get its stable IP
  3. Verify services are healthy behind the LB
  4. Update all DNS A records: Droplet IP → LB IP
  5. Wait 60 seconds for propagation
  6. Verify all services resolve to new IP
  7. Decommission Droplet

Why low TTLs now: DNS TTL changes take effect only after the previous TTL expires. If TTLs are currently 1 hour (3600s), lowering them to 60s right before migration means you still need to wait up to 1 hour for the old TTL to expire from caches. By lowering TTLs now, the 60s TTL is already cached everywhere when migration day arrives.


Phase 2: Container Registry Hygiene (1 hour)

Priority: P0 | Impact: High | Dependencies: None

Ensure all container images are stored in DigitalOcean Container Registry (DOCR) with a consistent tagging strategy that works for both Docker Compose and Kubernetes.

1. Verify DOCR is the image source for all services

Current Docker Compose already uses ${REGISTRY_URL}/forma3d-connect-*:${*_IMAGE_TAG:-latest}. Verify all services follow this pattern.

2. Implement semantic image tagging

Instead of relying solely on latest, ensure the CI pipeline tags images with:

  • git-<short-sha> — immutable reference to exact commit
  • latest — rolling tag for the most recent build
  • staging / production — environment-specific rolling tags
docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:git-abc1234
docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:staging

3. Add image pull policy awareness

In Docker Compose, add explicit pull_policy to each service:

services:
  gateway:
    image: ${REGISTRY_URL}/forma3d-connect-gateway:${GATEWAY_IMAGE_TAG:-latest}
    pull_policy: always

This mirrors Kubernetes' imagePullPolicy: Always behavior and ensures deployments always use the latest image for a given tag.


Phase 3: Health Check Standardization (2 hours)

Priority: P1 | Impact: High | Dependencies: None

Kubernetes uses three probe types: liveness (is the process alive?), readiness (can it serve traffic?), and startup (has it finished initializing?). Ensure all services expose HTTP endpoints that serve these purposes.

1. Verify health endpoints exist in all backend services

Each NestJS service should expose:

Endpoint Purpose K8s Probe Type Expected Response
GET /health/live Process is alive Liveness 200 OK
GET /health/ready Can serve traffic (DB connected, dependencies up) Readiness 200 OK or 503

2. Update Docker Compose health checks to use HTTP

Replace wget/curl health checks with consistent HTTP checks:

healthcheck:
  test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/health/live']
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 30s

3. Add readiness checks that verify dependencies

The /health/ready endpoint should verify:

  • Database connection is active
  • Redis connection is active (for services that use Redis)
  • Downstream services are reachable (for the Gateway)

This is critical for Kubernetes: a pod that passes liveness but fails readiness is kept alive but removed from the Service's endpoint list (no traffic sent to it).

4. Verify health checks for third-party / observability services

The following third-party services already have health checks in Docker Compose — verify they are consistent and functional:

Service Health Check Notes
ClickHouse clickhouse-client --query 'SELECT 1' Confirms query engine is ready
Grafana wget --spider http://localhost:3000/api/health Built-in API health endpoint
Uptime Kuma HTTP check on port 3001 Verify start_period is sufficient for DB init
Dozzle /dozzle healthcheck Built-in healthcheck command
OTel Collector Add curl http://localhost:13133/ Uses the health_check extension (port 13133) — verify this extension is enabled in otel-collector-config.yaml

Phase 4: Graceful Shutdown (2 hours)

Priority: P1 | Impact: High | Dependencies: None

Kubernetes sends SIGTERM to pods during rolling updates, then waits terminationGracePeriodSeconds (default 30s) before sending SIGKILL. Services must handle SIGTERM to finish in-flight requests.

1. Verify NestJS graceful shutdown is enabled

In each service's main.ts:

app.enableShutdownHooks();

This ensures NestJS listens for SIGTERM and: - Stops accepting new connections - Waits for in-flight HTTP requests to complete - Closes database connections cleanly - Closes Redis connections cleanly

2. Add stop_grace_period to Docker Compose

For each service in docker-compose.yml:

services:
  gateway:
    stop_grace_period: 30s

This mirrors Kubernetes' terminationGracePeriodSeconds and ensures Docker Compose also waits before sending SIGKILL.

3. Verify BullMQ workers handle shutdown

For services with BullMQ workers (order processing, print job processing), ensure workers call worker.close() on SIGTERM to finish processing the current job before shutting down.

4. Add stop_grace_period to third-party services

ClickHouse, Grafana, OTel Collector, Uptime Kuma, and Dozzle should also have stop_grace_period set. ClickHouse is especially critical — it may need time to flush in-memory buffers to disk on shutdown:

services:
  clickhouse:
    stop_grace_period: 60s   # needs time to flush write buffers
  grafana:
    stop_grace_period: 15s
  otel-collector:
    stop_grace_period: 30s   # flush pending telemetry batches
  uptime-kuma:
    stop_grace_period: 15s
  dozzle:
    stop_grace_period: 10s

Phase 5: Configuration Externalization Audit (2 hours)

Priority: P1 | Impact: High | Dependencies: None

Kubernetes uses ConfigMaps for non-sensitive configuration and Secrets for sensitive values. The Docker Compose .env file maps directly to these concepts — but only if ALL configuration is externalized.

1. Audit all services for hardcoded values

Search for hardcoded URLs, ports, timeouts, feature flags, or connection strings in application code. All must come from environment variables.

Common patterns to look for:

// ❌ WRONG — hardcoded
const DB_URL = 'postgresql://localhost:5432/forma3d';

// ✅ CORRECT — from environment
const DB_URL = process.env['DATABASE_URL'];

2. Categorize environment variables

Create a documented mapping of all environment variables into two categories:

Non-sensitive (→ ConfigMap):

Variable Description Example
NODE_ENV Environment name staging
APP_PORT Service port 3000
LOG_LEVEL Log verbosity info
RATE_LIMIT_DEFAULT Default rate limit 10000

Sensitive (→ Secret):

Variable Description
DATABASE_URL PostgreSQL connection string
REDIS_URL Redis connection string
SESSION_SECRET Cookie signing secret
INTERNAL_API_KEY Inter-service auth key
SENTRY_DSN Sentry data source name
SHOPIFY_* Shopify OAuth credentials
SENDCLOUD_* Sendcloud API credentials
SIMPLYPRINT_* SimplyPrint API credentials

3. Create a configuration reference document

Create docs/05-deployment/configuration-reference.md listing every environment variable, its purpose, default value, and whether it's sensitive.


Phase 6: Resource Constraints (1 hour)

Priority: P2 | Impact: Medium | Dependencies: None

Add resource limits to Docker Compose services. These translate directly to Kubernetes resource requests and limits.

1. Add deploy.resources to each service

services:
  gateway:
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
Service CPU Request CPU Limit Memory Request Memory Limit
Gateway 0.25 0.50 256M 512M
Order Service 0.25 0.50 256M 512M
Print Service 0.15 0.30 192M 384M
Shipping Service 0.15 0.30 192M 384M
GridFlock Service 0.25 0.50 256M 512M
Slicer 0.50 1.00 512M 1024M
Web (static) 0.10 0.25 64M 128M
Redis 0.15 0.30 128M 256M
Traefik 0.15 0.30 128M 256M
ClickHouse 0.50 1.00 512M 1536M
Grafana 0.15 0.30 128M 256M
OTel Collector 0.15 0.30 128M 256M
Uptime Kuma 0.10 0.25 128M 256M
Dozzle 0.05 0.15 64M 128M

Adjust based on observed usage via docker stats.


Phase 7: Stateless Service Verification (1 hour)

Priority: P1 | Impact: High | Dependencies: None

For Kubernetes horizontal scaling, all application services must be stateless. State must live in external stores (PostgreSQL, Redis, S3).

1. Verify no local file storage

Check that no service writes to the local filesystem for state that needs to persist. Temporary files (e.g., STL processing in GridFlock/Slicer) should use /tmp and be cleaned up after processing.

2. Verify session storage uses Redis

Sessions must be stored in Redis (not in-memory). The Gateway already uses Redis for sessions — verify this is consistently applied.

3. Verify no in-memory caches that require consistency

If any service maintains in-memory caches, they must tolerate cache inconsistency across replicas or be moved to Redis.

4. Document stateful dependencies

Dependency Type Location K8s Strategy
PostgreSQL Database DigitalOcean Managed DB External (no migration needed)
Redis Cache / Sessions / Queues Docker container DigitalOcean Managed Redis or StatefulSet
ClickHouse Observability / Analytics DB Docker container (volume) StatefulSet with PVC or ClickHouse Cloud
Grafana Dashboards / Datasource config Docker container (volume) StatefulSet with PVC or Grafana Cloud
Uptime Kuma Monitor state / history Docker container (volume) StatefulSet with PVC
Let's Encrypt certs TLS Traefik volume cert-manager in K8s
Uploaded files STL files Temporary local → S3 future DigitalOcean Spaces

Phase 8: Docker Compose Networking Alignment (1 hour)

Priority: P2 | Impact: Medium | Dependencies: None

Kubernetes uses Service objects for service discovery (DNS-based: <service-name>.<namespace>.svc.cluster.local). Docker Compose already uses DNS-based service discovery within the network. Ensure the naming is consistent.

1. Verify service names match container references

In the Gateway's environment variables, downstream services are referenced as:

ORDER_SERVICE_URL=http://order-service:3001
PRINT_SERVICE_URL=http://print-service:3002
SHIPPING_SERVICE_URL=http://shipping-service:3003
GRIDFLOCK_SERVICE_URL=http://gridflock-service:3004

These names must match the Docker Compose service names exactly. In Kubernetes, these will become Kubernetes Service names — the URL pattern stays identical.

The observability pipeline also uses DNS-based service discovery:

# OTel Collector → ClickHouse
CLICKHOUSE_ENDPOINT=http://clickhouse:8123

# Application services → OTel Collector
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Grafana → ClickHouse (via provisioned datasource)
# Configured in grafana/provisioning/datasources/

2. Use environment variables for all service URLs

Never hardcode inter-service URLs. Always use environment variables so the values can be changed for Kubernetes Service discovery:

# Docker Compose
ORDER_SERVICE_URL=http://order-service:3001

# Kubernetes (same pattern, different port if needed)
ORDER_SERVICE_URL=http://order-service.forma3d.svc.cluster.local:3001

Phase 9: Multi-Replica Readiness (2 hours)

Priority: P1 | Impact: High | Dependencies: Phase 4, Phase 7

Verify and configure all application services so they can run as multiple replicas behind Traefik (Docker Compose) and later behind a Kubernetes Ingress. This goes beyond statelessness verification — it validates actual concurrent execution.

1. Add deploy.replicas to Docker Compose

Add explicit replica counts (default 1) to all application services so they can be scaled up trivially:

services:
  gateway:
    deploy:
      replicas: 1   # Scale with: docker compose up -d --scale gateway=3
      resources:
        # ... (existing resource constraints)

2. Configure Traefik load-balancing across replicas

Traefik auto-discovers Docker containers by label. Verify that scaling up a service (e.g., docker compose up -d --scale gateway=3) results in Traefik distributing traffic across all replicas. Key considerations:

  • Do NOT expose ports: on application services. Use expose: instead so replicas don't fight over host ports. Only Traefik should map to host ports 80/443.
  • Verify Traefik labels use the service name, not a container name, so all replicas are included in the backend pool.
  • Add a round-robin or least-connections load-balancing strategy in Traefik's dynamic config if needed.
services:
  gateway:
    # ❌ WRONG — blocks scaling
    # ports:
    #   - "3000:3000"
    # ✅ CORRECT — allows multiple replicas
    expose:
      - "3000"
    labels:
      - "traefik.http.services.gateway.loadbalancer.server.port=3000"

3. Validate BullMQ worker concurrency

When running multiple replicas of a service with BullMQ workers, jobs are naturally distributed across workers (BullMQ uses Redis-based locking). Verify:

  • No duplicate processing: Two replicas must not process the same job. BullMQ handles this natively — verify no custom job-fetch logic bypasses it.
  • Worker concurrency settings: Ensure concurrency is set per-worker (not globally) so each replica processes its fair share.
  • Job events / progress: If the Gateway subscribes to job events via QueueEvents, ensure this works correctly with multiple producer replicas.

4. Handle WebSocket sticky sessions

If any service uses WebSocket connections (Socket.IO for real-time events), multiple replicas require sticky sessions to ensure the WebSocket upgrade request reaches the same backend that holds the socket state.

Traefik supports sticky sessions via cookies:

labels:
  - "traefik.http.services.events.loadbalancer.sticky.cookie=true"
  - "traefik.http.services.events.loadbalancer.sticky.cookie.name=server_id"
  - "traefik.http.services.events.loadbalancer.sticky.cookie.httponly=true"

Socket.IO must also be configured to use the Redis adapter so pub/sub events propagate across replicas:

import { createAdapter } from '@socket.io/redis-adapter';
io.adapter(createAdapter(pubClient, subClient));

5. Test multi-replica operation

Run a manual scaling test for each application service:

# Scale up
docker compose up -d --scale gateway=2 --scale order-service=2

# Verify all replicas are healthy
docker compose ps

# Verify Traefik routes to all replicas
for i in $(seq 1 10); do
  curl -s https://staging-connect-api.forma3d.be/health/live \
    -o /dev/null -w "%{remote_ip}\n"
done

# Scale back down
docker compose up -d --scale gateway=1 --scale order-service=1

Document which services can and cannot be scaled (e.g., the Slicer may have constraints around GPU or temp file cleanup).


Phase 10: Rolling Update Strategy (1 hour)

Priority: P1 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 9

Define and test a rolling update procedure for Docker Compose that achieves zero-downtime deployments. This same procedure translates directly to Kubernetes Deployment rolling update strategy.

1. Add deploy.update_config to Docker Compose

Configure rolling update behavior for each service:

services:
  gateway:
    deploy:
      replicas: 1
      update_config:
        parallelism: 1       # Update one replica at a time
        delay: 10s            # Wait 10s between replica updates
        order: start-first    # Start new replica before stopping old one
        failure_action: rollback
      rollback_config:
        parallelism: 1
        order: start-first

order: start-first is critical — it ensures the new container is healthy before the old one is removed, maintaining service availability throughout the update.

2. Document the rolling update procedure

Create a step-by-step update runbook:

# 1. Pull latest images
docker compose -f deployment/staging/docker-compose.yml pull

# 2. Rolling update — application services one at a time (no full restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gateway
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build order-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build print-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build shipping-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gridflock-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build slicer

# 3. Verify health after each service update
curl -sf https://staging-connect-api.forma3d.be/health/ready || echo "UNHEALTHY"

# 4. Update the web frontend (stateless, fast restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build web

# 5. Update observability & monitoring services
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps clickhouse
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps otel-collector
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps grafana
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps uptime-kuma
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps dozzle

The --no-deps flag ensures only the target service is recreated — not its dependencies. This prevents cascading restarts.

3. Ensure database migration compatibility

Rolling updates mean two versions of a service run simultaneously (old and new). Database migrations must be backward-compatible:

  • Adding a column: Safe — old code ignores it.
  • Removing a column: Unsafe — old code still queries it. Use a two-phase approach:
  • Deploy code that stops using the column.
  • Deploy migration that removes the column.
  • Renaming a column: Unsafe — treat as add + deprecate + remove.

Document this in the deployment guide as the expand-and-contract migration pattern.

4. Add a pre-deployment health gate

Before starting a rolling update, verify the system is healthy:

#!/usr/bin/env bash
set -euo pipefail

SERVICES=("staging-connect-api.forma3d.be" "staging-connect.forma3d.be")

for svc in "${SERVICES[@]}"; do
  status=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}/health/ready" || echo "000")
  if [ "$status" != "200" ]; then
    echo "❌ Pre-deploy health check failed for ${svc} (HTTP ${status}). Aborting."
    exit 1
  fi
done

echo "✅ All services healthy. Proceeding with rolling update."

5. Test rollback procedure

Document and test the rollback procedure:

# Rollback a single service to a specific image tag
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps \
  -e GATEWAY_IMAGE_TAG=git-abc1234 gateway

# Or rollback by reverting the .env file and recreating
git checkout HEAD~1 -- deployment/staging/.env
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps gateway

Phase 11: Local Development Experience with Rancher Desktop + Tilt (3–4 hours)

Priority: P2 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 5

Give every developer a one-command local environment that mirrors production networking, service discovery, and container runtime. The goal: git clone, tilt up, start coding.

Why Rancher Desktop + Tilt

Tool Role
Rancher Desktop Desktop application that provides a local Kubernetes cluster (K3s under the hood), container runtime (containerd or dockerd), and kubectl — one install replaces Docker Desktop + k3d + ctlptl
Tilt Watches source files, live-syncs into running containers, manages builds, port-forwards, and provides a dashboard at localhost:10350
Traefik Mesh Lightweight service mesh — automatic mTLS, request metrics, no sidecars. Same Helm chart reused in staging and production on DOKS
KubeView Real-time graphical visualization of cluster resources and their relationships at localhost:8000

The existing pnpm dev workflow (Nx parallel serve) remains available for quick, lightweight iteration. tilt up is the production-parity alternative.

Aspect pnpm dev tilt up
Infra setup Manual (install PostgreSQL, Redis) Automatic (K8s provisions everything)
Service discovery localhost + hardcoded ports K8s DNS (matches production)
Service mesh None Traefik Mesh (mTLS, metrics — same as staging/prod)
Hot reload Nx watch mode Tilt live_update (file sync into containers)
Debugging Direct (same process) Remote attach via --inspect port
Cluster visibility None KubeView (real-time resource graph)
Production parity Low (no containers) High (same images, same networking, same mesh)

1. Prerequisites

Developers need two tools installed (plus one optional tool for webhook testing):

  1. Rancher Desktop — Provides the local Kubernetes cluster, container runtime, and kubectl. Download from the website or install via Homebrew:
# macOS
brew install --cask rancher

# Linux / Windows: download from https://rancherdesktop.io
  1. Tilt — Orchestrates the development workflow:
# macOS
brew install tilt-dev/tap/tilt

# Linux
curl -fsSL https://raw.githubusercontent.com/tilt-dev/tilt/master/scripts/install.sh | bash
  1. cloudflared (optional) — Exposes localhost:3000 to the internet so Shopify, SimplyPrint, and SendCloud can deliver real webhooks during local development. Not needed for most development — curl simulation is sufficient:
# macOS
brew install cloudflared

# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
sudo apt update && sudo apt install cloudflared

Rancher Desktop configuration:

After installing, open Rancher Desktop and verify these settings: - Kubernetes enabled (on by default) - Container runtime: dockerd (moby) — required for Tilt's docker_build() to work - Kubernetes version: 1.29+ recommended

Rancher Desktop bundles kubectl and manages the kubeconfig automatically. No other prerequisites — PostgreSQL and Redis run inside the cluster.

2. Cluster setup — Rancher Desktop

No cluster definition file is needed. Rancher Desktop provides the Kubernetes cluster out of the box. Verify the cluster is running:

kubectl cluster-info
# Should show: Kubernetes control plane running at https://127.0.0.1:6443

kubectl get nodes
# Should show: rancher-desktop   Ready

Rancher Desktop's built-in K3s has Traefik installed by default. This doesn't conflict with local development — Tilt handles port-forwarding directly to pods, bypassing any in-cluster ingress.

Image builds: Tilt's docker_build() uses the local Docker daemon provided by Rancher Desktop (when configured with dockerd (moby) runtime). Built images are available to the cluster immediately — no separate registry needed.

3. Kubernetes manifests — k8s/dev/

Create lightweight manifests for local development only. These are NOT the production Kubernetes manifests (those come later during the actual DOKS migration).

k8s/dev/namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: forma3d-dev

k8s/dev/postgres.yaml — Single-node PostgreSQL with a PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: forma3d-dev
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: forma3d
            - name: POSTGRES_USER
              value: forma3d
            - name: POSTGRES_PASSWORD
              value: forma3d_dev
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command: [pg_isready, -U, forma3d]
            periodSeconds: 5
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: postgres-data
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: forma3d-dev
spec:
  selector:
    app: postgres
  ports:
    - port: 5432

k8s/dev/redis.yaml — Single-node Redis:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          readinessProbe:
            exec:
              command: [redis-cli, ping]
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: forma3d-dev
spec:
  selector:
    app: redis
  ports:
    - port: 6379

k8s/dev/configmap.yaml — Shared configuration for all application services:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: forma3d-dev
data:
  NODE_ENV: development
  LOG_LEVEL: debug
  DATABASE_URL: postgresql://forma3d:forma3d_dev@postgres.forma3d-dev.svc.cluster.local:5432/forma3d?schema=public
  REDIS_URL: redis://redis.forma3d-dev.svc.cluster.local:6379
  ORDER_SERVICE_URL: http://order-service.forma3d-dev.svc.cluster.local:3001
  PRINT_SERVICE_URL: http://print-service.forma3d-dev.svc.cluster.local:3002
  SHIPPING_SERVICE_URL: http://shipping-service.forma3d-dev.svc.cluster.local:3003
  GRIDFLOCK_SERVICE_URL: http://gridflock-service.forma3d-dev.svc.cluster.local:3004
  GATEWAY_URL: http://gateway.forma3d-dev.svc.cluster.local:3000
  API_URL: http://localhost:3000
  WEB_URL: http://localhost:4200

k8s/dev/secret.yaml.example — Template for sensitive values (not committed to git):

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: forma3d-dev
type: Opaque
stringData:
  SESSION_SECRET: local-dev-session-secret
  INTERNAL_API_KEY: local-dev-internal-api-key
  # Add your external service credentials below:
  # SHOPIFY_CLIENT_ID: ""
  # SHOPIFY_CLIENT_SECRET: ""
  # SIMPLYPRINT_API_KEY: ""
  # SENDCLOUD_PUBLIC_KEY: ""
  # SENDCLOUD_SECRET_KEY: ""
  # SENTRY_DSN: ""

Per-service manifests — Create one file per application service following this pattern (example for Gateway, repeat for order-service, print-service, shipping-service, gridflock-service):

k8s/dev/gateway.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gateway
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gateway
  template:
    metadata:
      labels:
        app: gateway
    spec:
      containers:
        - name: gateway
          image: forma3d-connect-gateway
          ports:
            - containerPort: 3000
              name: http
            - containerPort: 9229
              name: debug
          envFrom:
            - configMapRef:
                name: app-config
            - secretRef:
                name: app-secrets
          env:
            - name: APP_PORT
              value: "3000"
          readinessProbe:
            httpGet:
              path: /health/live
              port: 3000
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: gateway
  namespace: forma3d-dev
spec:
  selector:
    app: gateway
  ports:
    - name: http
      port: 3000
    - name: debug
      port: 9229

Repeat the pattern for each backend service (order-service on 3001/9230, print-service on 3002/9231, shipping-service on 3003/9232, gridflock-service on 3004/9233). Each service gets its own debug port and APP_PORT env override (since the shared configmap cannot hold per-service port values). Adjust the readinessProbe path to /health for the downstream services (they use /health instead of /health/live).

k8s/dev/web.yaml — The React app runs vite dev (not nginx) for HMR:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: forma3d-connect-web-dev
          ports:
            - containerPort: 4200
          env:
            - name: VITE_API_URL
              value: http://localhost:3000
---
apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: forma3d-dev
spec:
  selector:
    app: web
  ports:
    - port: 4200

k8s/dev/kubeview.yaml — Cluster visualization tool (KubeView):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubeview
  namespace: forma3d-dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kubeview
  template:
    metadata:
      labels:
        app: kubeview
    spec:
      serviceAccountName: kubeview
      containers:
        - name: kubeview
          image: ghcr.io/benc-uk/kubeview:latest
          ports:
            - containerPort: 8000
          env:
            - name: SINGLE_NAMESPACE
              value: forma3d-dev
            - name: NAMESPACE_FILTER
              value: "^kube-"
---
apiVersion: v1
kind: Service
metadata:
  name: kubeview
  namespace: forma3d-dev
spec:
  selector:
    app: kubeview
  ports:
    - port: 8000
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kubeview
  namespace: forma3d-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeview-reader
rules:
  - apiGroups: ["", "apps", "batch", "networking.k8s.io", "discovery.k8s.io", "autoscaling"]
    resources: ["*"]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubeview-reader-binding
subjects:
  - kind: ServiceAccount
    name: kubeview
    namespace: forma3d-dev
roleRef:
  kind: ClusterRole
  name: kubeview-reader
  apiGroup: rbac.authorization.k8s.io

KubeView provides a real-time graphical view of pods, deployments, services, and their relationships at http://localhost:8000. Useful for understanding how Kubernetes resources map to the application architecture.

Traefik Mesh — Installed via Helm in the Tiltfile (not a static manifest). Traefik Mesh is a lightweight service mesh that provides: - Automatic mTLS between services - Request-level metrics and tracing - SMI (Service Mesh Interface) support for traffic policies

To opt a service into the mesh, add the label mesh.traefik.io/enabled: "true" to its Service object. For example, in k8s/dev/gateway.yaml:

apiVersion: v1
kind: Service
metadata:
  name: gateway
  namespace: forma3d-dev
  labels:
    mesh.traefik.io/enabled: "true"
spec:
  selector:
    app: gateway
  ports:
    - name: http
      port: 3000
    - name: debug
      port: 9229

The same Traefik Mesh Helm chart and service labels will be reused in staging and production on DigitalOcean Managed Kubernetes (DOKS), giving consistent service-to-service networking across all environments.

4. Development Dockerfiles

Create lightweight development Dockerfiles that support Tilt's live_update (file sync instead of full rebuild):

.dockerignore — Create a .dockerignore in the project root to keep the build context small (Tilt's docker_build() uses context='.'):

node_modules
dist
build
.nx
.git
.vscode
.cursor
coverage
*.log
.DS_Store
docs
load-tests
.specstory

apps/gateway/Dockerfile.dev (same pattern for all NestJS services — adjust the service name and --inspect port):

FROM node:20-alpine
WORKDIR /app
RUN apk add --no-cache openssl
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
RUN pnpm prisma generate
COPY . .
CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]

Using tsx watch instead of the full Nx build pipeline — starts in seconds, restarts on file changes synced by Tilt. The --inspect flag enables remote debugging via VS Code (port per service: gateway=9229, order-service=9230, print-service=9231, shipping-service=9232, gridflock-service=9233). The openssl package is required by Prisma on Alpine. The prisma/ directory is copied separately before pnpm install so that prisma generate can run against the schema.

apps/web/Dockerfile.dev:

FROM node:20-alpine
WORKDIR /app
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
COPY . .
EXPOSE 4200
CMD ["node_modules/.bin/vite", "--host", "0.0.0.0", "--port", "4200", "apps/web"]

5. Tiltfile

Create Tiltfile in the project root:

# ---------------------------------------------------------------------------
# Forma3D.Connect — Local Development with Rancher Desktop + Tilt
# Usage: tilt up
# Dashboard: http://localhost:10350
# ---------------------------------------------------------------------------

load('ext://namespace', 'namespace_create')

# --- Cluster bootstrap ---
namespace_create('forma3d-dev')

# --- Infrastructure (PostgreSQL + Redis) ---
k8s_yaml('k8s/dev/postgres.yaml')
k8s_yaml('k8s/dev/redis.yaml')
k8s_yaml('k8s/dev/configmap.yaml')

# Apply secrets (developer must copy secret.yaml.example → secret.yaml)
if os.path.exists('k8s/dev/secret.yaml'):
    k8s_yaml('k8s/dev/secret.yaml')
else:
    fail('k8s/dev/secret.yaml not found. Copy k8s/dev/secret.yaml.example and fill in your values.')

k8s_resource('postgres', port_forwards=['5432:5432'],
             labels=['infra'])
k8s_resource('redis', port_forwards=['6379:6379'],
             labels=['infra'])

# --- Prisma migrations (runs after postgres is ready) ---
local_resource('prisma-migrate',
    cmd='pnpm prisma migrate deploy',
    resource_deps=['postgres'],
    labels=['setup'])

local_resource('prisma-seed',
    cmd='pnpm prisma db seed',
    resource_deps=['prisma-migrate'],
    auto_init=False,  # manual trigger via Tilt UI button
    labels=['setup'])

# --- Backend services ---
BACKEND_SERVICES = {
    'gateway':          {'port': 3000, 'debug': 9229},
    'order-service':    {'port': 3001, 'debug': 9230},
    'print-service':    {'port': 3002, 'debug': 9231},
    'shipping-service': {'port': 3003, 'debug': 9232},
    'gridflock-service':{'port': 3004, 'debug': 9233},
}

for svc, cfg in BACKEND_SERVICES.items():
    docker_build(
        'forma3d-connect-' + svc,
        context='.',
        dockerfile='apps/' + svc + '/Dockerfile.dev',
        live_update=[
            sync('apps/' + svc + '/src', '/app/apps/' + svc + '/src'),
            sync('libs/', '/app/libs/'),
            sync('prisma/', '/app/prisma/'),
        ],
    )
    k8s_yaml('k8s/dev/' + svc + '.yaml')
    k8s_resource(svc,
        port_forwards=[
            str(cfg['port']) + ':' + str(cfg['port']),
            str(cfg['debug']) + ':' + str(cfg['debug']),
        ],
        resource_deps=['prisma-migrate'],
        labels=['backend'])

# --- Web (React + Vite HMR) ---
docker_build(
    'forma3d-connect-web-dev',
    context='.',
    dockerfile='apps/web/Dockerfile.dev',
    live_update=[
        sync('apps/web/src', '/app/apps/web/src'),
        sync('libs/', '/app/libs/'),
    ],
)
k8s_yaml('k8s/dev/web.yaml')
k8s_resource('web',
    port_forwards=['4200:4200'],
    resource_deps=['gateway'],
    labels=['frontend'])

# --- Traefik Mesh (service mesh — same in dev, staging, production) ---
local_resource('traefik-mesh-install',
    cmd='helm repo add traefik https://traefik.github.io/charts --force-update && '
        'helm upgrade --install traefik-mesh traefik/traefik-mesh '
        '--namespace forma3d-dev --wait',
    resource_deps=['postgres'],  # ensure namespace exists
    labels=['mesh'])

# --- KubeView (cluster visualization) ---
k8s_yaml('k8s/dev/kubeview.yaml')
k8s_resource('kubeview',
    port_forwards=['8000:8000'],
    labels=['tools'])

# --- Cloudflare Tunnel (webhook testing — on-demand) ---
# Exposes localhost:3000 (Gateway) to the internet so external services
# (Shopify, SimplyPrint, SendCloud) can deliver webhooks during local dev.
# Start manually from the Tilt dashboard when testing webhook flows.
local_resource('tunnel',
    serve_cmd='cloudflared tunnel --url http://localhost:3000',
    auto_init=False,
    labels=['tools'])

6. VS Code debugging

Add launch configurations for attaching to running services inside the cluster. Create or extend .vscode/launch.json:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Attach: Gateway (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9229,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Order Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9230,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Print Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9231,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: Shipping Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9232,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    },
    {
      "name": "Attach: GridFlock Service (Tilt)",
      "type": "node",
      "request": "attach",
      "port": 9233,
      "restart": true,
      "sourceMaps": true,
      "localRoot": "${workspaceFolder}",
      "remoteRoot": "/app"
    }
  ]
}

To enable debugging, change the CMD in the service's Dockerfile.dev to include --inspect=0.0.0.0:9229 (adjusting the port per service), or add an environment variable toggle:

CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]

7. Developer workflow

First time setup:

# 1. Install Rancher Desktop (see Prerequisites) and ensure Kubernetes is running
kubectl cluster-info                                  # verify cluster is ready

# 2. Clone and install
git clone <repo-url> && cd forma-3d-connect
pnpm install                                          # install dependencies
cp k8s/dev/secret.yaml.example k8s/dev/secret.yaml    # add your API keys

# 3. Start
tilt up                                                # start everything

Daily development:

tilt up     # Rancher Desktop starts on login, cluster is always ready
# Edit files in apps/ or libs/ — changes sync into containers in <2 seconds
# Open http://localhost:4200 (web), http://localhost:3000 (API)
# Tilt dashboard at http://localhost:10350
# KubeView at http://localhost:8000 (cluster visualization)
tilt down   # stop all services (cluster persists)

Resetting the environment:

tilt down
kubectl delete namespace forma3d-dev    # remove all dev resources
tilt up                                 # recreate everything fresh

8. Webhook testing during local development

The application receives inbound webhooks from three external services:

Provider Webhook Path Routed To
Shopify /api/v1/webhooks/shopify order-service
SimplyPrint /webhooks/simplyprint print-service
SendCloud /webhooks/sendcloud shipping-service

These external services cannot reach localhost. When testing flows that depend on real webhook delivery, a tunnel is needed to expose the Gateway to the internet.

Option A: Cloudflare Tunnel (recommended for real webhook testing)

Install cloudflared:

# macOS
brew install cloudflared

# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/cloudflare-main.gpg] https://pkg.cloudflare.com/cloudflared any main' | sudo tee /etc/apt/sources.list.d/cloudflared.list
sudo apt update && sudo apt install cloudflared

The Tiltfile includes a tunnel resource (disabled by default). Start it from the Tilt dashboard when needed — it prints a temporary public URL (e.g., https://<random>.trycloudflare.com). Configure the external service's webhook URL to point to this tunnel URL:

# Shopify: Update webhook URL in Shopify Partner Dashboard or via API
# SimplyPrint: Update webhook URL in SimplyPrint dashboard
# SendCloud: Update webhook URL in SendCloud panel

# Example: verify the tunnel forwards to the Gateway
curl https://<tunnel-url>/health/live

Option B: Simulate webhooks with curl (no tunnel needed)

For most development, simulating webhook payloads locally is faster and doesn't require internet access. Send signed requests directly to the Gateway:

# Simulate a Shopify order creation webhook
curl -X POST http://localhost:3000/api/v1/webhooks/shopify \
  -H "Content-Type: application/json" \
  -H "X-Shopify-Topic: orders/create" \
  -H "X-Shopify-Hmac-Sha256: <computed-hmac>" \
  -H "X-Shopify-Shop-Domain: your-shop.myshopify.com" \
  -d @apps/order-service/src/shopify/__tests__/fixtures/order-created.json

# Simulate a SimplyPrint job status webhook
curl -X POST http://localhost:3000/webhooks/simplyprint \
  -H "Content-Type: application/json" \
  -d '{"event": "job.status_changed", "data": {"job_id": "123", "status": "completed"}}'

# Simulate a SendCloud parcel status webhook
curl -X POST http://localhost:3000/webhooks/sendcloud \
  -H "Content-Type: application/json" \
  -d '{"action": "parcel_status_changed", "parcel": {"id": 123, "status": {"id": 11}}}'

To bypass HMAC signature verification during local development, set WEBHOOK_SKIP_VERIFICATION=true in k8s/dev/configmap.yaml (this env variable must only be respected when NODE_ENV=development).

When to use which:

Scenario Use
Testing webhook handler logic in isolation Option B (curl)
Testing the full end-to-end flow with a real external service Option A (Cloudflare Tunnel)
CI / automated tests Option B (mock payloads in test fixtures)

9. Excluded services

The following staging-only services are NOT included in the local development setup:

Service Reason
Slicer (BambuStudio) Requires specific Linux binaries; optional for most development
ClickHouse Observability infra, not needed for feature development
Grafana Observability infra
Uptime Kuma Monitoring infra
Dozzle Log viewer — Tilt provides its own log aggregation
OTel Collector Observability pipeline
Traefik Proxy (Ingress) Tilt port-forwarding replaces the reverse proxy for local dev

Included dev tools: KubeView (cluster visualization at localhost:8000) and Traefik Mesh (service mesh — same Helm chart reused in staging/production).

If a developer needs the Slicer locally, they can run it standalone via Docker alongside the Tilt-managed cluster:

docker run -p 3010:3010 -v $(pwd)/deployment/slicer/profiles:/profiles \
  forma3d-connect-slicer:latest

📊 Migration Path Summary

Step When What Happens DNS Impact
Now (this prompt) Today Lower DNS TTLs to 60s, standardize health checks, externalize config TTL changes only
Multi-tenancy Next quarter System grows, consider DOKS None
DOKS setup When needed Create DOKS cluster, deploy services, create DO Load Balancer None
Cut-over Migration day Update DNS A records from Droplet IP to LB IP DNS update (propagates in <60s)
Cleanup Post-migration Decommission Droplet None

✅ Validation Checklist

DNS Preparation

  • DNS TTL lowered to 60s on all staging A records
  • DNS resolution verified (dig shows correct IP and low TTL)
  • Droplet IP and datacenter documented in deployment docs and .env.example
  • Migration cut-over runbook documented
  • All services still accessible after TTL changes (TLS valid, health checks pass)

Container Registry

  • All services pull from DOCR (${REGISTRY_URL}/forma3d-connect-*)
  • CI pipeline tags images with git-<sha> and environment tags
  • pull_policy: always set on all application services in Docker Compose

Health Checks

  • All backend services expose GET /health/live (200 OK)
  • All backend services expose GET /health/ready (200 OK when healthy, 503 when not)
  • Docker Compose health checks use HTTP endpoints consistently
  • Health check intervals and thresholds are consistent across services
  • Third-party services (ClickHouse, Grafana, Uptime Kuma, Dozzle, OTel Collector) have functioning health checks

Graceful Shutdown

  • app.enableShutdownHooks() called in all NestJS services
  • BullMQ workers handle SIGTERM (close cleanly)
  • stop_grace_period set on all services (30s for app services, 60s for ClickHouse, 30s for OTel Collector)
  • Verified: docker compose stop completes without SIGKILL for all services

Configuration

  • No hardcoded URLs, ports, or secrets in application code
  • All environment variables documented with sensitivity classification
  • Configuration reference document created
  • .env.example is complete and up-to-date

Resource Constraints

  • deploy.resources (limits + reservations) set for all services (including ClickHouse, Grafana, OTel Collector, Uptime Kuma, Dozzle)
  • Resource values validated against docker stats observations

Statelessness

  • No application service writes persistent state to local filesystem
  • Sessions stored in Redis (not in-memory)
  • No singleton in-memory caches that break with multiple replicas
  • Stateful dependencies documented with Kubernetes migration strategy (including ClickHouse, Grafana, Uptime Kuma volumes)

Multi-Replica Readiness

  • expose: used instead of ports: on all application services (only Traefik exposes host ports)
  • deploy.replicas: 1 set explicitly on all application services
  • Traefik load-balances across replicas when scaling up (docker compose up -d --scale gateway=2)
  • BullMQ job processing verified with multiple worker replicas (no duplicate processing)
  • WebSocket sticky sessions configured in Traefik for Socket.IO services
  • Socket.IO Redis adapter configured for cross-replica pub/sub
  • Each application service tested at 2+ replicas (scaled up and back down)
  • Services that cannot be scaled documented with reasoning

Rolling Updates

  • deploy.update_config with order: start-first set on all application services
  • deploy.rollback_config set on all application services
  • Rolling update runbook documented (per-service --no-deps updates, including observability services)
  • Pre-deployment health gate script created and tested
  • Rollback procedure documented and tested
  • Database migration backward-compatibility rules documented (expand-and-contract pattern)
  • CI pipeline updated to support targeted per-service deploys
  • Zero-downtime verified: rolling update completes with no failed health checks

Local Development (Rancher Desktop + Tilt)

  • Rancher Desktop with Kubernetes enabled and dockerd (moby) runtime documented as prerequisite
  • k8s/dev/ contains manifests for namespace, postgres, redis, configmap, secret.yaml.example, and all application services
  • Tiltfile exists in the project root and loads all k8s/dev/ manifests
  • Dev Dockerfiles (Dockerfile.dev) exist for all application services (gateway, order-service, print-service, shipping-service, gridflock-service, web)
  • Dev Dockerfiles include openssl for Prisma on Alpine and --inspect flag for remote debugging
  • .dockerignore exists in the project root to exclude node_modules/, .git/, dist/, etc. from the Docker build context
  • Per-service APP_PORT env override is set in each K8s manifest (not in the shared configmap)
  • tilt up from a clean clone (after pnpm install and secret.yaml setup) starts all services successfully
  • PostgreSQL and Redis are provisioned automatically inside the cluster
  • Prisma migrations run automatically on startup (after postgres is ready)
  • Port-forwards work: localhost:3000 (gateway), localhost:4200 (web), localhost:5432 (postgres), localhost:6379 (redis), localhost:8000 (KubeView)
  • Live-update works: editing a file in apps/<service>/src/ triggers a container restart within 2 seconds
  • Web HMR works: editing a file in apps/web/src/ reflects immediately in the browser
  • Debug ports are accessible: localhost:9229 (gateway), 92309233 (other services)
  • VS Code can attach debugger to running services via launch configurations
  • KubeView shows all pods, deployments, and services in forma3d-dev namespace at localhost:8000
  • Traefik Mesh is installed via Helm and running (kubectl get pods -n forma3d-dev shows mesh controller and proxies)
  • Services with mesh.traefik.io/enabled: "true" label are routed through the mesh
  • tilt down cleanly stops all services
  • kubectl delete namespace forma3d-dev cleanly removes all dev resources
  • Existing pnpm dev workflow still works independently
  • k8s/dev/secret.yaml is in .gitignore
  • Tiltfile includes an on-demand tunnel resource (cloudflared tunnel --url http://localhost:3000) with auto_init=False
  • cloudflared documented as an optional prerequisite (only needed for real webhook testing)
  • Webhook simulation with curl documented with example payloads for Shopify, SimplyPrint, and SendCloud
  • WEBHOOK_SKIP_VERIFICATION env variable supported in development mode for local curl testing

Verification Commands

# All services healthy and DNS TTLs lowered
curl -I https://staging-connect.forma3d.be
curl -I https://staging-connect-api.forma3d.be/health/live
curl -I https://staging-connect-api.forma3d.be/health/ready
dig staging-connect.forma3d.be | grep TTL

# Docker Compose validation
docker compose -f deployment/staging/docker-compose.yml config --quiet

# Graceful shutdown test
docker compose stop gateway  # Should stop within 30s without SIGKILL

# Resource usage baseline
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

# Build passes
pnpm nx run-many -t build --all

# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests

🚫 Constraints and Rules

MUST DO

  • Lower DNS TTLs to 60s on all staging A records (enables fast future cut-over)
  • Document the migration cut-over procedure (DNS update from Droplet IP to LB IP)
  • Verify all services expose HTTP health endpoints (/health/live and /health/ready)
  • Enable graceful shutdown hooks in all NestJS services
  • Add stop_grace_period to all Docker Compose application services
  • Audit and document all environment variables with sensitivity classification
  • Verify all application services are stateless
  • Add resource constraints to Docker Compose services
  • Use environment variables for all inter-service URLs (no hardcoding)
  • Create k8s/dev/ manifests (including KubeView), Tiltfile, and dev Dockerfiles for local Rancher Desktop + Tilt workflow
  • Install Traefik Mesh via Helm in the Tiltfile (same chart reused in staging/production on DOKS)
  • Verify tilt up starts all services from a clean state (clone + tilt up = working environment)
  • Preserve the existing pnpm dev workflow as a lightweight alternative

MUST NOT

  • Create any production Kubernetes manifests, Helm charts, or Kustomize configs — not yet (local-dev K8s manifests in k8s/dev/ are fine)
  • Deploy K8s manifests to staging or production — the k8s/dev/ manifests are for local development only
  • Install kubectl, helm, or any Kubernetes tooling on the Droplet
  • Change the Docker Compose deployment workflow
  • Remove or replace Traefik — it stays as the reverse proxy for now
  • Over-engineer for Kubernetes patterns that aren't needed yet (sidecars, custom operators, etc.) — Traefik Mesh is allowed as it's lightweight and non-invasive
  • Break any existing functionality or deployment process
  • Use any, ts-ignore, or eslint-disable

SHOULD DO (Nice to Have)

  • Document the Kubernetes migration path in docs/05-deployment/kubernetes-migration-plan.md
  • Set up DigitalOcean monitoring alerts for DNS records
  • Explore DigitalOcean's App Platform as an intermediate step before full Kubernetes
  • Add a tilt_config.json for per-developer overrides (e.g., enable/disable Slicer, toggle debug ports)
  • Create a CONTRIBUTING.md section documenting the tilt up workflow for new developers

🔄 Rollback Plan

All changes in this prompt are non-destructive:

  1. DNS TTL changes: Lowering TTLs is completely non-destructive. If needed, TTLs can be raised back to their original values.
  2. Health checks: Added endpoints don't affect existing functionality.
  3. Graceful shutdown: enableShutdownHooks() is additive — it doesn't change normal operation.
  4. Resource constraints: Docker Compose ignores deploy.resources unless using docker compose up with --compatibility flag or Docker Swarm mode. In standalone Docker Compose, these serve as documentation.
  5. Configuration audit: Documentation-only changes.
  6. Rancher Desktop + Tilt: All local dev files (k8s/dev/, Tiltfile, Dockerfile.dev) are additive. They don't affect staging/production deployment, CI pipeline, or the existing pnpm dev workflow. The Rancher Desktop cluster runs entirely on the developer's machine. KubeView is read-only. Traefik Mesh is opt-in (only affects services with the mesh.traefik.io/enabled label) and will be reused with the same Helm chart in staging/production.

📚 Key References

DigitalOcean: - DOCR: https://docs.digitalocean.com/products/container-registry/ - DOKS: https://docs.digitalocean.com/products/kubernetes/ - Load Balancers: https://docs.digitalocean.com/products/networking/load-balancers/ - Reserved IPs (Droplet-only): https://docs.digitalocean.com/products/networking/reserved-ips/ — Note: cannot be assigned to Load Balancers, only Droplets

Kubernetes Migration: - Docker Compose to Kubernetes: https://kubernetes.io/docs/tasks/configure-pod-container/translate-compose-kubernetes/ - Kompose (migration tool): https://kompose.io/

Local Development (Rancher Desktop + Tilt): - Rancher Desktop: https://rancherdesktop.io/ — local Kubernetes with built-in K3s, container runtime, and kubectl - Tilt: https://docs.tilt.dev/ — live development orchestration - Tilt live_update: https://docs.tilt.dev/live_update_reference — file sync into running containers - Tilt + Rancher Desktop: https://docs.tilt.dev/choosing_clusters#rancher-desktop — official Tilt integration guide - KubeView: https://github.com/benc-uk/kubeview — lightweight Kubernetes cluster visualization - Traefik Mesh: https://doc.traefik.io/traefik-mesh/ — lightweight service mesh (no sidecars, SMI-compatible) - Traefik Mesh install: https://doc.traefik.io/traefik-mesh/install/ — Helm-based installation - Cloudflare Tunnel (cloudflared): https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/ — free tunnel for exposing local services to the internet (webhook testing)

NestJS: - Graceful shutdown: https://docs.nestjs.com/fundamentals/lifecycle-events#application-shutdown - Health checks (Terminus): https://docs.nestjs.com/recipes/terminus

Existing Codebase: - Docker Compose: deployment/staging/docker-compose.yml - Traefik config: deployment/staging/traefik.yml - Deployment guide: docs/05-deployment/staging-deployment-guide.md - CI Pipeline: azure-pipelines.yml


END OF PROMPT


This prompt prepares the Forma3D.Connect infrastructure for a future Docker Compose to Kubernetes migration. The key networking deliverable is lowering DNS TTLs to 60s — enabling a fast DNS-based cut-over to a Load Balancer + DOKS cluster in the future (propagation under 1 minute). Note: DO Reserved IPs cannot be assigned to Load Balancers, so the migration uses a DNS update strategy instead. Supporting changes include health check standardization, graceful shutdown, configuration externalization, resource constraints, statelessness verification, multi-replica readiness (load-balancing, sticky sessions, worker deduplication), and a rolling update strategy with health gates and backward-compatible migrations. The system stays on Docker Compose for staging/production but becomes "Kubernetes-ready." A local Rancher Desktop + Tilt development environment provides production-parity K8s-based development with a one-command tilt up workflow.