AI Prompt: Forma3D.Connect — Scaling Preparations (Docker Compose → Kubernetes)¶
Purpose: Prepare the current single-Droplet Docker Compose deployment for a seamless future migration to Kubernetes on DigitalOcean, without adding unnecessary complexity now
Estimated Effort: 14–19 hours
Prerequisites: Staging deployment operational on a single DigitalOcean Droplet with Docker Compose + Traefik
Output: Externalized configuration, container registry strategy, architectural guardrails that make a future Kubernetes migration a straightforward lift-and-shift, and a local Rancher Desktop + Tilt development environment for production-parity "clone and tilt up" workflow
Status: 🚧 TODO
🎯 Mission¶
Prepare the Forma3D.Connect infrastructure for a seamless future migration from Docker Compose on a single Droplet to DigitalOcean Managed Kubernetes (DOKS). The goal is to make changes now — while the system is simple — that will pay off when multi-tenancy drives the need for horizontal scaling.
This is NOT a Kubernetes migration. This prompt makes the Docker Compose deployment "Kubernetes-ready" by:
- DNS strategy with low TTL — Prepare DNS records for a fast cut-over to a future DO Load Balancer IP by lowering TTLs now
- Container registry strategy — Ensure all images are pulled from DigitalOcean Container Registry (DOCR) with proper tagging
- Configuration externalization — Move all configuration to environment variables and
.envfiles so they map cleanly to Kubernetes ConfigMaps and Secrets - Health check standardization — Ensure all services expose HTTP health endpoints that work identically as Kubernetes liveness/readiness probes
- Stateless service design — Verify all services are stateless (no local file storage, no in-memory sessions without Redis backing)
- Graceful shutdown — Ensure all services handle SIGTERM for zero-downtime rolling updates
- Resource awareness — Add resource constraints to Docker Compose that translate directly to Kubernetes resource requests/limits
- DNS and TLS strategy — Plan the DNS/TLS migration path from Traefik to DigitalOcean Load Balancer + cert-manager
- Multi-replica readiness — Validate that all containers can run as multiple replicas behind Traefik with proper load-balancing, sticky sessions, and worker deduplication
- Rolling update strategy — Define and test a zero-downtime rolling update procedure with
start-firstordering, health gates, and backward-compatible database migrations - Local development with Rancher Desktop + Tilt — Create a "clone and
tilt up" developer experience using Rancher Desktop's built-in Kubernetes, Tilt live-update, port-forwarding, and VS Code debug attach
Important note on DigitalOcean Reserved IPs: DO Reserved IPs can only be assigned to Droplets, not to Load Balancers. This means we cannot use a Reserved IP as a stable entry point that gets reassigned from a Droplet to a Load Balancer. Instead, the migration strategy uses DNS-based cut-over: when moving to DOKS, update DNS A records from the Droplet IP to the Load Balancer's stable IP. Setting low TTLs (60s) on DNS records before migration minimizes propagation delay to under a minute.
Why now:
- Changes are cheap when the system is small (6 services + supporting containers)
- Retrofitting these patterns later is expensive and error-prone
- Multi-tenancy (the next major feature) will be the trigger for needing Kubernetes
- Lowering DNS TTLs now means the future DNS cut-over will propagate in under a minute
What stays unchanged:
- Docker Compose remains the deployment mechanism for now
- Traefik remains the reverse proxy for now
- Single Droplet remains the hosting model for now
- No Kubernetes manifests or Helm charts are created in this prompt
📐 Architecture¶
Current State¶
DNS A Records
│
┌────────────────┴────────────────────┐
│ staging-connect.forma3d.be │
│ staging-connect-api.forma3d.be │
│ staging-connect-docs.forma3d.be │
│ staging-connect-events.forma3d.be │
│ staging-connect-db.forma3d.be │
│ staging-connect-logs.forma3d.be │
│ staging-connect-uptime.forma3d.be │
└────────────────┬────────────────────┘
│
▼
┌──────────────────────┐
│ Droplet Public IP │
│ (e.g., 167.x.x.x) │
└──────────┬───────────┘
│
┌──────────┴───────────┐
│ Docker Compose │
│ + Traefik │
│ + All Services │
└──────────────────────┘
Target State (after this prompt)¶
DNS A Records (TTL: 60s)
│
┌────────────────┴────────────────────┐
│ staging-connect.forma3d.be │
│ staging-connect-api.forma3d.be │
│ (all subdomains) │
└────────────────┬────────────────────┘
│
▼
┌──────────────────────┐
│ Droplet Public IP │ ← Same IP, but DNS TTL lowered to 60s
│ (e.g., 167.x.x.x) │ so future cut-over propagates fast
└──────────┬───────────┘
│
┌──────────┴───────────┐
│ Docker Compose │ + Health checks standardized
│ + Traefik │ + Graceful shutdown enabled
│ + All Services │ + Resource constraints added
│ (K8s-ready) │ + Configuration externalized
└──────────────────────┘
Future State (Kubernetes — NOT this prompt)¶
DNS A Records (TTL: 60s)
│
▼
┌──────────────────────┐
│ DO Load Balancer │ ← DNS updated to LB's stable IP
│ (stable IP) │ Propagation: <1 min with 60s TTL
└──────────┬───────────┘
│
┌──────────┴───────────┐
│ DOKS Cluster │
│ ├── Ingress NGINX │
│ ├── Gateway Pod(s) │
│ ├── Order Svc Pod(s)│
│ ├── Print Svc Pod(s)│
│ ├── Ship Svc Pod(s) │
│ ├── GridFlock Pod(s)│
│ ├── Slicer Pod(s) │
│ └── Web Pod(s) │
└──────────────────────┘
Note: DigitalOcean Load Balancers have stable, persistent IP addresses that do not change throughout their lifetime. The migration requires a one-time DNS update from the Droplet IP to the LB IP. With a 60s TTL, this propagates in under a minute.
📋 Implementation Phases¶
Phase 1: DNS Preparation for Future Migration (1 hour)¶
Priority: P0 | Impact: Critical | Dependencies: None
Prepare DNS records so that a future migration to DOKS + Load Balancer can be done with minimal disruption. The key insight: DigitalOcean Reserved IPs can only be assigned to Droplets, not to Load Balancers. Therefore, we cannot use a Reserved IP as a stable entry point that moves between Droplet and LB. Instead, the migration strategy relies on low-TTL DNS cut-over.
Why NOT a Reserved IP for this use case:
- DO Reserved IPs are Droplet-only (cannot be assigned to Load Balancers)
- DO Load Balancers get their own stable, persistent IP addresses
- The migration requires updating DNS A records to the LB's new IP
- With low TTLs, this DNS update propagates globally in under a minute
1. Lower DNS TTL on all staging subdomains¶
Set TTL to 60 seconds on all A records:
| Record | TTL (current) | TTL (target) |
|---|---|---|
staging-connect.forma3d.be |
3600s (typical default) | 60s |
staging-connect-api.forma3d.be |
3600s | 60s |
staging-connect-docs.forma3d.be |
3600s | 60s |
staging-connect-events.forma3d.be |
3600s | 60s |
staging-connect-db.forma3d.be |
3600s | 60s |
staging-connect-logs.forma3d.be |
3600s | 60s |
staging-connect-uptime.forma3d.be |
3600s | 60s |
A 60s TTL means that when we later update the A records to point to a Load Balancer IP, all DNS caches worldwide will pick up the new IP within 60 seconds.
2. Verify DNS resolution¶
dig staging-connect.forma3d.be +short
dig staging-connect-api.forma3d.be +short
# Verify TTL is showing 60s or less
dig staging-connect.forma3d.be | grep -i ttl
3. Document the Droplet IP and datacenter¶
Add to deployment documentation and .env.example:
# DigitalOcean Infrastructure
DO_DROPLET_IP=<current-droplet-ip>
DO_DATACENTER=ams3
# Note: DNS TTLs set to 60s for future migration agility
4. Document the migration cut-over procedure¶
Create a brief runbook entry for the future DNS cut-over:
- Deploy services to DOKS cluster
- Create DO Load Balancer → get its stable IP
- Verify services are healthy behind the LB
- Update all DNS A records: Droplet IP → LB IP
- Wait 60 seconds for propagation
- Verify all services resolve to new IP
- Decommission Droplet
Why low TTLs now: DNS TTL changes take effect only after the previous TTL expires. If TTLs are currently 1 hour (3600s), lowering them to 60s right before migration means you still need to wait up to 1 hour for the old TTL to expire from caches. By lowering TTLs now, the 60s TTL is already cached everywhere when migration day arrives.
Phase 2: Container Registry Hygiene (1 hour)¶
Priority: P0 | Impact: High | Dependencies: None
Ensure all container images are stored in DigitalOcean Container Registry (DOCR) with a consistent tagging strategy that works for both Docker Compose and Kubernetes.
1. Verify DOCR is the image source for all services¶
Current Docker Compose already uses ${REGISTRY_URL}/forma3d-connect-*:${*_IMAGE_TAG:-latest}. Verify all services follow this pattern.
2. Implement semantic image tagging¶
Instead of relying solely on latest, ensure the CI pipeline tags images with:
git-<short-sha>— immutable reference to exact commitlatest— rolling tag for the most recent buildstaging/production— environment-specific rolling tags
docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:git-abc1234
docker tag forma3d-connect-gateway:latest ${REGISTRY_URL}/forma3d-connect-gateway:staging
3. Add image pull policy awareness¶
In Docker Compose, add explicit pull_policy to each service:
services:
gateway:
image: ${REGISTRY_URL}/forma3d-connect-gateway:${GATEWAY_IMAGE_TAG:-latest}
pull_policy: always
This mirrors Kubernetes' imagePullPolicy: Always behavior and ensures deployments always use the latest image for a given tag.
Phase 3: Health Check Standardization (2 hours)¶
Priority: P1 | Impact: High | Dependencies: None
Kubernetes uses three probe types: liveness (is the process alive?), readiness (can it serve traffic?), and startup (has it finished initializing?). Ensure all services expose HTTP endpoints that serve these purposes.
1. Verify health endpoints exist in all backend services¶
Each NestJS service should expose:
| Endpoint | Purpose | K8s Probe Type | Expected Response |
|---|---|---|---|
GET /health/live |
Process is alive | Liveness | 200 OK |
GET /health/ready |
Can serve traffic (DB connected, dependencies up) | Readiness | 200 OK or 503 |
2. Update Docker Compose health checks to use HTTP¶
Replace wget/curl health checks with consistent HTTP checks:
healthcheck:
test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3000/health/live']
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
3. Add readiness checks that verify dependencies¶
The /health/ready endpoint should verify:
- Database connection is active
- Redis connection is active (for services that use Redis)
- Downstream services are reachable (for the Gateway)
This is critical for Kubernetes: a pod that passes liveness but fails readiness is kept alive but removed from the Service's endpoint list (no traffic sent to it).
4. Verify health checks for third-party / observability services¶
The following third-party services already have health checks in Docker Compose — verify they are consistent and functional:
| Service | Health Check | Notes |
|---|---|---|
| ClickHouse | clickhouse-client --query 'SELECT 1' |
Confirms query engine is ready |
| Grafana | wget --spider http://localhost:3000/api/health |
Built-in API health endpoint |
| Uptime Kuma | HTTP check on port 3001 | Verify start_period is sufficient for DB init |
| Dozzle | /dozzle healthcheck |
Built-in healthcheck command |
| OTel Collector | Add curl http://localhost:13133/ |
Uses the health_check extension (port 13133) — verify this extension is enabled in otel-collector-config.yaml |
Phase 4: Graceful Shutdown (2 hours)¶
Priority: P1 | Impact: High | Dependencies: None
Kubernetes sends SIGTERM to pods during rolling updates, then waits terminationGracePeriodSeconds (default 30s) before sending SIGKILL. Services must handle SIGTERM to finish in-flight requests.
1. Verify NestJS graceful shutdown is enabled¶
In each service's main.ts:
app.enableShutdownHooks();
This ensures NestJS listens for SIGTERM and: - Stops accepting new connections - Waits for in-flight HTTP requests to complete - Closes database connections cleanly - Closes Redis connections cleanly
2. Add stop_grace_period to Docker Compose¶
For each service in docker-compose.yml:
services:
gateway:
stop_grace_period: 30s
This mirrors Kubernetes' terminationGracePeriodSeconds and ensures Docker Compose also waits before sending SIGKILL.
3. Verify BullMQ workers handle shutdown¶
For services with BullMQ workers (order processing, print job processing), ensure workers call worker.close() on SIGTERM to finish processing the current job before shutting down.
4. Add stop_grace_period to third-party services¶
ClickHouse, Grafana, OTel Collector, Uptime Kuma, and Dozzle should also have stop_grace_period set. ClickHouse is especially critical — it may need time to flush in-memory buffers to disk on shutdown:
services:
clickhouse:
stop_grace_period: 60s # needs time to flush write buffers
grafana:
stop_grace_period: 15s
otel-collector:
stop_grace_period: 30s # flush pending telemetry batches
uptime-kuma:
stop_grace_period: 15s
dozzle:
stop_grace_period: 10s
Phase 5: Configuration Externalization Audit (2 hours)¶
Priority: P1 | Impact: High | Dependencies: None
Kubernetes uses ConfigMaps for non-sensitive configuration and Secrets for sensitive values. The Docker Compose .env file maps directly to these concepts — but only if ALL configuration is externalized.
1. Audit all services for hardcoded values¶
Search for hardcoded URLs, ports, timeouts, feature flags, or connection strings in application code. All must come from environment variables.
Common patterns to look for:
// ❌ WRONG — hardcoded
const DB_URL = 'postgresql://localhost:5432/forma3d';
// ✅ CORRECT — from environment
const DB_URL = process.env['DATABASE_URL'];
2. Categorize environment variables¶
Create a documented mapping of all environment variables into two categories:
Non-sensitive (→ ConfigMap):
| Variable | Description | Example |
|---|---|---|
NODE_ENV |
Environment name | staging |
APP_PORT |
Service port | 3000 |
LOG_LEVEL |
Log verbosity | info |
RATE_LIMIT_DEFAULT |
Default rate limit | 10000 |
Sensitive (→ Secret):
| Variable | Description |
|---|---|
DATABASE_URL |
PostgreSQL connection string |
REDIS_URL |
Redis connection string |
SESSION_SECRET |
Cookie signing secret |
INTERNAL_API_KEY |
Inter-service auth key |
SENTRY_DSN |
Sentry data source name |
SHOPIFY_* |
Shopify OAuth credentials |
SENDCLOUD_* |
Sendcloud API credentials |
SIMPLYPRINT_* |
SimplyPrint API credentials |
3. Create a configuration reference document¶
Create docs/05-deployment/configuration-reference.md listing every environment variable, its purpose, default value, and whether it's sensitive.
Phase 6: Resource Constraints (1 hour)¶
Priority: P2 | Impact: Medium | Dependencies: None
Add resource limits to Docker Compose services. These translate directly to Kubernetes resource requests and limits.
1. Add deploy.resources to each service¶
services:
gateway:
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
2. Recommended resource allocations¶
| Service | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Gateway | 0.25 | 0.50 | 256M | 512M |
| Order Service | 0.25 | 0.50 | 256M | 512M |
| Print Service | 0.15 | 0.30 | 192M | 384M |
| Shipping Service | 0.15 | 0.30 | 192M | 384M |
| GridFlock Service | 0.25 | 0.50 | 256M | 512M |
| Slicer | 0.50 | 1.00 | 512M | 1024M |
| Web (static) | 0.10 | 0.25 | 64M | 128M |
| Redis | 0.15 | 0.30 | 128M | 256M |
| Traefik | 0.15 | 0.30 | 128M | 256M |
| ClickHouse | 0.50 | 1.00 | 512M | 1536M |
| Grafana | 0.15 | 0.30 | 128M | 256M |
| OTel Collector | 0.15 | 0.30 | 128M | 256M |
| Uptime Kuma | 0.10 | 0.25 | 128M | 256M |
| Dozzle | 0.05 | 0.15 | 64M | 128M |
Adjust based on observed usage via docker stats.
Phase 7: Stateless Service Verification (1 hour)¶
Priority: P1 | Impact: High | Dependencies: None
For Kubernetes horizontal scaling, all application services must be stateless. State must live in external stores (PostgreSQL, Redis, S3).
1. Verify no local file storage¶
Check that no service writes to the local filesystem for state that needs to persist. Temporary files (e.g., STL processing in GridFlock/Slicer) should use /tmp and be cleaned up after processing.
2. Verify session storage uses Redis¶
Sessions must be stored in Redis (not in-memory). The Gateway already uses Redis for sessions — verify this is consistently applied.
3. Verify no in-memory caches that require consistency¶
If any service maintains in-memory caches, they must tolerate cache inconsistency across replicas or be moved to Redis.
4. Document stateful dependencies¶
| Dependency | Type | Location | K8s Strategy |
|---|---|---|---|
| PostgreSQL | Database | DigitalOcean Managed DB | External (no migration needed) |
| Redis | Cache / Sessions / Queues | Docker container | DigitalOcean Managed Redis or StatefulSet |
| ClickHouse | Observability / Analytics DB | Docker container (volume) | StatefulSet with PVC or ClickHouse Cloud |
| Grafana | Dashboards / Datasource config | Docker container (volume) | StatefulSet with PVC or Grafana Cloud |
| Uptime Kuma | Monitor state / history | Docker container (volume) | StatefulSet with PVC |
| Let's Encrypt certs | TLS | Traefik volume | cert-manager in K8s |
| Uploaded files | STL files | Temporary local → S3 future | DigitalOcean Spaces |
Phase 8: Docker Compose Networking Alignment (1 hour)¶
Priority: P2 | Impact: Medium | Dependencies: None
Kubernetes uses Service objects for service discovery (DNS-based: <service-name>.<namespace>.svc.cluster.local). Docker Compose already uses DNS-based service discovery within the network. Ensure the naming is consistent.
1. Verify service names match container references¶
In the Gateway's environment variables, downstream services are referenced as:
ORDER_SERVICE_URL=http://order-service:3001
PRINT_SERVICE_URL=http://print-service:3002
SHIPPING_SERVICE_URL=http://shipping-service:3003
GRIDFLOCK_SERVICE_URL=http://gridflock-service:3004
These names must match the Docker Compose service names exactly. In Kubernetes, these will become Kubernetes Service names — the URL pattern stays identical.
The observability pipeline also uses DNS-based service discovery:
# OTel Collector → ClickHouse
CLICKHOUSE_ENDPOINT=http://clickhouse:8123
# Application services → OTel Collector
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# Grafana → ClickHouse (via provisioned datasource)
# Configured in grafana/provisioning/datasources/
2. Use environment variables for all service URLs¶
Never hardcode inter-service URLs. Always use environment variables so the values can be changed for Kubernetes Service discovery:
# Docker Compose
ORDER_SERVICE_URL=http://order-service:3001
# Kubernetes (same pattern, different port if needed)
ORDER_SERVICE_URL=http://order-service.forma3d.svc.cluster.local:3001
Phase 9: Multi-Replica Readiness (2 hours)¶
Priority: P1 | Impact: High | Dependencies: Phase 4, Phase 7
Verify and configure all application services so they can run as multiple replicas behind Traefik (Docker Compose) and later behind a Kubernetes Ingress. This goes beyond statelessness verification — it validates actual concurrent execution.
1. Add deploy.replicas to Docker Compose¶
Add explicit replica counts (default 1) to all application services so they can be scaled up trivially:
services:
gateway:
deploy:
replicas: 1 # Scale with: docker compose up -d --scale gateway=3
resources:
# ... (existing resource constraints)
2. Configure Traefik load-balancing across replicas¶
Traefik auto-discovers Docker containers by label. Verify that scaling up a service (e.g., docker compose up -d --scale gateway=3) results in Traefik distributing traffic across all replicas. Key considerations:
- Do NOT expose
ports:on application services. Useexpose:instead so replicas don't fight over host ports. Only Traefik should map to host ports 80/443. - Verify Traefik labels use the service name, not a container name, so all replicas are included in the backend pool.
- Add a round-robin or least-connections load-balancing strategy in Traefik's dynamic config if needed.
services:
gateway:
# ❌ WRONG — blocks scaling
# ports:
# - "3000:3000"
# ✅ CORRECT — allows multiple replicas
expose:
- "3000"
labels:
- "traefik.http.services.gateway.loadbalancer.server.port=3000"
3. Validate BullMQ worker concurrency¶
When running multiple replicas of a service with BullMQ workers, jobs are naturally distributed across workers (BullMQ uses Redis-based locking). Verify:
- No duplicate processing: Two replicas must not process the same job. BullMQ handles this natively — verify no custom job-fetch logic bypasses it.
- Worker concurrency settings: Ensure
concurrencyis set per-worker (not globally) so each replica processes its fair share. - Job events / progress: If the Gateway subscribes to job events via
QueueEvents, ensure this works correctly with multiple producer replicas.
4. Handle WebSocket sticky sessions¶
If any service uses WebSocket connections (Socket.IO for real-time events), multiple replicas require sticky sessions to ensure the WebSocket upgrade request reaches the same backend that holds the socket state.
Traefik supports sticky sessions via cookies:
labels:
- "traefik.http.services.events.loadbalancer.sticky.cookie=true"
- "traefik.http.services.events.loadbalancer.sticky.cookie.name=server_id"
- "traefik.http.services.events.loadbalancer.sticky.cookie.httponly=true"
Socket.IO must also be configured to use the Redis adapter so pub/sub events propagate across replicas:
import { createAdapter } from '@socket.io/redis-adapter';
io.adapter(createAdapter(pubClient, subClient));
5. Test multi-replica operation¶
Run a manual scaling test for each application service:
# Scale up
docker compose up -d --scale gateway=2 --scale order-service=2
# Verify all replicas are healthy
docker compose ps
# Verify Traefik routes to all replicas
for i in $(seq 1 10); do
curl -s https://staging-connect-api.forma3d.be/health/live \
-o /dev/null -w "%{remote_ip}\n"
done
# Scale back down
docker compose up -d --scale gateway=1 --scale order-service=1
Document which services can and cannot be scaled (e.g., the Slicer may have constraints around GPU or temp file cleanup).
Phase 10: Rolling Update Strategy (1 hour)¶
Priority: P1 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 9
Define and test a rolling update procedure for Docker Compose that achieves zero-downtime deployments. This same procedure translates directly to Kubernetes Deployment rolling update strategy.
1. Add deploy.update_config to Docker Compose¶
Configure rolling update behavior for each service:
services:
gateway:
deploy:
replicas: 1
update_config:
parallelism: 1 # Update one replica at a time
delay: 10s # Wait 10s between replica updates
order: start-first # Start new replica before stopping old one
failure_action: rollback
rollback_config:
parallelism: 1
order: start-first
order: start-first is critical — it ensures the new container is healthy before the old one is removed, maintaining service availability throughout the update.
2. Document the rolling update procedure¶
Create a step-by-step update runbook:
# 1. Pull latest images
docker compose -f deployment/staging/docker-compose.yml pull
# 2. Rolling update — application services one at a time (no full restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gateway
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build order-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build print-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build shipping-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build gridflock-service
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build slicer
# 3. Verify health after each service update
curl -sf https://staging-connect-api.forma3d.be/health/ready || echo "UNHEALTHY"
# 4. Update the web frontend (stateless, fast restart)
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps --build web
# 5. Update observability & monitoring services
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps clickhouse
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps otel-collector
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps grafana
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps uptime-kuma
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps dozzle
The --no-deps flag ensures only the target service is recreated — not its dependencies. This prevents cascading restarts.
3. Ensure database migration compatibility¶
Rolling updates mean two versions of a service run simultaneously (old and new). Database migrations must be backward-compatible:
- Adding a column: Safe — old code ignores it.
- Removing a column: Unsafe — old code still queries it. Use a two-phase approach:
- Deploy code that stops using the column.
- Deploy migration that removes the column.
- Renaming a column: Unsafe — treat as add + deprecate + remove.
Document this in the deployment guide as the expand-and-contract migration pattern.
4. Add a pre-deployment health gate¶
Before starting a rolling update, verify the system is healthy:
#!/usr/bin/env bash
set -euo pipefail
SERVICES=("staging-connect-api.forma3d.be" "staging-connect.forma3d.be")
for svc in "${SERVICES[@]}"; do
status=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}/health/ready" || echo "000")
if [ "$status" != "200" ]; then
echo "❌ Pre-deploy health check failed for ${svc} (HTTP ${status}). Aborting."
exit 1
fi
done
echo "✅ All services healthy. Proceeding with rolling update."
5. Test rollback procedure¶
Document and test the rollback procedure:
# Rollback a single service to a specific image tag
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps \
-e GATEWAY_IMAGE_TAG=git-abc1234 gateway
# Or rollback by reverting the .env file and recreating
git checkout HEAD~1 -- deployment/staging/.env
docker compose -f deployment/staging/docker-compose.yml up -d --no-deps gateway
Phase 11: Local Development Experience with Rancher Desktop + Tilt (3–4 hours)¶
Priority: P2 | Impact: High | Dependencies: Phase 3, Phase 4, Phase 5
Give every developer a one-command local environment that mirrors production networking, service discovery, and container runtime. The goal: git clone, tilt up, start coding.
Why Rancher Desktop + Tilt¶
| Tool | Role |
|---|---|
| Rancher Desktop | Desktop application that provides a local Kubernetes cluster (K3s under the hood), container runtime (containerd or dockerd), and kubectl — one install replaces Docker Desktop + k3d + ctlptl |
| Tilt | Watches source files, live-syncs into running containers, manages builds, port-forwards, and provides a dashboard at localhost:10350 |
| Traefik Mesh | Lightweight service mesh — automatic mTLS, request metrics, no sidecars. Same Helm chart reused in staging and production on DOKS |
| KubeView | Real-time graphical visualization of cluster resources and their relationships at localhost:8000 |
The existing pnpm dev workflow (Nx parallel serve) remains available for quick, lightweight iteration. tilt up is the production-parity alternative.
| Aspect | pnpm dev |
tilt up |
|---|---|---|
| Infra setup | Manual (install PostgreSQL, Redis) | Automatic (K8s provisions everything) |
| Service discovery | localhost + hardcoded ports |
K8s DNS (matches production) |
| Service mesh | None | Traefik Mesh (mTLS, metrics — same as staging/prod) |
| Hot reload | Nx watch mode | Tilt live_update (file sync into containers) |
| Debugging | Direct (same process) | Remote attach via --inspect port |
| Cluster visibility | None | KubeView (real-time resource graph) |
| Production parity | Low (no containers) | High (same images, same networking, same mesh) |
1. Prerequisites¶
Developers need two tools installed (plus one optional tool for webhook testing):
- Rancher Desktop — Provides the local Kubernetes cluster, container runtime, and
kubectl. Download from the website or install via Homebrew:
# macOS
brew install --cask rancher
# Linux / Windows: download from https://rancherdesktop.io
- Tilt — Orchestrates the development workflow:
# macOS
brew install tilt-dev/tap/tilt
# Linux
curl -fsSL https://raw.githubusercontent.com/tilt-dev/tilt/master/scripts/install.sh | bash
- cloudflared (optional) — Exposes
localhost:3000to the internet so Shopify, SimplyPrint, and SendCloud can deliver real webhooks during local development. Not needed for most development —curlsimulation is sufficient:
# macOS
brew install cloudflared
# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
sudo apt update && sudo apt install cloudflared
Rancher Desktop configuration:
After installing, open Rancher Desktop and verify these settings:
- Kubernetes enabled (on by default)
- Container runtime: dockerd (moby) — required for Tilt's docker_build() to work
- Kubernetes version: 1.29+ recommended
Rancher Desktop bundles kubectl and manages the kubeconfig automatically. No other prerequisites — PostgreSQL and Redis run inside the cluster.
2. Cluster setup — Rancher Desktop¶
No cluster definition file is needed. Rancher Desktop provides the Kubernetes cluster out of the box. Verify the cluster is running:
kubectl cluster-info
# Should show: Kubernetes control plane running at https://127.0.0.1:6443
kubectl get nodes
# Should show: rancher-desktop Ready
Rancher Desktop's built-in K3s has Traefik installed by default. This doesn't conflict with local development — Tilt handles port-forwarding directly to pods, bypassing any in-cluster ingress.
Image builds: Tilt's docker_build() uses the local Docker daemon provided by Rancher Desktop (when configured with dockerd (moby) runtime). Built images are available to the cluster immediately — no separate registry needed.
3. Kubernetes manifests — k8s/dev/¶
Create lightweight manifests for local development only. These are NOT the production Kubernetes manifests (those come later during the actual DOKS migration).
k8s/dev/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: forma3d-dev
k8s/dev/postgres.yaml — Single-node PostgreSQL with a PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: forma3d-dev
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
namespace: forma3d-dev
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16-alpine
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: forma3d
- name: POSTGRES_USER
value: forma3d
- name: POSTGRES_PASSWORD
value: forma3d_dev
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
readinessProbe:
exec:
command: [pg_isready, -U, forma3d]
periodSeconds: 5
volumes:
- name: data
persistentVolumeClaim:
claimName: postgres-data
---
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: forma3d-dev
spec:
selector:
app: postgres
ports:
- port: 5432
k8s/dev/redis.yaml — Single-node Redis:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: forma3d-dev
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
readinessProbe:
exec:
command: [redis-cli, ping]
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: redis
namespace: forma3d-dev
spec:
selector:
app: redis
ports:
- port: 6379
k8s/dev/configmap.yaml — Shared configuration for all application services:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: forma3d-dev
data:
NODE_ENV: development
LOG_LEVEL: debug
DATABASE_URL: postgresql://forma3d:forma3d_dev@postgres.forma3d-dev.svc.cluster.local:5432/forma3d?schema=public
REDIS_URL: redis://redis.forma3d-dev.svc.cluster.local:6379
ORDER_SERVICE_URL: http://order-service.forma3d-dev.svc.cluster.local:3001
PRINT_SERVICE_URL: http://print-service.forma3d-dev.svc.cluster.local:3002
SHIPPING_SERVICE_URL: http://shipping-service.forma3d-dev.svc.cluster.local:3003
GRIDFLOCK_SERVICE_URL: http://gridflock-service.forma3d-dev.svc.cluster.local:3004
GATEWAY_URL: http://gateway.forma3d-dev.svc.cluster.local:3000
API_URL: http://localhost:3000
WEB_URL: http://localhost:4200
k8s/dev/secret.yaml.example — Template for sensitive values (not committed to git):
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
namespace: forma3d-dev
type: Opaque
stringData:
SESSION_SECRET: local-dev-session-secret
INTERNAL_API_KEY: local-dev-internal-api-key
# Add your external service credentials below:
# SHOPIFY_CLIENT_ID: ""
# SHOPIFY_CLIENT_SECRET: ""
# SIMPLYPRINT_API_KEY: ""
# SENDCLOUD_PUBLIC_KEY: ""
# SENDCLOUD_SECRET_KEY: ""
# SENTRY_DSN: ""
Per-service manifests — Create one file per application service following this pattern (example for Gateway, repeat for order-service, print-service, shipping-service, gridflock-service):
k8s/dev/gateway.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway
namespace: forma3d-dev
spec:
replicas: 1
selector:
matchLabels:
app: gateway
template:
metadata:
labels:
app: gateway
spec:
containers:
- name: gateway
image: forma3d-connect-gateway
ports:
- containerPort: 3000
name: http
- containerPort: 9229
name: debug
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
env:
- name: APP_PORT
value: "3000"
readinessProbe:
httpGet:
path: /health/live
port: 3000
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: gateway
namespace: forma3d-dev
spec:
selector:
app: gateway
ports:
- name: http
port: 3000
- name: debug
port: 9229
Repeat the pattern for each backend service (order-service on 3001/9230, print-service on 3002/9231, shipping-service on 3003/9232, gridflock-service on 3004/9233). Each service gets its own debug port and APP_PORT env override (since the shared configmap cannot hold per-service port values). Adjust the readinessProbe path to /health for the downstream services (they use /health instead of /health/live).
k8s/dev/web.yaml — The React app runs vite dev (not nginx) for HMR:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: forma3d-dev
spec:
replicas: 1
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: forma3d-connect-web-dev
ports:
- containerPort: 4200
env:
- name: VITE_API_URL
value: http://localhost:3000
---
apiVersion: v1
kind: Service
metadata:
name: web
namespace: forma3d-dev
spec:
selector:
app: web
ports:
- port: 4200
k8s/dev/kubeview.yaml — Cluster visualization tool (KubeView):
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubeview
namespace: forma3d-dev
spec:
replicas: 1
selector:
matchLabels:
app: kubeview
template:
metadata:
labels:
app: kubeview
spec:
serviceAccountName: kubeview
containers:
- name: kubeview
image: ghcr.io/benc-uk/kubeview:latest
ports:
- containerPort: 8000
env:
- name: SINGLE_NAMESPACE
value: forma3d-dev
- name: NAMESPACE_FILTER
value: "^kube-"
---
apiVersion: v1
kind: Service
metadata:
name: kubeview
namespace: forma3d-dev
spec:
selector:
app: kubeview
ports:
- port: 8000
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kubeview
namespace: forma3d-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeview-reader
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "discovery.k8s.io", "autoscaling"]
resources: ["*"]
verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubeview-reader-binding
subjects:
- kind: ServiceAccount
name: kubeview
namespace: forma3d-dev
roleRef:
kind: ClusterRole
name: kubeview-reader
apiGroup: rbac.authorization.k8s.io
KubeView provides a real-time graphical view of pods, deployments, services, and their relationships at http://localhost:8000. Useful for understanding how Kubernetes resources map to the application architecture.
Traefik Mesh — Installed via Helm in the Tiltfile (not a static manifest). Traefik Mesh is a lightweight service mesh that provides: - Automatic mTLS between services - Request-level metrics and tracing - SMI (Service Mesh Interface) support for traffic policies
To opt a service into the mesh, add the label mesh.traefik.io/enabled: "true" to its Service object. For example, in k8s/dev/gateway.yaml:
apiVersion: v1
kind: Service
metadata:
name: gateway
namespace: forma3d-dev
labels:
mesh.traefik.io/enabled: "true"
spec:
selector:
app: gateway
ports:
- name: http
port: 3000
- name: debug
port: 9229
The same Traefik Mesh Helm chart and service labels will be reused in staging and production on DigitalOcean Managed Kubernetes (DOKS), giving consistent service-to-service networking across all environments.
4. Development Dockerfiles¶
Create lightweight development Dockerfiles that support Tilt's live_update (file sync instead of full rebuild):
.dockerignore — Create a .dockerignore in the project root to keep the build context small (Tilt's docker_build() uses context='.'):
node_modules
dist
build
.nx
.git
.vscode
.cursor
coverage
*.log
.DS_Store
docs
load-tests
.specstory
apps/gateway/Dockerfile.dev (same pattern for all NestJS services — adjust the service name and --inspect port):
FROM node:20-alpine
WORKDIR /app
RUN apk add --no-cache openssl
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
RUN pnpm prisma generate
COPY . .
CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]
Using tsx watch instead of the full Nx build pipeline — starts in seconds, restarts on file changes synced by Tilt. The --inspect flag enables remote debugging via VS Code (port per service: gateway=9229, order-service=9230, print-service=9231, shipping-service=9232, gridflock-service=9233). The openssl package is required by Prisma on Alpine. The prisma/ directory is copied separately before pnpm install so that prisma generate can run against the schema.
apps/web/Dockerfile.dev:
FROM node:20-alpine
WORKDIR /app
RUN corepack enable && corepack prepare pnpm@9 --activate
COPY package.json pnpm-lock.yaml ./
COPY prisma ./prisma/
RUN pnpm install --frozen-lockfile
COPY . .
EXPOSE 4200
CMD ["node_modules/.bin/vite", "--host", "0.0.0.0", "--port", "4200", "apps/web"]
5. Tiltfile¶
Create Tiltfile in the project root:
# ---------------------------------------------------------------------------
# Forma3D.Connect — Local Development with Rancher Desktop + Tilt
# Usage: tilt up
# Dashboard: http://localhost:10350
# ---------------------------------------------------------------------------
load('ext://namespace', 'namespace_create')
# --- Cluster bootstrap ---
namespace_create('forma3d-dev')
# --- Infrastructure (PostgreSQL + Redis) ---
k8s_yaml('k8s/dev/postgres.yaml')
k8s_yaml('k8s/dev/redis.yaml')
k8s_yaml('k8s/dev/configmap.yaml')
# Apply secrets (developer must copy secret.yaml.example → secret.yaml)
if os.path.exists('k8s/dev/secret.yaml'):
k8s_yaml('k8s/dev/secret.yaml')
else:
fail('k8s/dev/secret.yaml not found. Copy k8s/dev/secret.yaml.example and fill in your values.')
k8s_resource('postgres', port_forwards=['5432:5432'],
labels=['infra'])
k8s_resource('redis', port_forwards=['6379:6379'],
labels=['infra'])
# --- Prisma migrations (runs after postgres is ready) ---
local_resource('prisma-migrate',
cmd='pnpm prisma migrate deploy',
resource_deps=['postgres'],
labels=['setup'])
local_resource('prisma-seed',
cmd='pnpm prisma db seed',
resource_deps=['prisma-migrate'],
auto_init=False, # manual trigger via Tilt UI button
labels=['setup'])
# --- Backend services ---
BACKEND_SERVICES = {
'gateway': {'port': 3000, 'debug': 9229},
'order-service': {'port': 3001, 'debug': 9230},
'print-service': {'port': 3002, 'debug': 9231},
'shipping-service': {'port': 3003, 'debug': 9232},
'gridflock-service':{'port': 3004, 'debug': 9233},
}
for svc, cfg in BACKEND_SERVICES.items():
docker_build(
'forma3d-connect-' + svc,
context='.',
dockerfile='apps/' + svc + '/Dockerfile.dev',
live_update=[
sync('apps/' + svc + '/src', '/app/apps/' + svc + '/src'),
sync('libs/', '/app/libs/'),
sync('prisma/', '/app/prisma/'),
],
)
k8s_yaml('k8s/dev/' + svc + '.yaml')
k8s_resource(svc,
port_forwards=[
str(cfg['port']) + ':' + str(cfg['port']),
str(cfg['debug']) + ':' + str(cfg['debug']),
],
resource_deps=['prisma-migrate'],
labels=['backend'])
# --- Web (React + Vite HMR) ---
docker_build(
'forma3d-connect-web-dev',
context='.',
dockerfile='apps/web/Dockerfile.dev',
live_update=[
sync('apps/web/src', '/app/apps/web/src'),
sync('libs/', '/app/libs/'),
],
)
k8s_yaml('k8s/dev/web.yaml')
k8s_resource('web',
port_forwards=['4200:4200'],
resource_deps=['gateway'],
labels=['frontend'])
# --- Traefik Mesh (service mesh — same in dev, staging, production) ---
local_resource('traefik-mesh-install',
cmd='helm repo add traefik https://traefik.github.io/charts --force-update && '
'helm upgrade --install traefik-mesh traefik/traefik-mesh '
'--namespace forma3d-dev --wait',
resource_deps=['postgres'], # ensure namespace exists
labels=['mesh'])
# --- KubeView (cluster visualization) ---
k8s_yaml('k8s/dev/kubeview.yaml')
k8s_resource('kubeview',
port_forwards=['8000:8000'],
labels=['tools'])
# --- Cloudflare Tunnel (webhook testing — on-demand) ---
# Exposes localhost:3000 (Gateway) to the internet so external services
# (Shopify, SimplyPrint, SendCloud) can deliver webhooks during local dev.
# Start manually from the Tilt dashboard when testing webhook flows.
local_resource('tunnel',
serve_cmd='cloudflared tunnel --url http://localhost:3000',
auto_init=False,
labels=['tools'])
6. VS Code debugging¶
Add launch configurations for attaching to running services inside the cluster. Create or extend .vscode/launch.json:
{
"version": "0.2.0",
"configurations": [
{
"name": "Attach: Gateway (Tilt)",
"type": "node",
"request": "attach",
"port": 9229,
"restart": true,
"sourceMaps": true,
"localRoot": "${workspaceFolder}",
"remoteRoot": "/app"
},
{
"name": "Attach: Order Service (Tilt)",
"type": "node",
"request": "attach",
"port": 9230,
"restart": true,
"sourceMaps": true,
"localRoot": "${workspaceFolder}",
"remoteRoot": "/app"
},
{
"name": "Attach: Print Service (Tilt)",
"type": "node",
"request": "attach",
"port": 9231,
"restart": true,
"sourceMaps": true,
"localRoot": "${workspaceFolder}",
"remoteRoot": "/app"
},
{
"name": "Attach: Shipping Service (Tilt)",
"type": "node",
"request": "attach",
"port": 9232,
"restart": true,
"sourceMaps": true,
"localRoot": "${workspaceFolder}",
"remoteRoot": "/app"
},
{
"name": "Attach: GridFlock Service (Tilt)",
"type": "node",
"request": "attach",
"port": 9233,
"restart": true,
"sourceMaps": true,
"localRoot": "${workspaceFolder}",
"remoteRoot": "/app"
}
]
}
To enable debugging, change the CMD in the service's Dockerfile.dev to include --inspect=0.0.0.0:9229 (adjusting the port per service), or add an environment variable toggle:
CMD ["node_modules/.bin/tsx", "watch", "--inspect=0.0.0.0:9229", "apps/gateway/src/main.ts"]
7. Developer workflow¶
First time setup:
# 1. Install Rancher Desktop (see Prerequisites) and ensure Kubernetes is running
kubectl cluster-info # verify cluster is ready
# 2. Clone and install
git clone <repo-url> && cd forma-3d-connect
pnpm install # install dependencies
cp k8s/dev/secret.yaml.example k8s/dev/secret.yaml # add your API keys
# 3. Start
tilt up # start everything
Daily development:
tilt up # Rancher Desktop starts on login, cluster is always ready
# Edit files in apps/ or libs/ — changes sync into containers in <2 seconds
# Open http://localhost:4200 (web), http://localhost:3000 (API)
# Tilt dashboard at http://localhost:10350
# KubeView at http://localhost:8000 (cluster visualization)
tilt down # stop all services (cluster persists)
Resetting the environment:
tilt down
kubectl delete namespace forma3d-dev # remove all dev resources
tilt up # recreate everything fresh
8. Webhook testing during local development¶
The application receives inbound webhooks from three external services:
| Provider | Webhook Path | Routed To |
|---|---|---|
| Shopify | /api/v1/webhooks/shopify |
order-service |
| SimplyPrint | /webhooks/simplyprint |
print-service |
| SendCloud | /webhooks/sendcloud |
shipping-service |
These external services cannot reach localhost. When testing flows that depend on real webhook delivery, a tunnel is needed to expose the Gateway to the internet.
Option A: Cloudflare Tunnel (recommended for real webhook testing)
Install cloudflared:
# macOS
brew install cloudflared
# Linux
curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-main.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/cloudflare-main.gpg] https://pkg.cloudflare.com/cloudflared any main' | sudo tee /etc/apt/sources.list.d/cloudflared.list
sudo apt update && sudo apt install cloudflared
The Tiltfile includes a tunnel resource (disabled by default). Start it from the Tilt dashboard when needed — it prints a temporary public URL (e.g., https://<random>.trycloudflare.com). Configure the external service's webhook URL to point to this tunnel URL:
# Shopify: Update webhook URL in Shopify Partner Dashboard or via API
# SimplyPrint: Update webhook URL in SimplyPrint dashboard
# SendCloud: Update webhook URL in SendCloud panel
# Example: verify the tunnel forwards to the Gateway
curl https://<tunnel-url>/health/live
Option B: Simulate webhooks with curl (no tunnel needed)
For most development, simulating webhook payloads locally is faster and doesn't require internet access. Send signed requests directly to the Gateway:
# Simulate a Shopify order creation webhook
curl -X POST http://localhost:3000/api/v1/webhooks/shopify \
-H "Content-Type: application/json" \
-H "X-Shopify-Topic: orders/create" \
-H "X-Shopify-Hmac-Sha256: <computed-hmac>" \
-H "X-Shopify-Shop-Domain: your-shop.myshopify.com" \
-d @apps/order-service/src/shopify/__tests__/fixtures/order-created.json
# Simulate a SimplyPrint job status webhook
curl -X POST http://localhost:3000/webhooks/simplyprint \
-H "Content-Type: application/json" \
-d '{"event": "job.status_changed", "data": {"job_id": "123", "status": "completed"}}'
# Simulate a SendCloud parcel status webhook
curl -X POST http://localhost:3000/webhooks/sendcloud \
-H "Content-Type: application/json" \
-d '{"action": "parcel_status_changed", "parcel": {"id": 123, "status": {"id": 11}}}'
To bypass HMAC signature verification during local development, set WEBHOOK_SKIP_VERIFICATION=true in k8s/dev/configmap.yaml (this env variable must only be respected when NODE_ENV=development).
When to use which:
| Scenario | Use |
|---|---|
| Testing webhook handler logic in isolation | Option B (curl) |
| Testing the full end-to-end flow with a real external service | Option A (Cloudflare Tunnel) |
| CI / automated tests | Option B (mock payloads in test fixtures) |
9. Excluded services¶
The following staging-only services are NOT included in the local development setup:
| Service | Reason |
|---|---|
| Slicer (BambuStudio) | Requires specific Linux binaries; optional for most development |
| ClickHouse | Observability infra, not needed for feature development |
| Grafana | Observability infra |
| Uptime Kuma | Monitoring infra |
| Dozzle | Log viewer — Tilt provides its own log aggregation |
| OTel Collector | Observability pipeline |
| Traefik Proxy (Ingress) | Tilt port-forwarding replaces the reverse proxy for local dev |
Included dev tools: KubeView (cluster visualization at localhost:8000) and Traefik Mesh (service mesh — same Helm chart reused in staging/production).
If a developer needs the Slicer locally, they can run it standalone via Docker alongside the Tilt-managed cluster:
docker run -p 3010:3010 -v $(pwd)/deployment/slicer/profiles:/profiles \
forma3d-connect-slicer:latest
📊 Migration Path Summary¶
| Step | When | What Happens | DNS Impact |
|---|---|---|---|
| Now (this prompt) | Today | Lower DNS TTLs to 60s, standardize health checks, externalize config | TTL changes only |
| Multi-tenancy | Next quarter | System grows, consider DOKS | None |
| DOKS setup | When needed | Create DOKS cluster, deploy services, create DO Load Balancer | None |
| Cut-over | Migration day | Update DNS A records from Droplet IP to LB IP | DNS update (propagates in <60s) |
| Cleanup | Post-migration | Decommission Droplet | None |
✅ Validation Checklist¶
DNS Preparation¶
- DNS TTL lowered to 60s on all staging A records
- DNS resolution verified (
digshows correct IP and low TTL) - Droplet IP and datacenter documented in deployment docs and
.env.example - Migration cut-over runbook documented
- All services still accessible after TTL changes (TLS valid, health checks pass)
Container Registry¶
- All services pull from DOCR (
${REGISTRY_URL}/forma3d-connect-*) - CI pipeline tags images with
git-<sha>and environment tags -
pull_policy: alwaysset on all application services in Docker Compose
Health Checks¶
- All backend services expose
GET /health/live(200 OK) - All backend services expose
GET /health/ready(200 OK when healthy, 503 when not) - Docker Compose health checks use HTTP endpoints consistently
- Health check intervals and thresholds are consistent across services
- Third-party services (ClickHouse, Grafana, Uptime Kuma, Dozzle, OTel Collector) have functioning health checks
Graceful Shutdown¶
-
app.enableShutdownHooks()called in all NestJS services - BullMQ workers handle SIGTERM (close cleanly)
-
stop_grace_periodset on all services (30s for app services, 60s for ClickHouse, 30s for OTel Collector) - Verified:
docker compose stopcompletes without SIGKILL for all services
Configuration¶
- No hardcoded URLs, ports, or secrets in application code
- All environment variables documented with sensitivity classification
- Configuration reference document created
-
.env.exampleis complete and up-to-date
Resource Constraints¶
-
deploy.resources(limits + reservations) set for all services (including ClickHouse, Grafana, OTel Collector, Uptime Kuma, Dozzle) - Resource values validated against
docker statsobservations
Statelessness¶
- No application service writes persistent state to local filesystem
- Sessions stored in Redis (not in-memory)
- No singleton in-memory caches that break with multiple replicas
- Stateful dependencies documented with Kubernetes migration strategy (including ClickHouse, Grafana, Uptime Kuma volumes)
Multi-Replica Readiness¶
-
expose:used instead ofports:on all application services (only Traefik exposes host ports) -
deploy.replicas: 1set explicitly on all application services - Traefik load-balances across replicas when scaling up (
docker compose up -d --scale gateway=2) - BullMQ job processing verified with multiple worker replicas (no duplicate processing)
- WebSocket sticky sessions configured in Traefik for Socket.IO services
- Socket.IO Redis adapter configured for cross-replica pub/sub
- Each application service tested at 2+ replicas (scaled up and back down)
- Services that cannot be scaled documented with reasoning
Rolling Updates¶
-
deploy.update_configwithorder: start-firstset on all application services -
deploy.rollback_configset on all application services - Rolling update runbook documented (per-service
--no-depsupdates, including observability services) - Pre-deployment health gate script created and tested
- Rollback procedure documented and tested
- Database migration backward-compatibility rules documented (expand-and-contract pattern)
- CI pipeline updated to support targeted per-service deploys
- Zero-downtime verified: rolling update completes with no failed health checks
Local Development (Rancher Desktop + Tilt)¶
- Rancher Desktop with Kubernetes enabled and
dockerd (moby)runtime documented as prerequisite -
k8s/dev/contains manifests for namespace, postgres, redis, configmap, secret.yaml.example, and all application services -
Tiltfileexists in the project root and loads allk8s/dev/manifests - Dev Dockerfiles (
Dockerfile.dev) exist for all application services (gateway, order-service, print-service, shipping-service, gridflock-service, web) - Dev Dockerfiles include
opensslfor Prisma on Alpine and--inspectflag for remote debugging -
.dockerignoreexists in the project root to excludenode_modules/,.git/,dist/, etc. from the Docker build context - Per-service
APP_PORTenv override is set in each K8s manifest (not in the shared configmap) -
tilt upfrom a clean clone (afterpnpm installand secret.yaml setup) starts all services successfully - PostgreSQL and Redis are provisioned automatically inside the cluster
- Prisma migrations run automatically on startup (after postgres is ready)
- Port-forwards work:
localhost:3000(gateway),localhost:4200(web),localhost:5432(postgres),localhost:6379(redis),localhost:8000(KubeView) - Live-update works: editing a file in
apps/<service>/src/triggers a container restart within 2 seconds - Web HMR works: editing a file in
apps/web/src/reflects immediately in the browser - Debug ports are accessible:
localhost:9229(gateway),9230–9233(other services) - VS Code can attach debugger to running services via launch configurations
- KubeView shows all pods, deployments, and services in
forma3d-devnamespace atlocalhost:8000 - Traefik Mesh is installed via Helm and running (
kubectl get pods -n forma3d-devshows mesh controller and proxies) - Services with
mesh.traefik.io/enabled: "true"label are routed through the mesh -
tilt downcleanly stops all services -
kubectl delete namespace forma3d-devcleanly removes all dev resources - Existing
pnpm devworkflow still works independently -
k8s/dev/secret.yamlis in.gitignore - Tiltfile includes an on-demand
tunnelresource (cloudflared tunnel --url http://localhost:3000) withauto_init=False -
cloudflareddocumented as an optional prerequisite (only needed for real webhook testing) - Webhook simulation with
curldocumented with example payloads for Shopify, SimplyPrint, and SendCloud -
WEBHOOK_SKIP_VERIFICATIONenv variable supported in development mode for localcurltesting
Verification Commands¶
# All services healthy and DNS TTLs lowered
curl -I https://staging-connect.forma3d.be
curl -I https://staging-connect-api.forma3d.be/health/live
curl -I https://staging-connect-api.forma3d.be/health/ready
dig staging-connect.forma3d.be | grep TTL
# Docker Compose validation
docker compose -f deployment/staging/docker-compose.yml config --quiet
# Graceful shutdown test
docker compose stop gateway # Should stop within 30s without SIGKILL
# Resource usage baseline
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Build passes
pnpm nx run-many -t build --all
# Tests pass
pnpm nx run-many -t test --all --exclude=api-e2e,acceptance-tests
🚫 Constraints and Rules¶
MUST DO¶
- Lower DNS TTLs to 60s on all staging A records (enables fast future cut-over)
- Document the migration cut-over procedure (DNS update from Droplet IP to LB IP)
- Verify all services expose HTTP health endpoints (
/health/liveand/health/ready) - Enable graceful shutdown hooks in all NestJS services
- Add
stop_grace_periodto all Docker Compose application services - Audit and document all environment variables with sensitivity classification
- Verify all application services are stateless
- Add resource constraints to Docker Compose services
- Use environment variables for all inter-service URLs (no hardcoding)
- Create
k8s/dev/manifests (including KubeView),Tiltfile, and dev Dockerfiles for local Rancher Desktop + Tilt workflow - Install Traefik Mesh via Helm in the Tiltfile (same chart reused in staging/production on DOKS)
- Verify
tilt upstarts all services from a clean state (clone +tilt up= working environment) - Preserve the existing
pnpm devworkflow as a lightweight alternative
MUST NOT¶
- Create any production Kubernetes manifests, Helm charts, or Kustomize configs — not yet (local-dev K8s manifests in
k8s/dev/are fine) - Deploy K8s manifests to staging or production — the
k8s/dev/manifests are for local development only - Install kubectl, helm, or any Kubernetes tooling on the Droplet
- Change the Docker Compose deployment workflow
- Remove or replace Traefik — it stays as the reverse proxy for now
- Over-engineer for Kubernetes patterns that aren't needed yet (sidecars, custom operators, etc.) — Traefik Mesh is allowed as it's lightweight and non-invasive
- Break any existing functionality or deployment process
- Use
any,ts-ignore, oreslint-disable
SHOULD DO (Nice to Have)¶
- Document the Kubernetes migration path in
docs/05-deployment/kubernetes-migration-plan.md - Set up DigitalOcean monitoring alerts for DNS records
- Explore DigitalOcean's App Platform as an intermediate step before full Kubernetes
- Add a
tilt_config.jsonfor per-developer overrides (e.g., enable/disable Slicer, toggle debug ports) - Create a
CONTRIBUTING.mdsection documenting thetilt upworkflow for new developers
🔄 Rollback Plan¶
All changes in this prompt are non-destructive:
- DNS TTL changes: Lowering TTLs is completely non-destructive. If needed, TTLs can be raised back to their original values.
- Health checks: Added endpoints don't affect existing functionality.
- Graceful shutdown:
enableShutdownHooks()is additive — it doesn't change normal operation. - Resource constraints: Docker Compose ignores
deploy.resourcesunless usingdocker compose upwith--compatibilityflag or Docker Swarm mode. In standalone Docker Compose, these serve as documentation. - Configuration audit: Documentation-only changes.
- Rancher Desktop + Tilt: All local dev files (
k8s/dev/,Tiltfile,Dockerfile.dev) are additive. They don't affect staging/production deployment, CI pipeline, or the existingpnpm devworkflow. The Rancher Desktop cluster runs entirely on the developer's machine. KubeView is read-only. Traefik Mesh is opt-in (only affects services with themesh.traefik.io/enabledlabel) and will be reused with the same Helm chart in staging/production.
📚 Key References¶
DigitalOcean: - DOCR: https://docs.digitalocean.com/products/container-registry/ - DOKS: https://docs.digitalocean.com/products/kubernetes/ - Load Balancers: https://docs.digitalocean.com/products/networking/load-balancers/ - Reserved IPs (Droplet-only): https://docs.digitalocean.com/products/networking/reserved-ips/ — Note: cannot be assigned to Load Balancers, only Droplets
Kubernetes Migration: - Docker Compose to Kubernetes: https://kubernetes.io/docs/tasks/configure-pod-container/translate-compose-kubernetes/ - Kompose (migration tool): https://kompose.io/
Local Development (Rancher Desktop + Tilt): - Rancher Desktop: https://rancherdesktop.io/ — local Kubernetes with built-in K3s, container runtime, and kubectl - Tilt: https://docs.tilt.dev/ — live development orchestration - Tilt live_update: https://docs.tilt.dev/live_update_reference — file sync into running containers - Tilt + Rancher Desktop: https://docs.tilt.dev/choosing_clusters#rancher-desktop — official Tilt integration guide - KubeView: https://github.com/benc-uk/kubeview — lightweight Kubernetes cluster visualization - Traefik Mesh: https://doc.traefik.io/traefik-mesh/ — lightweight service mesh (no sidecars, SMI-compatible) - Traefik Mesh install: https://doc.traefik.io/traefik-mesh/install/ — Helm-based installation - Cloudflare Tunnel (cloudflared): https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/ — free tunnel for exposing local services to the internet (webhook testing)
NestJS: - Graceful shutdown: https://docs.nestjs.com/fundamentals/lifecycle-events#application-shutdown - Health checks (Terminus): https://docs.nestjs.com/recipes/terminus
Existing Codebase:
- Docker Compose: deployment/staging/docker-compose.yml
- Traefik config: deployment/staging/traefik.yml
- Deployment guide: docs/05-deployment/staging-deployment-guide.md
- CI Pipeline: azure-pipelines.yml
END OF PROMPT
This prompt prepares the Forma3D.Connect infrastructure for a future Docker Compose to Kubernetes migration. The key networking deliverable is lowering DNS TTLs to 60s — enabling a fast DNS-based cut-over to a Load Balancer + DOKS cluster in the future (propagation under 1 minute). Note: DO Reserved IPs cannot be assigned to Load Balancers, so the migration uses a DNS update strategy instead. Supporting changes include health check standardization, graceful shutdown, configuration externalization, resource constraints, statelessness verification, multi-replica readiness (load-balancing, sticky sessions, worker deduplication), and a rolling update strategy with health gates and backward-compatible migrations. The system stays on Docker Compose for staging/production but becomes "Kubernetes-ready." A local Rancher Desktop + Tilt development environment provides production-parity K8s-based development with a one-command tilt up workflow.