Skip to content

Forma3D.Connect Operations Runbook

Version: 1.0
Date: January 18, 2026
Status: Active

Overview

This runbook provides operational procedures for Forma3D.Connect in production. It covers health checks, common issues, incident response, and maintenance procedures.


1. Service Architecture

Components

Component URL Purpose
API https://connect-api.forma3d.be Backend NestJS application
Web https://connect.forma3d.be React dashboard
Database PostgreSQL (managed) Data persistence
Traefik Internal Reverse proxy with TLS

Staging Environment

Component URL
API https://staging-connect-api.forma3d.be
Web https://staging-connect.forma3d.be

External Dependencies

Service Purpose Documentation Status Page
Shopify E-commerce platform Shopify API Docs status.shopify.com
SimplyPrint 3D print management SimplyPrint API -
Sendcloud Shipping labels Sendcloud API status.sendcloud.com
Sentry Error monitoring Sentry Dashboard -

2. Health Checks

Endpoints

# Full health check (includes database)
curl https://connect-api.forma3d.be/health

# Liveness probe (process running)
curl https://connect-api.forma3d.be/health/live

# Readiness probe (database connected)
curl https://connect-api.forma3d.be/health/ready

# External dependencies (Shopify, SimplyPrint, Sendcloud)
curl https://connect-api.forma3d.be/health/dependencies

Expected Responses

Healthy System:

{
  "status": "ok",
  "database": "connected",
  "timestamp": "2026-01-18T10:00:00.000Z",
  "version": "20260118100000",
  "build": {
    "number": "20260118100000",
    "date": "2026-01-18T10:00:00.000Z",
    "commit": "abc123def456"
  },
  "uptime": 86400
}

Note: The version field now displays the CI/CD build number (same as build.number), making it easy to identify which deployment is running.

Dependencies Healthy:

{
  "status": "ok",
  "info": {
    "shopify": { "status": "up", "message": "Shopify API is reachable" },
    "simplyprint": { "status": "up", "message": "SimplyPrint API is connected" },
    "sendcloud": { "status": "up", "message": "Sendcloud API is reachable" }
  }
}

Degraded (External Service Down):

{
  "status": "error",
  "error": {
    "shopify": {
      "status": "down",
      "message": "Connection timeout"
    }
  },
  "info": {
    "simplyprint": { "status": "up" },
    "sendcloud": { "status": "up" }
  }
}

Monitoring with cURL

# Quick health check script
while true; do
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://connect-api.forma3d.be/health)
  if [ "$HTTP_STATUS" != "200" ]; then
    echo "[$(date)] ALERT: Health check failed with status $HTTP_STATUS"
  else
    echo "[$(date)] OK: Health check passed"
  fi
  sleep 60
done

3. Common Issues and Resolutions

High Error Rate

Symptoms: - Error rate > 5% - Sentry alerts increasing - Users reporting failures

Investigation: 1. Check Sentry for error patterns 2. Review API logs: docker logs forma3d-api --tail 100 3. Check database connectivity: curl .../health/ready 4. Verify external service status: curl .../health/dependencies

Resolution: 1. If database issue: Check connection pool, restart API if needed 2. If external service: Enable fallback/degraded mode, wait for recovery 3. If code bug: Deploy hotfix via normal pipeline

High Latency

Symptoms: - 95th percentile response time > 2 seconds - Dashboard feels slow - Webhook processing delays

Investigation: 1. Check database query performance 2. Review slow query logs 3. Check external API response times in logs 4. Monitor memory/CPU usage: docker stats

Resolution: 1. Scale up resources if needed 2. Optimize slow database queries 3. Add caching if appropriate 4. Contact external service provider if their API is slow

Database Connection Failed

Symptoms: - Health check returns database: disconnected - API returns 500 errors - Readiness probe fails

Investigation: 1. Check PostgreSQL status in DigitalOcean dashboard 2. Verify connection string in environment 3. Check network connectivity 4. Review connection pool settings

Resolution:

# SSH to droplet
ssh root@<DROPLET_IP>

# Check API container
docker ps | grep forma3d-api

# Restart API container
docker-compose restart api

# Check logs for connection errors
docker logs forma3d-api --tail 50 | grep -i database

Shopify Webhooks Not Arriving

Symptoms: - New Shopify orders not appearing - Order count not increasing - No webhook logs in event log

Investigation: 1. Check Shopify admin > Settings > Notifications > Webhooks 2. Verify webhook URL is correct 3. Check for failed webhook deliveries in Shopify 4. Review API logs for HMAC verification failures

Resolution: 1. Verify SHOPIFY_WEBHOOK_SECRET matches Shopify settings 2. Re-register webhooks if needed 3. Check firewall/security group rules 4. Test webhook delivery manually from Shopify admin

SimplyPrint API Unreachable

Symptoms: - Print jobs not being created - /health/dependencies shows SimplyPrint down - Error logs show SimplyPrint connection failures

Investigation: 1. Check SimplyPrint dashboard manually 2. Verify API credentials: SIMPLYPRINT_API_KEY, SIMPLYPRINT_COMPANY_ID 3. Test API directly with curl

Resolution: 1. Wait for SimplyPrint service recovery 2. Failed jobs will be retried automatically via retry queue 3. Manual retry: POST /api/v1/print-jobs/{id}/retry with API key

Orders Stuck in Processing

Symptoms: - Orders in PROCESSING state for > 60 minutes - No print job status updates - Customer complaints about delays

Investigation: 1. Check print job status in dashboard 2. Verify SimplyPrint job status in their dashboard 3. Check for failed webhooks from SimplyPrint 4. Review retry queue: GET /api/v1/admin/retry-queue

Resolution: 1. Force refresh print job status from SimplyPrint 2. Manually update order status if needed 3. Contact SimplyPrint support for print issues 4. Check printer status in SimplyPrint dashboard

Retry Queue Backlog

Symptoms: - More than 50 items in retry queue for > 15 minutes - Alerts for retry queue backlog - Failed operations not recovering

Investigation: 1. Check retry queue status 2. Identify failing job types 3. Review error messages in retry entries 4. Check if external services are down

Resolution: 1. Fix underlying issue causing failures 2. Clear old/stale entries if safe:

DELETE FROM "RetryQueue" 
WHERE status = 'FAILED' 
AND "createdAt" < NOW() - INTERVAL '7 days';
3. Increase retry processing capacity if needed


4. Incident Response

Severity Levels

Level Description Response Time Examples
P1 - Critical Complete service outage 15 minutes Database down, API unresponsive
P2 - High Major feature broken 1 hour Webhooks failing, fulfillments stuck
P3 - Medium Degraded performance 4 hours High latency, intermittent errors
P4 - Low Minor issue 1 business day UI bugs, documentation issues

Incident Response Process

  1. Detect - Alert received or user report
  2. Assess - Determine severity and impact
  3. Communicate - Notify stakeholders if P1/P2
  4. Investigate - Follow runbook procedures
  5. Resolve - Apply fix or workaround
  6. Document - Create incident report
  7. Review - Post-incident review (for P1/P2)

Incident Template

## Incident Report

**Date:** YYYY-MM-DD
**Severity:** P1/P2/P3/P4
**Duration:** HH:MM - HH:MM (X hours Y minutes)
**Impact:** [Description of user impact]

### Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Issue resolved

### Root Cause
[Description of what caused the issue]

### Resolution
[What was done to fix it]

### Prevention
[What will be done to prevent recurrence]

### Action Items
- [ ] Action 1
- [ ] Action 2

5. Maintenance Procedures

Deploying Updates

Updates are deployed automatically via Azure DevOps pipeline when merging to main. For manual deployment:

# SSH to staging/production server
ssh root@<DROPLET_IP>

# Pull latest images
docker-compose pull

# Deploy with zero downtime
docker-compose up -d --no-deps api
docker-compose up -d --no-deps web

# Verify deployment
curl https://<domain>/health

Database Migrations

Migrations run automatically during deployment. For manual migration:

# SSH to server
ssh root@<DROPLET_IP>

# Enter API container
docker exec -it forma3d-api sh

# Run migrations
npx prisma migrate deploy

# Exit container
exit

Rolling Back a Migration:

# Identify migration to rollback
npx prisma migrate resolve --rolled-back MIGRATION_NAME

Log Management

Logs are automatically rotated by Docker. Manual cleanup:

# SSH to server
ssh root@<DROPLET_IP>

# Check disk usage
df -h

# Clear old Docker logs
docker system prune --volumes -f

# View specific container logs
docker logs forma3d-api --since 1h --tail 100

Backup Procedures

Database backups are handled by DigitalOcean Managed Databases (automatic daily backups with 7-day retention).

Manual backup:

# Export database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql

# Compress backup
gzip backup_*.sql

Certificate Renewal

TLS certificates are automatically renewed by Traefik via Let's Encrypt. Verify renewal:

# Check certificate expiry
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates

# Check Traefik logs for renewal activity
docker logs forma3d-traefik 2>&1 | grep -i "acme\|certificate\|renew"

6. Contact Information

Role Contact Escalation
On-call Engineer [email/phone] Primary
Tech Lead [email/phone] If P1/P2
DevOps [email/phone] Infrastructure issues

7. Quick Reference Commands

# SSH to staging
ssh root@<STAGING_IP>

# SSH to production  
ssh root@<PRODUCTION_IP>

# View running containers
docker ps

# View container logs
docker logs forma3d-api --tail 100 -f
docker logs forma3d-web --tail 100 -f
docker logs forma3d-traefik --tail 100 -f

# Restart API
docker-compose restart api

# Restart all services
docker-compose restart

# Check disk space
df -h

# Check memory
free -h

# Check Docker resource usage
docker stats --no-stream

8. Load Testing

Running Load Tests

Load tests verify the system can handle expected production workloads (500+ orders/day).

# Run against staging
pnpm load-test:staging

# Run in baseline mode (no threshold failures)
k6 run --env ENV=staging --env BASELINE=true load-tests/k6/scenarios/order-throughput.js

Load Test via Azure DevOps Pipeline

  1. Navigate to Pipelines in Azure DevOps
  2. Click Run pipeline on the main branch
  3. Enable "Run Load Tests (optional)"
  4. Optionally enable "Load Test Baseline Mode" for data collection without threshold enforcement
  5. Click Run

The load test stage runs between Acceptance Tests and Production deployment. Results are available in the HTML Viewer tab with visual metric cards and raw JSON data.

Performance Thresholds

Metric Threshold NFR Reference
HTTP request duration p(95) < 2000ms NFR-PE-003
HTTP request failed < 1% NFR-AV-001
Checks pass rate > 99% -

Troubleshooting Load Test Failures

High failure rate (> 1%): - Check /health/dependencies for external service issues - Review Sentry for error spikes during the test window - Check database connection pool exhaustion (DATABASE_POOL_SIZE)

Slow response times (p95 > 2s): - Check database query performance in Sentry - Review N+1 query patterns - Check for resource contention (CPU/memory) on the droplet

Test seeding failures: - Verify NODE_ENV is not production (seeding disabled in prod) - Check if test-seeding module is loaded


Revision History:

Version Date Author Changes
1.0 2026-01-18 Phase 6 Implementation Initial runbook
1.1 2026-01-18 Phase 6 Implementation Added load testing section, updated version format