Forma3D.Connect Operations Runbook¶

Version: 1.0
Date: January 18, 2026
Status: Active

Overview¶

This runbook provides operational procedures for Forma3D.Connect in production. It covers health checks, common issues, incident response, and maintenance procedures.

1. Service Architecture¶

Components¶

Component	URL	Purpose
API	`https://connect-api.forma3d.be`	Backend NestJS application
Web	`https://connect.forma3d.be`	React dashboard
Database	PostgreSQL (managed)	Data persistence
Traefik	Internal	Reverse proxy with TLS

Staging Environment¶

Component	URL
API	`https://staging-connect-api.forma3d.be`
Web	`https://staging-connect.forma3d.be`

External Dependencies¶

Service	Purpose	Documentation	Status Page
Shopify	E-commerce platform	Shopify API Docs	status.shopify.com
SimplyPrint	3D print management	SimplyPrint API	-
Sendcloud	Shipping labels	Sendcloud API	status.sendcloud.com
Sentry	Error monitoring	Sentry Dashboard	-

2. Health Checks¶

Endpoints¶

# Full health check (includes database)
curl https://connect-api.forma3d.be/health

# Liveness probe (process running)
curl https://connect-api.forma3d.be/health/live

# Readiness probe (database connected)
curl https://connect-api.forma3d.be/health/ready

# External dependencies (Shopify, SimplyPrint, Sendcloud)
curl https://connect-api.forma3d.be/health/dependencies

Expected Responses¶

Healthy System:

{
  "status": "ok",
  "database": "connected",
  "timestamp": "2026-01-18T10:00:00.000Z",
  "version": "20260118100000",
  "build": {
    "number": "20260118100000",
    "date": "2026-01-18T10:00:00.000Z",
    "commit": "abc123def456"
  },
  "uptime": 86400
}

Note: The version field now displays the CI/CD build number (same as build.number), making it easy to identify which deployment is running.

Dependencies Healthy:

{
  "status": "ok",
  "info": {
    "shopify": { "status": "up", "message": "Shopify API is reachable" },
    "simplyprint": { "status": "up", "message": "SimplyPrint API is connected" },
    "sendcloud": { "status": "up", "message": "Sendcloud API is reachable" }
  }
}

Degraded (External Service Down):

{
  "status": "error",
  "error": {
    "shopify": {
      "status": "down",
      "message": "Connection timeout"
    }
  },
  "info": {
    "simplyprint": { "status": "up" },
    "sendcloud": { "status": "up" }
  }
}

Monitoring with cURL¶

# Quick health check script
while true; do
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://connect-api.forma3d.be/health)
  if [ "$HTTP_STATUS" != "200" ]; then
    echo "[$(date)] ALERT: Health check failed with status $HTTP_STATUS"
  else
    echo "[$(date)] OK: Health check passed"
  fi
  sleep 60
done

3. Common Issues and Resolutions¶

High Error Rate¶

Symptoms: - Error rate > 5% - Sentry alerts increasing - Users reporting failures

Investigation: 1. Check Sentry for error patterns 2. Review API logs: docker logs forma3d-api --tail 100 3. Check database connectivity: curl .../health/ready 4. Verify external service status: curl .../health/dependencies

Resolution: 1. If database issue: Check connection pool, restart API if needed 2. If external service: Enable fallback/degraded mode, wait for recovery 3. If code bug: Deploy hotfix via normal pipeline

High Latency¶

Symptoms: - 95^th percentile response time > 2 seconds - Dashboard feels slow - Webhook processing delays

Investigation: 1. Check database query performance 2. Review slow query logs 3. Check external API response times in logs 4. Monitor memory/CPU usage: docker stats

Resolution: 1. Scale up resources if needed 2. Optimize slow database queries 3. Add caching if appropriate 4. Contact external service provider if their API is slow

Database Connection Failed¶

Symptoms: - Health check returns database: disconnected - API returns 500 errors - Readiness probe fails

Investigation: 1. Check PostgreSQL status in DigitalOcean dashboard 2. Verify connection string in environment 3. Check network connectivity 4. Review connection pool settings

Resolution:

# SSH to droplet
ssh root@<DROPLET_IP>

# Check API container
docker ps | grep forma3d-api

# Restart API container
docker-compose restart api

# Check logs for connection errors
docker logs forma3d-api --tail 50 | grep -i database

Shopify Webhooks Not Arriving¶

Symptoms: - New Shopify orders not appearing - Order count not increasing - No webhook logs in event log

Investigation: 1. Check Shopify admin > Settings > Notifications > Webhooks 2. Verify webhook URL is correct 3. Check for failed webhook deliveries in Shopify 4. Review API logs for HMAC verification failures

Resolution: 1. Verify SHOPIFY_WEBHOOK_SECRET matches Shopify settings 2. Re-register webhooks if needed 3. Check firewall/security group rules 4. Test webhook delivery manually from Shopify admin

SimplyPrint API Unreachable¶

Symptoms: - Print jobs not being created - /health/dependencies shows SimplyPrint down - Error logs show SimplyPrint connection failures

Investigation: 1. Check SimplyPrint dashboard manually 2. Verify API credentials: SIMPLYPRINT_API_KEY, SIMPLYPRINT_COMPANY_ID 3. Test API directly with curl

Resolution: 1. Wait for SimplyPrint service recovery 2. Failed jobs will be retried automatically via retry queue 3. Manual retry: POST /api/v1/print-jobs/{id}/retry with API key

Orders Stuck in Processing¶

Symptoms: - Orders in PROCESSING state for > 60 minutes - No print job status updates - Customer complaints about delays

Investigation: 1. Check print job status in dashboard 2. Verify SimplyPrint job status in their dashboard 3. Check for failed webhooks from SimplyPrint 4. Review retry queue: GET /api/v1/admin/retry-queue

Resolution: 1. Force refresh print job status from SimplyPrint 2. Manually update order status if needed 3. Contact SimplyPrint support for print issues 4. Check printer status in SimplyPrint dashboard

Retry Queue Backlog¶

Symptoms: - More than 50 items in retry queue for > 15 minutes - Alerts for retry queue backlog - Failed operations not recovering

Investigation: 1. Check retry queue status 2. Identify failing job types 3. Review error messages in retry entries 4. Check if external services are down

Resolution: 1. Fix underlying issue causing failures 2. Clear old/stale entries if safe:

DELETE FROM "RetryQueue" 
WHERE status = 'FAILED' 
AND "createdAt" < NOW() - INTERVAL '7 days';

3. Increase retry processing capacity if needed

4. Incident Response¶

Severity Levels¶

Level	Description	Response Time	Examples
P1 - Critical	Complete service outage	15 minutes	Database down, API unresponsive
P2 - High	Major feature broken	1 hour	Webhooks failing, fulfillments stuck
P3 - Medium	Degraded performance	4 hours	High latency, intermittent errors
P4 - Low	Minor issue	1 business day	UI bugs, documentation issues

Incident Response Process¶

Detect - Alert received or user report
Assess - Determine severity and impact
Communicate - Notify stakeholders if P1/P2
Investigate - Follow runbook procedures
Resolve - Apply fix or workaround
Document - Create incident report
Review - Post-incident review (for P1/P2)

Incident Template¶

## Incident Report

**Date:** YYYY-MM-DD
**Severity:** P1/P2/P3/P4
**Duration:** HH:MM - HH:MM (X hours Y minutes)
**Impact:** [Description of user impact]

### Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Issue resolved

### Root Cause
[Description of what caused the issue]

### Resolution
[What was done to fix it]

### Prevention
[What will be done to prevent recurrence]

### Action Items
- [ ] Action 1
- [ ] Action 2

5. Maintenance Procedures¶

Deploying Updates¶

Updates are deployed automatically via Azure DevOps pipeline when merging to main. For manual deployment:

# SSH to staging/production server
ssh root@<DROPLET_IP>

# Pull latest images
docker-compose pull

# Deploy with zero downtime
docker-compose up -d --no-deps api
docker-compose up -d --no-deps web

# Verify deployment
curl https://<domain>/health

Database Migrations¶

Migrations run automatically during deployment. For manual migration:

# SSH to server
ssh root@<DROPLET_IP>

# Enter API container
docker exec -it forma3d-api sh

# Run migrations
npx prisma migrate deploy

# Exit container
exit

Rolling Back a Migration:

# Identify migration to rollback
npx prisma migrate resolve --rolled-back MIGRATION_NAME

Log Management¶

Logs are automatically rotated by Docker. Manual cleanup:

# SSH to server
ssh root@<DROPLET_IP>

# Check disk usage
df -h

# Clear old Docker logs
docker system prune --volumes -f

# View specific container logs
docker logs forma3d-api --since 1h --tail 100

Backup Procedures¶

Database backups are handled by DigitalOcean Managed Databases (automatic daily backups with 7-day retention).

Manual backup:

# Export database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql

# Compress backup
gzip backup_*.sql

Certificate Renewal¶

TLS certificates are automatically renewed by Traefik via Let's Encrypt. Verify renewal:

# Check certificate expiry
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates

# Check Traefik logs for renewal activity
docker logs forma3d-traefik 2>&1 | grep -i "acme\|certificate\|renew"

6. Contact Information¶

Role	Contact	Escalation
On-call Engineer	[email/phone]	Primary
Tech Lead	[email/phone]	If P1/P2
DevOps	[email/phone]	Infrastructure issues

7. Quick Reference Commands¶

# SSH to staging
ssh root@<STAGING_IP>

# SSH to production  
ssh root@<PRODUCTION_IP>

# View running containers
docker ps

# View container logs
docker logs forma3d-api --tail 100 -f
docker logs forma3d-web --tail 100 -f
docker logs forma3d-traefik --tail 100 -f

# Restart API
docker-compose restart api

# Restart all services
docker-compose restart

# Check disk space
df -h

# Check memory
free -h

# Check Docker resource usage
docker stats --no-stream

8. Load Testing¶

Running Load Tests¶

Load tests verify the system can handle expected production workloads (500+ orders/day).

# Run against staging
pnpm load-test:staging

# Run in baseline mode (no threshold failures)
k6 run --env ENV=staging --env BASELINE=true load-tests/k6/scenarios/order-throughput.js

Load Test via Azure DevOps Pipeline¶

Navigate to Pipelines in Azure DevOps
Click Run pipeline on the main branch
Enable "Run Load Tests (optional)"
Optionally enable "Load Test Baseline Mode" for data collection without threshold enforcement
Click Run

The load test stage runs between Acceptance Tests and Production deployment. Results are available in the HTML Viewer tab with visual metric cards and raw JSON data.

Performance Thresholds¶

Metric	Threshold	NFR Reference
HTTP request duration	p(95) < 2000ms	NFR-PE-003
HTTP request failed	< 1%	NFR-AV-001
Checks pass rate	> 99%	-

Troubleshooting Load Test Failures¶

High failure rate (> 1%): - Check /health/dependencies for external service issues - Review Sentry for error spikes during the test window - Check database connection pool exhaustion (DATABASE_POOL_SIZE)

Slow response times (p95 > 2s): - Check database query performance in Sentry - Review N+1 query patterns - Check for resource contention (CPU/memory) on the droplet

Test seeding failures: - Verify NODE_ENV is not production (seeding disabled in prod) - Check if test-seeding module is loaded

Revision History:

Version	Date	Author	Changes
1.0	2026-01-18	Phase 6 Implementation	Initial runbook
1.1	2026-01-18	Phase 6 Implementation	Added load testing section, updated version format