Forma3D.Connect Operations Runbook¶
Version: 1.0
Date: January 18, 2026
Status: Active
Overview¶
This runbook provides operational procedures for Forma3D.Connect in production. It covers health checks, common issues, incident response, and maintenance procedures.
1. Service Architecture¶
Components¶
| Component | URL | Purpose |
|---|---|---|
| API | https://connect-api.forma3d.be |
Backend NestJS application |
| Web | https://connect.forma3d.be |
React dashboard |
| Database | PostgreSQL (managed) | Data persistence |
| Traefik | Internal | Reverse proxy with TLS |
Staging Environment¶
| Component | URL |
|---|---|
| API | https://staging-connect-api.forma3d.be |
| Web | https://staging-connect.forma3d.be |
External Dependencies¶
| Service | Purpose | Documentation | Status Page |
|---|---|---|---|
| Shopify | E-commerce platform | Shopify API Docs | status.shopify.com |
| SimplyPrint | 3D print management | SimplyPrint API | - |
| Sendcloud | Shipping labels | Sendcloud API | status.sendcloud.com |
| Sentry | Error monitoring | Sentry Dashboard | - |
2. Health Checks¶
Endpoints¶
# Full health check (includes database)
curl https://connect-api.forma3d.be/health
# Liveness probe (process running)
curl https://connect-api.forma3d.be/health/live
# Readiness probe (database connected)
curl https://connect-api.forma3d.be/health/ready
# External dependencies (Shopify, SimplyPrint, Sendcloud)
curl https://connect-api.forma3d.be/health/dependencies
Expected Responses¶
Healthy System:
{
"status": "ok",
"database": "connected",
"timestamp": "2026-01-18T10:00:00.000Z",
"version": "20260118100000",
"build": {
"number": "20260118100000",
"date": "2026-01-18T10:00:00.000Z",
"commit": "abc123def456"
},
"uptime": 86400
}
Note: The
versionfield now displays the CI/CD build number (same asbuild.number), making it easy to identify which deployment is running.
Dependencies Healthy:
{
"status": "ok",
"info": {
"shopify": { "status": "up", "message": "Shopify API is reachable" },
"simplyprint": { "status": "up", "message": "SimplyPrint API is connected" },
"sendcloud": { "status": "up", "message": "Sendcloud API is reachable" }
}
}
Degraded (External Service Down):
{
"status": "error",
"error": {
"shopify": {
"status": "down",
"message": "Connection timeout"
}
},
"info": {
"simplyprint": { "status": "up" },
"sendcloud": { "status": "up" }
}
}
Monitoring with cURL¶
# Quick health check script
while true; do
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://connect-api.forma3d.be/health)
if [ "$HTTP_STATUS" != "200" ]; then
echo "[$(date)] ALERT: Health check failed with status $HTTP_STATUS"
else
echo "[$(date)] OK: Health check passed"
fi
sleep 60
done
3. Common Issues and Resolutions¶
High Error Rate¶
Symptoms: - Error rate > 5% - Sentry alerts increasing - Users reporting failures
Investigation:
1. Check Sentry for error patterns
2. Review API logs: docker logs forma3d-api --tail 100
3. Check database connectivity: curl .../health/ready
4. Verify external service status: curl .../health/dependencies
Resolution: 1. If database issue: Check connection pool, restart API if needed 2. If external service: Enable fallback/degraded mode, wait for recovery 3. If code bug: Deploy hotfix via normal pipeline
High Latency¶
Symptoms: - 95th percentile response time > 2 seconds - Dashboard feels slow - Webhook processing delays
Investigation:
1. Check database query performance
2. Review slow query logs
3. Check external API response times in logs
4. Monitor memory/CPU usage: docker stats
Resolution: 1. Scale up resources if needed 2. Optimize slow database queries 3. Add caching if appropriate 4. Contact external service provider if their API is slow
Database Connection Failed¶
Symptoms:
- Health check returns database: disconnected
- API returns 500 errors
- Readiness probe fails
Investigation: 1. Check PostgreSQL status in DigitalOcean dashboard 2. Verify connection string in environment 3. Check network connectivity 4. Review connection pool settings
Resolution:
# SSH to droplet
ssh root@<DROPLET_IP>
# Check API container
docker ps | grep forma3d-api
# Restart API container
docker-compose restart api
# Check logs for connection errors
docker logs forma3d-api --tail 50 | grep -i database
Shopify Webhooks Not Arriving¶
Symptoms: - New Shopify orders not appearing - Order count not increasing - No webhook logs in event log
Investigation: 1. Check Shopify admin > Settings > Notifications > Webhooks 2. Verify webhook URL is correct 3. Check for failed webhook deliveries in Shopify 4. Review API logs for HMAC verification failures
Resolution:
1. Verify SHOPIFY_WEBHOOK_SECRET matches Shopify settings
2. Re-register webhooks if needed
3. Check firewall/security group rules
4. Test webhook delivery manually from Shopify admin
SimplyPrint API Unreachable¶
Symptoms:
- Print jobs not being created
- /health/dependencies shows SimplyPrint down
- Error logs show SimplyPrint connection failures
Investigation:
1. Check SimplyPrint dashboard manually
2. Verify API credentials: SIMPLYPRINT_API_KEY, SIMPLYPRINT_COMPANY_ID
3. Test API directly with curl
Resolution:
1. Wait for SimplyPrint service recovery
2. Failed jobs will be retried automatically via retry queue
3. Manual retry: POST /api/v1/print-jobs/{id}/retry with API key
Orders Stuck in Processing¶
Symptoms: - Orders in PROCESSING state for > 60 minutes - No print job status updates - Customer complaints about delays
Investigation:
1. Check print job status in dashboard
2. Verify SimplyPrint job status in their dashboard
3. Check for failed webhooks from SimplyPrint
4. Review retry queue: GET /api/v1/admin/retry-queue
Resolution: 1. Force refresh print job status from SimplyPrint 2. Manually update order status if needed 3. Contact SimplyPrint support for print issues 4. Check printer status in SimplyPrint dashboard
Retry Queue Backlog¶
Symptoms: - More than 50 items in retry queue for > 15 minutes - Alerts for retry queue backlog - Failed operations not recovering
Investigation: 1. Check retry queue status 2. Identify failing job types 3. Review error messages in retry entries 4. Check if external services are down
Resolution: 1. Fix underlying issue causing failures 2. Clear old/stale entries if safe:
DELETE FROM "RetryQueue"
WHERE status = 'FAILED'
AND "createdAt" < NOW() - INTERVAL '7 days';
4. Incident Response¶
Severity Levels¶
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Complete service outage | 15 minutes | Database down, API unresponsive |
| P2 - High | Major feature broken | 1 hour | Webhooks failing, fulfillments stuck |
| P3 - Medium | Degraded performance | 4 hours | High latency, intermittent errors |
| P4 - Low | Minor issue | 1 business day | UI bugs, documentation issues |
Incident Response Process¶
- Detect - Alert received or user report
- Assess - Determine severity and impact
- Communicate - Notify stakeholders if P1/P2
- Investigate - Follow runbook procedures
- Resolve - Apply fix or workaround
- Document - Create incident report
- Review - Post-incident review (for P1/P2)
Incident Template¶
## Incident Report
**Date:** YYYY-MM-DD
**Severity:** P1/P2/P3/P4
**Duration:** HH:MM - HH:MM (X hours Y minutes)
**Impact:** [Description of user impact]
### Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Issue resolved
### Root Cause
[Description of what caused the issue]
### Resolution
[What was done to fix it]
### Prevention
[What will be done to prevent recurrence]
### Action Items
- [ ] Action 1
- [ ] Action 2
5. Maintenance Procedures¶
Deploying Updates¶
Updates are deployed automatically via Azure DevOps pipeline when merging to main. For manual deployment:
# SSH to staging/production server
ssh root@<DROPLET_IP>
# Pull latest images
docker-compose pull
# Deploy with zero downtime
docker-compose up -d --no-deps api
docker-compose up -d --no-deps web
# Verify deployment
curl https://<domain>/health
Database Migrations¶
Migrations run automatically during deployment. For manual migration:
# SSH to server
ssh root@<DROPLET_IP>
# Enter API container
docker exec -it forma3d-api sh
# Run migrations
npx prisma migrate deploy
# Exit container
exit
Rolling Back a Migration:
# Identify migration to rollback
npx prisma migrate resolve --rolled-back MIGRATION_NAME
Log Management¶
Logs are automatically rotated by Docker. Manual cleanup:
# SSH to server
ssh root@<DROPLET_IP>
# Check disk usage
df -h
# Clear old Docker logs
docker system prune --volumes -f
# View specific container logs
docker logs forma3d-api --since 1h --tail 100
Backup Procedures¶
Database backups are handled by DigitalOcean Managed Databases (automatic daily backups with 7-day retention).
Manual backup:
# Export database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql
# Compress backup
gzip backup_*.sql
Certificate Renewal¶
TLS certificates are automatically renewed by Traefik via Let's Encrypt. Verify renewal:
# Check certificate expiry
echo | openssl s_client -connect connect-api.forma3d.be:443 2>/dev/null | openssl x509 -noout -dates
# Check Traefik logs for renewal activity
docker logs forma3d-traefik 2>&1 | grep -i "acme\|certificate\|renew"
6. Contact Information¶
| Role | Contact | Escalation |
|---|---|---|
| On-call Engineer | [email/phone] | Primary |
| Tech Lead | [email/phone] | If P1/P2 |
| DevOps | [email/phone] | Infrastructure issues |
7. Quick Reference Commands¶
# SSH to staging
ssh root@<STAGING_IP>
# SSH to production
ssh root@<PRODUCTION_IP>
# View running containers
docker ps
# View container logs
docker logs forma3d-api --tail 100 -f
docker logs forma3d-web --tail 100 -f
docker logs forma3d-traefik --tail 100 -f
# Restart API
docker-compose restart api
# Restart all services
docker-compose restart
# Check disk space
df -h
# Check memory
free -h
# Check Docker resource usage
docker stats --no-stream
8. Load Testing¶
Running Load Tests¶
Load tests verify the system can handle expected production workloads (500+ orders/day).
# Run against staging
pnpm load-test:staging
# Run in baseline mode (no threshold failures)
k6 run --env ENV=staging --env BASELINE=true load-tests/k6/scenarios/order-throughput.js
Load Test via Azure DevOps Pipeline¶
- Navigate to Pipelines in Azure DevOps
- Click Run pipeline on the main branch
- Enable "Run Load Tests (optional)"
- Optionally enable "Load Test Baseline Mode" for data collection without threshold enforcement
- Click Run
The load test stage runs between Acceptance Tests and Production deployment. Results are available in the HTML Viewer tab with visual metric cards and raw JSON data.
Performance Thresholds¶
| Metric | Threshold | NFR Reference |
|---|---|---|
| HTTP request duration | p(95) < 2000ms | NFR-PE-003 |
| HTTP request failed | < 1% | NFR-AV-001 |
| Checks pass rate | > 99% | - |
Troubleshooting Load Test Failures¶
High failure rate (> 1%):
- Check /health/dependencies for external service issues
- Review Sentry for error spikes during the test window
- Check database connection pool exhaustion (DATABASE_POOL_SIZE)
Slow response times (p95 > 2s): - Check database query performance in Sentry - Review N+1 query patterns - Check for resource contention (CPU/memory) on the droplet
Test seeding failures:
- Verify NODE_ENV is not production (seeding disabled in prod)
- Check if test-seeding module is loaded
Revision History:
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-18 | Phase 6 Implementation | Initial runbook |
| 1.1 | 2026-01-18 | Phase 6 Implementation | Added load testing section, updated version format |