Disaster Recovery Research Report¶
Status: Research Document
Created: February 2026
Scope: Forma 3D Connect Platform
Table of Contents¶
- Executive Summary
- System Overview
- Risk Assessment Matrix
- Disaster Scenarios and Mitigation
- Backup and Recovery Strategies
- Incident Response Procedures
- Postmortem Process
- Status Page Implementation
- Alerting and Notification Strategy
- Service Level Agreements (SLAs)
- Recommendations and Next Steps
1. Executive Summary¶
This document outlines a comprehensive disaster recovery (DR) strategy for the Forma 3D Connect platform. The system orchestrates 3D printing fulfillment by integrating with Shopify (e-commerce), SimplyPrint (print management), and SendCloud (shipping). A robust DR plan is essential to ensure business continuity, protect customer data, and maintain SLA commitments.
Key Findings¶
- Critical Dependencies: PostgreSQL database, three external APIs (Shopify, SimplyPrint, SendCloud)
- Current Strengths: Retry queue system, health checks, event logging, webhook idempotency
- Gaps Identified: No formal backup strategy documented, no status page, limited alerting beyond Sentry
- Recommended RTO: 4 hours for critical services, 24 hours for full restoration
- Recommended RPO: 1 hour (maximum data loss tolerance)
2. System Overview¶
Architecture Components¶
┌─────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
├──────────────────┬──────────────────┬──────────────────┬────────────────┤
│ Shopify │ SimplyPrint │ SendCloud │ SMTP │
│ (E-commerce) │ (Printing) │ (Shipping) │ (Email) │
└────────┬─────────┴────────┬─────────┴────────┬─────────┴────────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ NestJS API │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│ │ Webhooks │ │ Orders │ │ Print Jobs │ │ Fulfillment ││
│ │ Handlers │ │ Service │ │ Service │ │ Service ││
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│ │ Retry Queue │ │ Health │ │ Sentry │ │ Notifications ││
│ │ Processor │ │ Checks │ │ Integration │ │ (Email/Push) ││
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ PostgreSQL Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ Orders │ │PrintJobs │ │ Shipments│ │AuditLogs │ │ RetryQueue │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Critical Data Assets¶
| Asset | Description | Sensitivity | Recovery Priority |
|---|---|---|---|
| Orders | Customer orders from Shopify | High | Critical |
| Print Jobs | Job status and tracking | Medium | Critical |
| Product Mappings | Product to 3D file mappings (by Shopify product/variant ID, SKU optional) | Medium | High |
| Shipments | Tracking and label data | High | Critical |
| User Accounts | RBAC and authentication | High | Critical |
| Audit Logs | Compliance and debugging | Medium | Medium |
| Event Logs | Operational visibility | Low | Low |
3. Risk Assessment Matrix¶
Risk Scoring¶
- Probability: 1 (Rare) to 5 (Frequent)
- Impact: 1 (Minimal) to 5 (Catastrophic)
- Risk Score: Probability × Impact
Identified Risks¶
| Risk ID | Risk Description | Probability | Impact | Score | Category |
|---|---|---|---|---|---|
| R01 | Database server crash | 2 | 5 | 10 | Infrastructure |
| R02 | Database corruption | 1 | 5 | 5 | Infrastructure |
| R03 | Application server crash | 3 | 4 | 12 | Infrastructure |
| R04 | Shopify API unavailable | 2 | 4 | 8 | External Service |
| R05 | SimplyPrint API unavailable | 3 | 4 | 12 | External Service |
| R06 | SendCloud API unavailable | 2 | 3 | 6 | External Service |
| R07 | Security breach / hack | 2 | 5 | 10 | Security |
| R08 | Credential compromise | 2 | 5 | 10 | Security |
| R09 | Major misconfiguration | 3 | 4 | 12 | Operational |
| R10 | Data loss (accidental deletion) | 2 | 5 | 10 | Operational |
| R11 | Network/DNS failure | 2 | 4 | 8 | Infrastructure |
| R12 | Certificate expiration | 2 | 3 | 6 | Operational |
| R13 | Disk space exhaustion | 3 | 3 | 9 | Infrastructure |
| R14 | Memory exhaustion / OOM | 3 | 3 | 9 | Infrastructure |
| R15 | Webhook flooding / DDoS | 2 | 4 | 8 | Security |
Risk Matrix Visualization¶
│ Catastrophic (5) │ R02 │ │ R01,R07,R08,R10 │ │ │
│ Major (4) │ │ R04,R11,R15 │ R03,R05,R09 │ │ │
│ Moderate (3) │ │ R06,R12 │ R13,R14 │ │ │
│ Minor (2) │ │ │ │ │ │
│ Minimal (1) │ │ │ │ │ │
└──────────────────┴───────┴───────┴───────┴───────┴───────┘
Rare Unlikely Possible Likely Frequent
(1) (2) (3) (4) (5)
4. Disaster Scenarios and Mitigation¶
4.1 Database Crashes¶
Scenario: PostgreSQL Server Becomes Unavailable¶
Symptoms:
- Health checks fail (/health/ready returns 503)
- All API endpoints return 500 errors
- Sentry floods with database connection errors
Immediate Response: 1. Check database container/server status 2. Review database logs for crash cause 3. Attempt restart if container crashed 4. Failover to replica if available
Mitigation Strategies:
| Strategy | Implementation | RTO Impact |
|---|---|---|
| Database Replication | PostgreSQL streaming replication with read replica | < 5 min failover |
| Managed Database | Azure Database for PostgreSQL / AWS RDS with automatic failover | < 2 min failover |
| Connection Pooling | PgBouncer for connection management | Reduces crash risk |
| Regular Backups | Automated daily backups with point-in-time recovery | 1-4 hours |
Current System Safeguards:
- Connection pooling configured (DATABASE_POOL_SIZE)
- Health checks detect issues quickly
- Retry queue persists failed operations for later retry
Recommended Actions:
# PostgreSQL High Availability Setup
primary:
- Streaming replication to standby
- WAL archiving to object storage
- Automated backups every 6 hours
standby:
- Hot standby for read queries
- Automatic promotion via Patroni/pgpool
backup_retention:
daily: 7 days
weekly: 4 weeks
monthly: 12 months
4.2 Application Server Crashes¶
Scenario: NestJS API Container Crashes or Becomes Unresponsive¶
Symptoms: - Kubernetes/Docker health probes fail - No response on any endpoint - WebSocket connections drop
Immediate Response: 1. Container orchestrator auto-restarts (if configured) 2. Check container logs for crash reason 3. Verify resource limits (memory/CPU) 4. Scale up if load-related
Mitigation Strategies:
| Strategy | Implementation | Benefit |
|---|---|---|
| Multiple Replicas | Kubernetes Deployment with 2+ replicas | Zero downtime |
| Health Probes | Liveness and readiness probes | Auto-recovery |
| Horizontal Pod Autoscaler | Scale based on CPU/memory | Handle load spikes |
| Graceful Shutdown | Handle SIGTERM properly | No dropped requests |
Kubernetes Deployment Example:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
spec:
containers:
- name: api
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
4.3 Security Breach / System Hack¶
Scenario: Unauthorized Access or Data Breach¶
Attack Vectors: 1. Credential theft (API keys, database passwords) 2. SQL injection (mitigated by Prisma ORM) 3. Webhook spoofing (mitigated by HMAC verification) 4. Session hijacking 5. Dependency vulnerabilities
Immediate Response Checklist:
## Security Incident Response Checklist
### Phase 1: Contain (First 30 minutes)
- [ ] Isolate affected systems (network level)
- [ ] Revoke compromised credentials immediately
- [ ] Enable maintenance mode if needed
- [ ] Preserve logs and evidence (DO NOT delete)
### Phase 2: Assess (1-4 hours)
- [ ] Identify attack vector and timeline
- [ ] Determine scope of data exposure
- [ ] Check for persistence mechanisms (backdoors)
- [ ] Review audit logs for suspicious activity
### Phase 3: Remediate (4-24 hours)
- [ ] Rotate all secrets and API keys
- [ ] Patch vulnerabilities
- [ ] Reset user passwords if needed
- [ ] Review and update access controls
### Phase 4: Recover (24-72 hours)
- [ ] Restore from clean backup if needed
- [ ] Re-enable services gradually
- [ ] Enhanced monitoring period
- [ ] Notify affected parties (GDPR: 72 hours)
Preventive Measures:
| Measure | Current Status | Recommendation |
|---|---|---|
| HMAC Webhook Verification | ✅ Implemented | Maintain |
| Rate Limiting | ✅ Implemented | Add IP-based limits |
| Audit Logging | ✅ Implemented | Add alerting |
| Secret Management | ⚠️ Environment variables | Use Azure Key Vault / HashiCorp Vault |
| Dependency Scanning | ❌ Not implemented | Add Dependabot / Snyk |
| WAF | ❌ Not implemented | Consider Azure WAF / Cloudflare |
Credential Rotation Procedure:
# Emergency credential rotation script
#!/bin/bash
# 1. Database password
echo "Rotating database credentials..."
# Update in secret manager, then rolling restart
# 2. Shopify credentials
echo "Rotating Shopify API credentials..."
# Generate new keys in Shopify Partner Dashboard
# Update SHOPIFY_API_KEY, SHOPIFY_API_SECRET
# 3. SimplyPrint API key
echo "Rotating SimplyPrint credentials..."
# Generate new API key in SimplyPrint dashboard
# Update SIMPLYPRINT_API_KEY
# 4. SendCloud credentials
echo "Rotating SendCloud credentials..."
# Generate new API keys in SendCloud dashboard
# Update SENDCLOUD_PUBLIC_KEY, SENDCLOUD_SECRET_KEY
# 5. Internal secrets
echo "Rotating internal secrets..."
# SESSION_SECRET, INTERNAL_API_KEY, SHOPIFY_TOKEN_ENCRYPTION_KEY
# 6. Rolling restart
echo "Performing rolling restart..."
kubectl rollout restart deployment/api
4.4 Major Misconfiguration¶
Scenario: Production Environment Misconfigured¶
Common Misconfigurations: 1. Wrong database URL (connecting to staging/dev) 2. Incorrect API keys for wrong environment 3. CORS misconfiguration blocking frontend 4. Invalid webhook URLs preventing order ingestion 5. Wrong SMTP settings causing notification failures
Detection Mechanisms:
| Misconfiguration | Detection Method | Time to Detect |
|---|---|---|
| Database URL wrong | Health check fails | < 1 minute |
| API keys invalid | First API call fails, logged to Sentry | < 5 minutes |
| Webhook URL wrong | No orders coming in, monitoring alert | 5-30 minutes |
| SMTP misconfigured | Email delivery fails, retry queue grows | 30 minutes |
Prevention Strategies:
-
Configuration Validation on Startup:
// apps/api/src/config/config.validation.ts export function validateConfiguration(config: Record<string, unknown>) { const errors: string[] = []; // Validate database URL format if (!config.DATABASE_URL?.toString().startsWith('postgresql://')) { errors.push('DATABASE_URL must be a valid PostgreSQL connection string'); } // Validate required secrets are present const requiredSecrets = [ 'SESSION_SECRET', 'SHOPIFY_API_KEY', 'SIMPLYPRINT_API_KEY', 'SENDCLOUD_PUBLIC_KEY', ]; for (const secret of requiredSecrets) { if (!config[secret]) { errors.push(`Missing required secret: ${secret}`); } } if (errors.length > 0) { throw new Error(`Configuration validation failed:\n${errors.join('\n')}`); } } -
Environment-Specific Safeguards:
// Prevent production database operations in non-production if (process.env.NODE_ENV !== 'production' && process.env.DATABASE_URL?.includes('production')) { throw new Error('DANGER: Non-production environment connected to production database!'); } -
Infrastructure as Code:
- Use Terraform/Pulumi for consistent deployments
- Environment-specific variable files
- Code review for infrastructure changes
4.5 Third-Party Service Unavailability¶
4.5.1 Shopify API Unavailable¶
Impact: Cannot receive new orders, cannot create fulfillments
Detection:
- /health/dependencies shows Shopify unhealthy
- Webhook deliveries fail (Shopify retries for 48 hours)
- Manual order creation fails
Current Mitigations: - ✅ Shopify retries webhooks with exponential backoff - ✅ Processed webhook idempotency prevents duplicates on retry - ✅ Health indicator detects issues
Response Procedure: 1. Check Shopify Status 2. If Shopify is up, check our API connectivity 3. Queue fulfillment operations for later retry 4. Monitor Shopify's status for resolution
Recommendations:
// Add Shopify-specific circuit breaker
@Injectable()
export class ShopifyCircuitBreaker {
private failures = 0;
private isOpen = false;
private lastFailure: Date;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.isOpen && this.shouldAttemptReset()) {
this.isOpen = false;
}
if (this.isOpen) {
throw new ServiceUnavailableException('Shopify circuit breaker is open');
}
try {
const result = await operation();
this.failures = 0;
return result;
} catch (error) {
this.failures++;
this.lastFailure = new Date();
if (this.failures >= 5) {
this.isOpen = true;
}
throw error;
}
}
}
4.5.2 SimplyPrint API Unavailable¶
Impact: Cannot create print jobs, cannot get status updates
Detection:
- /health/dependencies shows SimplyPrint unhealthy
- Polling service logs connection errors
- Print job creation fails
Current Mitigations: - ✅ Retry queue for print job creation - ✅ Polling handles temporary failures gracefully - ✅ Health indicator detects issues
Response Procedure: 1. Check SimplyPrint status/support channels 2. Print jobs queue in PENDING state 3. Polling will resume automatically when service returns 4. Manual reconciliation may be needed after extended outage
Degraded Mode Operation: - Orders continue to be received and stored - Print jobs are queued but not submitted - Dashboard shows "SimplyPrint Unavailable" warning - Operators can manually track printing progress
4.5.3 SendCloud API Unavailable¶
Impact: Cannot create shipping labels
Detection:
- /health/dependencies shows SendCloud unhealthy
- Shipment creation fails with errors
- Label generation queued in retry
Current Mitigations: - ✅ Retry queue for shipment creation - ✅ Health indicator detects issues
Response Procedure: 1. Check SendCloud Status 2. Shipments queue in PENDING state 3. Once restored, retry queue processes pending shipments 4. Operators may need to manually create labels for urgent orders
5. Backup and Recovery Strategies¶
5.1 Backup Strategy¶
Database Backups¶
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Full Backup | Daily (3 AM) | 30 days | Azure Blob / S3 |
| Incremental (WAL) | Continuous | 7 days | Azure Blob / S3 |
| Point-in-Time | Continuous | 7 days | Managed service |
| Monthly Archive | Monthly | 1 year | Cold storage |
Backup Configuration (Azure Database for PostgreSQL):
{
"backup": {
"geoRedundantBackup": "Enabled",
"backupRetentionDays": 35,
"earliestRestoreDate": "2026-01-01T00:00:00Z"
},
"storage": {
"storageSizeGB": 100,
"autoGrow": "Enabled"
}
}
Manual Backup Script:
#!/bin/bash
# backup-database.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="forma3d_backup_${DATE}.sql.gz"
# Create backup
pg_dump $DATABASE_URL | gzip > /tmp/$BACKUP_FILE
# Upload to cloud storage
az storage blob upload \
--container-name backups \
--file /tmp/$BACKUP_FILE \
--name "database/$BACKUP_FILE"
# Cleanup local file
rm /tmp/$BACKUP_FILE
# Verify backup
az storage blob exists \
--container-name backups \
--name "database/$BACKUP_FILE"
5.2 Recovery Procedures¶
Database Recovery¶
Scenario: Point-in-Time Recovery
# 1. Identify target recovery time
TARGET_TIME="2026-02-05T14:30:00Z"
# 2. Create new database from backup (Azure)
az postgres server restore \
--resource-group forma3d-rg \
--name forma3d-db-recovered \
--source-server forma3d-db \
--restore-point-in-time $TARGET_TIME
# 3. Verify data integrity
psql $RECOVERED_DATABASE_URL -c "SELECT COUNT(*) FROM orders;"
# 4. Update connection string and restart
kubectl set env deployment/api DATABASE_URL=$RECOVERED_DATABASE_URL
# 5. Run integrity checks
npm run db:integrity-check
Scenario: Full Database Restore from Backup
# 1. Download latest backup
az storage blob download \
--container-name backups \
--name "database/forma3d_backup_latest.sql.gz" \
--file /tmp/restore.sql.gz
# 2. Create new database
createdb -h $DB_HOST forma3d_restored
# 3. Restore backup
gunzip -c /tmp/restore.sql.gz | psql -h $DB_HOST forma3d_restored
# 4. Run migrations to ensure schema is current
DATABASE_URL="..." npx prisma migrate deploy
# 5. Switch over
# Update Kubernetes secrets and restart
5.3 Recovery Time and Point Objectives¶
| Scenario | RTO Target | RPO Target | Current Capability |
|---|---|---|---|
| Database crash (with replica) | 5 minutes | 0 (sync replication) | ❌ Need to implement |
| Database crash (backup only) | 2 hours | 1 hour | ✅ Achievable |
| Application crash | 2 minutes | 0 | ✅ With K8s auto-restart |
| Full disaster (region failure) | 4 hours | 1 hour | ❌ Need geo-redundancy |
| Security breach | 24 hours | Varies | ✅ From clean backup |
6. Incident Response Procedures¶
6.1 Incident Severity Levels¶
| Severity | Description | Examples | Response Time |
|---|---|---|---|
| SEV-1 | Critical - Complete service outage | Database down, all APIs failing | Immediate (< 15 min) |
| SEV-2 | Major - Significant functionality impaired | Cannot process new orders, fulfillment broken | < 30 min |
| SEV-3 | Moderate - Partial functionality affected | One integration down, degraded performance | < 2 hours |
| SEV-4 | Minor - Low impact issues | Slow dashboard, cosmetic issues | < 24 hours |
6.2 Incident Response Workflow¶
graph TD
A[Incident Detected] --> B{Severity Assessment}
B -->|SEV-1/2| C[Page On-Call]
B -->|SEV-3/4| D[Create Ticket]
C --> E[Acknowledge]
E --> F[Investigate]
F --> G{Root Cause Found?}
G -->|No| H[Escalate]
G -->|Yes| I[Implement Fix]
I --> J[Verify Resolution]
J --> K[Update Status Page]
K --> L[Create Postmortem]
D --> F
H --> F
6.3 On-Call Procedures¶
On-Call Rotation: - Primary: Responds to all pages - Secondary: Backup if primary unavailable - Escalation: Engineering lead for SEV-1
Escalation Path:
Primary On-Call (5 min)
↓
Secondary On-Call (10 min)
↓
Engineering Lead (15 min)
↓
CTO (30 min for SEV-1)
On-Call Checklist:
## Incident Response Checklist
### Initial Response (0-5 min)
- [ ] Acknowledge alert
- [ ] Open incident channel (#incident-YYYYMMDD-HHMM)
- [ ] Initial severity assessment
- [ ] Update status page to "Investigating"
### Investigation (5-30 min)
- [ ] Check health endpoints: /health, /health/dependencies
- [ ] Review Sentry for recent errors
- [ ] Check database connectivity
- [ ] Review recent deployments
- [ ] Check external service status pages
### Mitigation (varies)
- [ ] Document attempted fixes
- [ ] Rollback if deployment-related
- [ ] Fail over to backup systems if needed
- [ ] Update status page with progress
### Resolution
- [ ] Confirm service restored
- [ ] Update status page to "Resolved"
- [ ] Document timeline and actions
- [ ] Schedule postmortem (SEV-1/2)
7. Postmortem Process¶
7.1 When to Write a Postmortem¶
- All SEV-1 and SEV-2 incidents
- SEV-3 incidents with customer impact
- Near-misses that could have been worse
- Security incidents regardless of severity
- Data loss events
7.2 Postmortem Template¶
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: SEV-X
**Author**: [Name]
**Status**: [Draft | Final]
## Summary
[2-3 sentence summary of what happened and impact]
## Impact
- **Affected Services**: [List services]
- **Customer Impact**: [Description]
- **Orders Affected**: [Number]
- **Duration of Impact**: [Time]
## Timeline (All times in UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert triggered |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service fully restored |
## Root Cause
[Detailed explanation of what caused the incident]
## Detection
**How was the incident detected?**
- [ ] Automated alerting
- [ ] Customer report
- [ ] Internal observation
- [ ] External service notification
**Detection gap**: [If applicable, why wasn't this detected sooner?]
## Resolution
[Step-by-step description of how the incident was resolved]
## Lessons Learned
### What Went Well
- [Positive item 1]
- [Positive item 2]
### What Went Wrong
- [Issue 1]
- [Issue 2]
### Where We Got Lucky
- [Lucky circumstance that limited impact]
## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| [Action 1] | [Owner] | P1 | YYYY-MM-DD | [ ] Open |
| [Action 2] | [Owner] | P2 | YYYY-MM-DD | [ ] Open |
## Appendix
### Supporting Data
- [Links to dashboards, logs, etc.]
### Related Incidents
- [Links to previous related incidents]
7.3 Blameless Postmortem Culture¶
Principles: 1. Focus on systems, not individuals 2. Assume everyone acted with best intentions 3. Identify process gaps, not scapegoats 4. Share learnings openly 5. Follow through on action items
Anti-Patterns to Avoid: - "Human error" as root cause (ask why the error was possible) - Single point of blame - Superficial action items - Skipping postmortems for "obvious" issues
8. Status Page Implementation¶
8.1 Why a Status Page?¶
- Transparency: Customers see real-time system status
- Reduced Support Load: Fewer "is it down?" inquiries
- Trust Building: Proactive communication during incidents
- Accountability: Public track record of reliability
8.2 Recommended Solution: Statping-ng¶
Based on the project requirements and open-source preference, Statping-ng is recommended.
Why Statping-ng: - Self-hosted (data sovereignty) - Lightweight (~20MB Docker image) - Built-in monitoring and alerts - Beautiful, customizable UI - Prometheus exporter included - Multiple notification channels
Alternative Options:
| Solution | Type | Cost | Pros | Cons |
|---|---|---|---|---|
| Statping-ng | Self-hosted | Free | Full control, lightweight | Self-maintenance |
| Cachet | Self-hosted | Free | Simple, Laravel-based | Less active development |
| Upptime | GitHub-based | Free | Zero infrastructure | Limited features |
| Better Stack | SaaS | $20+/mo | Managed, incident management | Vendor lock-in |
| Atlassian Statuspage | SaaS | $29+/mo | Industry standard | Cost, complexity |
8.3 Statping-ng Implementation¶
Docker Compose Configuration:
# docker-compose.statuspage.yml
version: "3.8"
services:
statping:
image: adamboutcher/statping-ng:latest
container_name: statping
restart: always
ports:
- "8080:8080"
volumes:
- statping_data:/app
environment:
- DB_CONN=postgres
- DB_HOST=postgres
- DB_PORT=5432
- DB_DATABASE=statping
- DB_USER=statping
- DB_PASS=${STATPING_DB_PASS}
- NAME=Forma 3D Connect Status
- DESCRIPTION=Real-time status of Forma 3D Connect services
depends_on:
- postgres
postgres:
image: postgres:15-alpine
container_name: statping-db
restart: always
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
- POSTGRES_DB=statping
- POSTGRES_USER=statping
- POSTGRES_PASSWORD=${STATPING_DB_PASS}
volumes:
statping_data:
postgres_data:
Services to Monitor:
| Service | Check Type | Interval | Timeout |
|---|---|---|---|
| API Health | HTTP GET /health | 30s | 10s |
| API Ready | HTTP GET /health/ready | 30s | 10s |
| Shopify Integration | HTTP GET /health/dependencies | 60s | 15s |
| SimplyPrint Integration | HTTP GET /health/dependencies | 60s | 15s |
| SendCloud Integration | HTTP GET /health/dependencies | 60s | 15s |
| Web Dashboard | HTTP GET / | 60s | 10s |
| Database | TCP postgres:5432 | 30s | 5s |
Prometheus Integration:
# prometheus.yml
scrape_configs:
- job_name: 'statping'
bearer_token: '${STATPING_API_SECRET}'
static_configs:
- targets: ['statping:8080']
8.4 Status Page Content¶
Recommended Components: 1. Core Platform - API Services - Web Dashboard - Database
- Integrations
- Shopify Connection
- SimplyPrint Connection
- SendCloud Connection
-
Email Notifications
-
Background Services
- Order Processing
- Print Job Sync
- Retry Queue
Incident Communication Templates:
## Investigating: [Service] Performance Degradation
We are currently investigating reports of slow response times on [service].
Updates will be provided every 30 minutes.
## Identified: [Service] Outage
We have identified the cause of the [service] outage. Our team is working on a fix.
Estimated time to resolution: [X hours]
## Monitoring: [Service] Restored
[Service] has been restored. We are monitoring for stability.
A postmortem will be published within 48 hours.
## Resolved: [Service] Incident
The incident affecting [service] has been fully resolved.
[Link to postmortem when available]
9. Alerting and Notification Strategy¶
9.1 Current Alerting Capabilities¶
| Channel | Implementation | Coverage |
|---|---|---|
| Sentry | ✅ Implemented | Application errors, performance |
| ✅ Implemented | Failed operations (via retry queue) | |
| Push Notifications | ✅ Implemented | Order/job status (user-facing) |
| Slack/Teams | ❌ Not implemented | Operational alerts |
| PagerDuty/Opsgenie | ❌ Not implemented | On-call escalation |
9.2 Recommended Alert Configuration¶
Alert Categories:
| Category | Priority | Channel | Examples |
|---|---|---|---|
| Page | P1 | PagerDuty + Slack | Database down, API 5xx spike |
| Urgent | P2 | Slack + Email | Integration failures, queue backup |
| Warning | P3 | Slack | Elevated error rates, slow queries |
| Info | P4 | Slack (low-priority) | Deployment success, stats |
Alert Rules:
# alerting-rules.yml
groups:
- name: forma3d-critical
rules:
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL database is down"
description: "Database has been unreachable for more than 1 minute"
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP error rate detected"
description: "Error rate is above 5% for 5 minutes"
- alert: RetryQueueBacklog
expr: retry_queue_pending_count > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Retry queue backlog growing"
description: "More than 100 items pending in retry queue"
- alert: IntegrationUnhealthy
expr: health_dependency_status{service=~"shopify|simplyprint|sendcloud"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "External integration unhealthy"
description: "{{ $labels.service }} has been unhealthy for 5 minutes"
9.3 Notification Channels Setup¶
Slack Integration:
// libs/observability/src/alerting/slack-notifier.ts
import { WebClient } from '@slack/web-api';
@Injectable()
export class SlackNotifier {
private client: WebClient;
constructor(@Inject(CONFIG) private config: Config) {
this.client = new WebClient(config.SLACK_BOT_TOKEN);
}
async sendAlert(alert: Alert): Promise<void> {
const channel = this.getChannel(alert.severity);
await this.client.chat.postMessage({
channel,
blocks: [
{
type: 'header',
text: {
type: 'plain_text',
text: `🚨 ${alert.title}`,
},
},
{
type: 'section',
fields: [
{ type: 'mrkdwn', text: `*Severity:*\n${alert.severity}` },
{ type: 'mrkdwn', text: `*Service:*\n${alert.service}` },
],
},
{
type: 'section',
text: { type: 'mrkdwn', text: alert.description },
},
{
type: 'actions',
elements: [
{
type: 'button',
text: { type: 'plain_text', text: 'View in Sentry' },
url: alert.sentryUrl,
},
{
type: 'button',
text: { type: 'plain_text', text: 'Runbook' },
url: `https://docs.forma3d.com/runbooks/${alert.type}`,
},
],
},
],
});
}
private getChannel(severity: string): string {
switch (severity) {
case 'critical': return '#incidents';
case 'warning': return '#alerts';
default: return '#monitoring';
}
}
}
10. Service Level Agreements (SLAs)¶
10.1 Defining SLAs for Multi-Tenant System¶
When offering the platform to tenants, clear SLAs establish expectations and accountability.
Recommended SLA Tiers:
| Tier | Target Uptime | Support Response | Price Point |
|---|---|---|---|
| Basic | 99.0% (7.3h/month downtime) | 24 hours | Entry level |
| Professional | 99.5% (3.65h/month downtime) | 4 hours | Mid-tier |
| Enterprise | 99.9% (43.8min/month downtime) | 1 hour | Premium |
10.2 SLA Components¶
Uptime Calculation¶
Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) × 100
Exclusions:
- Scheduled maintenance (with 48h notice)
- Third-party service outages (Shopify, SimplyPrint, SendCloud)
- Force majeure events
- Customer-caused issues
Service Level Indicators (SLIs)¶
| SLI | Measurement | Target |
|---|---|---|
| Availability | Successful health checks / Total checks | 99.9% |
| Latency (API) | p95 response time | < 500ms |
| Latency (Dashboard) | p95 page load | < 3s |
| Order Processing | Time from webhook to processing start | < 30s |
| Error Rate | 5xx responses / Total responses | < 0.1% |
Service Level Objectives (SLOs)¶
# SLO Configuration
slos:
api_availability:
target: 99.9%
window: 30d
indicator: probe_success{job="api-health"}
api_latency:
target: 99.0%
window: 30d
indicator: histogram_quantile(0.95, http_request_duration_seconds_bucket) < 0.5
order_processing:
target: 99.5%
window: 30d
indicator: order_processing_time_seconds < 30
10.3 Error Budget¶
Concept: The allowed amount of unreliability within SLA targets.
99.9% uptime = 0.1% error budget = 43.8 minutes/month
Error Budget Remaining = Target Downtime - Actual Downtime
If error budget exhausted:
- Freeze non-critical deployments
- Focus on reliability improvements
- Increase testing requirements
Error Budget Policy:
| Budget Remaining | Actions |
|---|---|
| > 50% | Normal operations |
| 25-50% | Review deployment frequency |
| 10-25% | Pause feature releases, focus on reliability |
| < 10% | Emergency mode, critical fixes only |
10.4 SLA Communication Template¶
# Forma 3D Connect Service Level Agreement
## Service Commitment
Forma 3D Connect commits to providing [TIER] level service with the following guarantees:
### Availability
- **Target**: [99.X]% monthly uptime
- **Measurement**: Based on successful responses to health check endpoints
- **Exclusions**: Scheduled maintenance, third-party outages
### Support Response
- **Critical Issues**: Response within [X] hours
- **Major Issues**: Response within [X] hours
- **Minor Issues**: Response within [X] business days
### Remedies
If monthly uptime falls below the target:
| Uptime | Service Credit |
|--------|---------------|
| < 99.X% but ≥ 99.Y% | 10% of monthly fee |
| < 99.Y% but ≥ 99.Z% | 25% of monthly fee |
| < 99.Z% | 50% of monthly fee |
Credits must be requested within 30 days of the incident.
### Exclusions
This SLA does not apply to:
- Scheduled maintenance announced 48+ hours in advance
- Third-party service disruptions (Shopify, SimplyPrint, SendCloud)
- Customer misuse or misconfiguration
- Features labeled as "Beta" or "Preview"
- Free tier accounts
11. Recommendations and Next Steps¶
11.1 Priority Matrix¶
| Priority | Action | Effort | Impact | Timeline |
|---|---|---|---|---|
| P0 | Implement automated database backups | Medium | Critical | Week 1-2 |
| P0 | Document incident response procedures | Low | Critical | Week 1 |
| P1 | Deploy Statping-ng status page | Medium | High | Week 2-3 |
| P1 | Set up Slack alerting integration | Medium | High | Week 2 |
| P1 | Configure Prometheus + Grafana monitoring | High | High | Week 3-4 |
| P2 | Implement database replication | High | Critical | Month 2 |
| P2 | Add circuit breakers to integrations | Medium | Medium | Month 2 |
| P2 | Create runbooks for common issues | Medium | Medium | Month 2 |
| P3 | Implement PagerDuty on-call rotation | Medium | Medium | Month 3 |
| P3 | Conduct disaster recovery drill | Medium | High | Month 3 |
| P3 | Define and publish tenant SLAs | Low | Medium | Month 3 |
11.2 Implementation Roadmap¶
gantt
title Disaster Recovery Implementation
dateFormat YYYY-MM-DD
section Phase 1: Foundation
Document incident procedures :a1, 2026-02-10, 5d
Implement database backups :a2, 2026-02-10, 10d
section Phase 2: Visibility
Deploy status page :b1, 2026-02-17, 7d
Set up Slack alerts :b2, 2026-02-17, 5d
Configure monitoring stack :b3, 2026-02-20, 10d
section Phase 3: Resilience
Database replication :c1, 2026-03-01, 14d
Circuit breakers :c2, 2026-03-01, 7d
Create runbooks :c3, 2026-03-08, 10d
section Phase 4: Operations
PagerDuty setup :d1, 2026-03-15, 7d
DR drill :d2, 2026-03-22, 3d
Publish SLAs :d3, 2026-03-25, 5d
11.3 Success Metrics¶
| Metric | Current | Target | Measurement |
|---|---|---|---|
| MTTD (Mean Time to Detect) | Unknown | < 5 min | Alert to detection |
| MTTR (Mean Time to Recover) | Unknown | < 30 min | Detection to resolution |
| Incident frequency | Unknown | < 2/month | SEV-½ incidents |
| Postmortem completion | 0% | 100% | SEV-½ within 5 days |
| Backup test frequency | Never | Monthly | Restore verification |
11.4 Cost Estimates¶
| Component | Option A (Budget) | Option B (Recommended) | Option C (Enterprise) |
|---|---|---|---|
| Status Page | Statping-ng (Free) | Statping-ng (Free) | Atlassian ($79/mo) |
| Monitoring | Prometheus (Free) | Grafana Cloud ($50/mo) | Datadog ($100+/mo) |
| Alerting | Slack (Free) | PagerDuty ($25/user/mo) | PagerDuty + OpsGenie |
| Database HA | Manual failover | Managed DB ($100/mo) | Multi-region ($300+/mo) |
| Total | ~$0/mo | ~$175-250/mo | ~$500+/mo |
Appendix A: Quick Reference Cards¶
A.1 Emergency Contacts¶
## Emergency Contacts
| Role | Name | Phone | Email |
|------|------|-------|-------|
| Primary On-Call | [TBD] | [TBD] | [TBD] |
| Secondary On-Call | [TBD] | [TBD] | [TBD] |
| Engineering Lead | [TBD] | [TBD] | [TBD] |
## External Services
| Service | Status Page | Support |
|---------|-------------|---------|
| Shopify | shopifystatus.com | partners@shopify.com |
| SimplyPrint | [TBD] | support@simplyprint.io |
| SendCloud | status.sendcloud.sc | support@sendcloud.sc |
A.2 Critical Commands¶
# Health check
curl https://api.forma3d.com/health | jq
# Database connection test
psql $DATABASE_URL -c "SELECT 1"
# View recent errors in Sentry
# (via Sentry dashboard)
# Check retry queue
curl -H "Authorization: Bearer $API_KEY" \
https://api.forma3d.com/retry-queue/stats
# Force retry queue processing
curl -X POST -H "Authorization: Bearer $API_KEY" \
https://api.forma3d.com/retry-queue/process
# Kubernetes pod restart
kubectl rollout restart deployment/api -n forma3d
# View recent logs
kubectl logs -l app=api -n forma3d --tail=100 -f
# Database backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d).sql.gz
A.3 Runbook Index¶
| Issue | Runbook Location |
|---|---|
| Database unreachable | docs/runbooks/database-connection.md |
| Shopify webhooks failing | docs/runbooks/shopify-webhooks.md |
| SimplyPrint sync issues | docs/runbooks/simplyprint-sync.md |
| High error rate | docs/runbooks/error-rate-spike.md |
| Memory exhaustion | docs/runbooks/memory-oom.md |
| Disk space low | docs/runbooks/disk-space.md |
| Certificate expiration | docs/runbooks/ssl-renewal.md |
Appendix B: Compliance Considerations¶
GDPR Requirements¶
- Breach Notification: 72 hours to supervisory authority
- Data Subject Rights: Must be maintained during incidents
- Data Recovery: Backups must be restorable and tested
Audit Trail¶
- All incidents logged in
AuditLogtable - Postmortems stored for minimum 3 years
- Access logs retained per data retention policy
Document Version: 1.0
Last Updated: February 2026
Next Review: August 2026