Disaster Recovery Research Report¶

Status: Research Document
Created: February 2026
Scope: Forma 3D Connect Platform

Table of Contents¶

Executive Summary
System Overview
Risk Assessment Matrix
Disaster Scenarios and Mitigation
Backup and Recovery Strategies
Incident Response Procedures
Postmortem Process
Status Page Implementation
Alerting and Notification Strategy
Service Level Agreements (SLAs)
Recommendations and Next Steps

1. Executive Summary¶

This document outlines a comprehensive disaster recovery (DR) strategy for the Forma 3D Connect platform. The system orchestrates 3D printing fulfillment by integrating with Shopify (e-commerce), SimplyPrint (print management), and SendCloud (shipping). A robust DR plan is essential to ensure business continuity, protect customer data, and maintain SLA commitments.

Key Findings¶

Critical Dependencies: PostgreSQL database, three external APIs (Shopify, SimplyPrint, SendCloud)
Current Strengths: Retry queue system, health checks, event logging, webhook idempotency
Gaps Identified: No formal backup strategy documented, no status page, limited alerting beyond Sentry
Recommended RTO: 4 hours for critical services, 24 hours for full restoration
Recommended RPO: 1 hour (maximum data loss tolerance)

2. System Overview¶

Architecture Components¶

┌─────────────────────────────────────────────────────────────────────────┐
│                          EXTERNAL SERVICES                               │
├──────────────────┬──────────────────┬──────────────────┬────────────────┤
│    Shopify       │   SimplyPrint    │    SendCloud     │     SMTP       │
│  (E-commerce)    │   (Printing)     │   (Shipping)     │   (Email)      │
└────────┬─────────┴────────┬─────────┴────────┬─────────┴────────┬───────┘
         │                  │                  │                  │
         ▼                  ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           NestJS API                                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│  │  Webhooks    │ │   Orders     │ │  Print Jobs  │ │   Fulfillment    ││
│  │  Handlers    │ │   Service    │ │   Service    │ │   Service        ││
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│  │ Retry Queue  │ │   Health     │ │   Sentry     │ │   Notifications  ││
│  │  Processor   │ │   Checks     │ │ Integration  │ │   (Email/Push)   ││
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          PostgreSQL Database                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│  │  Orders  │ │PrintJobs │ │ Shipments│ │AuditLogs │ │ RetryQueue     │ │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

Critical Data Assets¶

Asset	Description	Sensitivity	Recovery Priority
Orders	Customer orders from Shopify	High	Critical
Print Jobs	Job status and tracking	Medium	Critical
Product Mappings	Product to 3D file mappings (by Shopify product/variant ID, SKU optional)	Medium	High
Shipments	Tracking and label data	High	Critical
User Accounts	RBAC and authentication	High	Critical
Audit Logs	Compliance and debugging	Medium	Medium
Event Logs	Operational visibility	Low	Low

3. Risk Assessment Matrix¶

Risk Scoring¶

Probability: 1 (Rare) to 5 (Frequent)
Impact: 1 (Minimal) to 5 (Catastrophic)
Risk Score: Probability × Impact

Identified Risks¶

Risk ID	Risk Description	Probability	Impact	Score	Category
R01	Database server crash	2	5	10	Infrastructure
R02	Database corruption	1	5	5	Infrastructure
R03	Application server crash	3	4	12	Infrastructure
R04	Shopify API unavailable	2	4	8	External Service
R05	SimplyPrint API unavailable	3	4	12	External Service
R06	SendCloud API unavailable	2	3	6	External Service
R07	Security breach / hack	2	5	10	Security
R08	Credential compromise	2	5	10	Security
R09	Major misconfiguration	3	4	12	Operational
R10	Data loss (accidental deletion)	2	5	10	Operational
R11	Network/DNS failure	2	4	8	Infrastructure
R12	Certificate expiration	2	3	6	Operational
R13	Disk space exhaustion	3	3	9	Infrastructure
R14	Memory exhaustion / OOM	3	3	9	Infrastructure
R15	Webhook flooding / DDoS	2	4	8	Security

Risk Matrix Visualization¶

     │ Catastrophic (5) │  R02  │       │ R01,R07,R08,R10 │       │       │
     │   Major (4)      │       │ R04,R11,R15 │ R03,R05,R09 │       │       │
     │   Moderate (3)   │       │ R06,R12 │ R13,R14 │       │       │
     │   Minor (2)      │       │       │       │       │       │
     │   Minimal (1)    │       │       │       │       │       │
     └──────────────────┴───────┴───────┴───────┴───────┴───────┘
                           Rare   Unlikely  Possible  Likely  Frequent
                            (1)     (2)       (3)      (4)      (5)

4. Disaster Scenarios and Mitigation¶

4.1 Database Crashes¶

Scenario: PostgreSQL Server Becomes Unavailable¶

Symptoms: - Health checks fail (/health/ready returns 503) - All API endpoints return 500 errors - Sentry floods with database connection errors

Immediate Response: 1. Check database container/server status 2. Review database logs for crash cause 3. Attempt restart if container crashed 4. Failover to replica if available

Mitigation Strategies:

Strategy	Implementation	RTO Impact
Database Replication	PostgreSQL streaming replication with read replica	< 5 min failover
Managed Database	Azure Database for PostgreSQL / AWS RDS with automatic failover	< 2 min failover
Connection Pooling	PgBouncer for connection management	Reduces crash risk
Regular Backups	Automated daily backups with point-in-time recovery	1-4 hours

Current System Safeguards: - Connection pooling configured (DATABASE_POOL_SIZE) - Health checks detect issues quickly - Retry queue persists failed operations for later retry

Recommended Actions:

# PostgreSQL High Availability Setup
primary:
  - Streaming replication to standby
  - WAL archiving to object storage
  - Automated backups every 6 hours

standby:
  - Hot standby for read queries
  - Automatic promotion via Patroni/pgpool

backup_retention:
  daily: 7 days
  weekly: 4 weeks
  monthly: 12 months

4.2 Application Server Crashes¶

Scenario: NestJS API Container Crashes or Becomes Unresponsive¶

Symptoms: - Kubernetes/Docker health probes fail - No response on any endpoint - WebSocket connections drop

Immediate Response: 1. Container orchestrator auto-restarts (if configured) 2. Check container logs for crash reason 3. Verify resource limits (memory/CPU) 4. Scale up if load-related

Mitigation Strategies:

Strategy	Implementation	Benefit
Multiple Replicas	Kubernetes Deployment with 2+ replicas	Zero downtime
Health Probes	Liveness and readiness probes	Auto-recovery
Horizontal Pod Autoscaler	Scale based on CPU/memory	Handle load spikes
Graceful Shutdown	Handle SIGTERM properly	No dropped requests

Kubernetes Deployment Example:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5

4.3 Security Breach / System Hack¶

Scenario: Unauthorized Access or Data Breach¶

Attack Vectors: 1. Credential theft (API keys, database passwords) 2. SQL injection (mitigated by Prisma ORM) 3. Webhook spoofing (mitigated by HMAC verification) 4. Session hijacking 5. Dependency vulnerabilities

Immediate Response Checklist:

## Security Incident Response Checklist

### Phase 1: Contain (First 30 minutes)
- [ ] Isolate affected systems (network level)
- [ ] Revoke compromised credentials immediately
- [ ] Enable maintenance mode if needed
- [ ] Preserve logs and evidence (DO NOT delete)

### Phase 2: Assess (1-4 hours)
- [ ] Identify attack vector and timeline
- [ ] Determine scope of data exposure
- [ ] Check for persistence mechanisms (backdoors)
- [ ] Review audit logs for suspicious activity

### Phase 3: Remediate (4-24 hours)
- [ ] Rotate all secrets and API keys
- [ ] Patch vulnerabilities
- [ ] Reset user passwords if needed
- [ ] Review and update access controls

### Phase 4: Recover (24-72 hours)
- [ ] Restore from clean backup if needed
- [ ] Re-enable services gradually
- [ ] Enhanced monitoring period
- [ ] Notify affected parties (GDPR: 72 hours)

Preventive Measures:

Measure	Current Status	Recommendation
HMAC Webhook Verification	✅ Implemented	Maintain
Rate Limiting	✅ Implemented	Add IP-based limits
Audit Logging	✅ Implemented	Add alerting
Secret Management	⚠️ Environment variables	Use Azure Key Vault / HashiCorp Vault
Dependency Scanning	❌ Not implemented	Add Dependabot / Snyk
WAF	❌ Not implemented	Consider Azure WAF / Cloudflare

Credential Rotation Procedure:

# Emergency credential rotation script
#!/bin/bash

# 1. Database password
echo "Rotating database credentials..."
# Update in secret manager, then rolling restart

# 2. Shopify credentials
echo "Rotating Shopify API credentials..."
# Generate new keys in Shopify Partner Dashboard
# Update SHOPIFY_API_KEY, SHOPIFY_API_SECRET

# 3. SimplyPrint API key
echo "Rotating SimplyPrint credentials..."
# Generate new API key in SimplyPrint dashboard
# Update SIMPLYPRINT_API_KEY

# 4. SendCloud credentials
echo "Rotating SendCloud credentials..."
# Generate new API keys in SendCloud dashboard
# Update SENDCLOUD_PUBLIC_KEY, SENDCLOUD_SECRET_KEY

# 5. Internal secrets
echo "Rotating internal secrets..."
# SESSION_SECRET, INTERNAL_API_KEY, SHOPIFY_TOKEN_ENCRYPTION_KEY

# 6. Rolling restart
echo "Performing rolling restart..."
kubectl rollout restart deployment/api

4.4 Major Misconfiguration¶

Scenario: Production Environment Misconfigured¶

Common Misconfigurations: 1. Wrong database URL (connecting to staging/dev) 2. Incorrect API keys for wrong environment 3. CORS misconfiguration blocking frontend 4. Invalid webhook URLs preventing order ingestion 5. Wrong SMTP settings causing notification failures

Detection Mechanisms:

Misconfiguration	Detection Method	Time to Detect
Database URL wrong	Health check fails	< 1 minute
API keys invalid	First API call fails, logged to Sentry	< 5 minutes
Webhook URL wrong	No orders coming in, monitoring alert	5-30 minutes
SMTP misconfigured	Email delivery fails, retry queue grows	30 minutes

Prevention Strategies:

Configuration Validation on Startup:

// apps/api/src/config/config.validation.ts
export function validateConfiguration(config: Record<string, unknown>) {
  const errors: string[] = [];

  // Validate database URL format
  if (!config.DATABASE_URL?.toString().startsWith('postgresql://')) {
    errors.push('DATABASE_URL must be a valid PostgreSQL connection string');
  }

  // Validate required secrets are present
  const requiredSecrets = [
    'SESSION_SECRET',
    'SHOPIFY_API_KEY',
    'SIMPLYPRINT_API_KEY',
    'SENDCLOUD_PUBLIC_KEY',
  ];

  for (const secret of requiredSecrets) {
    if (!config[secret]) {
      errors.push(`Missing required secret: ${secret}`);
    }
  }

  if (errors.length > 0) {
    throw new Error(`Configuration validation failed:\n${errors.join('\n')}`);
  }
}

Environment-Specific Safeguards:

// Prevent production database operations in non-production
if (process.env.NODE_ENV !== 'production' && 
    process.env.DATABASE_URL?.includes('production')) {
  throw new Error('DANGER: Non-production environment connected to production database!');
}

Infrastructure as Code:
Use Terraform/Pulumi for consistent deployments
Environment-specific variable files
Code review for infrastructure changes

4.5 Third-Party Service Unavailability¶

4.5.1 Shopify API Unavailable¶

Impact: Cannot receive new orders, cannot create fulfillments

Detection: - /health/dependencies shows Shopify unhealthy - Webhook deliveries fail (Shopify retries for 48 hours) - Manual order creation fails

Current Mitigations: - ✅ Shopify retries webhooks with exponential backoff - ✅ Processed webhook idempotency prevents duplicates on retry - ✅ Health indicator detects issues

Response Procedure: 1. Check Shopify Status 2. If Shopify is up, check our API connectivity 3. Queue fulfillment operations for later retry 4. Monitor Shopify's status for resolution

Recommendations:

// Add Shopify-specific circuit breaker
@Injectable()
export class ShopifyCircuitBreaker {
  private failures = 0;
  private isOpen = false;
  private lastFailure: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.isOpen && this.shouldAttemptReset()) {
      this.isOpen = false;
    }

    if (this.isOpen) {
      throw new ServiceUnavailableException('Shopify circuit breaker is open');
    }

    try {
      const result = await operation();
      this.failures = 0;
      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = new Date();
      if (this.failures >= 5) {
        this.isOpen = true;
      }
      throw error;
    }
  }
}

4.5.2 SimplyPrint API Unavailable¶

Impact: Cannot create print jobs, cannot get status updates

Detection: - /health/dependencies shows SimplyPrint unhealthy - Polling service logs connection errors - Print job creation fails

Current Mitigations: - ✅ Retry queue for print job creation - ✅ Polling handles temporary failures gracefully - ✅ Health indicator detects issues

Response Procedure: 1. Check SimplyPrint status/support channels 2. Print jobs queue in PENDING state 3. Polling will resume automatically when service returns 4. Manual reconciliation may be needed after extended outage

Degraded Mode Operation: - Orders continue to be received and stored - Print jobs are queued but not submitted - Dashboard shows "SimplyPrint Unavailable" warning - Operators can manually track printing progress

4.5.3 SendCloud API Unavailable¶

Impact: Cannot create shipping labels

Detection: - /health/dependencies shows SendCloud unhealthy - Shipment creation fails with errors - Label generation queued in retry

Current Mitigations: - ✅ Retry queue for shipment creation - ✅ Health indicator detects issues

Response Procedure: 1. Check SendCloud Status 2. Shipments queue in PENDING state 3. Once restored, retry queue processes pending shipments 4. Operators may need to manually create labels for urgent orders

5. Backup and Recovery Strategies¶

5.1 Backup Strategy¶

Database Backups¶

Backup Type	Frequency	Retention	Storage
Full Backup	Daily (3 AM)	30 days	Azure Blob / S3
Incremental (WAL)	Continuous	7 days	Azure Blob / S3
Point-in-Time	Continuous	7 days	Managed service
Monthly Archive	Monthly	1 year	Cold storage

Backup Configuration (Azure Database for PostgreSQL):

{
  "backup": {
    "geoRedundantBackup": "Enabled",
    "backupRetentionDays": 35,
    "earliestRestoreDate": "2026-01-01T00:00:00Z"
  },
  "storage": {
    "storageSizeGB": 100,
    "autoGrow": "Enabled"
  }
}

Manual Backup Script:

#!/bin/bash
# backup-database.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="forma3d_backup_${DATE}.sql.gz"

# Create backup
pg_dump $DATABASE_URL | gzip > /tmp/$BACKUP_FILE

# Upload to cloud storage
az storage blob upload \
  --container-name backups \
  --file /tmp/$BACKUP_FILE \
  --name "database/$BACKUP_FILE"

# Cleanup local file
rm /tmp/$BACKUP_FILE

# Verify backup
az storage blob exists \
  --container-name backups \
  --name "database/$BACKUP_FILE"

5.2 Recovery Procedures¶

Database Recovery¶

Scenario: Point-in-Time Recovery

# 1. Identify target recovery time
TARGET_TIME="2026-02-05T14:30:00Z"

# 2. Create new database from backup (Azure)
az postgres server restore \
  --resource-group forma3d-rg \
  --name forma3d-db-recovered \
  --source-server forma3d-db \
  --restore-point-in-time $TARGET_TIME

# 3. Verify data integrity
psql $RECOVERED_DATABASE_URL -c "SELECT COUNT(*) FROM orders;"

# 4. Update connection string and restart
kubectl set env deployment/api DATABASE_URL=$RECOVERED_DATABASE_URL

# 5. Run integrity checks
npm run db:integrity-check

Scenario: Full Database Restore from Backup

# 1. Download latest backup
az storage blob download \
  --container-name backups \
  --name "database/forma3d_backup_latest.sql.gz" \
  --file /tmp/restore.sql.gz

# 2. Create new database
createdb -h $DB_HOST forma3d_restored

# 3. Restore backup
gunzip -c /tmp/restore.sql.gz | psql -h $DB_HOST forma3d_restored

# 4. Run migrations to ensure schema is current
DATABASE_URL="..." npx prisma migrate deploy

# 5. Switch over
# Update Kubernetes secrets and restart

5.3 Recovery Time and Point Objectives¶

Scenario	RTO Target	RPO Target	Current Capability
Database crash (with replica)	5 minutes	0 (sync replication)	❌ Need to implement
Database crash (backup only)	2 hours	1 hour	✅ Achievable
Application crash	2 minutes	0	✅ With K8s auto-restart
Full disaster (region failure)	4 hours	1 hour	❌ Need geo-redundancy
Security breach	24 hours	Varies	✅ From clean backup

6. Incident Response Procedures¶

6.1 Incident Severity Levels¶

Severity	Description	Examples	Response Time
SEV-1	Critical - Complete service outage	Database down, all APIs failing	Immediate (< 15 min)
SEV-2	Major - Significant functionality impaired	Cannot process new orders, fulfillment broken	< 30 min
SEV-3	Moderate - Partial functionality affected	One integration down, degraded performance	< 2 hours
SEV-4	Minor - Low impact issues	Slow dashboard, cosmetic issues	< 24 hours

6.2 Incident Response Workflow¶

graph TD
    A[Incident Detected] --> B{Severity Assessment}
    B -->|SEV-1/2| C[Page On-Call]
    B -->|SEV-3/4| D[Create Ticket]
    C --> E[Acknowledge]
    E --> F[Investigate]
    F --> G{Root Cause Found?}
    G -->|No| H[Escalate]
    G -->|Yes| I[Implement Fix]
    I --> J[Verify Resolution]
    J --> K[Update Status Page]
    K --> L[Create Postmortem]
    D --> F
    H --> F

6.3 On-Call Procedures¶

On-Call Rotation: - Primary: Responds to all pages - Secondary: Backup if primary unavailable - Escalation: Engineering lead for SEV-1

Escalation Path:

Primary On-Call (5 min)
    ↓
Secondary On-Call (10 min)
    ↓
Engineering Lead (15 min)
    ↓
CTO (30 min for SEV-1)

On-Call Checklist:

## Incident Response Checklist

### Initial Response (0-5 min)
- [ ] Acknowledge alert
- [ ] Open incident channel (#incident-YYYYMMDD-HHMM)
- [ ] Initial severity assessment
- [ ] Update status page to "Investigating"

### Investigation (5-30 min)
- [ ] Check health endpoints: /health, /health/dependencies
- [ ] Review Sentry for recent errors
- [ ] Check database connectivity
- [ ] Review recent deployments
- [ ] Check external service status pages

### Mitigation (varies)
- [ ] Document attempted fixes
- [ ] Rollback if deployment-related
- [ ] Fail over to backup systems if needed
- [ ] Update status page with progress

### Resolution
- [ ] Confirm service restored
- [ ] Update status page to "Resolved"
- [ ] Document timeline and actions
- [ ] Schedule postmortem (SEV-1/2)

7. Postmortem Process¶

7.1 When to Write a Postmortem¶

All SEV-1 and SEV-2 incidents
SEV-3 incidents with customer impact
Near-misses that could have been worse
Security incidents regardless of severity
Data loss events

7.2 Postmortem Template¶

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: SEV-X
**Author**: [Name]
**Status**: [Draft | Final]

## Summary

[2-3 sentence summary of what happened and impact]

## Impact

- **Affected Services**: [List services]
- **Customer Impact**: [Description]
- **Orders Affected**: [Number]
- **Duration of Impact**: [Time]

## Timeline (All times in UTC)

| Time | Event |
|------|-------|
| HH:MM | First alert triggered |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service fully restored |

## Root Cause

[Detailed explanation of what caused the incident]

## Detection

**How was the incident detected?**
- [ ] Automated alerting
- [ ] Customer report
- [ ] Internal observation
- [ ] External service notification

**Detection gap**: [If applicable, why wasn't this detected sooner?]

## Resolution

[Step-by-step description of how the incident was resolved]

## Lessons Learned

### What Went Well
- [Positive item 1]
- [Positive item 2]

### What Went Wrong
- [Issue 1]
- [Issue 2]

### Where We Got Lucky
- [Lucky circumstance that limited impact]

## Action Items

| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| [Action 1] | [Owner] | P1 | YYYY-MM-DD | [ ] Open |
| [Action 2] | [Owner] | P2 | YYYY-MM-DD | [ ] Open |

## Appendix

### Supporting Data
- [Links to dashboards, logs, etc.]

### Related Incidents
- [Links to previous related incidents]

7.3 Blameless Postmortem Culture¶

Principles: 1. Focus on systems, not individuals 2. Assume everyone acted with best intentions 3. Identify process gaps, not scapegoats 4. Share learnings openly 5. Follow through on action items

Anti-Patterns to Avoid: - "Human error" as root cause (ask why the error was possible) - Single point of blame - Superficial action items - Skipping postmortems for "obvious" issues

8. Status Page Implementation¶

8.1 Why a Status Page?¶

Transparency: Customers see real-time system status
Reduced Support Load: Fewer "is it down?" inquiries
Trust Building: Proactive communication during incidents
Accountability: Public track record of reliability

8.2 Recommended Solution: Statping-ng¶

Based on the project requirements and open-source preference, Statping-ng is recommended.

Why Statping-ng: - Self-hosted (data sovereignty) - Lightweight (~20MB Docker image) - Built-in monitoring and alerts - Beautiful, customizable UI - Prometheus exporter included - Multiple notification channels

Alternative Options:

Solution	Type	Cost	Pros	Cons
Statping-ng	Self-hosted	Free	Full control, lightweight	Self-maintenance
Cachet	Self-hosted	Free	Simple, Laravel-based	Less active development
Upptime	GitHub-based	Free	Zero infrastructure	Limited features
Better Stack	SaaS	$20+/mo	Managed, incident management	Vendor lock-in
Atlassian Statuspage	SaaS	$29+/mo	Industry standard	Cost, complexity

8.3 Statping-ng Implementation¶

Docker Compose Configuration:

# docker-compose.statuspage.yml
version: "3.8"

services:
  statping:
    image: adamboutcher/statping-ng:latest
    container_name: statping
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - statping_data:/app
    environment:
      - DB_CONN=postgres
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_DATABASE=statping
      - DB_USER=statping
      - DB_PASS=${STATPING_DB_PASS}
      - NAME=Forma 3D Connect Status
      - DESCRIPTION=Real-time status of Forma 3D Connect services
    depends_on:
      - postgres

  postgres:
    image: postgres:15-alpine
    container_name: statping-db
    restart: always
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_DB=statping
      - POSTGRES_USER=statping
      - POSTGRES_PASSWORD=${STATPING_DB_PASS}

volumes:
  statping_data:
  postgres_data:

Services to Monitor:

Service	Check Type	Interval	Timeout
API Health	HTTP GET /health	30s	10s
API Ready	HTTP GET /health/ready	30s	10s
Shopify Integration	HTTP GET /health/dependencies	60s	15s
SimplyPrint Integration	HTTP GET /health/dependencies	60s	15s
SendCloud Integration	HTTP GET /health/dependencies	60s	15s
Web Dashboard	HTTP GET /	60s	10s
Database	TCP postgres:5432	30s	5s

Prometheus Integration:

# prometheus.yml
scrape_configs:
  - job_name: 'statping'
    bearer_token: '${STATPING_API_SECRET}'
    static_configs:
      - targets: ['statping:8080']

8.4 Status Page Content¶

Recommended Components: 1. Core Platform - API Services - Web Dashboard - Database

Integrations
Shopify Connection
SimplyPrint Connection
SendCloud Connection
Email Notifications
Background Services
Order Processing
Print Job Sync
Retry Queue

Incident Communication Templates:

## Investigating: [Service] Performance Degradation
We are currently investigating reports of slow response times on [service].
Updates will be provided every 30 minutes.

## Identified: [Service] Outage
We have identified the cause of the [service] outage. Our team is working on a fix.
Estimated time to resolution: [X hours]

## Monitoring: [Service] Restored
[Service] has been restored. We are monitoring for stability.
A postmortem will be published within 48 hours.

## Resolved: [Service] Incident
The incident affecting [service] has been fully resolved.
[Link to postmortem when available]

9. Alerting and Notification Strategy¶

9.1 Current Alerting Capabilities¶

Channel	Implementation	Coverage
Sentry	✅ Implemented	Application errors, performance
Email	✅ Implemented	Failed operations (via retry queue)
Push Notifications	✅ Implemented	Order/job status (user-facing)
Slack/Teams	❌ Not implemented	Operational alerts
PagerDuty/Opsgenie	❌ Not implemented	On-call escalation

9.2 Recommended Alert Configuration¶

Alert Categories:

Category	Priority	Channel	Examples
Page	P1	PagerDuty + Slack	Database down, API 5xx spike
Urgent	P2	Slack + Email	Integration failures, queue backup
Warning	P3	Slack	Elevated error rates, slow queries
Info	P4	Slack (low-priority)	Deployment success, stats

Alert Rules:

# alerting-rules.yml
groups:
  - name: forma3d-critical
    rules:
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database has been unreachable for more than 1 minute"

      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High HTTP error rate detected"
          description: "Error rate is above 5% for 5 minutes"

      - alert: RetryQueueBacklog
        expr: retry_queue_pending_count > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Retry queue backlog growing"
          description: "More than 100 items pending in retry queue"

      - alert: IntegrationUnhealthy
        expr: health_dependency_status{service=~"shopify|simplyprint|sendcloud"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "External integration unhealthy"
          description: "{{ $labels.service }} has been unhealthy for 5 minutes"

9.3 Notification Channels Setup¶

Slack Integration:

// libs/observability/src/alerting/slack-notifier.ts
import { WebClient } from '@slack/web-api';

@Injectable()
export class SlackNotifier {
  private client: WebClient;

  constructor(@Inject(CONFIG) private config: Config) {
    this.client = new WebClient(config.SLACK_BOT_TOKEN);
  }

  async sendAlert(alert: Alert): Promise<void> {
    const channel = this.getChannel(alert.severity);

    await this.client.chat.postMessage({
      channel,
      blocks: [
        {
          type: 'header',
          text: {
            type: 'plain_text',
            text: `🚨 ${alert.title}`,
          },
        },
        {
          type: 'section',
          fields: [
            { type: 'mrkdwn', text: `*Severity:*\n${alert.severity}` },
            { type: 'mrkdwn', text: `*Service:*\n${alert.service}` },
          ],
        },
        {
          type: 'section',
          text: { type: 'mrkdwn', text: alert.description },
        },
        {
          type: 'actions',
          elements: [
            {
              type: 'button',
              text: { type: 'plain_text', text: 'View in Sentry' },
              url: alert.sentryUrl,
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: 'Runbook' },
              url: `https://docs.forma3d.com/runbooks/${alert.type}`,
            },
          ],
        },
      ],
    });
  }

  private getChannel(severity: string): string {
    switch (severity) {
      case 'critical': return '#incidents';
      case 'warning': return '#alerts';
      default: return '#monitoring';
    }
  }
}

10. Service Level Agreements (SLAs)¶

10.1 Defining SLAs for Multi-Tenant System¶

When offering the platform to tenants, clear SLAs establish expectations and accountability.

Recommended SLA Tiers:

Tier	Target Uptime	Support Response	Price Point
Basic	99.0% (7.3h/month downtime)	24 hours	Entry level
Professional	99.5% (3.65h/month downtime)	4 hours	Mid-tier
Enterprise	99.9% (43.8min/month downtime)	1 hour	Premium

10.2 SLA Components¶

Uptime Calculation¶

Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) × 100

Exclusions:
- Scheduled maintenance (with 48h notice)
- Third-party service outages (Shopify, SimplyPrint, SendCloud)
- Force majeure events
- Customer-caused issues

Service Level Indicators (SLIs)¶

SLI	Measurement	Target
Availability	Successful health checks / Total checks	99.9%
Latency (API)	p95 response time	< 500ms
Latency (Dashboard)	p95 page load	< 3s
Order Processing	Time from webhook to processing start	< 30s
Error Rate	5xx responses / Total responses	< 0.1%

Service Level Objectives (SLOs)¶

# SLO Configuration
slos:
  api_availability:
    target: 99.9%
    window: 30d
    indicator: probe_success{job="api-health"}

  api_latency:
    target: 99.0%
    window: 30d
    indicator: histogram_quantile(0.95, http_request_duration_seconds_bucket) < 0.5

  order_processing:
    target: 99.5%
    window: 30d
    indicator: order_processing_time_seconds < 30

10.3 Error Budget¶

Concept: The allowed amount of unreliability within SLA targets.

99.9% uptime = 0.1% error budget = 43.8 minutes/month

Error Budget Remaining = Target Downtime - Actual Downtime

If error budget exhausted:
- Freeze non-critical deployments
- Focus on reliability improvements
- Increase testing requirements

Error Budget Policy:

Budget Remaining	Actions
> 50%	Normal operations
25-50%	Review deployment frequency
10-25%	Pause feature releases, focus on reliability
< 10%	Emergency mode, critical fixes only

10.4 SLA Communication Template¶

# Forma 3D Connect Service Level Agreement

## Service Commitment

Forma 3D Connect commits to providing [TIER] level service with the following guarantees:

### Availability
- **Target**: [99.X]% monthly uptime
- **Measurement**: Based on successful responses to health check endpoints
- **Exclusions**: Scheduled maintenance, third-party outages

### Support Response
- **Critical Issues**: Response within [X] hours
- **Major Issues**: Response within [X] hours
- **Minor Issues**: Response within [X] business days

### Remedies

If monthly uptime falls below the target:

| Uptime | Service Credit |
|--------|---------------|
| < 99.X% but ≥ 99.Y% | 10% of monthly fee |
| < 99.Y% but ≥ 99.Z% | 25% of monthly fee |
| < 99.Z% | 50% of monthly fee |

Credits must be requested within 30 days of the incident.

### Exclusions

This SLA does not apply to:
- Scheduled maintenance announced 48+ hours in advance
- Third-party service disruptions (Shopify, SimplyPrint, SendCloud)
- Customer misuse or misconfiguration
- Features labeled as "Beta" or "Preview"
- Free tier accounts

11. Recommendations and Next Steps¶

11.1 Priority Matrix¶

Priority	Action	Effort	Impact	Timeline
P0	Implement automated database backups	Medium	Critical	Week 1-2
P0	Document incident response procedures	Low	Critical	Week 1
P1	Deploy Statping-ng status page	Medium	High	Week 2-3
P1	Set up Slack alerting integration	Medium	High	Week 2
P1	Configure Prometheus + Grafana monitoring	High	High	Week 3-4
P2	Implement database replication	High	Critical	Month 2
P2	Add circuit breakers to integrations	Medium	Medium	Month 2
P2	Create runbooks for common issues	Medium	Medium	Month 2
P3	Implement PagerDuty on-call rotation	Medium	Medium	Month 3
P3	Conduct disaster recovery drill	Medium	High	Month 3
P3	Define and publish tenant SLAs	Low	Medium	Month 3

11.2 Implementation Roadmap¶

gantt
    title Disaster Recovery Implementation
    dateFormat  YYYY-MM-DD
    section Phase 1: Foundation
    Document incident procedures     :a1, 2026-02-10, 5d
    Implement database backups       :a2, 2026-02-10, 10d
    section Phase 2: Visibility
    Deploy status page               :b1, 2026-02-17, 7d
    Set up Slack alerts              :b2, 2026-02-17, 5d
    Configure monitoring stack       :b3, 2026-02-20, 10d
    section Phase 3: Resilience
    Database replication             :c1, 2026-03-01, 14d
    Circuit breakers                 :c2, 2026-03-01, 7d
    Create runbooks                  :c3, 2026-03-08, 10d
    section Phase 4: Operations
    PagerDuty setup                  :d1, 2026-03-15, 7d
    DR drill                         :d2, 2026-03-22, 3d
    Publish SLAs                     :d3, 2026-03-25, 5d

11.3 Success Metrics¶

Metric	Current	Target	Measurement
MTTD (Mean Time to Detect)	Unknown	< 5 min	Alert to detection
MTTR (Mean Time to Recover)	Unknown	< 30 min	Detection to resolution
Incident frequency	Unknown	< 2/month	SEV-½ incidents
Postmortem completion	0%	100%	SEV-½ within 5 days
Backup test frequency	Never	Monthly	Restore verification

11.4 Cost Estimates¶

Component	Option A (Budget)	Option B (Recommended)	Option C (Enterprise)
Status Page	Statping-ng (Free)	Statping-ng (Free)	Atlassian ($79/mo)
Monitoring	Prometheus (Free)	Grafana Cloud ($50/mo)	Datadog ($100+/mo)
Alerting	Slack (Free)	PagerDuty ($25/user/mo)	PagerDuty + OpsGenie
Database HA	Manual failover	Managed DB ($100/mo)	Multi-region ($300+/mo)
Total	~$0/mo	~$175-250/mo	~$500+/mo

Appendix A: Quick Reference Cards¶

A.1 Emergency Contacts¶

## Emergency Contacts

| Role | Name | Phone | Email |
|------|------|-------|-------|
| Primary On-Call | [TBD] | [TBD] | [TBD] |
| Secondary On-Call | [TBD] | [TBD] | [TBD] |
| Engineering Lead | [TBD] | [TBD] | [TBD] |

## External Services

| Service | Status Page | Support |
|---------|-------------|---------|
| Shopify | shopifystatus.com | partners@shopify.com |
| SimplyPrint | [TBD] | support@simplyprint.io |
| SendCloud | status.sendcloud.sc | support@sendcloud.sc |

A.2 Critical Commands¶

# Health check
curl https://api.forma3d.com/health | jq

# Database connection test
psql $DATABASE_URL -c "SELECT 1"

# View recent errors in Sentry
# (via Sentry dashboard)

# Check retry queue
curl -H "Authorization: Bearer $API_KEY" \
  https://api.forma3d.com/retry-queue/stats

# Force retry queue processing
curl -X POST -H "Authorization: Bearer $API_KEY" \
  https://api.forma3d.com/retry-queue/process

# Kubernetes pod restart
kubectl rollout restart deployment/api -n forma3d

# View recent logs
kubectl logs -l app=api -n forma3d --tail=100 -f

# Database backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d).sql.gz

A.3 Runbook Index¶

Issue	Runbook Location
Database unreachable	`docs/runbooks/database-connection.md`
Shopify webhooks failing	`docs/runbooks/shopify-webhooks.md`
SimplyPrint sync issues	`docs/runbooks/simplyprint-sync.md`
High error rate	`docs/runbooks/error-rate-spike.md`
Memory exhaustion	`docs/runbooks/memory-oom.md`
Disk space low	`docs/runbooks/disk-space.md`
Certificate expiration	`docs/runbooks/ssl-renewal.md`

Appendix B: Compliance Considerations¶

Breach Notification: 72 hours to supervisory authority
Data Subject Rights: Must be maintained during incidents
Data Recovery: Backups must be restorable and tested

Audit Trail¶

All incidents logged in AuditLog table
Postmortems stored for minimum 3 years
Access logs retained per data retention policy

Document Version: 1.0
Last Updated: February 2026
Next Review: August 2026