Skip to content

Disaster Recovery Research Report

Status: Research Document
Created: February 2026
Scope: Forma 3D Connect Platform

Table of Contents

  1. Executive Summary
  2. System Overview
  3. Risk Assessment Matrix
  4. Disaster Scenarios and Mitigation
  5. Backup and Recovery Strategies
  6. Incident Response Procedures
  7. Postmortem Process
  8. Status Page Implementation
  9. Alerting and Notification Strategy
  10. Service Level Agreements (SLAs)
  11. Recommendations and Next Steps

1. Executive Summary

This document outlines a comprehensive disaster recovery (DR) strategy for the Forma 3D Connect platform. The system orchestrates 3D printing fulfillment by integrating with Shopify (e-commerce), SimplyPrint (print management), and SendCloud (shipping). A robust DR plan is essential to ensure business continuity, protect customer data, and maintain SLA commitments.

Key Findings

  • Critical Dependencies: PostgreSQL database, three external APIs (Shopify, SimplyPrint, SendCloud)
  • Current Strengths: Retry queue system, health checks, event logging, webhook idempotency
  • Gaps Identified: No formal backup strategy documented, no status page, limited alerting beyond Sentry
  • Recommended RTO: 4 hours for critical services, 24 hours for full restoration
  • Recommended RPO: 1 hour (maximum data loss tolerance)

2. System Overview

Architecture Components

┌─────────────────────────────────────────────────────────────────────────┐
│                          EXTERNAL SERVICES                               │
├──────────────────┬──────────────────┬──────────────────┬────────────────┤
│    Shopify       │   SimplyPrint    │    SendCloud     │     SMTP       │
│  (E-commerce)    │   (Printing)     │   (Shipping)     │   (Email)      │
└────────┬─────────┴────────┬─────────┴────────┬─────────┴────────┬───────┘
         │                  │                  │                  │
         ▼                  ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           NestJS API                                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│  │  Webhooks    │ │   Orders     │ │  Print Jobs  │ │   Fulfillment    ││
│  │  Handlers    │ │   Service    │ │   Service    │ │   Service        ││
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│
│  │ Retry Queue  │ │   Health     │ │   Sentry     │ │   Notifications  ││
│  │  Processor   │ │   Checks     │ │ Integration  │ │   (Email/Push)   ││
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          PostgreSQL Database                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│  │  Orders  │ │PrintJobs │ │ Shipments│ │AuditLogs │ │ RetryQueue     │ │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

Critical Data Assets

Asset Description Sensitivity Recovery Priority
Orders Customer orders from Shopify High Critical
Print Jobs Job status and tracking Medium Critical
Product Mappings Product to 3D file mappings (by Shopify product/variant ID, SKU optional) Medium High
Shipments Tracking and label data High Critical
User Accounts RBAC and authentication High Critical
Audit Logs Compliance and debugging Medium Medium
Event Logs Operational visibility Low Low

3. Risk Assessment Matrix

Risk Scoring

  • Probability: 1 (Rare) to 5 (Frequent)
  • Impact: 1 (Minimal) to 5 (Catastrophic)
  • Risk Score: Probability × Impact

Identified Risks

Risk ID Risk Description Probability Impact Score Category
R01 Database server crash 2 5 10 Infrastructure
R02 Database corruption 1 5 5 Infrastructure
R03 Application server crash 3 4 12 Infrastructure
R04 Shopify API unavailable 2 4 8 External Service
R05 SimplyPrint API unavailable 3 4 12 External Service
R06 SendCloud API unavailable 2 3 6 External Service
R07 Security breach / hack 2 5 10 Security
R08 Credential compromise 2 5 10 Security
R09 Major misconfiguration 3 4 12 Operational
R10 Data loss (accidental deletion) 2 5 10 Operational
R11 Network/DNS failure 2 4 8 Infrastructure
R12 Certificate expiration 2 3 6 Operational
R13 Disk space exhaustion 3 3 9 Infrastructure
R14 Memory exhaustion / OOM 3 3 9 Infrastructure
R15 Webhook flooding / DDoS 2 4 8 Security

Risk Matrix Visualization

     │ Catastrophic (5) │  R02  │       │ R01,R07,R08,R10 │       │       │
     │   Major (4)      │       │ R04,R11,R15 │ R03,R05,R09 │       │       │
     │   Moderate (3)   │       │ R06,R12 │ R13,R14 │       │       │
     │   Minor (2)      │       │       │       │       │       │
     │   Minimal (1)    │       │       │       │       │       │
     └──────────────────┴───────┴───────┴───────┴───────┴───────┘
                           Rare   Unlikely  Possible  Likely  Frequent
                            (1)     (2)       (3)      (4)      (5)

4. Disaster Scenarios and Mitigation

4.1 Database Crashes

Scenario: PostgreSQL Server Becomes Unavailable

Symptoms: - Health checks fail (/health/ready returns 503) - All API endpoints return 500 errors - Sentry floods with database connection errors

Immediate Response: 1. Check database container/server status 2. Review database logs for crash cause 3. Attempt restart if container crashed 4. Failover to replica if available

Mitigation Strategies:

Strategy Implementation RTO Impact
Database Replication PostgreSQL streaming replication with read replica < 5 min failover
Managed Database Azure Database for PostgreSQL / AWS RDS with automatic failover < 2 min failover
Connection Pooling PgBouncer for connection management Reduces crash risk
Regular Backups Automated daily backups with point-in-time recovery 1-4 hours

Current System Safeguards: - Connection pooling configured (DATABASE_POOL_SIZE) - Health checks detect issues quickly - Retry queue persists failed operations for later retry

Recommended Actions:

# PostgreSQL High Availability Setup
primary:
  - Streaming replication to standby
  - WAL archiving to object storage
  - Automated backups every 6 hours

standby:
  - Hot standby for read queries
  - Automatic promotion via Patroni/pgpool

backup_retention:
  daily: 7 days
  weekly: 4 weeks
  monthly: 12 months


4.2 Application Server Crashes

Scenario: NestJS API Container Crashes or Becomes Unresponsive

Symptoms: - Kubernetes/Docker health probes fail - No response on any endpoint - WebSocket connections drop

Immediate Response: 1. Container orchestrator auto-restarts (if configured) 2. Check container logs for crash reason 3. Verify resource limits (memory/CPU) 4. Scale up if load-related

Mitigation Strategies:

Strategy Implementation Benefit
Multiple Replicas Kubernetes Deployment with 2+ replicas Zero downtime
Health Probes Liveness and readiness probes Auto-recovery
Horizontal Pod Autoscaler Scale based on CPU/memory Handle load spikes
Graceful Shutdown Handle SIGTERM properly No dropped requests

Kubernetes Deployment Example:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5


4.3 Security Breach / System Hack

Scenario: Unauthorized Access or Data Breach

Attack Vectors: 1. Credential theft (API keys, database passwords) 2. SQL injection (mitigated by Prisma ORM) 3. Webhook spoofing (mitigated by HMAC verification) 4. Session hijacking 5. Dependency vulnerabilities

Immediate Response Checklist:

## Security Incident Response Checklist

### Phase 1: Contain (First 30 minutes)
- [ ] Isolate affected systems (network level)
- [ ] Revoke compromised credentials immediately
- [ ] Enable maintenance mode if needed
- [ ] Preserve logs and evidence (DO NOT delete)

### Phase 2: Assess (1-4 hours)
- [ ] Identify attack vector and timeline
- [ ] Determine scope of data exposure
- [ ] Check for persistence mechanisms (backdoors)
- [ ] Review audit logs for suspicious activity

### Phase 3: Remediate (4-24 hours)
- [ ] Rotate all secrets and API keys
- [ ] Patch vulnerabilities
- [ ] Reset user passwords if needed
- [ ] Review and update access controls

### Phase 4: Recover (24-72 hours)
- [ ] Restore from clean backup if needed
- [ ] Re-enable services gradually
- [ ] Enhanced monitoring period
- [ ] Notify affected parties (GDPR: 72 hours)

Preventive Measures:

Measure Current Status Recommendation
HMAC Webhook Verification ✅ Implemented Maintain
Rate Limiting ✅ Implemented Add IP-based limits
Audit Logging ✅ Implemented Add alerting
Secret Management ⚠️ Environment variables Use Azure Key Vault / HashiCorp Vault
Dependency Scanning ❌ Not implemented Add Dependabot / Snyk
WAF ❌ Not implemented Consider Azure WAF / Cloudflare

Credential Rotation Procedure:

# Emergency credential rotation script
#!/bin/bash

# 1. Database password
echo "Rotating database credentials..."
# Update in secret manager, then rolling restart

# 2. Shopify credentials
echo "Rotating Shopify API credentials..."
# Generate new keys in Shopify Partner Dashboard
# Update SHOPIFY_API_KEY, SHOPIFY_API_SECRET

# 3. SimplyPrint API key
echo "Rotating SimplyPrint credentials..."
# Generate new API key in SimplyPrint dashboard
# Update SIMPLYPRINT_API_KEY

# 4. SendCloud credentials
echo "Rotating SendCloud credentials..."
# Generate new API keys in SendCloud dashboard
# Update SENDCLOUD_PUBLIC_KEY, SENDCLOUD_SECRET_KEY

# 5. Internal secrets
echo "Rotating internal secrets..."
# SESSION_SECRET, INTERNAL_API_KEY, SHOPIFY_TOKEN_ENCRYPTION_KEY

# 6. Rolling restart
echo "Performing rolling restart..."
kubectl rollout restart deployment/api


4.4 Major Misconfiguration

Scenario: Production Environment Misconfigured

Common Misconfigurations: 1. Wrong database URL (connecting to staging/dev) 2. Incorrect API keys for wrong environment 3. CORS misconfiguration blocking frontend 4. Invalid webhook URLs preventing order ingestion 5. Wrong SMTP settings causing notification failures

Detection Mechanisms:

Misconfiguration Detection Method Time to Detect
Database URL wrong Health check fails < 1 minute
API keys invalid First API call fails, logged to Sentry < 5 minutes
Webhook URL wrong No orders coming in, monitoring alert 5-30 minutes
SMTP misconfigured Email delivery fails, retry queue grows 30 minutes

Prevention Strategies:

  1. Configuration Validation on Startup:

    // apps/api/src/config/config.validation.ts
    export function validateConfiguration(config: Record<string, unknown>) {
      const errors: string[] = [];
    
      // Validate database URL format
      if (!config.DATABASE_URL?.toString().startsWith('postgresql://')) {
        errors.push('DATABASE_URL must be a valid PostgreSQL connection string');
      }
    
      // Validate required secrets are present
      const requiredSecrets = [
        'SESSION_SECRET',
        'SHOPIFY_API_KEY',
        'SIMPLYPRINT_API_KEY',
        'SENDCLOUD_PUBLIC_KEY',
      ];
    
      for (const secret of requiredSecrets) {
        if (!config[secret]) {
          errors.push(`Missing required secret: ${secret}`);
        }
      }
    
      if (errors.length > 0) {
        throw new Error(`Configuration validation failed:\n${errors.join('\n')}`);
      }
    }
    

  2. Environment-Specific Safeguards:

    // Prevent production database operations in non-production
    if (process.env.NODE_ENV !== 'production' && 
        process.env.DATABASE_URL?.includes('production')) {
      throw new Error('DANGER: Non-production environment connected to production database!');
    }
    

  3. Infrastructure as Code:

  4. Use Terraform/Pulumi for consistent deployments
  5. Environment-specific variable files
  6. Code review for infrastructure changes

4.5 Third-Party Service Unavailability

4.5.1 Shopify API Unavailable

Impact: Cannot receive new orders, cannot create fulfillments

Detection: - /health/dependencies shows Shopify unhealthy - Webhook deliveries fail (Shopify retries for 48 hours) - Manual order creation fails

Current Mitigations: - ✅ Shopify retries webhooks with exponential backoff - ✅ Processed webhook idempotency prevents duplicates on retry - ✅ Health indicator detects issues

Response Procedure: 1. Check Shopify Status 2. If Shopify is up, check our API connectivity 3. Queue fulfillment operations for later retry 4. Monitor Shopify's status for resolution

Recommendations:

// Add Shopify-specific circuit breaker
@Injectable()
export class ShopifyCircuitBreaker {
  private failures = 0;
  private isOpen = false;
  private lastFailure: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.isOpen && this.shouldAttemptReset()) {
      this.isOpen = false;
    }

    if (this.isOpen) {
      throw new ServiceUnavailableException('Shopify circuit breaker is open');
    }

    try {
      const result = await operation();
      this.failures = 0;
      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = new Date();
      if (this.failures >= 5) {
        this.isOpen = true;
      }
      throw error;
    }
  }
}

4.5.2 SimplyPrint API Unavailable

Impact: Cannot create print jobs, cannot get status updates

Detection: - /health/dependencies shows SimplyPrint unhealthy - Polling service logs connection errors - Print job creation fails

Current Mitigations: - ✅ Retry queue for print job creation - ✅ Polling handles temporary failures gracefully - ✅ Health indicator detects issues

Response Procedure: 1. Check SimplyPrint status/support channels 2. Print jobs queue in PENDING state 3. Polling will resume automatically when service returns 4. Manual reconciliation may be needed after extended outage

Degraded Mode Operation: - Orders continue to be received and stored - Print jobs are queued but not submitted - Dashboard shows "SimplyPrint Unavailable" warning - Operators can manually track printing progress

4.5.3 SendCloud API Unavailable

Impact: Cannot create shipping labels

Detection: - /health/dependencies shows SendCloud unhealthy - Shipment creation fails with errors - Label generation queued in retry

Current Mitigations: - ✅ Retry queue for shipment creation - ✅ Health indicator detects issues

Response Procedure: 1. Check SendCloud Status 2. Shipments queue in PENDING state 3. Once restored, retry queue processes pending shipments 4. Operators may need to manually create labels for urgent orders


5. Backup and Recovery Strategies

5.1 Backup Strategy

Database Backups

Backup Type Frequency Retention Storage
Full Backup Daily (3 AM) 30 days Azure Blob / S3
Incremental (WAL) Continuous 7 days Azure Blob / S3
Point-in-Time Continuous 7 days Managed service
Monthly Archive Monthly 1 year Cold storage

Backup Configuration (Azure Database for PostgreSQL):

{
  "backup": {
    "geoRedundantBackup": "Enabled",
    "backupRetentionDays": 35,
    "earliestRestoreDate": "2026-01-01T00:00:00Z"
  },
  "storage": {
    "storageSizeGB": 100,
    "autoGrow": "Enabled"
  }
}

Manual Backup Script:

#!/bin/bash
# backup-database.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="forma3d_backup_${DATE}.sql.gz"

# Create backup
pg_dump $DATABASE_URL | gzip > /tmp/$BACKUP_FILE

# Upload to cloud storage
az storage blob upload \
  --container-name backups \
  --file /tmp/$BACKUP_FILE \
  --name "database/$BACKUP_FILE"

# Cleanup local file
rm /tmp/$BACKUP_FILE

# Verify backup
az storage blob exists \
  --container-name backups \
  --name "database/$BACKUP_FILE"

5.2 Recovery Procedures

Database Recovery

Scenario: Point-in-Time Recovery

# 1. Identify target recovery time
TARGET_TIME="2026-02-05T14:30:00Z"

# 2. Create new database from backup (Azure)
az postgres server restore \
  --resource-group forma3d-rg \
  --name forma3d-db-recovered \
  --source-server forma3d-db \
  --restore-point-in-time $TARGET_TIME

# 3. Verify data integrity
psql $RECOVERED_DATABASE_URL -c "SELECT COUNT(*) FROM orders;"

# 4. Update connection string and restart
kubectl set env deployment/api DATABASE_URL=$RECOVERED_DATABASE_URL

# 5. Run integrity checks
npm run db:integrity-check

Scenario: Full Database Restore from Backup

# 1. Download latest backup
az storage blob download \
  --container-name backups \
  --name "database/forma3d_backup_latest.sql.gz" \
  --file /tmp/restore.sql.gz

# 2. Create new database
createdb -h $DB_HOST forma3d_restored

# 3. Restore backup
gunzip -c /tmp/restore.sql.gz | psql -h $DB_HOST forma3d_restored

# 4. Run migrations to ensure schema is current
DATABASE_URL="..." npx prisma migrate deploy

# 5. Switch over
# Update Kubernetes secrets and restart

5.3 Recovery Time and Point Objectives

Scenario RTO Target RPO Target Current Capability
Database crash (with replica) 5 minutes 0 (sync replication) ❌ Need to implement
Database crash (backup only) 2 hours 1 hour ✅ Achievable
Application crash 2 minutes 0 ✅ With K8s auto-restart
Full disaster (region failure) 4 hours 1 hour ❌ Need geo-redundancy
Security breach 24 hours Varies ✅ From clean backup

6. Incident Response Procedures

6.1 Incident Severity Levels

Severity Description Examples Response Time
SEV-1 Critical - Complete service outage Database down, all APIs failing Immediate (< 15 min)
SEV-2 Major - Significant functionality impaired Cannot process new orders, fulfillment broken < 30 min
SEV-3 Moderate - Partial functionality affected One integration down, degraded performance < 2 hours
SEV-4 Minor - Low impact issues Slow dashboard, cosmetic issues < 24 hours

6.2 Incident Response Workflow

graph TD
    A[Incident Detected] --> B{Severity Assessment}
    B -->|SEV-1/2| C[Page On-Call]
    B -->|SEV-3/4| D[Create Ticket]
    C --> E[Acknowledge]
    E --> F[Investigate]
    F --> G{Root Cause Found?}
    G -->|No| H[Escalate]
    G -->|Yes| I[Implement Fix]
    I --> J[Verify Resolution]
    J --> K[Update Status Page]
    K --> L[Create Postmortem]
    D --> F
    H --> F

6.3 On-Call Procedures

On-Call Rotation: - Primary: Responds to all pages - Secondary: Backup if primary unavailable - Escalation: Engineering lead for SEV-1

Escalation Path:

Primary On-Call (5 min)
    ↓
Secondary On-Call (10 min)
    ↓
Engineering Lead (15 min)
    ↓
CTO (30 min for SEV-1)

On-Call Checklist:

## Incident Response Checklist

### Initial Response (0-5 min)
- [ ] Acknowledge alert
- [ ] Open incident channel (#incident-YYYYMMDD-HHMM)
- [ ] Initial severity assessment
- [ ] Update status page to "Investigating"

### Investigation (5-30 min)
- [ ] Check health endpoints: /health, /health/dependencies
- [ ] Review Sentry for recent errors
- [ ] Check database connectivity
- [ ] Review recent deployments
- [ ] Check external service status pages

### Mitigation (varies)
- [ ] Document attempted fixes
- [ ] Rollback if deployment-related
- [ ] Fail over to backup systems if needed
- [ ] Update status page with progress

### Resolution
- [ ] Confirm service restored
- [ ] Update status page to "Resolved"
- [ ] Document timeline and actions
- [ ] Schedule postmortem (SEV-1/2)


7. Postmortem Process

7.1 When to Write a Postmortem

  • All SEV-1 and SEV-2 incidents
  • SEV-3 incidents with customer impact
  • Near-misses that could have been worse
  • Security incidents regardless of severity
  • Data loss events

7.2 Postmortem Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: SEV-X
**Author**: [Name]
**Status**: [Draft | Final]

## Summary

[2-3 sentence summary of what happened and impact]

## Impact

- **Affected Services**: [List services]
- **Customer Impact**: [Description]
- **Orders Affected**: [Number]
- **Duration of Impact**: [Time]

## Timeline (All times in UTC)

| Time | Event |
|------|-------|
| HH:MM | First alert triggered |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service fully restored |

## Root Cause

[Detailed explanation of what caused the incident]

## Detection

**How was the incident detected?**
- [ ] Automated alerting
- [ ] Customer report
- [ ] Internal observation
- [ ] External service notification

**Detection gap**: [If applicable, why wasn't this detected sooner?]

## Resolution

[Step-by-step description of how the incident was resolved]

## Lessons Learned

### What Went Well
- [Positive item 1]
- [Positive item 2]

### What Went Wrong
- [Issue 1]
- [Issue 2]

### Where We Got Lucky
- [Lucky circumstance that limited impact]

## Action Items

| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| [Action 1] | [Owner] | P1 | YYYY-MM-DD | [ ] Open |
| [Action 2] | [Owner] | P2 | YYYY-MM-DD | [ ] Open |

## Appendix

### Supporting Data
- [Links to dashboards, logs, etc.]

### Related Incidents
- [Links to previous related incidents]

7.3 Blameless Postmortem Culture

Principles: 1. Focus on systems, not individuals 2. Assume everyone acted with best intentions 3. Identify process gaps, not scapegoats 4. Share learnings openly 5. Follow through on action items

Anti-Patterns to Avoid: - "Human error" as root cause (ask why the error was possible) - Single point of blame - Superficial action items - Skipping postmortems for "obvious" issues


8. Status Page Implementation

8.1 Why a Status Page?

  • Transparency: Customers see real-time system status
  • Reduced Support Load: Fewer "is it down?" inquiries
  • Trust Building: Proactive communication during incidents
  • Accountability: Public track record of reliability

Based on the project requirements and open-source preference, Statping-ng is recommended.

Why Statping-ng: - Self-hosted (data sovereignty) - Lightweight (~20MB Docker image) - Built-in monitoring and alerts - Beautiful, customizable UI - Prometheus exporter included - Multiple notification channels

Alternative Options:

Solution Type Cost Pros Cons
Statping-ng Self-hosted Free Full control, lightweight Self-maintenance
Cachet Self-hosted Free Simple, Laravel-based Less active development
Upptime GitHub-based Free Zero infrastructure Limited features
Better Stack SaaS $20+/mo Managed, incident management Vendor lock-in
Atlassian Statuspage SaaS $29+/mo Industry standard Cost, complexity

8.3 Statping-ng Implementation

Docker Compose Configuration:

# docker-compose.statuspage.yml
version: "3.8"

services:
  statping:
    image: adamboutcher/statping-ng:latest
    container_name: statping
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - statping_data:/app
    environment:
      - DB_CONN=postgres
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_DATABASE=statping
      - DB_USER=statping
      - DB_PASS=${STATPING_DB_PASS}
      - NAME=Forma 3D Connect Status
      - DESCRIPTION=Real-time status of Forma 3D Connect services
    depends_on:
      - postgres

  postgres:
    image: postgres:15-alpine
    container_name: statping-db
    restart: always
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_DB=statping
      - POSTGRES_USER=statping
      - POSTGRES_PASSWORD=${STATPING_DB_PASS}

volumes:
  statping_data:
  postgres_data:

Services to Monitor:

Service Check Type Interval Timeout
API Health HTTP GET /health 30s 10s
API Ready HTTP GET /health/ready 30s 10s
Shopify Integration HTTP GET /health/dependencies 60s 15s
SimplyPrint Integration HTTP GET /health/dependencies 60s 15s
SendCloud Integration HTTP GET /health/dependencies 60s 15s
Web Dashboard HTTP GET / 60s 10s
Database TCP postgres:5432 30s 5s

Prometheus Integration:

# prometheus.yml
scrape_configs:
  - job_name: 'statping'
    bearer_token: '${STATPING_API_SECRET}'
    static_configs:
      - targets: ['statping:8080']

8.4 Status Page Content

Recommended Components: 1. Core Platform - API Services - Web Dashboard - Database

  1. Integrations
  2. Shopify Connection
  3. SimplyPrint Connection
  4. SendCloud Connection
  5. Email Notifications

  6. Background Services

  7. Order Processing
  8. Print Job Sync
  9. Retry Queue

Incident Communication Templates:

## Investigating: [Service] Performance Degradation
We are currently investigating reports of slow response times on [service].
Updates will be provided every 30 minutes.

## Identified: [Service] Outage
We have identified the cause of the [service] outage. Our team is working on a fix.
Estimated time to resolution: [X hours]

## Monitoring: [Service] Restored
[Service] has been restored. We are monitoring for stability.
A postmortem will be published within 48 hours.

## Resolved: [Service] Incident
The incident affecting [service] has been fully resolved.
[Link to postmortem when available]

9. Alerting and Notification Strategy

9.1 Current Alerting Capabilities

Channel Implementation Coverage
Sentry ✅ Implemented Application errors, performance
Email ✅ Implemented Failed operations (via retry queue)
Push Notifications ✅ Implemented Order/job status (user-facing)
Slack/Teams ❌ Not implemented Operational alerts
PagerDuty/Opsgenie ❌ Not implemented On-call escalation

Alert Categories:

Category Priority Channel Examples
Page P1 PagerDuty + Slack Database down, API 5xx spike
Urgent P2 Slack + Email Integration failures, queue backup
Warning P3 Slack Elevated error rates, slow queries
Info P4 Slack (low-priority) Deployment success, stats

Alert Rules:

# alerting-rules.yml
groups:
  - name: forma3d-critical
    rules:
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database has been unreachable for more than 1 minute"

      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High HTTP error rate detected"
          description: "Error rate is above 5% for 5 minutes"

      - alert: RetryQueueBacklog
        expr: retry_queue_pending_count > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Retry queue backlog growing"
          description: "More than 100 items pending in retry queue"

      - alert: IntegrationUnhealthy
        expr: health_dependency_status{service=~"shopify|simplyprint|sendcloud"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "External integration unhealthy"
          description: "{{ $labels.service }} has been unhealthy for 5 minutes"

9.3 Notification Channels Setup

Slack Integration:

// libs/observability/src/alerting/slack-notifier.ts
import { WebClient } from '@slack/web-api';

@Injectable()
export class SlackNotifier {
  private client: WebClient;

  constructor(@Inject(CONFIG) private config: Config) {
    this.client = new WebClient(config.SLACK_BOT_TOKEN);
  }

  async sendAlert(alert: Alert): Promise<void> {
    const channel = this.getChannel(alert.severity);

    await this.client.chat.postMessage({
      channel,
      blocks: [
        {
          type: 'header',
          text: {
            type: 'plain_text',
            text: `🚨 ${alert.title}`,
          },
        },
        {
          type: 'section',
          fields: [
            { type: 'mrkdwn', text: `*Severity:*\n${alert.severity}` },
            { type: 'mrkdwn', text: `*Service:*\n${alert.service}` },
          ],
        },
        {
          type: 'section',
          text: { type: 'mrkdwn', text: alert.description },
        },
        {
          type: 'actions',
          elements: [
            {
              type: 'button',
              text: { type: 'plain_text', text: 'View in Sentry' },
              url: alert.sentryUrl,
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: 'Runbook' },
              url: `https://docs.forma3d.com/runbooks/${alert.type}`,
            },
          ],
        },
      ],
    });
  }

  private getChannel(severity: string): string {
    switch (severity) {
      case 'critical': return '#incidents';
      case 'warning': return '#alerts';
      default: return '#monitoring';
    }
  }
}


10. Service Level Agreements (SLAs)

10.1 Defining SLAs for Multi-Tenant System

When offering the platform to tenants, clear SLAs establish expectations and accountability.

Recommended SLA Tiers:

Tier Target Uptime Support Response Price Point
Basic 99.0% (7.3h/month downtime) 24 hours Entry level
Professional 99.5% (3.65h/month downtime) 4 hours Mid-tier
Enterprise 99.9% (43.8min/month downtime) 1 hour Premium

10.2 SLA Components

Uptime Calculation

Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) × 100

Exclusions:
- Scheduled maintenance (with 48h notice)
- Third-party service outages (Shopify, SimplyPrint, SendCloud)
- Force majeure events
- Customer-caused issues

Service Level Indicators (SLIs)

SLI Measurement Target
Availability Successful health checks / Total checks 99.9%
Latency (API) p95 response time < 500ms
Latency (Dashboard) p95 page load < 3s
Order Processing Time from webhook to processing start < 30s
Error Rate 5xx responses / Total responses < 0.1%

Service Level Objectives (SLOs)

# SLO Configuration
slos:
  api_availability:
    target: 99.9%
    window: 30d
    indicator: probe_success{job="api-health"}

  api_latency:
    target: 99.0%
    window: 30d
    indicator: histogram_quantile(0.95, http_request_duration_seconds_bucket) < 0.5

  order_processing:
    target: 99.5%
    window: 30d
    indicator: order_processing_time_seconds < 30

10.3 Error Budget

Concept: The allowed amount of unreliability within SLA targets.

99.9% uptime = 0.1% error budget = 43.8 minutes/month

Error Budget Remaining = Target Downtime - Actual Downtime

If error budget exhausted:
- Freeze non-critical deployments
- Focus on reliability improvements
- Increase testing requirements

Error Budget Policy:

Budget Remaining Actions
> 50% Normal operations
25-50% Review deployment frequency
10-25% Pause feature releases, focus on reliability
< 10% Emergency mode, critical fixes only

10.4 SLA Communication Template

# Forma 3D Connect Service Level Agreement

## Service Commitment

Forma 3D Connect commits to providing [TIER] level service with the following guarantees:

### Availability
- **Target**: [99.X]% monthly uptime
- **Measurement**: Based on successful responses to health check endpoints
- **Exclusions**: Scheduled maintenance, third-party outages

### Support Response
- **Critical Issues**: Response within [X] hours
- **Major Issues**: Response within [X] hours
- **Minor Issues**: Response within [X] business days

### Remedies

If monthly uptime falls below the target:

| Uptime | Service Credit |
|--------|---------------|
| < 99.X% but ≥ 99.Y% | 10% of monthly fee |
| < 99.Y% but ≥ 99.Z% | 25% of monthly fee |
| < 99.Z% | 50% of monthly fee |

Credits must be requested within 30 days of the incident.

### Exclusions

This SLA does not apply to:
- Scheduled maintenance announced 48+ hours in advance
- Third-party service disruptions (Shopify, SimplyPrint, SendCloud)
- Customer misuse or misconfiguration
- Features labeled as "Beta" or "Preview"
- Free tier accounts

11. Recommendations and Next Steps

11.1 Priority Matrix

Priority Action Effort Impact Timeline
P0 Implement automated database backups Medium Critical Week 1-2
P0 Document incident response procedures Low Critical Week 1
P1 Deploy Statping-ng status page Medium High Week 2-3
P1 Set up Slack alerting integration Medium High Week 2
P1 Configure Prometheus + Grafana monitoring High High Week 3-4
P2 Implement database replication High Critical Month 2
P2 Add circuit breakers to integrations Medium Medium Month 2
P2 Create runbooks for common issues Medium Medium Month 2
P3 Implement PagerDuty on-call rotation Medium Medium Month 3
P3 Conduct disaster recovery drill Medium High Month 3
P3 Define and publish tenant SLAs Low Medium Month 3

11.2 Implementation Roadmap

gantt
    title Disaster Recovery Implementation
    dateFormat  YYYY-MM-DD
    section Phase 1: Foundation
    Document incident procedures     :a1, 2026-02-10, 5d
    Implement database backups       :a2, 2026-02-10, 10d
    section Phase 2: Visibility
    Deploy status page               :b1, 2026-02-17, 7d
    Set up Slack alerts              :b2, 2026-02-17, 5d
    Configure monitoring stack       :b3, 2026-02-20, 10d
    section Phase 3: Resilience
    Database replication             :c1, 2026-03-01, 14d
    Circuit breakers                 :c2, 2026-03-01, 7d
    Create runbooks                  :c3, 2026-03-08, 10d
    section Phase 4: Operations
    PagerDuty setup                  :d1, 2026-03-15, 7d
    DR drill                         :d2, 2026-03-22, 3d
    Publish SLAs                     :d3, 2026-03-25, 5d

11.3 Success Metrics

Metric Current Target Measurement
MTTD (Mean Time to Detect) Unknown < 5 min Alert to detection
MTTR (Mean Time to Recover) Unknown < 30 min Detection to resolution
Incident frequency Unknown < 2/month SEV-½ incidents
Postmortem completion 0% 100% SEV-½ within 5 days
Backup test frequency Never Monthly Restore verification

11.4 Cost Estimates

Component Option A (Budget) Option B (Recommended) Option C (Enterprise)
Status Page Statping-ng (Free) Statping-ng (Free) Atlassian ($79/mo)
Monitoring Prometheus (Free) Grafana Cloud ($50/mo) Datadog ($100+/mo)
Alerting Slack (Free) PagerDuty ($25/user/mo) PagerDuty + OpsGenie
Database HA Manual failover Managed DB ($100/mo) Multi-region ($300+/mo)
Total ~$0/mo ~$175-250/mo ~$500+/mo

Appendix A: Quick Reference Cards

A.1 Emergency Contacts

## Emergency Contacts

| Role | Name | Phone | Email |
|------|------|-------|-------|
| Primary On-Call | [TBD] | [TBD] | [TBD] |
| Secondary On-Call | [TBD] | [TBD] | [TBD] |
| Engineering Lead | [TBD] | [TBD] | [TBD] |

## External Services

| Service | Status Page | Support |
|---------|-------------|---------|
| Shopify | shopifystatus.com | partners@shopify.com |
| SimplyPrint | [TBD] | support@simplyprint.io |
| SendCloud | status.sendcloud.sc | support@sendcloud.sc |

A.2 Critical Commands

# Health check
curl https://api.forma3d.com/health | jq

# Database connection test
psql $DATABASE_URL -c "SELECT 1"

# View recent errors in Sentry
# (via Sentry dashboard)

# Check retry queue
curl -H "Authorization: Bearer $API_KEY" \
  https://api.forma3d.com/retry-queue/stats

# Force retry queue processing
curl -X POST -H "Authorization: Bearer $API_KEY" \
  https://api.forma3d.com/retry-queue/process

# Kubernetes pod restart
kubectl rollout restart deployment/api -n forma3d

# View recent logs
kubectl logs -l app=api -n forma3d --tail=100 -f

# Database backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d).sql.gz

A.3 Runbook Index

Issue Runbook Location
Database unreachable docs/runbooks/database-connection.md
Shopify webhooks failing docs/runbooks/shopify-webhooks.md
SimplyPrint sync issues docs/runbooks/simplyprint-sync.md
High error rate docs/runbooks/error-rate-spike.md
Memory exhaustion docs/runbooks/memory-oom.md
Disk space low docs/runbooks/disk-space.md
Certificate expiration docs/runbooks/ssl-renewal.md

Appendix B: Compliance Considerations

GDPR Requirements

  • Breach Notification: 72 hours to supervisory authority
  • Data Subject Rights: Must be maintained during incidents
  • Data Recovery: Backups must be restorable and tested

Audit Trail

  • All incidents logged in AuditLog table
  • Postmortems stored for minimum 3 years
  • Access logs retained per data retention policy

Document Version: 1.0
Last Updated: February 2026
Next Review: August 2026