Skip to content

External Service Resilience Research

Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Shopify, SimplyPrint, SendCloud outage resilience Trigger: SendCloud planned maintenance notification

Table of Contents

  1. Executive Summary
  2. External Service Dependency Map
  3. Current Resilience Mechanisms
  4. Failure Scenario Analysis
  5. Identified Gaps
  6. Recommendations
  7. Implementation Priority

1. Executive Summary

Forma 3D Connect depends on three external services for its core order-to-delivery pipeline: Shopify (e-commerce), SimplyPrint (3D print management), and SendCloud (shipping/labels). When any of these services go down — whether through planned maintenance, unplanned outages, or network glitches — our system must continue to operate gracefully and recover automatically once the service returns.

Key Findings

Area Current State Risk Level
Shopify order ingestion Well protected — webhook idempotency + backfill polling every 5 min Low
Shopify fulfillment creation Protected — retry queue with exponential backoff (5 attempts) Medium
SimplyPrint print job creation Vulnerable — no retry, immediate FAILED status High
SimplyPrint status tracking Well protected — webhooks + polling every 30s + reconciliation every 1 min Low
SendCloud shipment creation Broken — retry jobs enqueued but never processed (missing handler) Critical
SendCloud status tracking Well protected — webhooks + reconciliation every 5 min Low
Circuit breaker (all services) Not implemented — repeated calls to a down service waste resources and risk cascading failure High

Bottom Line

The system will mostly recover from short outages (< 5 minutes) thanks to webhook retry behavior from Shopify and the reconciliation services for SimplyPrint and SendCloud. However, longer outages expose real gaps: SimplyPrint print job creation has no automatic retry, and SendCloud shipment retries are enqueued but silently dropped because the retry queue processor has no SHIPMENT handler. There is no circuit breaker to prevent cascading failures during prolonged downtime.


2. External Service Dependency Map

Data Flow

Shopify                  SimplyPrint                SendCloud
   │                         │                          │
   │  webhooks +             │  webhooks +              │  webhooks +
   │  polling backfill       │  polling + reconciliation│  reconciliation
   │                         │                          │
   ▼                         ▼                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        Forma 3D Connect                             │
│                                                                     │
│  Order Service          Print Service           Shipping Service    │
│  ┌──────────────┐       ┌──────────────┐       ┌──────────────┐    │
│  │ ShopifyAPI   │       │ SimplyPrintAPI│       │ SendcloudAPI │    │
│  │ Client       │       │ Client       │       │ Client       │    │
│  └──────┬───────┘       └──────┬───────┘       └──────┬───────┘    │
│         │                      │                      │            │
│  ┌──────▼───────┐       ┌──────▼───────┐       ┌──────▼───────┐    │
│  │ Retry Queue  │       │ (no retry)   │       │ Retry Queue  │    │
│  │ (DB-backed)  │       │              │       │ (DB-backed)  │    │
│  └──────────────┘       └──────────────┘       └──────────────┘    │
│                                                                     │
│  ┌──────────────┐       ┌──────────────┐       ┌──────────────┐    │
│  │ Backfill     │       │ Reconciliation│      │ Reconciliation│   │
│  │ (every 5min) │       │ (every 1min) │       │ (every 5min) │    │
│  └──────────────┘       └──────────────┘       └──────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Interaction Types per Service

Service Inbound (they → us) Outbound (us → them)
Shopify Webhooks (orders/create, orders/updated, orders/cancelled, orders/fulfilled) REST + GraphQL API (fulfillments, products, draft orders, variants)
SimplyPrint Webhooks (job.started, job.done, job.failed, job.cancelled, etc.) REST API (create job, add to queue, file management, get status)
SendCloud Webhooks (parcel_status_changed) REST API (create parcel, get label, cancel parcel, get tracking)

What Each Service Controls

Service If down, we cannot...
Shopify Receive new orders (webhooks), create fulfillments (mark orders as shipped), manage products/variants
SimplyPrint Create print jobs, check print status, manage printer queue
SendCloud Create shipping labels, get tracking info, cancel parcels

3. Current Resilience Mechanisms

3.1 Shopify

Inbound: Order Ingestion

Mechanism How it works Coverage
Webhook retry (Shopify-side) Shopify retries failed webhooks for up to 48 hours with exponential backoff Covers our downtime up to 48h
Webhook idempotency ProcessedWebhook table prevents duplicate processing if same webhook arrives twice Prevents duplicates
Backfill polling ShopifyBackfillService polls Shopify Orders API every 5 minutes using a durable since_id watermark Catches anything webhooks missed
Test order filtering Test/bogus orders are detected and skipped Prevents noise

Assessment: Order ingestion is well protected. Even during extended downtime, Shopify's own webhook retry (48h) combined with our backfill polling creates a robust safety net. The watermark-based backfill ensures no orders are lost even if both mechanisms overlap.

Outbound: Fulfillment Creation

Mechanism How it works Coverage
Retry queue Failed fulfillment attempts are enqueued with exponential backoff (5 attempts, 1s→1h max delay, 2x multiplier, ±10% jitter) Covers transient failures
Retryable error detection HTTP 429, 500, 502, 503, 504, timeouts, ECONNRESET, ECONNREFUSED classified as retryable Distinguishes transient vs permanent
Rate limit handling 429 responses trigger immediate retry using Retry-After header Prevents rate limit violations
Request timeout AbortController with configurable timeout prevents hanging requests Prevents resource leaks

Assessment: Fulfillment creation is reasonably protected. The 5-attempt retry with exponential backoff handles most transient failures. However, a prolonged outage (> ~1 hour given the max delay) would exhaust retries, and the fulfillment would be permanently marked as failed requiring manual intervention.

Key files

  • apps/order-service/src/shopify/shopify-api.client.ts — API client with rate limit handling
  • apps/order-service/src/shopify/shopify.service.ts — Webhook processing
  • apps/order-service/src/shopify/shopify-backfill.service.ts — Polling-based order recovery
  • apps/order-service/src/retry-queue/retry-queue.service.ts — Retry queue with backoff
  • apps/order-service/src/retry-queue/retry-queue.processor.ts — Processes FULFILLMENT retries

3.2 SimplyPrint

Inbound: Print Job Status Updates

Mechanism How it works Coverage
Webhooks SimplyPrint sends job.started, job.done, job.failed, etc. Primary status channel
Webhook idempotency WebhookIdempotencyRepository prevents duplicate processing Prevents duplicates
Polling SimplyPrintService polls SimplyPrint every 30 seconds for job status updates Covers missed webhooks
Reconciliation SimplyPrintReconciliationService runs every minute, compares all active jobs against SimplyPrint API Catches any status drift

Assessment: Status tracking is well protected through three independent mechanisms (webhooks, polling, reconciliation). Even if all webhooks are lost during downtime, reconciliation will catch up within 1 minute.

Outbound: Print Job Creation

Mechanism How it works Coverage
None addToQueue() failures immediately set job to FAILED status No automatic recovery
Manual retry retryJob() method allows manual retry of FAILED/CANCELLED jobs Requires human action

Assessment: This is the biggest gap. If SimplyPrint is down when a new order arrives and print jobs need to be created, those jobs are immediately marked as FAILED with no automatic retry. An operator must manually retry each failed job after SimplyPrint recovers. During a maintenance window, this could affect all incoming orders.

Key files

  • apps/print-service/src/simplyprint/simplyprint-api.client.ts — API client (no retry logic)
  • apps/print-service/src/print-jobs/print-jobs.service.ts — Job creation and manual retry
  • apps/print-service/src/simplyprint/simplyprint.service.ts — Webhook processing + polling
  • apps/print-service/src/simplyprint/simplyprint-reconciliation.service.ts — Status reconciliation

3.3 SendCloud

Inbound: Shipment Status Updates

Mechanism How it works Coverage
Webhooks SendCloud sends parcel_status_changed events Primary status channel
Webhook idempotency WebhookIdempotencyRepository prevents duplicate processing Prevents duplicates
Reconciliation SendcloudReconciliationService runs every 5 minutes, checks all active shipments against SendCloud API Catches missed webhooks

Assessment: Status tracking is well protected. The reconciliation service will detect any status changes missed during downtime within 5 minutes.

Outbound: Shipment/Label Creation

Mechanism How it works Coverage
Retry queue (enqueue only) Failed shipment creation enqueues RetryJobType.SHIPMENT with 3 max attempts Jobs are saved...
Retry processor (missing handler) RetryQueueProcessor.processJob() has no case RetryJobType.SHIPMENT — falls to default which logs a warning and does nothing ...but never processed

Assessment: This is a critical bug. The SendcloudService correctly identifies retryable errors and enqueues retry jobs, but the RetryQueueProcessor in the shipping-service has no handler for SHIPMENT jobs. They sit in the database indefinitely. After max retries are "exhausted" (they never actually ran), they are marked as FAILED and require manual intervention. This is a code path that was designed but never completed.

Key files

  • apps/shipping-service/src/sendcloud/sendcloud-api.client.ts — API client
  • apps/shipping-service/src/sendcloud/sendcloud.service.ts — Shipment creation (enqueues retries at line 457-461)
  • apps/shipping-service/src/retry-queue/retry-queue.processor.tsMissing SHIPMENT case (line 71-91)
  • apps/shipping-service/src/sendcloud/sendcloud-reconciliation.service.ts — Status reconciliation

4. Failure Scenario Analysis

Scenario 1: SendCloud Planned Maintenance (1-4 hours)

What happens today:

Step Event System Response Data Impact
1 SendCloud API returns 503 createParcel() throws error No label created
2 Error classified as retryable handleShipmentError() enqueues SHIPMENT retry job Job saved in DB
3 Retry processor runs every 30s No handler for SHIPMENT — job is silently skipped Job never retried
4 After max attempts exhausted (time-based) Job marked as FAILED, requiresAttention: true logged Permanent failure
5 SendCloud comes back online Nothing happens automatically Labels remain uncreated
6 Operator notices FAILED shipments Must manually trigger shipment creation for each affected order Manual work

Impact: Every order that reaches the shipping stage during the maintenance window will require manual intervention. For a 4-hour window, this could be dozens of orders.

Scenario 2: SimplyPrint Unplanned Outage (30 min - 2 hours)

What happens today:

Step Event System Response Data Impact
1 New Shopify order arrives (via webhook or backfill) Order created successfully Order saved
2 Print job creation triggered addToQueue() call to SimplyPrint fails API error
3 Error caught Job immediately marked as FAILED No retry
4 SimplyPrint comes back Reconciliation runs but only checks existing jobs — never re-creates failed ones Gap persists
5 Operator notices Must manually retry each failed print job Manual work

Impact: All orders arriving during the outage have their print jobs permanently failed. The reconciliation service only reconciles status of jobs that were already successfully created in SimplyPrint — it does not retry failed job creation.

Scenario 3: Shopify Brief Network Glitch (< 5 minutes)

What happens today:

Step Event System Response Data Impact
1 Our fulfillment API call to Shopify fails Error classified as retryable Retry enqueued
2 Retry queue processes (every 30s) Retries with exponential backoff Usually succeeds
3 Shopify webhooks during glitch Shopify retries automatically (48h window) Nothing lost
4 Backfill runs Catches any orders missed by webhooks Belt-and-suspenders

Impact: Minimal. The system handles this scenario well.

Scenario 4: Shopify Extended Outage (> 1 hour)

What happens today:

Step Event System Response Data Impact
1 No webhooks arriving Backfill attempts also fail (Shopify is down) No new orders
2 Fulfillment retries exhaust After 5 attempts (~1h with backoff), fulfillments permanently fail Requires manual action
3 Shopify recovers Webhooks start flowing again, backfill catches missed orders Orders recovered
4 Failed fulfillments Not automatically retried — remain in FAILED state Manual action needed

Impact: Order ingestion recovers automatically. Fulfillments that exhausted retries during the outage require manual intervention.

Scenario 5: Cascading Failure (any service down + high load)

What happens today:

Step Event System Response Data Impact
1 Service returns errors/timeouts Every incoming request still attempts the failing API call Wasted resources
2 Request threads blocked on timeouts API response time degrades across all endpoints User experience
3 Health check reports dependency unhealthy But no automatic response — just informational No mitigation
4 Retry queue floods with jobs All retries attempt the still-failing API Wasted work

Impact: Without a circuit breaker, the system wastes resources hammering a known-dead service. This can degrade the entire platform, not just the affected integration. The retry queue also fills up with doomed retries that will all fail.


5. Identified Gaps

5.1 Critical: SendCloud Retry Handler Missing

Location: apps/shipping-service/src/retry-queue/retry-queue.processor.ts lines 71-91

The switch statement handles FULFILLMENT, PRINT_JOB_CREATION, NOTIFICATION, and CANCELLATION but has no case for SHIPMENT. The SendcloudService at line 457-461 enqueues RetryJobType.SHIPMENT jobs that are never processed.

Fix complexity: Low — add a case RetryJobType.SHIPMENT handler that calls SendcloudService.createShipment().

5.2 High: No Retry for SimplyPrint Job Creation

Location: apps/print-service/src/print-jobs/print-jobs.service.ts

Failed addToQueue() calls immediately mark the print job as FAILED with no retry mechanism. There is no retry queue in the print-service for this operation.

Fix complexity: Medium — implement a retry queue in the print-service (can reuse the same pattern from order-service/shipping-service) and enqueue failed addToQueue calls.

5.3 High: No Circuit Breaker on Any External Service

Location: All API clients

When an external service is down, the system continues to make API calls that are guaranteed to fail. This wastes resources, blocks request threads on timeouts, and floods the retry queue.

Fix complexity: Medium — implement circuit breaker using the opossum library (already recommended in docs/03-architecture/patterns-evaluation.md lines 1696-1754).

5.4 Medium: Exhausted Retries Are Not Re-Retriable

When all retry attempts are exhausted, jobs are permanently marked as FAILED. Even after the external service recovers, there is no mechanism to automatically retry these "exhausted" jobs. This requires manual operator intervention.

Fix complexity: Medium — add a "re-enqueue exhausted jobs" feature that checks for FAILED retry jobs whose target service is now healthy (using the existing health check indicators).

5.5 Medium: Print Job Creation Retry Not Implemented

Location: apps/order-service/src/retry-queue/retry-queue.processor.ts line 113-122

The PRINT_JOB_CREATION case in the order-service retry processor throws a ConflictError('Print job retry not yet implemented'). Any print job creation failures that are enqueued in the order-service retry queue will also silently fail.

5.6 Low: No Alerting for Maintenance Windows

There is no mechanism to pre-emptively adjust behavior when a planned maintenance window is known. The system treats planned maintenance the same as an unplanned outage.


6. Recommendations

6.1 Fix: Add SHIPMENT Handler to Shipping-Service Retry Processor

Priority: Immediate — this is a bug

Add the missing case to RetryQueueProcessor:

case RetryJobType.SHIPMENT:
  await this.processShipmentRetry(job);
  break;

With a handler that calls SendcloudService.createShipment():

private async processShipmentRetry(job: RetryQueue): Promise<void> {
  const payload = job.payload as { orderId: string; action: string };
  await this.sendcloudService.createShipment(payload.orderId);
}

This requires injecting SendcloudService into the RetryQueueProcessor.

6.2 Fix: Add Retry Queue for SimplyPrint Job Creation

Priority: High

Apply the same retry queue pattern already used in order-service and shipping-service:

  1. When addToQueue() fails with a retryable error, enqueue a PRINT_JOB_CREATION retry job instead of immediately marking as FAILED
  2. Keep the job in QUEUED status (not FAILED) while retries are pending
  3. Add a PRINT_JOB_CREATION handler in the print-service retry processor
  4. Only mark as FAILED after all retries are exhausted

6.3 Implement: Circuit Breaker Pattern

Priority: High

Use the opossum library to wrap all external API clients:

import CircuitBreaker from 'opossum';

const options = {
  timeout: 10000,          // 10 second timeout per call
  errorThresholdPercentage: 50,  // Open circuit after 50% failures
  resetTimeout: 30000,     // Try again after 30 seconds
  volumeThreshold: 5,      // Minimum 5 calls before evaluation
};

const breaker = new CircuitBreaker(apiCall, options);

breaker.on('open', () => {
  logger.warn('Circuit breaker OPEN — service is down, fast-failing requests');
});

breaker.on('halfOpen', () => {
  logger.log('Circuit breaker HALF-OPEN — testing if service recovered');
});

breaker.on('close', () => {
  logger.log('Circuit breaker CLOSED — service recovered, resuming normal operation');
});

Benefits: - Fast failure: When a service is known to be down, immediately fail instead of waiting for timeout - Automatic recovery: Periodically tests if the service is back, automatically resumes when it is - Resource protection: Prevents thread exhaustion from blocked timeout requests - Retry queue synergy: Retryable errors go to the retry queue, the circuit breaker prevents the retry queue from hammering a dead service

6.4 Implement: Auto-Retry for Exhausted Jobs After Service Recovery

Priority: Medium

Add a scheduled job that checks for FAILED retry jobs whose target service is now healthy:

  1. Use the existing health check indicators (/health/dependencies) to determine service status
  2. When a service transitions from unhealthy → healthy, scan for FAILED retry jobs targeting that service
  3. Re-enqueue them with a fresh attempt counter
  4. Rate-limit the re-enqueue to avoid flooding the recovered service

This closes the gap between "retries exhausted during downtime" and "service came back but nobody retried."

6.5 Implement: Maintenance Mode

Priority: Low

Add a per-service "maintenance mode" that:

  1. Can be activated via admin API or config flag
  2. Pauses outbound API calls to the service (queue them instead of calling)
  3. Keeps accepting inbound webhooks and processing other work
  4. When deactivated, drains the queued calls with rate limiting
  5. Optionally: integrate with service status pages to auto-activate

This is especially useful for known maintenance windows like the SendCloud notification that triggered this research.

6.6 Monitoring: Circuit Breaker Dashboard Integration

Priority: Low

Expose circuit breaker state through the existing health check endpoints and the admin dashboard:

  • Current state per service (CLOSED / OPEN / HALF-OPEN)
  • Failure count and threshold
  • Time since last state transition
  • Number of queued retry jobs per service

7. Implementation Priority

Phase 1: Bug Fixes (1-2 days)

Item Effort Impact
Add SHIPMENT handler to shipping-service retry processor 2 hours Fixes critical bug — SendCloud retries will actually work
Add PRINT_JOB_CREATION retry logic to print-service 1 day SimplyPrint job creation will survive outages

Phase 2: Circuit Breaker (2-3 days)

Item Effort Impact
Install opossum and create CircuitBreakerService wrapper 4 hours Reusable infrastructure
Wrap Shopify API client 2 hours Prevents cascading failure
Wrap SimplyPrint API client 2 hours Prevents cascading failure
Wrap SendCloud API client 2 hours Prevents cascading failure
Expose circuit breaker state in health checks 2 hours Observability

Phase 3: Self-Healing (1-2 days)

Item Effort Impact
Auto-retry exhausted jobs after service recovery 1 day Eliminates most manual intervention
Alerting for circuit breaker state changes 4 hours Early warning

Phase 4: Operational Excellence (optional)

Item Effort Impact
Maintenance mode per service 1 day Graceful handling of known windows
Dashboard integration 1 day Visibility

Appendix A: Current Retry Queue Configuration

Parameter Value
Max retries 5 (order-service), 3 (shipping-service for shipments)
Initial delay 1,000 ms
Max delay 3,600,000 ms (1 hour)
Backoff multiplier 2
Jitter ±10%
Processing interval Every 30 seconds
Cleanup Daily at 3 AM, removes jobs older than 7 days

Retry timeline example (5 attempts):

Attempt Approx delay Wall clock (cumulative)
1 ~1s ~1s
2 ~2s ~3s
3 ~4s ~7s
4 ~8s ~15s
5 ~16s ~31s

With jitter, actual delays vary by ±10%. The max delay cap (1 hour) only matters for configurations with many more retries.

Appendix B: Reconciliation Service Summary

Service Interval Scope Resilience Provided
ShopifyBackfillService Every 5 min All orders since watermark Catches missed webhooks, recovers order ingestion
SimplyPrintReconciliationService Every 1 min All active print jobs Catches missed status webhooks, detects status drift
SendcloudReconciliationService Every 5 min All active shipments Catches missed status webhooks, detects status drift

All reconciliation services: - Are configurable via environment variables (*_RECONCILIATION_ENABLED) - Run an initial check shortly after startup (10-45 second delay) - Use mutex (isReconciling flag) to prevent concurrent runs - Log results to EventLog with severity based on error count - Report errors to Sentry

Appendix C: Webhook Behavior by Service

Service Retry on failure? Retry window Idempotency key
Shopify Yes (automatic) 48 hours, exponential backoff ${topic}:${shopify_order_id}
SimplyPrint Unknown / not documented ${webhook_id}:${event}:${job.uid}:${timestamp}
SendCloud Yes (automatic) Limited retries ${parcel.id}-${parcel.status.id}-${timestamp}

Shopify is the most robust — if we return a non-2xx response, Shopify will keep retrying for 48 hours. Our system deliberately returns 200 OK for non-critical errors to avoid unnecessary Shopify retries, and only returns 500 for critical errors (like database connection failures) to trigger Shopify's retry mechanism.

SimplyPrint webhook retry behavior is not well documented. The reconciliation service and polling provide the primary safety net here.

SendCloud performs limited webhook retries. The reconciliation service compensates for any missed deliveries.