External Service Resilience Research¶

Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Shopify, SimplyPrint, SendCloud outage resilience Trigger: SendCloud planned maintenance notification

Table of Contents¶

Executive Summary
External Service Dependency Map
Current Resilience Mechanisms
Failure Scenario Analysis
Identified Gaps
Recommendations
Implementation Priority

1. Executive Summary¶

Forma 3D Connect depends on three external services for its core order-to-delivery pipeline: Shopify (e-commerce), SimplyPrint (3D print management), and SendCloud (shipping/labels). When any of these services go down — whether through planned maintenance, unplanned outages, or network glitches — our system must continue to operate gracefully and recover automatically once the service returns.

Key Findings¶

Area	Current State	Risk Level
Shopify order ingestion	Well protected — webhook idempotency + backfill polling every 5 min	Low
Shopify fulfillment creation	Protected — retry queue with exponential backoff (5 attempts)	Medium
SimplyPrint print job creation	Vulnerable — no retry, immediate FAILED status	High
SimplyPrint status tracking	Well protected — webhooks + polling every 30s + reconciliation every 1 min	Low
SendCloud shipment creation	Broken — retry jobs enqueued but never processed (missing handler)	Critical
SendCloud status tracking	Well protected — webhooks + reconciliation every 5 min	Low
Circuit breaker (all services)	Not implemented — repeated calls to a down service waste resources and risk cascading failure	High

Bottom Line¶

The system will mostly recover from short outages (< 5 minutes) thanks to webhook retry behavior from Shopify and the reconciliation services for SimplyPrint and SendCloud. However, longer outages expose real gaps: SimplyPrint print job creation has no automatic retry, and SendCloud shipment retries are enqueued but silently dropped because the retry queue processor has no SHIPMENT handler. There is no circuit breaker to prevent cascading failures during prolonged downtime.

2. External Service Dependency Map¶

Data Flow¶

Shopify                  SimplyPrint                SendCloud
   │                         │                          │
   │  webhooks +             │  webhooks +              │  webhooks +
   │  polling backfill       │  polling + reconciliation│  reconciliation
   │                         │                          │
   ▼                         ▼                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        Forma 3D Connect                             │
│                                                                     │
│  Order Service          Print Service           Shipping Service    │
│  ┌──────────────┐       ┌──────────────┐       ┌──────────────┐    │
│  │ ShopifyAPI   │       │ SimplyPrintAPI│       │ SendcloudAPI │    │
│  │ Client       │       │ Client       │       │ Client       │    │
│  └──────┬───────┘       └──────┬───────┘       └──────┬───────┘    │
│         │                      │                      │            │
│  ┌──────▼───────┐       ┌──────▼───────┐       ┌──────▼───────┐    │
│  │ Retry Queue  │       │ (no retry)   │       │ Retry Queue  │    │
│  │ (DB-backed)  │       │              │       │ (DB-backed)  │    │
│  └──────────────┘       └──────────────┘       └──────────────┘    │
│                                                                     │
│  ┌──────────────┐       ┌──────────────┐       ┌──────────────┐    │
│  │ Backfill     │       │ Reconciliation│      │ Reconciliation│   │
│  │ (every 5min) │       │ (every 1min) │       │ (every 5min) │    │
│  └──────────────┘       └──────────────┘       └──────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Interaction Types per Service¶

Service	Inbound (they → us)	Outbound (us → them)
Shopify	Webhooks (orders/create, orders/updated, orders/cancelled, orders/fulfilled)	REST + GraphQL API (fulfillments, products, draft orders, variants)
SimplyPrint	Webhooks (job.started, job.done, job.failed, job.cancelled, etc.)	REST API (create job, add to queue, file management, get status)
SendCloud	Webhooks (parcel_status_changed)	REST API (create parcel, get label, cancel parcel, get tracking)

What Each Service Controls¶

Service	If down, we cannot...
Shopify	Receive new orders (webhooks), create fulfillments (mark orders as shipped), manage products/variants
SimplyPrint	Create print jobs, check print status, manage printer queue
SendCloud	Create shipping labels, get tracking info, cancel parcels

3. Current Resilience Mechanisms¶

3.1 Shopify¶

Inbound: Order Ingestion¶

Mechanism	How it works	Coverage
Webhook retry (Shopify-side)	Shopify retries failed webhooks for up to 48 hours with exponential backoff	Covers our downtime up to 48h
Webhook idempotency	`ProcessedWebhook` table prevents duplicate processing if same webhook arrives twice	Prevents duplicates
Backfill polling	`ShopifyBackfillService` polls Shopify Orders API every 5 minutes using a durable `since_id` watermark	Catches anything webhooks missed
Test order filtering	Test/bogus orders are detected and skipped	Prevents noise

Assessment: Order ingestion is well protected. Even during extended downtime, Shopify's own webhook retry (48h) combined with our backfill polling creates a robust safety net. The watermark-based backfill ensures no orders are lost even if both mechanisms overlap.

Outbound: Fulfillment Creation¶

Mechanism	How it works	Coverage
Retry queue	Failed fulfillment attempts are enqueued with exponential backoff (5 attempts, 1s→1h max delay, 2x multiplier, ±10% jitter)	Covers transient failures
Retryable error detection	HTTP 429, 500, 502, 503, 504, timeouts, ECONNRESET, ECONNREFUSED classified as retryable	Distinguishes transient vs permanent
Rate limit handling	429 responses trigger immediate retry using `Retry-After` header	Prevents rate limit violations
Request timeout	`AbortController` with configurable timeout prevents hanging requests	Prevents resource leaks

Assessment: Fulfillment creation is reasonably protected. The 5-attempt retry with exponential backoff handles most transient failures. However, a prolonged outage (> ~1 hour given the max delay) would exhaust retries, and the fulfillment would be permanently marked as failed requiring manual intervention.

Key files¶

apps/order-service/src/shopify/shopify-api.client.ts — API client with rate limit handling
apps/order-service/src/shopify/shopify.service.ts — Webhook processing
apps/order-service/src/shopify/shopify-backfill.service.ts — Polling-based order recovery
apps/order-service/src/retry-queue/retry-queue.service.ts — Retry queue with backoff
apps/order-service/src/retry-queue/retry-queue.processor.ts — Processes FULFILLMENT retries

3.2 SimplyPrint¶

Inbound: Print Job Status Updates¶

Mechanism	How it works	Coverage
Webhooks	SimplyPrint sends job.started, job.done, job.failed, etc.	Primary status channel
Webhook idempotency	`WebhookIdempotencyRepository` prevents duplicate processing	Prevents duplicates
Polling	`SimplyPrintService` polls SimplyPrint every 30 seconds for job status updates	Covers missed webhooks
Reconciliation	`SimplyPrintReconciliationService` runs every minute, compares all active jobs against SimplyPrint API	Catches any status drift

Assessment: Status tracking is well protected through three independent mechanisms (webhooks, polling, reconciliation). Even if all webhooks are lost during downtime, reconciliation will catch up within 1 minute.

Outbound: Print Job Creation¶

Mechanism	How it works	Coverage
None	`addToQueue()` failures immediately set job to FAILED status	No automatic recovery
Manual retry	`retryJob()` method allows manual retry of FAILED/CANCELLED jobs	Requires human action

Assessment: This is the biggest gap. If SimplyPrint is down when a new order arrives and print jobs need to be created, those jobs are immediately marked as FAILED with no automatic retry. An operator must manually retry each failed job after SimplyPrint recovers. During a maintenance window, this could affect all incoming orders.

Key files¶

apps/print-service/src/simplyprint/simplyprint-api.client.ts — API client (no retry logic)
apps/print-service/src/print-jobs/print-jobs.service.ts — Job creation and manual retry
apps/print-service/src/simplyprint/simplyprint.service.ts — Webhook processing + polling
apps/print-service/src/simplyprint/simplyprint-reconciliation.service.ts — Status reconciliation

3.3 SendCloud¶

Inbound: Shipment Status Updates¶

Mechanism	How it works	Coverage
Webhooks	SendCloud sends `parcel_status_changed` events	Primary status channel
Webhook idempotency	`WebhookIdempotencyRepository` prevents duplicate processing	Prevents duplicates
Reconciliation	`SendcloudReconciliationService` runs every 5 minutes, checks all active shipments against SendCloud API	Catches missed webhooks

Assessment: Status tracking is well protected. The reconciliation service will detect any status changes missed during downtime within 5 minutes.

Outbound: Shipment/Label Creation¶

Mechanism	How it works	Coverage
Retry queue (enqueue only)	Failed shipment creation enqueues `RetryJobType.SHIPMENT` with 3 max attempts	Jobs are saved...
Retry processor (missing handler)	`RetryQueueProcessor.processJob()` has no `case RetryJobType.SHIPMENT` — falls to `default` which logs a warning and does nothing	...but never processed

Assessment: This is a critical bug. The SendcloudService correctly identifies retryable errors and enqueues retry jobs, but the RetryQueueProcessor in the shipping-service has no handler for SHIPMENT jobs. They sit in the database indefinitely. After max retries are "exhausted" (they never actually ran), they are marked as FAILED and require manual intervention. This is a code path that was designed but never completed.

Key files¶

apps/shipping-service/src/sendcloud/sendcloud-api.client.ts — API client
apps/shipping-service/src/sendcloud/sendcloud.service.ts — Shipment creation (enqueues retries at line 457-461)
apps/shipping-service/src/retry-queue/retry-queue.processor.ts — Missing SHIPMENT case (line 71-91)
apps/shipping-service/src/sendcloud/sendcloud-reconciliation.service.ts — Status reconciliation

4. Failure Scenario Analysis¶

Scenario 1: SendCloud Planned Maintenance (1-4 hours)¶

What happens today:

Step	Event	System Response	Data Impact
1	SendCloud API returns 503	`createParcel()` throws error	No label created
2	Error classified as retryable	`handleShipmentError()` enqueues `SHIPMENT` retry job	Job saved in DB
3	Retry processor runs every 30s	No handler for SHIPMENT — job is silently skipped	Job never retried
4	After max attempts exhausted (time-based)	Job marked as `FAILED`, `requiresAttention: true` logged	Permanent failure
5	SendCloud comes back online	Nothing happens automatically	Labels remain uncreated
6	Operator notices FAILED shipments	Must manually trigger shipment creation for each affected order	Manual work

Impact: Every order that reaches the shipping stage during the maintenance window will require manual intervention. For a 4-hour window, this could be dozens of orders.

Scenario 2: SimplyPrint Unplanned Outage (30 min - 2 hours)¶

What happens today:

Step	Event	System Response	Data Impact
1	New Shopify order arrives (via webhook or backfill)	Order created successfully	Order saved
2	Print job creation triggered	`addToQueue()` call to SimplyPrint fails	API error
3	Error caught	Job immediately marked as `FAILED`	No retry
4	SimplyPrint comes back	Reconciliation runs but only checks existing jobs — never re-creates failed ones	Gap persists
5	Operator notices	Must manually retry each failed print job	Manual work

Impact: All orders arriving during the outage have their print jobs permanently failed. The reconciliation service only reconciles status of jobs that were already successfully created in SimplyPrint — it does not retry failed job creation.

Scenario 3: Shopify Brief Network Glitch (< 5 minutes)¶

What happens today:

Step	Event	System Response	Data Impact
1	Our fulfillment API call to Shopify fails	Error classified as retryable	Retry enqueued
2	Retry queue processes (every 30s)	Retries with exponential backoff	Usually succeeds
3	Shopify webhooks during glitch	Shopify retries automatically (48h window)	Nothing lost
4	Backfill runs	Catches any orders missed by webhooks	Belt-and-suspenders

Impact: Minimal. The system handles this scenario well.

Scenario 4: Shopify Extended Outage (> 1 hour)¶

What happens today:

Step	Event	System Response	Data Impact
1	No webhooks arriving	Backfill attempts also fail (Shopify is down)	No new orders
2	Fulfillment retries exhaust	After 5 attempts (~1h with backoff), fulfillments permanently fail	Requires manual action
3	Shopify recovers	Webhooks start flowing again, backfill catches missed orders	Orders recovered
4	Failed fulfillments	Not automatically retried — remain in FAILED state	Manual action needed

Impact: Order ingestion recovers automatically. Fulfillments that exhausted retries during the outage require manual intervention.

Scenario 5: Cascading Failure (any service down + high load)¶

What happens today:

Step	Event	System Response	Data Impact
1	Service returns errors/timeouts	Every incoming request still attempts the failing API call	Wasted resources
2	Request threads blocked on timeouts	API response time degrades across all endpoints	User experience
3	Health check reports dependency unhealthy	But no automatic response — just informational	No mitigation
4	Retry queue floods with jobs	All retries attempt the still-failing API	Wasted work

Impact: Without a circuit breaker, the system wastes resources hammering a known-dead service. This can degrade the entire platform, not just the affected integration. The retry queue also fills up with doomed retries that will all fail.

5. Identified Gaps¶

5.1 Critical: SendCloud Retry Handler Missing¶

Location: apps/shipping-service/src/retry-queue/retry-queue.processor.ts lines 71-91

The switch statement handles FULFILLMENT, PRINT_JOB_CREATION, NOTIFICATION, and CANCELLATION but has no case for SHIPMENT. The SendcloudService at line 457-461 enqueues RetryJobType.SHIPMENT jobs that are never processed.

Fix complexity: Low — add a case RetryJobType.SHIPMENT handler that calls SendcloudService.createShipment().

5.2 High: No Retry for SimplyPrint Job Creation¶

Location: apps/print-service/src/print-jobs/print-jobs.service.ts

Failed addToQueue() calls immediately mark the print job as FAILED with no retry mechanism. There is no retry queue in the print-service for this operation.

Fix complexity: Medium — implement a retry queue in the print-service (can reuse the same pattern from order-service/shipping-service) and enqueue failed addToQueue calls.

5.3 High: No Circuit Breaker on Any External Service¶

Location: All API clients

When an external service is down, the system continues to make API calls that are guaranteed to fail. This wastes resources, blocks request threads on timeouts, and floods the retry queue.

Fix complexity: Medium — implement circuit breaker using the opossum library (already recommended in docs/03-architecture/patterns-evaluation.md lines 1696-1754).

5.4 Medium: Exhausted Retries Are Not Re-Retriable¶

When all retry attempts are exhausted, jobs are permanently marked as FAILED. Even after the external service recovers, there is no mechanism to automatically retry these "exhausted" jobs. This requires manual operator intervention.

Fix complexity: Medium — add a "re-enqueue exhausted jobs" feature that checks for FAILED retry jobs whose target service is now healthy (using the existing health check indicators).

5.5 Medium: Print Job Creation Retry Not Implemented¶

Location: apps/order-service/src/retry-queue/retry-queue.processor.ts line 113-122

The PRINT_JOB_CREATION case in the order-service retry processor throws a ConflictError('Print job retry not yet implemented'). Any print job creation failures that are enqueued in the order-service retry queue will also silently fail.

5.6 Low: No Alerting for Maintenance Windows¶

There is no mechanism to pre-emptively adjust behavior when a planned maintenance window is known. The system treats planned maintenance the same as an unplanned outage.

6. Recommendations¶

6.1 Fix: Add SHIPMENT Handler to Shipping-Service Retry Processor¶

Priority: Immediate — this is a bug

Add the missing case to RetryQueueProcessor:

case RetryJobType.SHIPMENT:
  await this.processShipmentRetry(job);
  break;

With a handler that calls SendcloudService.createShipment():

private async processShipmentRetry(job: RetryQueue): Promise<void> {
  const payload = job.payload as { orderId: string; action: string };
  await this.sendcloudService.createShipment(payload.orderId);
}

This requires injecting SendcloudService into the RetryQueueProcessor.

6.2 Fix: Add Retry Queue for SimplyPrint Job Creation¶

Priority: High

Apply the same retry queue pattern already used in order-service and shipping-service:

When addToQueue() fails with a retryable error, enqueue a PRINT_JOB_CREATION retry job instead of immediately marking as FAILED
Keep the job in QUEUED status (not FAILED) while retries are pending
Add a PRINT_JOB_CREATION handler in the print-service retry processor
Only mark as FAILED after all retries are exhausted

6.3 Implement: Circuit Breaker Pattern¶

Priority: High

Use the opossum library to wrap all external API clients:

import CircuitBreaker from 'opossum';

const options = {
  timeout: 10000,          // 10 second timeout per call
  errorThresholdPercentage: 50,  // Open circuit after 50% failures
  resetTimeout: 30000,     // Try again after 30 seconds
  volumeThreshold: 5,      // Minimum 5 calls before evaluation
};

const breaker = new CircuitBreaker(apiCall, options);

breaker.on('open', () => {
  logger.warn('Circuit breaker OPEN — service is down, fast-failing requests');
});

breaker.on('halfOpen', () => {
  logger.log('Circuit breaker HALF-OPEN — testing if service recovered');
});

breaker.on('close', () => {
  logger.log('Circuit breaker CLOSED — service recovered, resuming normal operation');
});

Benefits: - Fast failure: When a service is known to be down, immediately fail instead of waiting for timeout - Automatic recovery: Periodically tests if the service is back, automatically resumes when it is - Resource protection: Prevents thread exhaustion from blocked timeout requests - Retry queue synergy: Retryable errors go to the retry queue, the circuit breaker prevents the retry queue from hammering a dead service

6.4 Implement: Auto-Retry for Exhausted Jobs After Service Recovery¶

Priority: Medium

Add a scheduled job that checks for FAILED retry jobs whose target service is now healthy:

Use the existing health check indicators (/health/dependencies) to determine service status
When a service transitions from unhealthy → healthy, scan for FAILED retry jobs targeting that service
Re-enqueue them with a fresh attempt counter
Rate-limit the re-enqueue to avoid flooding the recovered service

This closes the gap between "retries exhausted during downtime" and "service came back but nobody retried."

6.5 Implement: Maintenance Mode¶

Priority: Low

Add a per-service "maintenance mode" that:

Can be activated via admin API or config flag
Pauses outbound API calls to the service (queue them instead of calling)
Keeps accepting inbound webhooks and processing other work
When deactivated, drains the queued calls with rate limiting
Optionally: integrate with service status pages to auto-activate

This is especially useful for known maintenance windows like the SendCloud notification that triggered this research.

6.6 Monitoring: Circuit Breaker Dashboard Integration¶

Priority: Low

Expose circuit breaker state through the existing health check endpoints and the admin dashboard:

Current state per service (CLOSED / OPEN / HALF-OPEN)
Failure count and threshold
Time since last state transition
Number of queued retry jobs per service

7. Implementation Priority¶

Phase 1: Bug Fixes (1-2 days)¶

Item	Effort	Impact
Add `SHIPMENT` handler to shipping-service retry processor	2 hours	Fixes critical bug — SendCloud retries will actually work
Add `PRINT_JOB_CREATION` retry logic to print-service	1 day	SimplyPrint job creation will survive outages

Phase 2: Circuit Breaker (2-3 days)¶

Item	Effort	Impact
Install `opossum` and create `CircuitBreakerService` wrapper	4 hours	Reusable infrastructure
Wrap Shopify API client	2 hours	Prevents cascading failure
Wrap SimplyPrint API client	2 hours	Prevents cascading failure
Wrap SendCloud API client	2 hours	Prevents cascading failure
Expose circuit breaker state in health checks	2 hours	Observability

Phase 3: Self-Healing (1-2 days)¶

Item	Effort	Impact
Auto-retry exhausted jobs after service recovery	1 day	Eliminates most manual intervention
Alerting for circuit breaker state changes	4 hours	Early warning

Phase 4: Operational Excellence (optional)¶

Item	Effort	Impact
Maintenance mode per service	1 day	Graceful handling of known windows
Dashboard integration	1 day	Visibility

Appendix A: Current Retry Queue Configuration¶

Parameter	Value
Max retries	5 (order-service), 3 (shipping-service for shipments)
Initial delay	1,000 ms
Max delay	3,600,000 ms (1 hour)
Backoff multiplier	2
Jitter	±10%
Processing interval	Every 30 seconds
Cleanup	Daily at 3 AM, removes jobs older than 7 days

Retry timeline example (5 attempts):

Attempt	Approx delay	Wall clock (cumulative)
1	~1s	~1s
2	~2s	~3s
3	~4s	~7s
4	~8s	~15s
5	~16s	~31s

With jitter, actual delays vary by ±10%. The max delay cap (1 hour) only matters for configurations with many more retries.

Appendix B: Reconciliation Service Summary¶

Service	Interval	Scope	Resilience Provided
`ShopifyBackfillService`	Every 5 min	All orders since watermark	Catches missed webhooks, recovers order ingestion
`SimplyPrintReconciliationService`	Every 1 min	All active print jobs	Catches missed status webhooks, detects status drift
`SendcloudReconciliationService`	Every 5 min	All active shipments	Catches missed status webhooks, detects status drift

All reconciliation services: - Are configurable via environment variables (*_RECONCILIATION_ENABLED) - Run an initial check shortly after startup (10-45 second delay) - Use mutex (isReconciling flag) to prevent concurrent runs - Log results to EventLog with severity based on error count - Report errors to Sentry

Appendix C: Webhook Behavior by Service¶

Service	Retry on failure?	Retry window	Idempotency key
Shopify	Yes (automatic)	48 hours, exponential backoff	`${topic}:${shopify_order_id}`
SimplyPrint	Unknown / not documented	—	`${webhook_id}:${event}:${job.uid}:${timestamp}`
SendCloud	Yes (automatic)	Limited retries	`${parcel.id}-${parcel.status.id}-${timestamp}`

Shopify is the most robust — if we return a non-2xx response, Shopify will keep retrying for 48 hours. Our system deliberately returns 200 OK for non-critical errors to avoid unnecessary Shopify retries, and only returns 500 for critical errors (like database connection failures) to trigger Shopify's retry mechanism.

SimplyPrint webhook retry behavior is not well documented. The reconciliation service and polling provide the primary safety net here.

SendCloud performs limited webhook retries. The reconciliation service compensates for any missed deliveries.