External Service Resilience Research¶
Status: Research Document Created: February 2026 Scope: Forma 3D Connect — Shopify, SimplyPrint, SendCloud outage resilience Trigger: SendCloud planned maintenance notification
Table of Contents¶
- Executive Summary
- External Service Dependency Map
- Current Resilience Mechanisms
- Failure Scenario Analysis
- Identified Gaps
- Recommendations
- Implementation Priority
1. Executive Summary¶
Forma 3D Connect depends on three external services for its core order-to-delivery pipeline: Shopify (e-commerce), SimplyPrint (3D print management), and SendCloud (shipping/labels). When any of these services go down — whether through planned maintenance, unplanned outages, or network glitches — our system must continue to operate gracefully and recover automatically once the service returns.
Key Findings¶
| Area | Current State | Risk Level |
|---|---|---|
| Shopify order ingestion | Well protected — webhook idempotency + backfill polling every 5 min | Low |
| Shopify fulfillment creation | Protected — retry queue with exponential backoff (5 attempts) | Medium |
| SimplyPrint print job creation | Vulnerable — no retry, immediate FAILED status | High |
| SimplyPrint status tracking | Well protected — webhooks + polling every 30s + reconciliation every 1 min | Low |
| SendCloud shipment creation | Broken — retry jobs enqueued but never processed (missing handler) | Critical |
| SendCloud status tracking | Well protected — webhooks + reconciliation every 5 min | Low |
| Circuit breaker (all services) | Not implemented — repeated calls to a down service waste resources and risk cascading failure | High |
Bottom Line¶
The system will mostly recover from short outages (< 5 minutes) thanks to webhook retry behavior from Shopify and the reconciliation services for SimplyPrint and SendCloud. However, longer outages expose real gaps: SimplyPrint print job creation has no automatic retry, and SendCloud shipment retries are enqueued but silently dropped because the retry queue processor has no SHIPMENT handler. There is no circuit breaker to prevent cascading failures during prolonged downtime.
2. External Service Dependency Map¶
Data Flow¶
Shopify SimplyPrint SendCloud
│ │ │
│ webhooks + │ webhooks + │ webhooks +
│ polling backfill │ polling + reconciliation│ reconciliation
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Forma 3D Connect │
│ │
│ Order Service Print Service Shipping Service │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ShopifyAPI │ │ SimplyPrintAPI│ │ SendcloudAPI │ │
│ │ Client │ │ Client │ │ Client │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │
│ │ Retry Queue │ │ (no retry) │ │ Retry Queue │ │
│ │ (DB-backed) │ │ │ │ (DB-backed) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Backfill │ │ Reconciliation│ │ Reconciliation│ │
│ │ (every 5min) │ │ (every 1min) │ │ (every 5min) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Interaction Types per Service¶
| Service | Inbound (they → us) | Outbound (us → them) |
|---|---|---|
| Shopify | Webhooks (orders/create, orders/updated, orders/cancelled, orders/fulfilled) | REST + GraphQL API (fulfillments, products, draft orders, variants) |
| SimplyPrint | Webhooks (job.started, job.done, job.failed, job.cancelled, etc.) | REST API (create job, add to queue, file management, get status) |
| SendCloud | Webhooks (parcel_status_changed) | REST API (create parcel, get label, cancel parcel, get tracking) |
What Each Service Controls¶
| Service | If down, we cannot... |
|---|---|
| Shopify | Receive new orders (webhooks), create fulfillments (mark orders as shipped), manage products/variants |
| SimplyPrint | Create print jobs, check print status, manage printer queue |
| SendCloud | Create shipping labels, get tracking info, cancel parcels |
3. Current Resilience Mechanisms¶
3.1 Shopify¶
Inbound: Order Ingestion¶
| Mechanism | How it works | Coverage |
|---|---|---|
| Webhook retry (Shopify-side) | Shopify retries failed webhooks for up to 48 hours with exponential backoff | Covers our downtime up to 48h |
| Webhook idempotency | ProcessedWebhook table prevents duplicate processing if same webhook arrives twice |
Prevents duplicates |
| Backfill polling | ShopifyBackfillService polls Shopify Orders API every 5 minutes using a durable since_id watermark |
Catches anything webhooks missed |
| Test order filtering | Test/bogus orders are detected and skipped | Prevents noise |
Assessment: Order ingestion is well protected. Even during extended downtime, Shopify's own webhook retry (48h) combined with our backfill polling creates a robust safety net. The watermark-based backfill ensures no orders are lost even if both mechanisms overlap.
Outbound: Fulfillment Creation¶
| Mechanism | How it works | Coverage |
|---|---|---|
| Retry queue | Failed fulfillment attempts are enqueued with exponential backoff (5 attempts, 1s→1h max delay, 2x multiplier, ±10% jitter) | Covers transient failures |
| Retryable error detection | HTTP 429, 500, 502, 503, 504, timeouts, ECONNRESET, ECONNREFUSED classified as retryable | Distinguishes transient vs permanent |
| Rate limit handling | 429 responses trigger immediate retry using Retry-After header |
Prevents rate limit violations |
| Request timeout | AbortController with configurable timeout prevents hanging requests |
Prevents resource leaks |
Assessment: Fulfillment creation is reasonably protected. The 5-attempt retry with exponential backoff handles most transient failures. However, a prolonged outage (> ~1 hour given the max delay) would exhaust retries, and the fulfillment would be permanently marked as failed requiring manual intervention.
Key files¶
apps/order-service/src/shopify/shopify-api.client.ts— API client with rate limit handlingapps/order-service/src/shopify/shopify.service.ts— Webhook processingapps/order-service/src/shopify/shopify-backfill.service.ts— Polling-based order recoveryapps/order-service/src/retry-queue/retry-queue.service.ts— Retry queue with backoffapps/order-service/src/retry-queue/retry-queue.processor.ts— Processes FULFILLMENT retries
3.2 SimplyPrint¶
Inbound: Print Job Status Updates¶
| Mechanism | How it works | Coverage |
|---|---|---|
| Webhooks | SimplyPrint sends job.started, job.done, job.failed, etc. | Primary status channel |
| Webhook idempotency | WebhookIdempotencyRepository prevents duplicate processing |
Prevents duplicates |
| Polling | SimplyPrintService polls SimplyPrint every 30 seconds for job status updates |
Covers missed webhooks |
| Reconciliation | SimplyPrintReconciliationService runs every minute, compares all active jobs against SimplyPrint API |
Catches any status drift |
Assessment: Status tracking is well protected through three independent mechanisms (webhooks, polling, reconciliation). Even if all webhooks are lost during downtime, reconciliation will catch up within 1 minute.
Outbound: Print Job Creation¶
| Mechanism | How it works | Coverage |
|---|---|---|
| None | addToQueue() failures immediately set job to FAILED status |
No automatic recovery |
| Manual retry | retryJob() method allows manual retry of FAILED/CANCELLED jobs |
Requires human action |
Assessment: This is the biggest gap. If SimplyPrint is down when a new order arrives and print jobs need to be created, those jobs are immediately marked as FAILED with no automatic retry. An operator must manually retry each failed job after SimplyPrint recovers. During a maintenance window, this could affect all incoming orders.
Key files¶
apps/print-service/src/simplyprint/simplyprint-api.client.ts— API client (no retry logic)apps/print-service/src/print-jobs/print-jobs.service.ts— Job creation and manual retryapps/print-service/src/simplyprint/simplyprint.service.ts— Webhook processing + pollingapps/print-service/src/simplyprint/simplyprint-reconciliation.service.ts— Status reconciliation
3.3 SendCloud¶
Inbound: Shipment Status Updates¶
| Mechanism | How it works | Coverage |
|---|---|---|
| Webhooks | SendCloud sends parcel_status_changed events |
Primary status channel |
| Webhook idempotency | WebhookIdempotencyRepository prevents duplicate processing |
Prevents duplicates |
| Reconciliation | SendcloudReconciliationService runs every 5 minutes, checks all active shipments against SendCloud API |
Catches missed webhooks |
Assessment: Status tracking is well protected. The reconciliation service will detect any status changes missed during downtime within 5 minutes.
Outbound: Shipment/Label Creation¶
| Mechanism | How it works | Coverage |
|---|---|---|
| Retry queue (enqueue only) | Failed shipment creation enqueues RetryJobType.SHIPMENT with 3 max attempts |
Jobs are saved... |
| Retry processor (missing handler) | RetryQueueProcessor.processJob() has no case RetryJobType.SHIPMENT — falls to default which logs a warning and does nothing |
...but never processed |
Assessment: This is a critical bug. The SendcloudService correctly identifies retryable errors and enqueues retry jobs, but the RetryQueueProcessor in the shipping-service has no handler for SHIPMENT jobs. They sit in the database indefinitely. After max retries are "exhausted" (they never actually ran), they are marked as FAILED and require manual intervention. This is a code path that was designed but never completed.
Key files¶
apps/shipping-service/src/sendcloud/sendcloud-api.client.ts— API clientapps/shipping-service/src/sendcloud/sendcloud.service.ts— Shipment creation (enqueues retries at line 457-461)apps/shipping-service/src/retry-queue/retry-queue.processor.ts— MissingSHIPMENTcase (line 71-91)apps/shipping-service/src/sendcloud/sendcloud-reconciliation.service.ts— Status reconciliation
4. Failure Scenario Analysis¶
Scenario 1: SendCloud Planned Maintenance (1-4 hours)¶
What happens today:
| Step | Event | System Response | Data Impact |
|---|---|---|---|
| 1 | SendCloud API returns 503 | createParcel() throws error |
No label created |
| 2 | Error classified as retryable | handleShipmentError() enqueues SHIPMENT retry job |
Job saved in DB |
| 3 | Retry processor runs every 30s | No handler for SHIPMENT — job is silently skipped | Job never retried |
| 4 | After max attempts exhausted (time-based) | Job marked as FAILED, requiresAttention: true logged |
Permanent failure |
| 5 | SendCloud comes back online | Nothing happens automatically | Labels remain uncreated |
| 6 | Operator notices FAILED shipments | Must manually trigger shipment creation for each affected order | Manual work |
Impact: Every order that reaches the shipping stage during the maintenance window will require manual intervention. For a 4-hour window, this could be dozens of orders.
Scenario 2: SimplyPrint Unplanned Outage (30 min - 2 hours)¶
What happens today:
| Step | Event | System Response | Data Impact |
|---|---|---|---|
| 1 | New Shopify order arrives (via webhook or backfill) | Order created successfully | Order saved |
| 2 | Print job creation triggered | addToQueue() call to SimplyPrint fails |
API error |
| 3 | Error caught | Job immediately marked as FAILED |
No retry |
| 4 | SimplyPrint comes back | Reconciliation runs but only checks existing jobs — never re-creates failed ones | Gap persists |
| 5 | Operator notices | Must manually retry each failed print job | Manual work |
Impact: All orders arriving during the outage have their print jobs permanently failed. The reconciliation service only reconciles status of jobs that were already successfully created in SimplyPrint — it does not retry failed job creation.
Scenario 3: Shopify Brief Network Glitch (< 5 minutes)¶
What happens today:
| Step | Event | System Response | Data Impact |
|---|---|---|---|
| 1 | Our fulfillment API call to Shopify fails | Error classified as retryable | Retry enqueued |
| 2 | Retry queue processes (every 30s) | Retries with exponential backoff | Usually succeeds |
| 3 | Shopify webhooks during glitch | Shopify retries automatically (48h window) | Nothing lost |
| 4 | Backfill runs | Catches any orders missed by webhooks | Belt-and-suspenders |
Impact: Minimal. The system handles this scenario well.
Scenario 4: Shopify Extended Outage (> 1 hour)¶
What happens today:
| Step | Event | System Response | Data Impact |
|---|---|---|---|
| 1 | No webhooks arriving | Backfill attempts also fail (Shopify is down) | No new orders |
| 2 | Fulfillment retries exhaust | After 5 attempts (~1h with backoff), fulfillments permanently fail | Requires manual action |
| 3 | Shopify recovers | Webhooks start flowing again, backfill catches missed orders | Orders recovered |
| 4 | Failed fulfillments | Not automatically retried — remain in FAILED state | Manual action needed |
Impact: Order ingestion recovers automatically. Fulfillments that exhausted retries during the outage require manual intervention.
Scenario 5: Cascading Failure (any service down + high load)¶
What happens today:
| Step | Event | System Response | Data Impact |
|---|---|---|---|
| 1 | Service returns errors/timeouts | Every incoming request still attempts the failing API call | Wasted resources |
| 2 | Request threads blocked on timeouts | API response time degrades across all endpoints | User experience |
| 3 | Health check reports dependency unhealthy | But no automatic response — just informational | No mitigation |
| 4 | Retry queue floods with jobs | All retries attempt the still-failing API | Wasted work |
Impact: Without a circuit breaker, the system wastes resources hammering a known-dead service. This can degrade the entire platform, not just the affected integration. The retry queue also fills up with doomed retries that will all fail.
5. Identified Gaps¶
5.1 Critical: SendCloud Retry Handler Missing¶
Location: apps/shipping-service/src/retry-queue/retry-queue.processor.ts lines 71-91
The switch statement handles FULFILLMENT, PRINT_JOB_CREATION, NOTIFICATION, and CANCELLATION but has no case for SHIPMENT. The SendcloudService at line 457-461 enqueues RetryJobType.SHIPMENT jobs that are never processed.
Fix complexity: Low — add a case RetryJobType.SHIPMENT handler that calls SendcloudService.createShipment().
5.2 High: No Retry for SimplyPrint Job Creation¶
Location: apps/print-service/src/print-jobs/print-jobs.service.ts
Failed addToQueue() calls immediately mark the print job as FAILED with no retry mechanism. There is no retry queue in the print-service for this operation.
Fix complexity: Medium — implement a retry queue in the print-service (can reuse the same pattern from order-service/shipping-service) and enqueue failed addToQueue calls.
5.3 High: No Circuit Breaker on Any External Service¶
Location: All API clients
When an external service is down, the system continues to make API calls that are guaranteed to fail. This wastes resources, blocks request threads on timeouts, and floods the retry queue.
Fix complexity: Medium — implement circuit breaker using the opossum library (already recommended in docs/03-architecture/patterns-evaluation.md lines 1696-1754).
5.4 Medium: Exhausted Retries Are Not Re-Retriable¶
When all retry attempts are exhausted, jobs are permanently marked as FAILED. Even after the external service recovers, there is no mechanism to automatically retry these "exhausted" jobs. This requires manual operator intervention.
Fix complexity: Medium — add a "re-enqueue exhausted jobs" feature that checks for FAILED retry jobs whose target service is now healthy (using the existing health check indicators).
5.5 Medium: Print Job Creation Retry Not Implemented¶
Location: apps/order-service/src/retry-queue/retry-queue.processor.ts line 113-122
The PRINT_JOB_CREATION case in the order-service retry processor throws a ConflictError('Print job retry not yet implemented'). Any print job creation failures that are enqueued in the order-service retry queue will also silently fail.
5.6 Low: No Alerting for Maintenance Windows¶
There is no mechanism to pre-emptively adjust behavior when a planned maintenance window is known. The system treats planned maintenance the same as an unplanned outage.
6. Recommendations¶
6.1 Fix: Add SHIPMENT Handler to Shipping-Service Retry Processor¶
Priority: Immediate — this is a bug
Add the missing case to RetryQueueProcessor:
case RetryJobType.SHIPMENT:
await this.processShipmentRetry(job);
break;
With a handler that calls SendcloudService.createShipment():
private async processShipmentRetry(job: RetryQueue): Promise<void> {
const payload = job.payload as { orderId: string; action: string };
await this.sendcloudService.createShipment(payload.orderId);
}
This requires injecting SendcloudService into the RetryQueueProcessor.
6.2 Fix: Add Retry Queue for SimplyPrint Job Creation¶
Priority: High
Apply the same retry queue pattern already used in order-service and shipping-service:
- When
addToQueue()fails with a retryable error, enqueue aPRINT_JOB_CREATIONretry job instead of immediately marking as FAILED - Keep the job in
QUEUEDstatus (notFAILED) while retries are pending - Add a
PRINT_JOB_CREATIONhandler in the print-service retry processor - Only mark as
FAILEDafter all retries are exhausted
6.3 Implement: Circuit Breaker Pattern¶
Priority: High
Use the opossum library to wrap all external API clients:
import CircuitBreaker from 'opossum';
const options = {
timeout: 10000, // 10 second timeout per call
errorThresholdPercentage: 50, // Open circuit after 50% failures
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 5, // Minimum 5 calls before evaluation
};
const breaker = new CircuitBreaker(apiCall, options);
breaker.on('open', () => {
logger.warn('Circuit breaker OPEN — service is down, fast-failing requests');
});
breaker.on('halfOpen', () => {
logger.log('Circuit breaker HALF-OPEN — testing if service recovered');
});
breaker.on('close', () => {
logger.log('Circuit breaker CLOSED — service recovered, resuming normal operation');
});
Benefits: - Fast failure: When a service is known to be down, immediately fail instead of waiting for timeout - Automatic recovery: Periodically tests if the service is back, automatically resumes when it is - Resource protection: Prevents thread exhaustion from blocked timeout requests - Retry queue synergy: Retryable errors go to the retry queue, the circuit breaker prevents the retry queue from hammering a dead service
6.4 Implement: Auto-Retry for Exhausted Jobs After Service Recovery¶
Priority: Medium
Add a scheduled job that checks for FAILED retry jobs whose target service is now healthy:
- Use the existing health check indicators (
/health/dependencies) to determine service status - When a service transitions from unhealthy → healthy, scan for
FAILEDretry jobs targeting that service - Re-enqueue them with a fresh attempt counter
- Rate-limit the re-enqueue to avoid flooding the recovered service
This closes the gap between "retries exhausted during downtime" and "service came back but nobody retried."
6.5 Implement: Maintenance Mode¶
Priority: Low
Add a per-service "maintenance mode" that:
- Can be activated via admin API or config flag
- Pauses outbound API calls to the service (queue them instead of calling)
- Keeps accepting inbound webhooks and processing other work
- When deactivated, drains the queued calls with rate limiting
- Optionally: integrate with service status pages to auto-activate
This is especially useful for known maintenance windows like the SendCloud notification that triggered this research.
6.6 Monitoring: Circuit Breaker Dashboard Integration¶
Priority: Low
Expose circuit breaker state through the existing health check endpoints and the admin dashboard:
- Current state per service (CLOSED / OPEN / HALF-OPEN)
- Failure count and threshold
- Time since last state transition
- Number of queued retry jobs per service
7. Implementation Priority¶
Phase 1: Bug Fixes (1-2 days)¶
| Item | Effort | Impact |
|---|---|---|
Add SHIPMENT handler to shipping-service retry processor |
2 hours | Fixes critical bug — SendCloud retries will actually work |
Add PRINT_JOB_CREATION retry logic to print-service |
1 day | SimplyPrint job creation will survive outages |
Phase 2: Circuit Breaker (2-3 days)¶
| Item | Effort | Impact |
|---|---|---|
Install opossum and create CircuitBreakerService wrapper |
4 hours | Reusable infrastructure |
| Wrap Shopify API client | 2 hours | Prevents cascading failure |
| Wrap SimplyPrint API client | 2 hours | Prevents cascading failure |
| Wrap SendCloud API client | 2 hours | Prevents cascading failure |
| Expose circuit breaker state in health checks | 2 hours | Observability |
Phase 3: Self-Healing (1-2 days)¶
| Item | Effort | Impact |
|---|---|---|
| Auto-retry exhausted jobs after service recovery | 1 day | Eliminates most manual intervention |
| Alerting for circuit breaker state changes | 4 hours | Early warning |
Phase 4: Operational Excellence (optional)¶
| Item | Effort | Impact |
|---|---|---|
| Maintenance mode per service | 1 day | Graceful handling of known windows |
| Dashboard integration | 1 day | Visibility |
Appendix A: Current Retry Queue Configuration¶
| Parameter | Value |
|---|---|
| Max retries | 5 (order-service), 3 (shipping-service for shipments) |
| Initial delay | 1,000 ms |
| Max delay | 3,600,000 ms (1 hour) |
| Backoff multiplier | 2 |
| Jitter | ±10% |
| Processing interval | Every 30 seconds |
| Cleanup | Daily at 3 AM, removes jobs older than 7 days |
Retry timeline example (5 attempts):
| Attempt | Approx delay | Wall clock (cumulative) |
|---|---|---|
| 1 | ~1s | ~1s |
| 2 | ~2s | ~3s |
| 3 | ~4s | ~7s |
| 4 | ~8s | ~15s |
| 5 | ~16s | ~31s |
With jitter, actual delays vary by ±10%. The max delay cap (1 hour) only matters for configurations with many more retries.
Appendix B: Reconciliation Service Summary¶
| Service | Interval | Scope | Resilience Provided |
|---|---|---|---|
ShopifyBackfillService |
Every 5 min | All orders since watermark | Catches missed webhooks, recovers order ingestion |
SimplyPrintReconciliationService |
Every 1 min | All active print jobs | Catches missed status webhooks, detects status drift |
SendcloudReconciliationService |
Every 5 min | All active shipments | Catches missed status webhooks, detects status drift |
All reconciliation services:
- Are configurable via environment variables (*_RECONCILIATION_ENABLED)
- Run an initial check shortly after startup (10-45 second delay)
- Use mutex (isReconciling flag) to prevent concurrent runs
- Log results to EventLog with severity based on error count
- Report errors to Sentry
Appendix C: Webhook Behavior by Service¶
| Service | Retry on failure? | Retry window | Idempotency key |
|---|---|---|---|
| Shopify | Yes (automatic) | 48 hours, exponential backoff | ${topic}:${shopify_order_id} |
| SimplyPrint | Unknown / not documented | — | ${webhook_id}:${event}:${job.uid}:${timestamp} |
| SendCloud | Yes (automatic) | Limited retries | ${parcel.id}-${parcel.status.id}-${timestamp} |
Shopify is the most robust — if we return a non-2xx response, Shopify will keep retrying for 48 hours. Our system deliberately returns 200 OK for non-critical errors to avoid unnecessary Shopify retries, and only returns 500 for critical errors (like database connection failures) to trigger Shopify's retry mechanism.
SimplyPrint webhook retry behavior is not well documented. The reconciliation service and polling provide the primary safety net here.
SendCloud performs limited webhook retries. The reconciliation service compensates for any missed deliveries.