Skip to content

Architecture Decision Records (ADR)

Project: Forma3D.Connect
Version: 4.5
Last Updated: March 22, 2026

This document captures the significant architectural decisions made during the development of Forma3D.Connect.


Table of Contents

  1. ADR-001: Monorepo with Nx
  2. ADR-002: NestJS for Backend Framework
  3. ADR-003: React 19 with Vite for Frontend
  4. ADR-004: PostgreSQL with Prisma ORM
  5. ADR-005: TypeScript Strict Mode
  6. ADR-006: Azure DevOps for CI/CD
  7. ADR-007: Layered Architecture with Repository Pattern
  8. ADR-008: Event-Driven Internal Communication
  9. ADR-009: OpenAPI/Swagger for API Documentation
  10. ADR-010: HMAC Verification for Webhooks
  11. ADR-011: Idempotent Webhook Processing
  12. ADR-012: Assembly Parts Model for Product Mapping
  13. ADR-013: Shared Domain Library
  14. ADR-014: SimplyPrint as Unified Print Farm Controller
  15. ADR-015: Aikido Security Platform (Superseded)
  16. ADR-016: Sentry Observability with OpenTelemetry
  17. ADR-017: Docker + Traefik Deployment Strategy
  18. ADR-018: Nx Affected Conditional Deployment Strategy
  19. ADR-019: SimplyPrint Webhook Verification
  20. ADR-020: Hybrid Status Monitoring (Polling + Webhooks)
  21. ADR-021: Retry Queue with Exponential Backoff
  22. ADR-022: Event-Driven Fulfillment Architecture
  23. ADR-023: Email Notification Strategy
  24. ADR-024: API Key Authentication for Admin Endpoints
  25. ADR-025: Cosign Image Signing for Supply Chain Security
  26. ADR-026: CycloneDX SBOM Attestations
  27. ADR-027: TanStack Query for Server State Management
  28. ADR-028: Socket.IO for Real-Time Dashboard Updates
  29. ADR-029: API Key Authentication for Dashboard
  30. ADR-030: Sendcloud for Shipping Integration
  31. ADR-031: Automated Container Registry Cleanup
  32. ADR-032: Domain Boundary Separation with Interface Contracts
  33. ADR-033: Database-Backed Webhook Idempotency
  34. ADR-034: Docker Log Rotation & Resource Cleanup
  35. ADR-035: Progressive Web App (PWA) for Cross-Platform Access
  36. ADR-036: localStorage Fallback for PWA Install Detection
  37. ADR-037: Keep a Changelog for Release Documentation
  38. ADR-038: Zensical for Publishing Project Documentation
  39. ADR-039: Global API Key Authentication (Fail-Closed)
  40. ADR-040: Shopify Order Backfill for Downtime Recovery
  41. ADR-041: SimplyPrint Webhook Idempotency and Job Reconciliation
  42. ADR-042: SendCloud Webhook Integration for Shipment Status Updates
  43. ADR-043: PWA Version Mismatch Detection
  44. ADR-044: Role-Based Access Control and Tenant-Ready Architecture
  45. ADR-045: pgAdmin for Staging Database Administration
  46. ADR-046: PostgreSQL Session Store for Persistent Authentication
  47. ADR-047: Two-Tier Logging Strategy (Application + Business Events)
  48. ADR-048: Shopify OAuth 2.0 Authentication
  49. ADR-049: Optional SKU with Shopify Product/Variant ID Matching Priority
  50. ADR-050: Apache ECharts for Dashboard Analytics
  51. ADR-051: Decompose Monolithic API into Domain-Aligned Microservices
  52. ADR-052: BullMQ Event Queues for Inter-Service Async Communication
  53. ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
  54. ADR-054: SimplyPrint API Files for Gcode Upload
  55. ADR-055: BambuStudio CLI Slicer Container
  56. ADR-056: Redis for Sessions, Event Queues, and Socket.IO Adapter
  57. ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
  58. ADR-058: Self-Hosted Log Infrastructure (ClickHouse + Grafana via OpenTelemetry)
  59. ADR-059: Nx Affected Resilience via Last-Successful-Deploy Tag
  60. ADR-060: Single Source of Truth for STL Preview Generation
  61. ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly
  62. ADR-062: Inventory Tracking and Stock Replenishment
  63. ADR-063: ORDER-over-STOCK Print Queue Priority
  64. ADR-064: Stock Replenishment Event Subscriber for SimplyPrint Queue
  65. ADR-065: SonarCloud for Continuous Code Quality Analysis
  66. ADR-066: CodeCharta City Visualization for Codebase Insight
  67. ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance
  68. ADR-068: Dependency License Compliance Check
  69. ADR-069: Agent CLAUDE.md Governance — Repo as Source of Truth
  70. ADR-070: Per-Agent Claude Model Selection

ADR-001: Monorepo with Nx

Attribute Value
ID ADR-001
Status Accepted
Date 2026-01-09
Context Need to manage multiple applications (API, Web, Desktop, Mobile) and shared libraries in a single repository

Decision

Use Nx (v19.x) as the monorepo management tool with pnpm as the package manager.

Rationale

  • Unified tooling: Single command to build, test, lint all projects
  • Dependency graph: Nx understands project dependencies and can run only affected tests
  • Caching: Local and remote caching speeds up CI/CD pipelines
  • Code sharing: Shared libraries (@forma3d/domain, @forma3d/utils, etc.) are first-class citizens
  • Plugin ecosystem: Built-in support for NestJS, React, and other frameworks

Consequences

  • ✅ Fast CI through affected commands and caching
  • ✅ Consistent tooling across all projects
  • ✅ Easy code sharing via path aliases
  • ⚠️ Learning curve for developers unfamiliar with Nx
  • ⚠️ Initial setup complexity

Alternatives Considered

Alternative Reason for Rejection
Turborepo Less mature NestJS support
Lerna Deprecated in favor of Nx
Separate repositories Too much overhead for shared code

ADR-002: NestJS for Backend Framework

Attribute Value
ID ADR-002
Status Accepted
Date 2026-01-09
Context Need a robust, scalable backend framework for the integration API

Decision

Use NestJS (v10.x) as the backend framework.

Rationale

  • Enterprise-grade: Built-in support for dependency injection, modules, guards, interceptors
  • TypeScript-first: Native TypeScript support with decorators
  • Modular architecture: Easy to organize code by feature
  • Excellent documentation: Well-documented with active community
  • Testing support: Built-in testing utilities with Jest
  • OpenAPI support: First-class Swagger/OpenAPI integration via @nestjs/swagger

Consequences

  • ✅ Clean, maintainable code structure
  • ✅ Easy to add new features as modules
  • ✅ Built-in validation with class-validator
  • ✅ Excellent integration with Prisma
  • ⚠️ Verbose compared to Express.js
  • ⚠️ Decorator-heavy syntax

Alternatives Considered

Alternative Reason for Rejection
Express.js Too low-level, lacks structure
Fastify Less ecosystem support
Hono Too new, less enterprise features

ADR-003: React 19 with Vite for Frontend

Attribute Value
ID ADR-003
Status Accepted
Date 2026-01-09
Context Need a modern frontend framework for the admin dashboard

Decision

Use React 19 with Vite as the bundler and Tailwind CSS for styling.

Rationale

  • React 19: Latest version with improved performance and new features
  • Vite: Extremely fast development server and build times
  • Tailwind CSS: Utility-first CSS for rapid UI development
  • TanStack Query: Excellent server state management
  • React Router: Standard routing solution

Consequences

  • ✅ Fast development experience with HMR
  • ✅ Modern React features (Server Components ready)
  • ✅ Consistent styling with Tailwind
  • ⚠️ Tailwind learning curve for traditional CSS developers

Alternatives Considered

Alternative Reason for Rejection
Next.js Overkill for admin dashboard, SSR not needed
Angular Less flexibility, steeper learning curve
Vue.js Team expertise in React

ADR-004: PostgreSQL with Prisma ORM

Attribute Value
ID ADR-004
Status Accepted
Date 2026-01-09
Context Need a reliable database with type-safe access

Decision

Use PostgreSQL 16 as the database with Prisma 5 as the ORM.

Rationale

  • PostgreSQL: Robust, ACID-compliant, excellent JSON support
  • Prisma: Type-safe database access, auto-generated client
  • Schema-first: Prisma schema as single source of truth
  • Migrations: Built-in migration system
  • Studio: Visual database browser for development

Consequences

  • ✅ Full type safety from database to API
  • ✅ Easy schema changes with migrations
  • ✅ No raw SQL in application code
  • ⚠️ Prisma Client must be regenerated after schema changes
  • ⚠️ Some complex queries require raw SQL

Schema Design Decisions

  • UUIDs for primary keys (portability, no sequence conflicts)
  • JSON columns for flexible data (shipping address, print profiles)
  • Decimal type for monetary values (precision)
  • Timestamps with timezone (audit trail)

ADR-005: TypeScript Strict Mode

Attribute Value
ID ADR-005
Status Accepted
Date 2026-01-09
Context Need to ensure code quality and catch errors early

Decision

Enable TypeScript strict mode with additional strict checks:

{
  "strict": true,
  "noImplicitAny": true,
  "strictNullChecks": true,
  "noUnusedLocals": true,
  "noUnusedParameters": true
}

Rationale

  • Early error detection: Catch type errors at compile time
  • Self-documenting code: Types serve as documentation
  • Refactoring safety: IDE can safely refactor with full type information
  • No any type: Prevents type escape hatches

Consequences

  • ✅ Higher code quality
  • ✅ Better IDE support and autocomplete
  • ✅ Safer refactoring
  • ⚠️ More verbose code with explicit types
  • ⚠️ Stricter null checking requires careful handling

ADR-006: Azure DevOps for CI/CD with Digital Ocean Hosting

Attribute Value
ID ADR-006
Status Accepted
Date 2026-01-09
Context Need a CI/CD pipeline for automated testing and deployment

Decision

Use Azure DevOps Pipelines for CI/CD and Digital Ocean for hosting.

Rationale

  • Azure DevOps: Existing team expertise with YAML pipelines
  • Digital Ocean: Cost-effective, simple infrastructure for small-scale deployment
  • Separation of concerns: CI/CD tooling separate from hosting
  • Docker-based: Consistent container deployment across environments
  • Managed Database: Digital Ocean managed PostgreSQL for reliability

Infrastructure

Component Service Purpose
CI/CD Azure DevOps Pipelines Build, test, deploy automation
Container Registry Digital Ocean Registry Docker image storage
Staging Digital Ocean Droplet Staging environment
Production Digital Ocean Droplet Production environment
Database Digital Ocean Managed PostgreSQL Data persistence

Pipeline Stages

  1. Validate & Test: Lint, type check, and unit tests (parallel across 3 agents)
  2. Build & Package: Detect affected, build projects, Docker images on self-hosted agents
  3. Deploy Staging: Auto-deploy affected services on main branch
  4. Acceptance Test: Playwright tests against staging
  5. Deploy Production: Manual approval gate

Updated Feb 2026: Stages merged and agents optimized per ADR-057.

Consequences

  • ✅ Automated quality gates
  • ✅ Fast feedback on PRs
  • ✅ Consistent deployments
  • ✅ Cost-effective hosting
  • ⚠️ Need to manage Docker deployments on Droplets

ADR-007: Layered Architecture with Repository Pattern

Attribute Value
ID ADR-007
Status Accepted
Date 2026-01-09
Context Need a clean separation of concerns in the backend

Decision

Implement a layered architecture with the Repository Pattern:

uml diagram

Layer Responsibilities

Layer Responsibility Example
Controller HTTP handling, validation, routing OrdersController
Service Business logic, orchestration OrdersService
Repository Data access, Prisma queries OrdersRepository
DTO Data transfer, validation CreateOrderDto

Rationale

  • Testability: Each layer can be tested in isolation
  • Single responsibility: Clear separation of concerns
  • Flexibility: Easy to swap implementations (e.g., different databases)
  • Maintainability: Changes in one layer don't affect others

Consequences

  • ✅ Clean, maintainable code
  • ✅ Easy to unit test with mocks
  • ✅ Prisma isolated to repository layer
  • ⚠️ More files per feature
  • ⚠️ Some boilerplate code

ADR-008: Event-Driven Internal Communication

Attribute Value
ID ADR-008
Status ✅ Implemented
Date 2026-01-09 (Updated: 2026-01-25)
Context Need to decouple components and trigger actions on state changes

Decision

Use NestJS EventEmitter for internal event-driven communication.

Events Defined

Order Events (ORDER_EVENTS)

Event Trigger Listeners
order.created New order from Shopify webhook OrchestrationService, EventsGateway, PushService
order.status_changed Order status update EventsGateway, PushService
order.cancelled Order cancellation CancellationService, EventsGateway
order.ready-for-fulfillment All print jobs completed SendcloudService, FulfillmentService, EventsGateway
order.fulfilled Order shipped EventsGateway
order.failed Order processing failed EventsGateway
Event Trigger Listeners
printjob.created Print job created in SimplyPrint EventsGateway
printjob.status-changed Print job status update OrchestrationService, EventsGateway, PushService
printjob.completed Print job finished successfully OrchestrationService, EventsGateway
printjob.failed Print job failed OrchestrationService, EventsGateway, NotificationsService
printjob.cancelled Print job cancelled EventsGateway
printjob.retry-requested Print job retry initiated (EventLogService)

Orchestration Events (ORCHESTRATION_EVENTS)

Event Trigger Listeners
order.ready-for-fulfillment All print jobs for order complete SendcloudService, FulfillmentService
order.partially-completed Some jobs complete, some pending (Logging)
order.all-jobs-failed All print jobs for order failed (Logging)

SimplyPrint Events (SIMPLYPRINT_EVENTS)

Event Trigger Listeners
simplyprint.job-status-changed SimplyPrint webhook/poll update PrintJobsService

Shipment Events (SHIPMENT_EVENTS)

Event Trigger Listeners
shipment.created Shipment created FulfillmentService, PushService
shipment.label-ready Shipping label downloaded PushService
shipment.failed Shipment creation failed (Logging)
shipment.updated Shipment status update (Logging)

SendCloud Webhook Events (SENDCLOUD_WEBHOOK_EVENTS)

Event Trigger Listeners
sendcloud.shipment.status_changed SendCloud webhook/reconciliation ShipmentsService

Fulfillment Events (FULFILLMENT_EVENTS)

Event Trigger Listeners
fulfillment.created Shopify fulfillment created (Logging)
fulfillment.failed Shopify fulfillment failed NotificationsService
fulfillment.retrying Fulfillment retry in progress (Logging)

Event Flow Diagram

Shopify Webhook → OrdersService → order.created
                                       ↓
                              OrchestrationService
                                       ↓
                              PrintJobsService → printjob.created
                                       ↓
                              SimplyPrint API

SimplyPrint Webhook → SimplyPrintService → simplyprint.job-status-changed
                                                  ↓
                                          PrintJobsService → printjob.status-changed
                                                  ↓                    ↓
                                          printjob.completed    printjob.failed
                                                  ↓                    ↓
                                          OrchestrationService ←───────┘
                                                  ↓
                                          order.ready-for-fulfillment
                                                  ↓
                                ┌─────────────────┴─────────────────┐
                                ↓                                   ↓
                        SendcloudService                   FulfillmentService
                                ↓                                   ↓
                        shipment.created ──────────────────→ Shopify Fulfillment
                                ↓
                        SendCloud Webhook → sendcloud.shipment.status_changed

Rationale

  • Decoupling: Services don't directly depend on each other
  • Extensibility: Easy to add new listeners
  • Async processing: Events can be processed asynchronously
  • Audit trail: Events naturally support logging
  • Orchestration: Clean separation between job creation and completion tracking
  • Real-time updates: EventsGateway broadcasts to dashboard via Socket.IO
  • Push notifications: PushService sends alerts to subscribed PWA clients

Consequences

  • ✅ Loose coupling between modules
  • ✅ Easy to add new functionality
  • ✅ Clear event flow
  • ✅ Enables reactive order completion tracking
  • ✅ Real-time dashboard updates via Socket.IO
  • ✅ Push notifications for mobile/desktop PWA
  • ⚠️ Harder to trace execution flow (mitigated by correlation IDs and logging)
  • ⚠️ Eventual consistency considerations

ADR-009: OpenAPI/Swagger for API Documentation

Attribute Value
ID ADR-009
Status Accepted
Date 2026-01-10
Context Need interactive API documentation for developers

Decision

Use @nestjs/swagger for OpenAPI 3.0 documentation with Swagger UI.

Implementation

  • Swagger UI: Available at /api/docs
  • OpenAPI JSON: Available at /api/docs-json
  • Environment restriction: Only enabled in non-production
  • Decorator-based: All endpoints documented via decorators

Decorators Used

Decorator Purpose
@ApiTags Group endpoints by feature
@ApiOperation Describe endpoint purpose
@ApiResponse Document response schemas
@ApiProperty Document DTO properties
@ApiParam Document path parameters
@ApiQuery Document query parameters

Consequences

  • ✅ Interactive API testing
  • ✅ Auto-generated documentation
  • ✅ Type-safe documentation
  • ⚠️ Must keep decorators in sync with code

ADR-010: HMAC Verification for Webhooks

Attribute Value
ID ADR-010
Status Accepted
Date 2026-01-09
Context Need to verify webhook requests are genuinely from Shopify

Decision

Implement HMAC-SHA256 signature verification for all Shopify webhooks.

Implementation

// ShopifyWebhookGuard
const hash = crypto.createHmac('sha256', webhookSecret).update(rawBody, 'utf8').digest('base64');

return crypto.timingSafeEqual(Buffer.from(hash), Buffer.from(hmacHeader));

Rationale

  • Security: Prevents forged webhook requests
  • Shopify standard: Required by Shopify webhook specification
  • Timing-safe comparison: Prevents timing attacks
  • Raw body access: NestJS configured to preserve raw body

Consequences

  • ✅ Secure webhook endpoint
  • ✅ Compliant with Shopify requirements
  • ⚠️ Requires raw body access (special NestJS configuration)

ADR-011: Idempotent Webhook Processing

Attribute Value
ID ADR-011
Status Accepted
Date 2026-01-09
Context Shopify may send duplicate webhooks; need to handle gracefully

Decision

Implement idempotent webhook processing using:

  1. Webhook ID tracking (in-memory Set)
  2. Database unique constraints (shopifyOrderId)

Implementation

// ShopifyService
private readonly processedWebhooks = new Set<string>();

if (this.processedWebhooks.has(webhookId)) {
  return; // Skip duplicate
}
this.processedWebhooks.add(webhookId);

// OrdersService
const existing = await this.ordersRepository.findByShopifyOrderId(id);
if (existing) {
  return existing; // Return existing, don't create duplicate
}

Consequences

  • ✅ No duplicate orders created
  • ✅ Safe to retry failed webhooks
  • ⚠️ In-memory Set resets on restart (database constraint is primary guard)

ADR-012: Assembly Parts Model for Product Mapping

Attribute Value
ID ADR-012
Status Accepted
Date 2026-01-09
Context A single Shopify product may require multiple 3D printed parts

Decision

Implement ProductMapping → AssemblyPart one-to-many relationship.

Data Model

uml diagram

Fields

  • ProductMapping: shopifyProductId, SKU (optional), defaultPrintProfile
  • AssemblyPart: partName, partNumber, simplyPrintFileId, quantityPerProduct

Rationale

  • Flexibility: Support both single-part (1 part) and multi-part products
  • Quantity support: quantityPerProduct for parts needed multiple times (e.g., 4 wheels)
  • Print profiles: Override default profile per part if needed

Consequences

  • ✅ Supports complex assemblies
  • ✅ Clear part ordering via partNumber
  • ✅ Flexible print settings per part
  • ⚠️ More complex order processing logic

ADR-013: Shared Domain Library

Attribute Value
ID ADR-013
Status Accepted
Date 2026-01-09
Context Need to share types between frontend, backend, and external integrations

Decision

Create a shared @forma3d/domain library containing:

  • Entity types
  • Enums
  • Shopify types
  • Common interfaces

Structure

libs/domain/src/
├── entities/
│   ├── order.ts
│   ├── line-item.ts
│   ├── print-job.ts
│   └── product-mapping.ts
├── enums/
│   ├── order-status.ts
│   ├── line-item-status.ts
│   └── print-job-status.ts
├── shopify/
│   ├── shopify-order.entity.ts
│   └── shopify-product.entity.ts
└── index.ts

Rationale

  • Single source of truth: Types defined once, used everywhere
  • Type safety: Frontend and backend share exact same types
  • Nx integration: Clean imports via path aliases

Consequences

  • ✅ Consistent types across codebase
  • ✅ No type drift between frontend/backend
  • ✅ Easy to update types in one place
  • ⚠️ Must rebuild library on changes

ADR-014: SimplyPrint as Unified Print Farm Controller

Attribute Value
ID ADR-014
Status ✅ Implemented (Phase 2)
Date 2026-01-10 (Updated: 2026-01-13)
Context Need to control multiple 3D printer brands (Prusa, Bambu Lab) from one API

Decision

Use SimplyPrint as the unified print farm management solution with an edge device connecting to all printers via LAN.

Architecture

uml diagram

Rationale

  • Unified API: Single integration point for all printer brands
  • LAN mode: Direct communication with printers, no cloud dependency for print control
  • Edge device: Handles printer communication, buffering, and monitoring
  • Multi-brand support: Prusa and Bambu Lab printers managed together
  • No Bambu Cloud dependency: Avoids Bambu Lab Cloud API limitations

Printer Support

Brand Models Connection
Prusa MK3S+, XL, Mini LAN via SimplyPrint edge device
Bambu Lab X1 Carbon, P1S LAN via SimplyPrint edge device

Implementation Details (Phase 2)

API Client (apps/api/src/simplyprint/simplyprint-api.client.ts):

  • HTTP Basic Authentication with Company ID and API Key
  • Typed methods for files, jobs, printers, and queue operations
  • Automatic connection verification on startup
  • Sentry integration for 5xx error tracking

Webhook Controller (apps/api/src/simplyprint/simplyprint-webhook.controller.ts):

  • Endpoint: POST /webhooks/simplyprint
  • X-SP-Token verification via guard
  • Event-driven status updates

Print Jobs Service (apps/api/src/print-jobs/print-jobs.service.ts):

  • Creates print jobs in SimplyPrint when orders arrive
  • Updates local status based on SimplyPrint events
  • Supports cancel and retry operations

API Endpoints Used:

Endpoint Method Purpose
/{companyId}/files/GetFiles GET List available print files
/{companyId}/printers/Get GET Get printer statuses
/{companyId}/printers/actions/CreateJob POST Create new print job
/{companyId}/printers/actions/Cancel POST Cancel active job
/{companyId}/queue/GetItems GET Get queue items
/{companyId}/queue/AddItem POST Add item to queue
/{companyId}/queue/RemoveItem POST Remove from queue

Consequences

  • ✅ Single API for all printers
  • ✅ No dependency on Bambu Lab Cloud
  • ✅ Local network resilience
  • ✅ Real-time printer status via edge device
  • ✅ Typed API client with full error handling
  • ✅ Webhook and polling support for status updates
  • ⚠️ Requires edge device on print farm network
  • ⚠️ SimplyPrint subscription required

ADR-015: Aikido Security Platform for Continuous Security Monitoring

Attribute Value
ID ADR-015
Status Superseded by ADR-067 (Grype CVE Scanning)
Date 2026-01-10
Context Need continuous security monitoring, vulnerability scanning, and SBOM generation

Decision

Use Aikido Security Platform as the centralized security monitoring and compliance solution integrated into the CI/CD pipeline.

Security Checks Implemented

Check Status Description
Open Source Dependency Monitoring Active Monitors 3rd party dependencies for vulnerabilities
Exposed Secrets Monitoring Compliant Detects accidentally exposed secrets in source code
License Management Compliant Validates dependency licenses for legal compliance
SAST Compliant Static Application Security Testing
IaC Testing Compliant Infrastructure as Code security analysis
Malware Detection Compliant Detects malware in dependencies
Mobile Issues Compliant Mobile manifest file monitoring
SBOM Generation Active Software Bill of Materials for supply chain security

Rationale

  • Comprehensive coverage: Single platform covers multiple security domains
  • CI/CD integration: Automated scanning on every code change
  • SBOM generation: Critical for supply chain security and compliance
  • License compliance: Automated license validation prevents legal issues
  • Developer-friendly: Clear dashboards and actionable remediation guidance
  • Proactive detection: Continuous monitoring catches issues before production

Future Enhancements

  • Code Quality Analysis: Will be enabled in a subsequent phase to complement security scanning

Consequences

  • ✅ Continuous security visibility across the codebase
  • ✅ Automated vulnerability detection in dependencies
  • ✅ SBOM generation for supply chain transparency
  • ✅ License compliance validation
  • ✅ Secrets exposure prevention
  • ⚠️ Requires Aikido platform subscription
  • ⚠️ May flag false positives requiring triage

Alternatives Considered

Alternative Reason for Rejection
Snyk More expensive, less comprehensive for our needs
GitHub Advanced Security Limited to GitHub, not as comprehensive
Manual audits Not scalable, too slow for continuous delivery
Dependabot only Only covers dependency vulnerabilities, not comprehensive

ADR-016: Sentry Observability with OpenTelemetry ✅

Attribute Value
ID ADR-016
Status ✅ Implemented (Updated by ADR-058: structured logging moved to ClickHouse)
Date 2026-01-10
Context Need comprehensive observability: error tracking, performance monitoring, distributed tracing

Decision

Use Sentry as the observability platform with an OpenTelemetry-first architecture for vendor neutrality.

Architecture

uml diagram

Implementation Details

Backend (NestJS):

  • @sentry/nestjs for error tracking and performance
  • @sentry/profiling-node for profiling
  • nestjs-pino for structured JSON logging
  • OpenTelemetry auto-instrumentation for Prisma queries
  • Global exception filter with Sentry capture
  • Logging interceptor with correlation IDs

Frontend (React):

  • @sentry/react for error tracking
  • Custom ErrorBoundary component with Sentry integration
  • Browser tracing for page navigation
  • User-friendly error fallback UI

Sampling Configuration (Free Tier Compatible):

Environment Traces Profiles Errors
Development 100% 100% 100%
Production 10% 10% 100%

Rationale

  • Sentry: Industry-leading error tracking with excellent stack trace support
  • OpenTelemetry: Vendor-neutral instrumentation standard, future-proof
  • Structured Logging: JSON logs enable log aggregation and searching
  • Correlation IDs: End-to-end request tracing across frontend and backend
  • Free Tier: Sufficient for small-scale production (10K errors/month)

Data Privacy

Sensitive data is automatically scrubbed:

  • Authorization headers
  • Cookies
  • API tokens
  • Passwords
  • Shopify access tokens

Implementation Details (Phase 1b)

Backend (apps/api):

  • instrument.ts - Sentry initialization with profiling (imported first in main.ts)
  • ObservabilityModule - Global module with Pino logger and Sentry integration
  • SentryExceptionFilter - Captures all exceptions with request context
  • LoggingInterceptor - Request/response logging with correlation IDs
  • ObservabilityController - Test endpoints for verifying observability (non-prod only)
  • Prisma service enhanced with Sentry breadcrumbs for query tracing

Frontend (apps/web):

  • sentry.ts - Sentry initialization with browser tracing and session replay
  • ErrorBoundary.tsx - React error boundary with Sentry integration

Shared Library (libs/observability):

  • sentry.config.ts - Shared Sentry configuration with 100% sampling
  • otel.config.ts - OpenTelemetry configuration
  • constants.ts - Trace/request ID header constants

Sampling Decision:

  • 100% sampling for all environments (traces and profiles)
  • Rationale: Full visibility needed during early development
  • Can be reduced when traffic increases and limits are reached

Consequences

  • ✅ Comprehensive error visibility with stack traces and context
  • ✅ Performance monitoring for API endpoints and database queries
  • ✅ Distributed tracing across frontend and backend
  • ✅ Structured logs with correlation IDs for debugging
  • ✅ Vendor-neutral instrumentation via OpenTelemetry
  • ✅ Test endpoints for verifying observability in development
  • ⚠️ Requires Sentry account (free tier available)
  • ⚠️ Must initialize Sentry before other imports in main.ts
  • ⚠️ 100% sampling may hit free tier limits with high traffic

Alternatives Considered

Alternative Reason for Rejection
Datadog Expensive for small-scale, overkill for current needs
New Relic Expensive, complex pricing model
Grafana + Loki Requires self-hosting, more operational overhead
ELK Stack Complex to set up and maintain, expensive at scale
Console.log only No centralized visibility, hard to debug production issues

ADR-017: Docker + Traefik Deployment Strategy

Attribute Value
ID ADR-017
Status ⏳ In Progress
Date 2026-01-10
Context Need a deployment strategy for staging/production on DigitalOcean with TLS and zero-downtime

Decision

Use Docker Compose with Traefik reverse proxy for deploying to DigitalOcean Droplets.

Architecture

uml diagram

Deployment Components

Component Technology Purpose
Reverse Proxy Traefik v3 TLS termination, routing, load balancing
TLS Certificates Let's Encrypt Automatic certificate issuance/renewal
Container Orchestration Docker Compose Service definition and networking
Image Registry DigitalOcean Registry Private Docker image storage
Database DO Managed PostgreSQL Persistent data storage with TLS

Traefik Configuration

Feature Implementation
Entry Points HTTP (:80) with redirect to HTTPS (:443)
Certificate Resolver Let's Encrypt with HTTP challenge
Service Discovery Docker labels on containers
Health Checks HTTP health endpoints (/health, /health/live, /health/ready)
Logging JSON format for log aggregation

Staging URLs

Service URL
API https://staging-connect-api.forma3d.be
Web https://staging-connect.forma3d.be

Pipeline Integration

Stage Trigger Action
Package develop branch Build Docker images, push to DO Registry
Deploy Staging develop branch SSH + docker compose up
Deploy Production main branch Manual approval + SSH deploy

Image Tagging Strategy

Tag Format Example Purpose
Pipeline Instance 20260110143709 Immutable deployment reference
Latest latest Convenience for development

Database Migration Strategy

Prisma migrations run before container deployment:

# Executed in pipeline before docker compose up
docker compose run --rm api npx prisma migrate deploy

Rationale

  • Traefik: Automatic TLS, Docker-native, label-based configuration
  • Docker Compose: Simple, declarative, easy to understand
  • SSH deployment: Direct control, no additional orchestration overhead
  • Managed PostgreSQL: Reliability, automated backups, TLS built-in
  • Let's Encrypt: Free, automated TLS certificates

Zero-Downtime Deployment

# Pull new images
docker compose pull

# Run migrations (idempotent)
docker compose run --rm api npx prisma migrate deploy

# Start new containers (Compose handles replacement)
docker compose up -d --remove-orphans

# Clean up old images
docker image prune -f

Consequences

  • ✅ Automatic TLS certificate management
  • ✅ Simple deployment via SSH + Docker Compose
  • ✅ Zero-downtime container replacement
  • ✅ Docker labels for routing configuration
  • ✅ Consistent image tagging with pipeline ID
  • ⚠️ Single droplet = single point of failure (acceptable for staging)
  • ⚠️ Requires manual SSH key management in Azure DevOps

Alternatives Considered

Alternative Reason for Rejection
Kubernetes Overkill for current scale, operational complexity
Docker Swarm Less ecosystem support, not needed for single-node
Nginx Manual certificate management, less dynamic
Caddy Less mature Docker integration than Traefik
DigitalOcean App Platform Less control, higher cost

ADR-018: Nx Affected Conditional Deployment Strategy

Attribute Value
ID ADR-018
Status ✅ Implemented
Date 2026-01-11
Context Need to avoid unnecessary Docker builds and deployments when only part of the codebase changes

Decision

Use Nx affected to detect which applications have changed and conditionally run package/deploy stages only for affected apps.

Architecture

uml diagram

Pipeline Parameters

Parameter Type Default Purpose
ForceFullVersioningAndDeployment boolean true Bypass affected detection, deploy all apps
breakingMigration boolean false Stop API before migrations

How Affected Detection Works

The pipeline runs pnpm nx show projects --affected --type=app to identify which applications have changed compared to the base branch (origin/main).

Scenarios:

Change Location API Affected Web Affected Reason
apps/api/** Only API code changed
apps/web/** Only Web code changed
libs/domain/** Shared library affects both apps
libs/api-client/** API client only used by Web
prisma/** Database schema affects API
docs/**, *.md Docs are published as a separate static site (Zensical)

Migration Safety

The deployment follows a specific order to ensure database safety:

  1. Pull new images (uses latest code with new Prisma schema)
  2. Stop API (only if breakingMigration=true)
  3. Run migrations (using new image via docker compose run --rm)
  4. Start API (after migrations complete)

Migration Types:

Migration Type Safe During Old API? Recommended Action
Add nullable column ✅ Safe Normal deployment
Add column with default ✅ Safe Normal deployment
Add new table ✅ Safe Normal deployment
Drop column ❌ Dangerous Use breakingMigration=true
Rename column ❌ Dangerous Use breakingMigration=true
Add non-nullable column ❌ Dangerous Use breakingMigration=true

Rationale

  • Efficiency: Avoid building/pushing Docker images when code hasn't changed
  • Cost reduction: Fewer container registry pushes, less storage used
  • Faster deployments: Only affected services are restarted
  • Cleaner versioning: New version tags only when actual code changes
  • Nx integration: Leverages existing monorepo tooling for dependency detection

Consequences

  • ✅ Significantly faster CI/CD for partial changes
  • ✅ Reduced container registry costs
  • ✅ Cleaner deployment history (versions reflect actual changes)
  • ✅ Safe migration order (migrations before restart)
  • ✅ Support for breaking migrations with explicit parameter
  • ✅ Override available via ForceFullVersioningAndDeployment parameter
  • ⚠️ First pipeline run on new branch may show all apps affected
  • ⚠️ Shared library changes trigger both app deployments (by design)
  • ⚠️ breakingMigration requires manual assessment of migration type

Alternatives Considered

Alternative Reason for Rejection
Always build both apps Wasteful, slow, unnecessary version proliferation
Manual selection of apps Error-prone, requires human decision each time
Git diff on Dockerfiles only Misses shared library changes
Separate pipelines per app Loses monorepo benefits, harder to maintain

ADR-019: SimplyPrint Webhook Verification

Attribute Value
ID ADR-019
Status ✅ Implemented
Date 2026-01-13
Context Need to verify webhook requests are genuinely from SimplyPrint

Decision

Implement X-SP-Token header verification with timing-safe comparison for all SimplyPrint webhooks.

Implementation

// SimplyPrintWebhookGuard
@Injectable()
export class SimplyPrintWebhookGuard implements CanActivate {
  canActivate(context: ExecutionContext): boolean {
    const request = context.switchToHttp().getRequest();
    const token = request.headers['x-sp-token'];

    if (!this.webhookSecret) {
      this.logger.warn('SimplyPrint webhook secret not configured, skipping verification');
      return true;
    }

    if (!token) {
      throw new UnauthorizedException('Missing X-SP-Token header');
    }

    // Timing-safe comparison to prevent timing attacks
    const tokenBuffer = Buffer.from(token);
    const secretBuffer = Buffer.from(this.webhookSecret);

    if (tokenBuffer.length !== secretBuffer.length) {
      throw new UnauthorizedException('Invalid SimplyPrint webhook signature');
    }

    if (!crypto.timingSafeEqual(tokenBuffer, secretBuffer)) {
      throw new UnauthorizedException('Invalid SimplyPrint webhook signature');
    }

    return true;
  }
}

Rationale

  • Security: Prevents forged webhook requests
  • SimplyPrint standard: Uses the X-SP-Token header as per SimplyPrint documentation
  • Timing-safe comparison: Prevents timing attacks on secret comparison
  • Graceful degradation: Allows bypassing verification in development when secret not configured

Webhook Endpoint

Endpoint Method Purpose
/webhooks/simplyprint POST Receive SimplyPrint events

Supported Events

Event Action
job.started Update job status to PRINTING
job.done Update job status to COMPLETED
job.failed Update job status to FAILED
job.cancelled Update job status to CANCELLED
job.paused Keep as PRINTING (temporary state)
job.resumed Keep as PRINTING
printer.* Ignored (no job status change)

Consequences

  • ✅ Secure webhook endpoint
  • ✅ Protection against timing attacks
  • ✅ Clear event-to-status mapping
  • ✅ Development-friendly (optional verification)
  • ⚠️ Requires SIMPLYPRINT_WEBHOOK_SECRET environment variable

ADR-020: Hybrid Status Monitoring (Polling + Webhooks)

Attribute Value
ID ADR-020
Status ✅ Implemented
Date 2026-01-13
Context Need reliable print job status updates even if webhooks fail or are delayed

Decision

Implement a hybrid approach using both SimplyPrint webhooks (primary) and periodic polling (fallback) for job status monitoring.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Status Update Sources                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   SimplyPrint Cloud                                              │
│         │                                                        │
│         ├─── Webhooks (Primary, Real-time) ───┐                  │
│         │    • Immediate notification         │                  │
│         │    • Event: job.started/done/failed │                  │
│         │                                     ▼                  │
│         │                            SimplyPrintService          │
│         │                                     │                  │
│         └─── Polling (Fallback, 30s) ────────►│                  │
│              • @Cron every 30 seconds         │                  │
│              • Checks queue and printers      │                  │
│              • Catches missed webhooks        ▼                  │
│                                    simplyprint.job-status-changed│
│                                               │                  │
│                                               ▼                  │
│                                       PrintJobsService           │
│                                               │                  │
│                                               ▼                  │
│                                        Database Update           │
└─────────────────────────────────────────────────────────────────┘

Implementation

Webhook Handler (Primary):

async handleWebhook(payload: SimplyPrintWebhookPayload): Promise<void> {
  const jobData = payload.data.job;
  if (!jobData) return;

  const newStatus = this.mapWebhookEventToStatus(payload.event);
  if (!newStatus) return;

  this.eventEmitter.emit(SIMPLYPRINT_EVENTS.JOB_STATUS_CHANGED, {
    simplyPrintJobId: jobData.uid,
    newStatus,
    printerId: payload.data.printer?.id,
    printerName: payload.data.printer?.name,
    timestamp: new Date(payload.timestamp * 1000),
  });
}

Polling Fallback:

@Cron(CronExpression.EVERY_30_SECONDS)
async pollJobStatuses(): Promise<void> {
  if (!this.pollingEnabled || this.isPolling) return;

  this.isPolling = true;
  try {
    const printers = await this.simplyPrintClient.getPrinters();

    for (const printer of printers) {
      if (printer.currentJobId && printer.status === 'printing') {
        this.eventEmitter.emit(SIMPLYPRINT_EVENTS.JOB_STATUS_CHANGED, {
          simplyPrintJobId: printer.currentJobId,
          newStatus: PrintJobStatus.PRINTING,
          printerId: printer.id,
          printerName: printer.name,
          timestamp: new Date(),
        });
      }
    }
  } finally {
    this.isPolling = false;
  }
}

Configuration

Environment Variable Default Description
SIMPLYPRINT_POLLING_ENABLED true Enable/disable polling fallback
SIMPLYPRINT_POLLING_INTERVAL_MS 30000 Polling interval in milliseconds

Rationale

  • Reliability: Webhooks can fail due to network issues, SimplyPrint outages, or configuration problems
  • Real-time updates: Webhooks provide immediate notification when status changes
  • Consistency: Polling catches any status changes that webhooks might miss
  • Idempotency: Status updates check current status before updating, preventing duplicate updates
  • Configurable: Polling can be disabled in environments where webhooks are reliable

Status Deduplication

The system handles duplicate status updates gracefully:

async updateJobStatus(simplyPrintJobId: string, newStatus: PrintJobStatus): Promise<PrintJob> {
  const printJob = await this.findBySimplyPrintJobId(simplyPrintJobId);

  // Skip if status unchanged (idempotent)
  if (printJob.status === newStatus) {
    return printJob;
  }

  // Update and emit events
  // ...
}

Consequences

  • ✅ High reliability for status updates
  • ✅ Real-time updates via webhooks
  • ✅ Catches missed webhooks via polling
  • ✅ Configurable polling interval
  • ✅ Idempotent status updates
  • ⚠️ Polling adds API calls every 30 seconds (minimal overhead)
  • ⚠️ Potential for slight delay if only relying on polling

Alternatives Considered

Alternative Reason for Rejection
Webhooks only Single point of failure, missed events cause stale status
Polling only Higher latency, unnecessary API calls when webhooks work
WebSocket connection SimplyPrint doesn't offer WebSocket API
Manual refresh button Poor UX, requires operator intervention

ADR-021: Retry Queue with Exponential Backoff

Attribute Value
ID ADR-021
Status ✅ Implemented
Date 2026-01-14
Context Need to handle transient failures in external API calls (Shopify, SimplyPrint) gracefully

Decision

Implement a database-backed retry queue with exponential backoff and jitter for all retryable operations.

Configuration

Setting Value Description
Max Retries 5 Maximum retry attempts
Initial Delay 1 second First retry delay
Max Delay 1 hour Maximum retry delay
Backoff Multiplier 2 Exponential growth factor
Jitter ±10% Randomization to prevent thundering herd
Cleanup 7 days Old completed jobs deleted

Implementation

calculateDelay(attempt: number): number {
  let delay = initialDelayMs * Math.pow(backoffMultiplier, attempt - 1);
  delay = Math.min(delay, maxDelayMs);
  const jitter = delay * 0.1 * (Math.random() * 2 - 1);
  return Math.round(delay + jitter);
}

Supported Job Types

Job Type Description
FULFILLMENT Shopify fulfillment creation
PRINT_JOB_CREATION SimplyPrint job creation
CANCELLATION Job cancellation operations
NOTIFICATION Email notification sending

Consequences

  • ✅ Automatic recovery from transient failures
  • ✅ Prevents thundering herd with jitter
  • ✅ Persistent queue survives application restarts
  • ✅ Failed jobs trigger operator alerts
  • ⚠️ Adds database table for queue persistence

ADR-022: Event-Driven Fulfillment Architecture

Attribute Value
ID ADR-022
Status ✅ Implemented
Date 2026-01-14
Context Need to automatically create Shopify fulfillments when all print jobs complete

Decision

Use NestJS Event Emitter to trigger fulfillment creation when the orchestration service determines all print jobs for an order are complete.

Event Flow

PrintJob.COMPLETED → OrchestrationService checks all jobs
                   → If all complete: emit order.ready-for-fulfillment
                   → FulfillmentService listens and creates Shopify fulfillment

Key Events

Event Producer Consumer
order.ready-for-fulfillment OrchestrationService FulfillmentService
fulfillment.created FulfillmentService NotificationsService
fulfillment.failed FulfillmentService NotificationsService
order.cancelled OrdersService CancellationService

Consequences

  • ✅ Loose coupling between order management and fulfillment
  • ✅ Easy to add additional listeners (logging, analytics)
  • ✅ Failure in fulfillment doesn't block order completion
  • ⚠️ Event ordering not guaranteed (acceptable for this use case)

ADR-023: Email Notification Strategy

Attribute Value
ID ADR-023
Status ✅ Implemented
Date 2026-01-14
Context Need to alert operators when automated processes fail and require attention

Decision

Implement email notifications via SMTP using Nodemailer with Handlebars templates for operator alerts.

Notification Triggers

Trigger Severity Description
Fulfillment failed (final) ERROR Fulfillment failed after max retries
Print job failed (final) ERROR Print job failed after max retries
Cancellation needs review WARNING Order cancelled with in-progress prints
Retry exhausted ERROR Any retry job exceeded max attempts

Configuration

SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=notifications@forma3d.be
SMTP_PASS=***
SMTP_FROM=noreply@forma3d.be
OPERATOR_EMAIL=operator@forma3d.be
NOTIFICATIONS_ENABLED=true

Consequences

  • ✅ Operators notified of issues requiring attention
  • ✅ Email templates are maintainable and customizable
  • ✅ Graceful degradation if email unavailable
  • ✅ Can be disabled in development
  • ⚠️ Requires SMTP configuration for each environment

ADR-024: API Key Authentication for Admin Endpoints

Attribute Value
ID ADR-024
Status ✅ Implemented
Date 2026-01-14
Context Admin endpoints (fulfillment, cancellation) need protection from unauthorized access

Decision

Implement API key authentication using a custom NestJS guard for all admin endpoints that modify order state.

Implementation

// ApiKeyGuard
@Injectable()
export class ApiKeyGuard implements CanActivate {
  canActivate(context: ExecutionContext): boolean {
    if (!this.isEnabled) return true; // Development mode

    const request = context.switchToHttp().getRequest();
    const providedKey = request.headers['x-api-key'];

    if (!providedKey) {
      throw new UnauthorizedException('API key required');
    }

    // Timing-safe comparison to prevent timing attacks
    if (!crypto.timingSafeEqual(Buffer.from(providedKey), Buffer.from(this.apiKey))) {
      throw new UnauthorizedException('Invalid API key');
    }

    return true;
  }
}

Protected Endpoints

Endpoint Method Purpose
/api/v1/fulfillments/order/:orderId POST Create fulfillment
/api/v1/fulfillments/order/:orderId/force POST Force fulfill order
/api/v1/fulfillments/order/:orderId/status GET Get fulfillment status
/api/v1/cancellations/order/:orderId POST Cancel order
/api/v1/cancellations/print-job/:jobId POST Cancel single print job

Authentication Methods Summary

Endpoint Type Method Header Verification
Shopify Webhooks HMAC-SHA256 Signature X-Shopify-Hmac-Sha256 Timing-safe comparison
SimplyPrint Webhooks Token Verification X-SP-Token Timing-safe comparison
Admin Endpoints API Key X-API-Key Timing-safe comparison
Public Endpoints None - -

Configuration

# Generate secure API key
openssl rand -hex 32

# Environment variable
INTERNAL_API_KEY="your-secure-api-key"

Security Considerations

  1. Timing-safe comparison: Prevents timing attacks on key validation
  2. Generic error messages: Returns "API key required" or "Invalid API key" to prevent information leakage
  3. Audit logging: Access attempts are logged for security monitoring
  4. Development mode: If INTERNAL_API_KEY not set, endpoints are accessible (development only)

Rationale

  • IDOR Prevention: Addresses Insecure Direct Object Reference (IDOR) vulnerabilities flagged by security scanners
  • Defense in Depth: Additional layer of protection for sensitive operations
  • Simple Implementation: API keys are stateless and easy to rotate
  • Swagger Integration: API key documented in OpenAPI spec for easy testing

Consequences

  • ✅ Protection against unauthorized access to admin functions
  • ✅ IDOR vulnerability mitigated
  • ✅ Timing-safe implementation prevents timing attacks
  • ✅ Development-friendly (optional in dev mode)
  • ✅ Documented in Swagger UI
  • ⚠️ Requires secure key management in production
  • ⚠️ Key must be rotated if compromised

Alternatives Considered

Alternative Reason for Rejection
OAuth 2.0 / JWT Overkill for internal B2B system with no user accounts
IP Whitelisting Too inflexible, requires network configuration
mTLS Complex certificate management for simple use case
No authentication Unacceptable security risk (IDOR vulnerability)

ADR-025: Cosign Image Signing for Supply Chain Security

Attribute Value
ID ADR-025
Status ✅ Implemented
Date 2026-01-14
Context Need to cryptographically sign container images and create attestations for promotion tracking

Decision

Implement key-based container image signing using cosign from the Sigstore project, with attestations to track image promotions through environments.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        Azure DevOps Pipeline                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Build & Package          Acceptance Test           Production           │
│  ┌─────────────┐          ┌─────────────┐          ┌─────────────┐      │
│  │ Build Docker│          │ Deploy to   │          │ Verify      │      │
│  │ Images      │          │ Staging     │          │ Staging     │      │
│  └──────┬──────┘          └──────┬──────┘          │ Attestation │      │
│         │                        │                 └──────┬──────┘      │
│         ▼                        ▼                        │             │
│  ┌─────────────┐          ┌─────────────┐                 ▼             │
│  │ Sign with   │          │ Run Tests   │          ┌─────────────┐      │
│  │ cosign.key  │          └──────┬──────┘          │ Deploy to   │      │
│  └──────┬──────┘                 │                 │ Production  │      │
│         │                        ▼                 └──────┬──────┘      │
│         │                 ┌─────────────┐                 │             │
│         │                 │ Create      │                 ▼             │
│         │                 │ Staging     │          ┌─────────────┐      │
│         │                 │ Attestation │          │ Create Prod │      │
│         │                 └─────────────┘          │ Attestation │      │
│         │                                          └─────────────┘      │
└─────────┼──────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────┐     ┌─────────────────────┐
│   DigitalOcean Container Registry   │     │     Repository      │
│   ─────────────────────────────────│     │  ─────────────────  │
│   • Image:tag                       │     │  • cosign.pub       │
│   • Image.sig (signature)           │◄────│    (public key)     │
│   • Image.att (attestation)         │     │                     │
└─────────────────────────────────────┘     └─────────────────────┘

Implementation

Key-Based Signing (chosen approach):

# Azure DevOps Pipeline
- task: DownloadSecureFile@1
  name: cosignKey
  inputs:
    secureFile: 'cosign.key'

- script: |
    cosign sign \
      --key $(cosignKey.secureFilePath) \
      --annotations "build.number=$(imageTag)" \
      $(dockerRegistry)/$(imageName)@$(digest)
  env:
    COSIGN_PASSWORD: $(COSIGN_PASSWORD)

Attestation for Promotion Tracking:

{
  "_type": "https://forma3d.com/attestations/promotion/v1",
  "environment": "staging",
  "promotedAt": "2026-01-14T16:00:00+00:00",
  "build": {
    "number": "20260114160000",
    "pipeline": "forma-3d-connect",
    "commit": "abc123..."
  },
  "verification": {
    "healthCheckPassed": true,
    "acceptanceTestsPassed": true
  }
}

Key Management

File Location Purpose
cosign.key Azure DevOps Secure Files Sign images (private)
COSIGN_PASSWORD Azure DevOps Variable Group Decrypt private key
cosign.pub Repository root (/cosign.pub) Verify signatures (public)

Signing Workflow

Stage Action Artifact Created
Build & Package Sign image after push Image signature (.sig)
Staging Deploy Create staging attestation Staging attestation (.att)
Production Deploy Verify staging attestation, then sign Production attestation

Rationale

  • Supply chain security: Cryptographic proof that images were built by the CI/CD pipeline
  • Promotion tracking: Attestations provide audit trail without modifying image tags
  • Tamper detection: Modifications to signed images are detectable
  • Key-based over keyless: Keyless (OIDC) requires workload identity federation which adds complexity; key-based is simpler and fully functional in Azure DevOps

Why Key-Based Instead of Keyless

Sigstore's "keyless" signing uses OIDC tokens from identity providers (GitHub Actions, Google Cloud, etc.). While elegant, it has challenges in Azure DevOps:

Approach Pros Cons
Keyless (OIDC) No key management, identity-based Requires Azure Workload Identity Federation, falls back to device flow in CI (fails)
Key-Based Works immediately in any CI Requires secure key storage and rotation

We chose key-based because:

  1. Azure DevOps doesn't have native OIDC integration with Sigstore
  2. Device flow authentication cannot work in non-interactive CI
  3. Key-based signing is well-supported and reliable

Security Considerations

  1. Private key protection: Stored in Azure DevOps Secure Files (encrypted at rest)
  2. Password protection: Private key is encrypted, password in secret variable
  3. Timing-safe verification: Public key verification uses constant-time comparison
  4. Key rotation: Documented procedure for rotating keys periodically (see Cosign Setup Guide)

Pipeline Parameters

Parameter Type Default Description
enableSigning boolean true Enable/disable image signing and attestations

Verification Commands

# Verify image signature
cosign verify --key cosign.pub \
  registry.digitalocean.com/forma-3d/forma3d-connect-api:20260114160000

# View attestations attached to image
cosign tree registry.digitalocean.com/forma-3d/forma3d-connect-api:20260114160000

# Verify and decode attestation
cosign verify-attestation --key cosign.pub --type custom \
  registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
  | jq '.payload | @base64d | fromjson | .predicate'

Local Tooling

A script is provided to view image promotion status:

# List all images with their promotion status
./scripts/list-image-promotions.sh

# Output shows signed status and promotion level
  TAG                PROMOTION    SIGNED   UPDATED
  20260114160000     STAGING              2026-01-14
  20260114120000     none                 2026-01-14

Consequences

  • ✅ Cryptographic proof of image provenance
  • ✅ Tamper detection for container images
  • ✅ Audit trail for environment promotions
  • ✅ Works reliably in Azure DevOps without OIDC setup
  • ✅ Can verify images locally with public key
  • ⚠️ Requires secure key management
  • ⚠️ Keys must be rotated periodically (recommended: 6-12 months)
  • ⚠️ Pipeline requires secure files and variables to be configured

Alternatives Considered

Alternative Reason for Rejection
No signing No supply chain security, no tamper detection
Keyless signing (OIDC) Falls back to device flow in Azure DevOps, requires manual auth
Docker Content Trust (DCT) Less flexible, no custom attestations, vendor lock-in
Image tags for promotion Tags can be overwritten, no cryptographic verification
External attestation store Additional infrastructure, attestations separate from images

ADR-026: CycloneDX SBOM Attestations

Attribute Value
ID ADR-026
Status ✅ Implemented
Date 2026-01-16
Context Need to generate and attach Software Bill of Materials (SBOM) to container images for supply chain transparency

Decision

Generate CycloneDX SBOMs using Syft and attach them as signed attestations using cosign.

Architecture

Each container image in the registry will have multiple attestations stored as separate OCI artifacts:

Container Image (e.g., forma3d-connect-api:20260116120000)
├── Image signature (.sig) ─────────────── cosign sign
├── SBOM attestation (.att) ────────────── cosign attest --type cyclonedx
├── Staging promotion attestation (.att) ─ cosign attest --type custom
└── Production promotion attestation (.att) cosign attest --type custom

Why CycloneDX over SPDX

Criteria CycloneDX SPDX
Primary Focus Security & DevSecOps License compliance
VEX Support Native Separate spec
Tool Ecosystem Excellent (Grype, Syft) Good
Format Complexity Simpler More complex
OWASP Alignment Yes (OWASP project) No

CycloneDX was chosen because:

  1. Better integration with vulnerability scanners (Grype, Trivy)
  2. Native support for VEX (Vulnerability Exploitability eXchange)
  3. Simpler format for debugging
  4. Aligns with OWASP security practices
  5. Growing adoption in DevSecOps pipelines

Implementation

Pipeline Step (after image signing):

- script: |
    set -e

    # Install Syft
    curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin

    # Generate CycloneDX SBOM
    syft $(dockerRegistry)/$(imageName)@$(digest) \
      --output cyclonedx-json=sbom.cdx.json

    # Attach as signed attestation
    cosign attest \
      --yes \
      --key $(cosignKey.secureFilePath) \
      --predicate sbom.cdx.json \
      --type cyclonedx \
      $(dockerRegistry)/$(imageName)@$(digest)
  displayName: 'Generate and Attach SBOM'
  env:
    COSIGN_PASSWORD: $(COSIGN_PASSWORD)

Attestation Types in Registry

After deployment, each image has multiple separate attestations:

Attestation Type Purpose Created By
Signature Proves image was built by CI/CD cosign sign
CycloneDX SBOM Lists all components/packages cosign attest --type cyclonedx
Staging Proves image passed staging cosign attest --type custom
Production Proves image deployed to prod cosign attest --type custom

Verification Commands

# View all attestations attached to an image
cosign tree registry.digitalocean.com/forma-3d/forma3d-connect-api:latest

# Verify and extract SBOM
cosign verify-attestation \
  --key cosign.pub \
  --type cyclonedx \
  registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
  | jq -r '.payload' | base64 -d | jq '.predicate'

# Count components in SBOM
cosign verify-attestation --key cosign.pub --type cyclonedx \
  registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
  | jq -r '.payload' | base64 -d | jq '.predicate.components | length'

Scanning for Vulnerabilities

With the SBOM attached, you can scan for vulnerabilities without pulling the full image:

# Extract SBOM and scan with Grype
cosign verify-attestation --key cosign.pub --type cyclonedx \
  registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
  | jq -r '.payload' | base64 -d | jq '.predicate' > sbom.cdx.json

grype sbom:sbom.cdx.json

Rationale

  • Supply chain transparency: SBOM provides complete visibility into image contents
  • Vulnerability management: Enables scanning without pulling full images
  • Compliance: Meets requirements for software transparency (US Executive Order 14028)
  • Signed attestation: SBOM itself is cryptographically signed, preventing tampering
  • Tool-agnostic: CycloneDX is an open standard supported by many tools

Consequences

  • ✅ Complete visibility into image dependencies
  • ✅ Enables vulnerability scanning from SBOM
  • ✅ Signed attestation prevents SBOM tampering
  • ✅ Supports compliance requirements
  • ✅ Works with existing cosign infrastructure
  • ⚠️ Adds ~10-15 seconds to pipeline per image
  • ⚠️ SBOM attestation adds ~2KB manifest to registry

Alternatives Considered

Alternative Reason for Rejection
SPDX format More focused on licensing, less security tooling
Syft native format Not an industry standard, limited tool support
Docker Buildx --sbom Requires buildx, less control over format
No SBOM Missing supply chain transparency
SBOM in image labels Not cryptographically signed, can be tampered

Tools Used

Tool License Purpose
Syft Apache 2.0 Generate CycloneDX SBOM
Cosign Apache 2.0 Sign and attach as attestation
Grype Apache 2.0 Vulnerability scanning (optional)

ADR-027: TanStack Query for Server State Management

Attribute Value
ID ADR-026
Status Accepted
Date 2026-01-14
Context Need to manage server state in the React dashboard with caching, refetching, and loading states

Decision

Use TanStack Query (v5.x, formerly React Query) for server state management in the dashboard.

Rationale

  • Automatic caching: Query results are cached and deduplicated automatically
  • Background refetching: Data stays fresh with configurable stale times and refetch intervals
  • Loading/error states: Built-in loading, error, and success states reduce boilerplate
  • Optimistic updates: Supports optimistic updates for better UX on mutations
  • DevTools: React Query DevTools for debugging cache state
  • TypeScript support: Excellent TypeScript integration with inferred types

Implementation

// Query client configuration (apps/web/src/lib/query-client.ts)
const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      staleTime: 30 * 1000, // 30 seconds
      gcTime: 5 * 60 * 1000, // 5 minutes cache
      retry: 1,
      refetchOnWindowFocus: false,
    },
  },
});

// Example hook (apps/web/src/hooks/use-orders.ts)
export function useOrders(query: OrdersQuery = {}) {
  return useQuery({
    queryKey: ['orders', query],
    queryFn: () => apiClient.orders.list(query),
  });
}

Consequences

  • ✅ Eliminates manual loading/error state management
  • ✅ Automatic cache invalidation on mutations
  • ✅ Integrates well with Socket.IO for real-time updates
  • ✅ Reduces API calls through intelligent caching
  • ⚠️ Requires understanding of query keys for proper cache invalidation

Alternatives Considered

Alternative Reason for Rejection
Redux Too much boilerplate for server state
SWR Less features than TanStack Query
Apollo Client GraphQL-focused, overkill for REST API
Manual fetch Requires implementing caching/loading states manually

ADR-028: Socket.IO for Real-Time Dashboard Updates

Attribute Value
ID ADR-028
Status Accepted
Date 2026-01-14
Context Dashboard needs real-time updates when orders and print jobs change status

Decision

Use Socket.IO for real-time WebSocket communication between backend and dashboard.

Architecture

Backend Events          WebSocket Gateway         React Dashboard
     │                        │                        │
     │  order.created         │                        │
     ├───────────────────────►│                        │
     │                        │  order:created         │
     │                        ├───────────────────────►│
     │                        │                        │ invalidateQueries()
     │                        │                        │ toast.success()

Implementation

Backend (NestJS WebSocket Gateway):

// apps/api/src/gateway/events.gateway.ts
@WebSocketGateway({ namespace: '/events' })
export class EventsGateway {
  @WebSocketServer()
  server!: Server;

  @OnEvent(ORDER_EVENTS.CREATED)
  handleOrderCreated(event: OrderEventPayload): void {
    this.server.emit('order:created', { ... });
  }
}

Frontend (React Context):

// apps/web/src/contexts/socket-context.tsx
socketInstance.on('order:created', (data) => {
  toast.success(`New order: #${data.orderNumber}`);
  queryClient.invalidateQueries({ queryKey: ['orders'] });
});

Rationale

  • Already installed: Socket.IO server was already in dependencies for Phase 3
  • Bidirectional: Supports future features like notifications and chat
  • Automatic reconnection: Handles network interruptions gracefully
  • Namespace support: Can separate different event channels
  • Browser compatibility: Works across all modern browsers

Consequences

  • ✅ Real-time updates without polling
  • ✅ Toast notifications on important events
  • ✅ Automatic TanStack Query cache invalidation
  • ✅ Connection status visible in UI
  • ⚠️ Requires WebSocket support in infrastructure

Alternatives Considered

Alternative Reason for Rejection
Polling Higher latency, more server load
Server-Sent Events One-directional only
Raw WebSockets Less features than Socket.IO (rooms, reconnection)
Pusher/Ably External dependency, cost

ADR-029: API Key Authentication for Dashboard

Attribute Value
ID ADR-029
Status Accepted
Date 2026-01-14
Context Dashboard needs authentication to protect admin operations

Decision

Use API key authentication stored in browser localStorage for dashboard authentication.

Implementation

// apps/web/src/contexts/auth-context.tsx
const AUTH_STORAGE_KEY = 'forma3d_api_key';

export function AuthProvider({ children }: { children: ReactNode }) {
  const [apiKey, setApiKey] = useState<string | null>(() => {
    return localStorage.getItem(AUTH_STORAGE_KEY);
  });

  const login = (key: string) => {
    localStorage.setItem(AUTH_STORAGE_KEY, key);
    setApiKey(key);
  };
  // ...
}

// Protected routes redirect to /login if not authenticated
function ProtectedRoute({ children }: { children: ReactNode }) {
  const { isAuthenticated } = useAuth();
  if (!isAuthenticated) return <Navigate to="/login" replace />;
  return <>{children}</>;
}

Rationale

  • Simplicity: No session management, token refresh, or OAuth complexity
  • Consistent with API: Uses same API key authentication as backend (ADR-024)
  • Offline-capable: Works without server validation on page load
  • Single operator: System is used by single operator, not public users

Security Considerations

  • API key stored in localStorage (acceptable for internal admin tool)
  • Key sent via X-API-Key header for mutations
  • HTTPS required in production
  • Key should be rotated periodically

Consequences

  • ✅ Simple implementation and user experience
  • ✅ Consistent with existing API key guard on backend
  • ✅ No additional authentication infrastructure needed
  • ⚠️ API key visible in localStorage (acceptable for admin tool)
  • ⚠️ No role-based access control (single admin role)

Alternatives Considered

Alternative Reason for Rejection
OAuth/OIDC Overkill for single-operator system
JWT tokens Adds complexity without benefit for this use case
Session cookies Requires server-side session management
No auth Admin operations must be protected

ADR-030: Sendcloud for Shipping Integration

Attribute Value
ID ADR-030
Status Accepted
Date 2026-01-16
Context Need to generate shipping labels and sync tracking information to Shopify

Decision

Use Sendcloud API (custom integration) rather than the native Sendcloud-Shopify app for shipping label generation and tracking.

Rationale

Why Sendcloud as a Platform

  • Multi-carrier support: Single API for PostNL, DPD, DHL, UPS, and 80+ other carriers
  • European focus: Strong presence in Belgium/Netherlands matching Forma3D's primary market
  • Simple API: REST API with Basic Auth, parcel creation returns label PDF immediately
  • Automatic tracking: Tracking numbers and URLs provided on parcel creation
  • Webhook support: Status updates available via webhooks (for future enhancement)
  • Competitive pricing: Pay-per-label pricing suitable for small business volumes
  • Label formats: Supports A4, A6, and thermal printer formats

Why Custom API Integration vs Native Shopify-Sendcloud App

Sendcloud offers a native Shopify integration that automatically syncs orders. However, we chose a custom API integration for the following reasons:

Aspect Native Sendcloud-Shopify App Our Custom API Integration
Trigger Manual — operator must create label in Sendcloud dashboard Automatic — triggered when all print jobs complete
Print awareness None — doesn't know about 3D printing workflow Full — waits for SimplyPrint jobs to finish
Unified dashboard Split across Shopify + Sendcloud panels Single dashboard — orders, prints, shipments in one place
Audit trail Separate logs in each system Integrated event log with full traceability
Custom workflow Generic e-commerce flow Custom print-to-ship automation
Tracking sync timing After manual label creation Immediate — included in Shopify fulfillment

Key insight: The native integration doesn't know when 3D printing is complete. An operator would need to:

  1. Monitor SimplyPrint for job completion
  2. Switch to Sendcloud dashboard
  3. Find the order and create a label
  4. Wait for tracking to sync back to Shopify

Our custom integration automates this entire workflow:

Print Jobs Complete → Auto-Generate Label → Auto-Fulfill with Tracking → Customer Notified

This reduces manual intervention from ~5 minutes per order to zero, which is critical for scaling order volumes.

Implementation

apps/api/src/
├── sendcloud/
│   ├── sendcloud-api.client.ts    # HTTP client with Basic Auth
│   ├── sendcloud.service.ts       # Business logic, event listener
│   ├── sendcloud.controller.ts    # REST endpoints
│   └── sendcloud.module.ts
├── shipments/
│   ├── shipments.repository.ts    # Prisma queries for Shipment
│   ├── shipments.controller.ts    # REST endpoints
│   └── shipments.module.ts

libs/api-client/src/
└── sendcloud/
    └── sendcloud.types.ts         # Typed DTOs for Sendcloud API

Event Flow

  1. All print jobs complete → OrchestrationService emits order.ready-for-fulfillment
  2. SendcloudService listens → creates parcel via Sendcloud API
  3. Sendcloud returns label URL + tracking number
  4. Shipment record stored in database
  5. SendcloudService emits shipment.created event
  6. FulfillmentService listens → creates Shopify fulfillment with tracking info
  7. Customer receives email notification with tracking link
┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────────┐
│ SimplyPrint │───▶│ Orchestration│───▶│  Sendcloud  │───▶│ Fulfillment │
│  (prints)   │    │   Service    │    │   Service   │    │   Service   │
└─────────────┘    └──────────────┘    └─────────────┘    └─────────────┘
                         │                    │                  │
                         │ order.ready-       │ shipment.        │ Shopify
                         │ for-fulfillment    │ created          │ Fulfillment
                         ▼                    ▼                  ▼
                   [All jobs done]      [Label + tracking]  [Customer notified]

Consequences

  • ✅ Single integration for multiple carriers
  • ✅ Automatic label PDF generation
  • ✅ Tracking information synced to Shopify fulfillments
  • ✅ Dashboard displays shipment status and label download
  • ⚠️ Dependent on Sendcloud uptime and API availability
  • ⚠️ Limited to carriers supported by Sendcloud
  • ⚠️ Requires Sendcloud account and sender address configuration

Environment Variables

SENDCLOUD_PUBLIC_KEY=xxx
SENDCLOUD_SECRET_KEY=xxx
SENDCLOUD_API_URL=https://panel.sendcloud.sc/api/v2
DEFAULT_SHIPPING_METHOD_ID=8
DEFAULT_SENDER_ADDRESS_ID=12345
SHIPPING_ENABLED=true

Alternatives Considered

Alternative Reason for Rejection
Native Sendcloud-Shopify app Requires manual label creation; no print workflow awareness
Direct carrier APIs Too many integrations to maintain, each with different APIs
ShipStation US-focused, less European carrier support
EasyPost Less European carrier coverage than Sendcloud
Manual labels Does not meet automation requirements; ~5 min overhead per order

ADR-031: Automated Container Registry Cleanup

Attribute Value
ID ADR-031
Status Accepted
Date 2026-01-16
Context Container registries accumulate old images over time, increasing storage costs and clutter

Decision

Implement automated container registry cleanup that runs after each successful staging deployment and attestation. The cleanup uses attestation-based policies to determine which images to keep or delete.

Rationale

The Problem

Without automated cleanup, the DigitalOcean Container Registry accumulates images indefinitely:

  • Each CI build creates new images with timestamped tags (e.g., 20260116120000)
  • Signature and attestation artifacts add ~2KB per image
  • Storage costs grow linearly with deployment frequency
  • Old images provide no value after newer versions are verified in production

Attestation-Based Cleanup Policy

The cleanup leverages the cosign attestation system (ADR-025) to make intelligent retention decisions:

Image Status Action Rationale
PRODUCTION attestation Keep May need for rollback
Currently deployed Keep Active in production/staging
Recent (last 5) Keep Recent builds for debugging
STAGING-only attestation Delete Superseded by newer staging builds
No attestation Delete Never passed acceptance tests

This policy ensures:

  1. Rollback capability: Production-attested images are always available
  2. Debugging support: Recent images preserved for investigation
  3. Automatic garbage collection: Old staging/unsigned images removed

Integration with Health Endpoints

The cleanup script queries the /health endpoints to determine which images are currently deployed:

# API health endpoint returns current build number
curl https://staging-connect-api.forma3d.be/health
# Response: { "build": { "number": "20260116120000" }, ... }

This prevents accidental deletion of running containers.

Implementation

scripts/
└── cleanup-registry.sh    # Cleanup script with attestation checking

azure-pipelines.yml
└── RegistryMaintenance stage  # Runs on every main branch pipeline
    └── CleanupRegistry job    # Cleans manifests + triggers GC

Cleanup Script

The scripts/cleanup-registry.sh script:

  1. Authenticates to DigitalOcean Container Registry via doctl
  2. Queries health endpoints to find currently deployed image tags
  3. Lists all images in the registry for each repository
  4. Checks attestations using cosign verify-attestation with the public key
  5. Applies retention policy based on attestation status
  6. Deletes eligible images via doctl registry repository delete-manifest
  7. Triggers garbage collection to reclaim storage space

Pipeline Integration

The cleanup runs in a dedicated RegistryMaintenance stage that executes on every main branch pipeline, even when no apps are affected (DeployStaging skipped):

- stage: RegistryMaintenance
  dependsOn: [Build, DeployStaging]
  condition: and(not(canceled()), eq(variables.isMain, true))

Cleanup Flow

┌─────────────────────────────────────────────────────────────────────┐
│                    RegistryMaintenance Stage                          │
│            (runs on every main branch pipeline)                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                          CleanupRegistry                             │
│                               │                                      │
│                               ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ 1. Query /health endpoints for deployed versions             │   │
│  │ 2. List all images in registry                               │   │
│  │ 3. For each image:                                           │   │
│  │    - Check if PRODUCTION attested → KEEP                     │   │
│  │    - Check if currently deployed → KEEP                      │   │
│  │    - Check if in top 5 recent → KEEP                         │   │
│  │    - Check if STAGING-only attested → DELETE                 │   │
│  │    - Check if no attestation → DELETE                        │   │
│  │ 4. Wait for any active GC, start new GC, verify completion   │   │
│  │ 5. EXIT trap ensures GC runs even if script crashes          │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Usage

Local Testing (Dry Run)

# Preview what would be deleted
./scripts/cleanup-registry.sh \
  --key cosign.pub \
  --api-url https://staging-connect-api.forma3d.be \
  --web-url https://staging-connect.forma3d.be \
  --dry-run \
  --verbose

Manual Cleanup

# Perform actual cleanup
./scripts/cleanup-registry.sh \
  --key cosign.pub \
  --api-url https://staging-connect-api.forma3d.be \
  --web-url https://staging-connect.forma3d.be \
  --verbose

Script Options

Option Description
-k, --key FILE Public key for attestation verification (required)
--api-url URL API health endpoint URL (required)
--web-url URL Web health endpoint URL (required)
--keep-recent N Keep N most recent images (default: 5)
--dry-run Preview deletions without executing
-v, --verbose Show detailed output

Consequences

  • ✅ Automatic storage management reduces costs
  • ✅ Attestation-based policy ensures production rollback capability
  • ✅ Health endpoint check prevents deletion of running containers
  • ✅ Dry-run mode enables safe testing
  • ✅ Garbage collection reclaims space after deletion
  • ⚠️ Requires health endpoints to return build information
  • ⚠️ Dependent on cosign/doctl availability in pipeline

Alternatives Considered

Alternative Reason for Rejection
Time-based retention (e.g., 30 days) Doesn't account for promotion status; may delete production-ready images
Tag-based retention (e.g., keep latest) latest tag is mutable; doesn't guarantee correct image
Manual cleanup Error-prone, inconsistent, doesn't scale
Registry auto-purge policies DigitalOcean doesn't support attestation-aware policies

ADR-032: Domain Boundary Separation with Interface Contracts

Attribute Value
ID ADR-032
Title Domain Boundary Separation with Interface Contracts
Status Implemented
Context Prepare the modular monolith for potential future microservices extraction by establishing clean domain boundaries
Date 2026-01-17

Context

As the application grows, we need to ensure domain boundaries are well-defined to:

  1. Enable future microservices extraction without major refactoring
  2. Reduce coupling between modules
  3. Enable independent testing of domain logic
  4. Provide distributed tracing capabilities

Decision

We implement domain boundary separation with the following patterns:

1. Domain Contracts Library (libs/domain-contracts)

Create a dedicated library containing:

  • Interface definitions (IOrdersService, IPrintJobsService, etc.)
  • DTOs for cross-domain communication (OrderDto, PrintJobDto, etc.)
  • Symbol injection tokens (ORDERS_SERVICE, PRINT_JOBS_SERVICE, etc.)

2. Correlation ID Infrastructure

Add correlation ID propagation for distributed tracing:

  • CorrelationMiddleware extracts/generates x-correlation-id headers
  • CorrelationService uses AsyncLocalStorage for context propagation
  • All domain events include correlationId, timestamp, and source fields

3. Repository Encapsulation

Repositories are internal implementation details:

  • Modules stop exporting repositories
  • Only interface tokens are exported for cross-domain communication
  • Services implement domain interfaces

4. Event-Based Base Interfaces

Define base event interfaces that all domain events extend:

interface BaseEvent {
  correlationId: string;
  timestamp: Date;
  source: string;
}

Implementation

Component Path Description
Domain Contracts libs/domain-contracts/ Interface definitions and DTOs
Correlation Service apps/api/src/common/correlation/ Request context propagation
Base Events libs/domain/src/events/ Base event interfaces

Interface Tokens Pattern

// In domain-contracts library
export const ORDERS_SERVICE = Symbol('IOrdersService');

export interface IOrdersService {
  findById(id: string): Promise<OrderDto | null>;
  updateStatus(id: string, status: OrderStatus): Promise<OrderDto>;
  // ... other methods
}

// In module
@Module({
  providers: [OrdersService, { provide: ORDERS_SERVICE, useExisting: OrdersService }],
  exports: [ORDERS_SERVICE], // No longer exports repository
})
export class OrdersModule {}

// In consumer service
@Injectable()
export class FulfillmentService {
  constructor(
    @Inject(ORDERS_SERVICE)
    private readonly ordersService: IOrdersService
  ) {}
}

Scope

Interface tokens (@Inject(ORDERS_SERVICE), etc.) enforce boundaries between domains. Services that live within the same domain module should inject the concrete class directly rather than going through the token indirection. For example, OrchestrationService injects PrintJobsService directly because both live inside the order-service; it injects IOrdersService via ORDERS_SERVICE because orders are a separate domain boundary.

Consequences

Positive:

  • Clear domain boundaries enable future microservices extraction
  • Reduced coupling between modules
  • Better testability with interface-based mocking
  • Distributed tracing via correlation IDs
  • Repository details are now private implementation

Negative:

  • Slight increase in boilerplate (interface definitions, DTOs)
  • Need to maintain DTO mapping logic
  • Some forwardRef() usages remain for circular retry patterns
  • ADR-007: Layered Architecture with Repository Pattern
  • ADR-008: Event-Driven Internal Communication
  • ADR-013: Shared Domain Library

ADR-033: Database-Backed Webhook Idempotency

Attribute Value
ID ADR-033
Title Database-Backed Webhook Idempotency
Status Implemented
Context In-memory webhook idempotency cache doesn't work in multi-instance deployments
Date 2026-01-17

Context

The original implementation used an in-memory Set<string> for webhook idempotency tracking:

private readonly processedWebhooks = new Set<string>();

This approach had critical problems:

  1. Horizontal Scaling Failure: In a multi-instance deployment, each API instance has its own cache. Webhooks may be processed multiple times across instances.
  2. Memory Leak: The Set grows unbounded as webhooks are processed, causing memory pressure in long-running instances.
  3. Restart Data Loss: All idempotency data is lost on application restart, allowing duplicate processing during restarts.

Decision

Use a PostgreSQL table (ProcessedWebhook) for webhook idempotency instead of Redis or in-memory caching.

Rationale

  • No additional infrastructure: Uses existing PostgreSQL database
  • Transactional safety: Database unique constraint ensures race-condition-safe idempotency
  • Simple cleanup: Scheduled job removes expired records hourly
  • Debugging support: Records include metadata (webhook type, order ID, timestamps)
  • Horizontal scaling: Works correctly across multiple API instances

Implementation

// Atomic check-and-mark using unique constraint
async isProcessedOrMark(webhookId: string, type: string): Promise<boolean> {
  try {
    await this.prisma.processedWebhook.create({
      data: { webhookId, webhookType: type, expiresAt }
    });
    return false; // First time processing
  } catch (error) {
    if (error.code === 'P2002') return true; // Already processed
    throw error;
  }
}

Database Schema

model ProcessedWebhook {
  id          String   @id @default(uuid())
  webhookId   String   @unique  // The Shopify webhook ID
  webhookType String            // e.g., "orders/create"
  processedAt DateTime @default(now())
  expiresAt   DateTime          // When this record can be cleaned up
  orderId     String?           // Associated order for debugging

  @@index([expiresAt])          // For cleanup job queries
  @@index([processedAt])        // For monitoring
}

Alternatives Considered

Alternative Pros Cons Decision
Redis TTL support, fast Additional infrastructure Rejected
Distributed Lock Works with DB Complex, race conditions Rejected
Database Table Simple, no new infra Needs cleanup job Selected

Consequences

Positive:

  • ✅ Works correctly in multi-instance deployments
  • ✅ Survives application restarts
  • ✅ No memory leaks
  • ✅ Auditable (can query processed webhooks)
  • ✅ Race-condition safe via unique constraint

Negative:

  • ⚠️ Slightly higher latency than in-memory (< 10ms)
  • ⚠️ Requires cleanup job (runs hourly)
  • ADR-007: Layered Architecture with Repository Pattern
  • ADR-021: Retry Queue for Resilient Operations

ADR-034: Docker Infrastructure Hardening (Log Rotation & Resource Cleanup)

Status Date Context
Accepted 2026-01-19 Prevent disk exhaustion from Docker logs and images

Context

During staging operations, the server disk filled to 100% due to:

  1. Unbounded Docker logs: The default json-file log driver has no size limits, causing container logs to grow indefinitely
  2. Accumulated old images: Each deployment pulls new images but old versions remained on disk
  3. Health check failures: When disk was full, Docker couldn't execute health checks, causing containers to be marked unhealthy and Traefik to stop routing traffic

Decision

Implement automated infrastructure hardening in the deployment pipeline:

  1. Docker Log Rotation: Configure daemon-level log rotation with size limits
  2. Aggressive Resource Cleanup: Remove unused images, volumes, and networks after each deployment
  3. Separate Image Tags: Use independent version tags for API and Web to support partial deployments

Implementation

1. Docker Log Rotation Configuration

The pipeline automatically creates /etc/docker/daemon.json if missing:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

This limits each container to:

  • Maximum 10MB per log file
  • Maximum 3 rotated files
  • Total: 30MB per container (90MB for all 3 containers)

2. Deployment Cleanup Steps

After container restart, the pipeline runs:

# Remove dangling images
docker image prune -f

# Remove unused images older than 24h
docker image prune -a -f --filter "until=24h"

# Clean up unused volumes and networks
docker volume prune -f
docker network prune -f

3. Separate Image Tags

docker-compose.yml now uses independent tags:

api:
  image: ${REGISTRY_URL}/forma3d-connect-api:${API_IMAGE_TAG:-latest}

web:
  image: ${REGISTRY_URL}/forma3d-connect-web:${WEB_IMAGE_TAG:-latest}

This allows:

  • Deploying only API without changing Web version
  • Deploying only Web without changing API version
  • Independent rollbacks for each service

Consequences

Positive:

  • ✅ Prevents disk exhaustion from unbounded log growth
  • ✅ Reduces disk usage by cleaning old images after deployment
  • ✅ Supports independent versioning for API and Web
  • ✅ Self-healing: Pipeline automatically configures log rotation if missing
  • ✅ No manual intervention required

Negative:

  • ⚠️ Docker daemon restart required if log rotation config is missing (brief container interruption)
  • ⚠️ Log history limited to ~30MB per container (may need external log aggregation for production)

Configuration Summary

Setting Value Rationale
max-size 10m Balance between history and disk usage
max-file 3 Keeps ~30MB per container
Image cleanup filter 24h Keeps recent images for quick rollback
  • ADR-017: Docker + Traefik Deployment Strategy
  • ADR-031: Automated Container Registry Cleanup

ADR-035: Progressive Web App (PWA) for Cross-Platform Access

Attribute Value
ID ADR-035
Status Accepted
Date 2026-01-19
Context Need to provide mobile and desktop access for operators monitoring print jobs and managing orders while away from desk

Decision

Adopt Progressive Web App (PWA) technology for the existing React web application, replacing the planned Tauri (desktop) and Capacitor (mobile) native shell applications.

The web application will be enhanced with:

  1. Web App Manifest for installability
  2. Service Worker for offline caching and push notifications
  3. Web Push API for real-time alerts on print job status

Rationale

PWA Suitability for Admin Dashboards

Research conducted in January 2026 confirms PWA is an ideal fit for Forma3D.Connect:

  • Application type: Admin dashboards and SaaS tools are PWA's primary use case
  • Feature requirements: Order management, real-time updates, and push notifications are fully supported
  • Device features: No deep hardware integration (Bluetooth, NFC, sensors) required

iOS/Safari PWA Support (2026)

Apple has significantly improved PWA support:

Feature iOS Version Status
Web Push Notifications iOS 16.4+ ✅ Supported (Home Screen install required)
Badging API iOS 16.4+ ✅ Supported
Declarative Web Push iOS 18.4+ ✅ Improved reliability
Standalone Display Mode iOS 16.4+ ✅ Supported

Cost-Benefit Analysis

Aspect Tauri + Capacitor PWA
Initial development 40-80 hours 8-16 hours
CI/CD pipelines Additional complexity None
Code signing Required (Apple, Windows) None
App store submissions Required None
Update cycle Days (app store review) Instant
Maintenance Ongoing Minimal

Estimated savings: 80-150 hours initial + ongoing maintenance reduction

Tauri/Capacitor Provided No Real Advantage

Both planned native apps were WebView wrappers:

  • Container(desktop, "Tauri, Rust", "Native desktop shell wrapping the web application")
  • Container(mobile, "Capacitor", "Mobile shell for on-the-go monitoring")

PWA provides the same experience (installable, app-like, offline capable) without:

  • Separate build pipelines
  • Platform-specific debugging
  • App store management
  • Code signing certificates

Implementation

Phase 1: PWA Foundation

  1. Add vite-plugin-pwa to the web application
  2. Create manifest.json with app metadata and icons
  3. Configure service worker for asset caching
  4. Enable HTTPS (already implemented)
{
  "name": "Forma3D.Connect",
  "short_name": "Forma3D",
  "start_url": "/",
  "display": "standalone",
  "background_color": "#ffffff",
  "theme_color": "#0066cc"
}

Phase 2: Push Notifications

  1. Implement Web Push API in frontend
  2. Add VAPID key configuration to API
  3. Create notification service (integrate with existing email notifications)
  4. User permission flow in dashboard settings

Phase 3: Enhanced Offline Support

  1. IndexedDB for offline data caching
  2. Background sync for queued actions
  3. Optimistic UI updates

Consequences

Positive:

  • ✅ Significant reduction in development and maintenance effort
  • ✅ Single codebase, single deployment target
  • ✅ Instant updates for all users (no app store delays)
  • ✅ No platform-specific bugs or WebView inconsistencies
  • ✅ No code signing or app store management
  • ✅ Works on any device with a modern browser

Negative:

  • ⚠️ iOS requires Home Screen install for full PWA features
  • ⚠️ No notification sounds on iOS PWA (visual only)
  • ⚠️ Limited system tray integration on desktop

Removed from Project:

  • apps/desktop (Tauri) - removed from roadmap
  • apps/mobile (Capacitor) - removed from roadmap

Updated Architecture

The C4 Container diagram has been updated to reflect the PWA-only architecture:

Before:
├── Web Application (React 19)
├── Desktop App (Tauri) [future]
├── Mobile App (Capacitor) [future]
└── API Server (NestJS)

After:
├── Progressive Web App (React 19 + PWA)
└── API Server (NestJS)

Alternatives Considered

Alternative Reason for Rejection
Keep Tauri + Capacitor plan Unnecessary complexity; WebView wrappers provide no advantage over PWA
React Native for mobile Requires separate codebase; overkill for admin dashboard
Electron for desktop Large bundle size; same WebView approach as Tauri but less efficient
Flutter Requires separate codebase; not justified for simple dashboard

ADR-036: localStorage Fallback for PWA Install Detection

Attribute Value
ID ADR-036
Status Accepted
Date 2026-01-20
Context Need to detect if PWA is installed when user views site in browser, to show appropriate messaging and avoid duplicate install prompts

Decision

Use a dual detection strategy combining the getInstalledRelatedApps() API with localStorage persistence as a fallback for PWA installation detection.

Rationale

The Problem

When a user installs a PWA and later visits the same site in a regular browser:

  • The browser doesn't know the PWA is installed
  • The site shows "Install App" even though it's already installed
  • This creates a confusing user experience

API Limitations

The navigator.getInstalledRelatedApps() API can detect installed PWAs, but has limitations:

Platform Chrome Version Support
Android 80+ ✅ Full support
Windows 85+ ✅ Supported
macOS 140+ ✅ Same-scope only
iOS/Safari - ❌ Not supported

Even where supported, the API can be unreliable due to:

  • Scope restrictions (must be same origin/scope)
  • Timing issues during page load
  • Browser implementation quirks

Dual Detection Strategy

  1. Primary: getInstalledRelatedApps() API
  2. Query the browser for installed related apps
  3. Works when supported and correctly configured

  4. Fallback: localStorage persistence

  5. Store pwa-installed: true when:
    • User installs via appinstalled event
    • App is opened in standalone mode
    • API successfully detects installation
  6. Check localStorage on page load

Implementation

// Detection flow
useEffect(() => {
  // 1. Check standalone mode (running inside PWA)
  const isStandalone = window.matchMedia('(display-mode: standalone)').matches;
  if (isStandalone) {
    setIsInstalled(true);
    localStorage.setItem('pwa-installed', 'true');
    return;
  }

  // 2. Check localStorage fallback
  if (localStorage.getItem('pwa-installed') === 'true') {
    setIsInstalled(true);
  }

  // 3. Try getInstalledRelatedApps API
  if (navigator.getInstalledRelatedApps) {
    navigator.getInstalledRelatedApps().then((apps) => {
      if (apps.some((app) => app.platform === 'webapp')) {
        setIsInstalled(true);
        localStorage.setItem('pwa-installed', 'true');
      }
    });
  }
}, []);

// Persist on install
window.addEventListener('appinstalled', () => {
  localStorage.setItem('pwa-installed', 'true');
});

Consequences

Positive:

  • ✅ Works across all browsers and platforms
  • ✅ Provides consistent UX when switching between PWA and browser
  • ✅ No false "Install App" prompts when already installed
  • ✅ Gracefully degrades when API not supported

Negative:

  • ⚠️ localStorage can become stale if user uninstalls PWA externally
  • ⚠️ No automatic cleanup mechanism for uninstalled apps
  • ⚠️ Per-browser storage (installing in Chrome won't reflect in Firefox)

Trade-off Accepted:

The risk of showing "Installed" for an uninstalled app is acceptable because:

  • Users rarely uninstall and then want to reinstall immediately
  • Clearing site data will reset the state
  • Better UX than constantly prompting to install an already-installed app

Alternatives Considered

Alternative Reason for Rejection
API only Too unreliable; doesn't work on Safari/iOS
localStorage only Misses installations from other sessions
Server-side tracking Requires authentication; overcomplicated
Cookie-based Cleared more frequently than localStorage

ADR-037: Keep a Changelog for Release Documentation

Attribute Value
ID ADR-037
Status Accepted
Date 2026-01-20
Context Need a standardized way to document changes between releases for developers, operators, and stakeholders

Decision

Adopt the Keep a Changelog format for documenting all notable changes to the project, combined with Semantic Versioning for version numbers.

Rationale

Why Keep a Changelog?

  1. Human-readable: Written for humans, not machines - focuses on what matters to users
  2. Standardized format: Well-known convention reduces cognitive load
  3. Categorized changes: Clear sections (Added, Changed, Deprecated, Removed, Fixed, Security)
  4. Release-oriented: Groups changes by version, making it easy to see what's in each release
  5. Unreleased section: Accumulates changes before a release, making release notes easy

Why Semantic Versioning?

  • MAJOR.MINOR.PATCH format communicates impact:
  • MAJOR: Breaking changes
  • MINOR: New features (backward compatible)
  • PATCH: Bug fixes (backward compatible)
  • Industry standard, well understood by developers
  • Enables automated tooling and dependency management

Benefits for AI-Generated Codebase

This project is primarily AI-generated, making structured documentation critical:

  1. Context for AI: Changelog provides history context for future AI sessions
  2. Audit trail: Documents what was added/changed in each phase
  3. Stakeholder communication: Non-technical stakeholders can understand progress
  4. Debugging aid: When issues arise, changelog helps identify when changes were introduced

Implementation

File location: CHANGELOG.md in repository root

Format:

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.7.0] - 2026-01-19

### Added

- Feature description

### Changed

- Change description

### Fixed

- Bug fix description

### Security

- Security fix description

Change categories (use only those that apply):

  • Added: New features
  • Changed: Changes to existing functionality
  • Deprecated: Features marked for removal
  • Removed: Features removed
  • Fixed: Bug fixes
  • Security: Vulnerability fixes

Guidelines

  1. Update with every PR: Add changelog entry as part of the PR
  2. Write for humans: Describe the user impact, not implementation details
  3. Link to issues/PRs: Reference related issues where helpful
  4. Keep Unreleased current: Move entries to versioned section on release
  5. One entry per change: Don't combine unrelated changes

Consequences

Positive:

  • ✅ Clear release history for all stakeholders
  • ✅ Standardized format reduces documentation overhead
  • ✅ Supports both manual reading and automated parsing
  • ✅ Integrates well with CI/CD release workflows
  • ✅ Provides context for AI-assisted development sessions

Negative:

  • ⚠️ Requires discipline to update with each change
  • ⚠️ Can become verbose if too granular

Alternatives Considered

Alternative Reason for Rejection
Git commit history only Too granular; hard to see high-level changes
GitHub Releases only Tied to GitHub; not in repository
Auto-generated from commits Requires strict commit conventions; often too noisy
Wiki-based changelog Separate from code; easy to forget to update

ADR-038: Zensical for Publishing Project Documentation

Attribute Value
ID ADR-038
Status Accepted
Date 2026-01-21
Context Need a maintainable, deployable documentation website built from the repository docs/

Decision

Publish the repository documentation in docs/ as a static website built with Zensical.

The docs site is:

  • Built from docs/ with configuration in zensical.toml
  • Rendered with PlantUML pre-rendering (SVG/PNG) for existing diagrams
  • Packaged as a container image forma3d-connect-docs and published to the existing container registry
  • Deployed to staging behind Traefik at https://staging-connect-docs.forma3d.be
  • Managed by the existing Azure DevOps pipeline using docsAffected detection

Rationale

  • Single source of truth: docs live next to the code they describe (docs/)
  • Static output: simple, fast, cacheable; no backend runtime required
  • Pipeline parity: follows the same build/sign/SBOM/deploy controls as api and web
  • Diagram support: preserves existing PlantUML investment via deterministic CI rendering

Implementation

  • Config: zensical.toml (sets site name, logo, PlantUML markdown extension)
  • Container build: deployment/docs/Dockerfile (builds site + serves via Nginx)
  • Staging service: deployment/staging/docker-compose.yml (docs service + Traefik labels)
  • CI/CD: azure-pipelines.yml
  • Detect changes to docs/** or zensical.toml via docsAffected
  • Build/push/sign/SBOM the forma3d-connect-docs image
  • Deploy conditionally to staging

Consequences

Positive:

  • ✅ Documentation changes can be delivered independently of API/Web
  • ✅ Consistent hosting model (Traefik + container) across services
  • ✅ PlantUML diagrams render in the published docs site

Negative:

  • ⚠️ Docs builds can be slower due to diagram rendering (mitigated by caching)
  • ⚠️ Local preview requires Zensical + Java/Graphviz (documented in developer workflow)

Alternatives Considered

Alternative Reason for Rejection
Host Markdown in repo UI Not a branded, searchable documentation site
MkDocs Material Zensical provides a modern, batteries-included path with similar ecosystem compatibility
Convert all diagrams to Mermaid High migration effort; risk of losing diagram fidelity

ADR-039: Global API Key Authentication (Fail-Closed)

Attribute Value
ID ADR-039
Status Accepted
Date 2026-01-21
Context The API exposed non-health endpoints when INTERNAL_API_KEY was missing, risking data access

Decision

Enforce API key authentication globally for the API application, with explicit public exceptions.

  • All HTTP routes require X-API-Key by default
  • Only the following are public:
  • /health/** (orchestration/monitoring probes)
  • External webhook receivers (secured by their own verification guards)
  • Authentication is fail-closed:
  • If INTERNAL_API_KEY is not configured, non-public endpoints return an error (no “development bypass”)
  • Real-time channel is also secured:
  • Socket.IO /events requires the same internal API key during handshake

Rationale

  • Default-secure posture: avoids accidental exposure in development/staging due to missing env vars
  • Consistency: one policy applied across all controllers (no “forgot to add @UseGuards” drift)
  • Clear separation: health/webhooks remain reachable for infrastructure and external platforms
  • Parity with dashboard: matches the operator dashboard’s expectation that API access is gated

Implementation

  • Global guard: register API key guard as an APP_GUARD in apps/api
  • Public routes: introduce @Public() decorator to opt out for /health/** and webhook controllers
  • Fail-closed config: if INTERNAL_API_KEY is missing, non-public HTTP routes are rejected
  • WebSocket guard: add WsApiKeyGuard to EventsGateway for the /events namespace
  • Webhook verification: require SIMPLYPRINT_WEBHOOK_SECRET for SimplyPrint inbound verification (fail-closed)

Consequences

Positive:

  • ✅ Eliminates unauthenticated access to operational/admin API endpoints
  • ✅ Prevents misconfiguration from silently reducing security
  • ✅ Makes “secure endpoints” the default, with explicit public exceptions
  • ✅ Secures both REST and realtime update channels consistently

Negative:

  • ⚠️ Local development now requires configuring INTERNAL_API_KEY to use non-health endpoints
  • ⚠️ Clients (dashboard, tools) must always send X-API-Key for non-public routes

Alternatives Considered

Alternative Reason for Rejection
Per-controller @UseGuards(ApiKeyGuard) only Easy to miss a controller; inconsistent over time
Allow all when key missing (“dev mode”) Unsafe default; makes staging/prod exposure more likely
Network-only restrictions (IP allowlist) Harder operationally; not sufficient on its own
  • ADR-024 / ADR-029 (previous API key authentication decisions)
  • apps/api/src/common/guards/api-key.guard.ts
  • apps/api/src/common/decorators/public.decorator.ts

ADR-040: Shopify Order Backfill for Downtime Recovery

Attribute Value
ID ADR-040
Status Accepted
Date 2026-01-22
Context Shopify webhooks retry for only ~4 hours; extended downtime can permanently lose order events

Decision

Implement a scheduled backfill service that periodically polls Shopify's Orders API to catch any orders missed during webhook delivery failures.

Strategy:

  • Store a durable since_id watermark in the SystemConfig table
  • Every 5 minutes (configurable), fetch orders from Shopify with since_id pagination
  • For each order not in our database, create it using the same mapping logic as webhooks
  • Advance watermark only after successful processing (not before, unlike webhook path)
  • Provide admin endpoints for manual backfill trigger, status check, and watermark reset

Rationale

  • Shopify retry window is limited: Webhooks are retried only 8 times over ~4 hours (as of September 2024)
  • Downtime recovery: If service is down longer than 4 hours, orders would be permanently lost without backfill
  • Idempotent by design: Order creation is already deduplicated by shopifyOrderId, so re-processing is safe
  • Operational visibility: Admin endpoints allow operators to trigger backfill after incidents
  • Consistent mapping: Reuses the same ShopifyService.buildCreateOrderInput() method as webhooks

Implementation

  • SystemConfigService: New service for persisting key-value configuration (watermarks, etc.)
  • ShopifyBackfillService: Scheduled job with @Cron(EVERY_5_MINUTES) plus startup run
  • ShopifyAdminController: Admin endpoints at /api/v1/admin/shopify/backfill/*
  • Shared mapping: Extracted buildCreateOrderInput() and checkUnmappedSkus() in ShopifyService (uses findUnmappedLineItems() with product/variant ID + SKU matching)

Configuration

Environment Variable Default Description
SHOPIFY_BACKFILL_ENABLED true Enable/disable scheduled backfill
SHOPIFY_BACKFILL_BATCH_SIZE 50 Orders to fetch per API call

Consequences

Positive:

  • ✅ Guarantees order recovery after extended downtime (not dependent on webhook retry window)
  • ✅ Uses existing idempotency (no duplicates even with aggressive backfill)
  • ✅ Operators can manually trigger backfill after incidents
  • ✅ Observable via event logs and admin status endpoint

Negative:

  • ⚠️ Adds Shopify API calls even during normal operation (rate limit aware)
  • ⚠️ Does not reconstruct intermediate webhook events (e.g., multiple orders/updated during downtime)
  • ⚠️ Initial backfill on existing system may take time to paginate through history

Alternatives Considered

Alternative Reason for Rejection
Only rely on Shopify retries 4-hour window insufficient for extended outages
Event sourcing / webhook queue Over-engineered for current scale; adds infrastructure
Manual import after incidents Error-prone, delays recovery, requires operator intervention
Time-based polling (updated_at_min) Harder to paginate reliably; since_id is simpler and robust
  • ADR-011 (Idempotent Webhook Processing)
  • ADR-033 (Database-Backed Webhook Idempotency)
  • apps/api/src/shopify/shopify-backfill.service.ts
  • apps/api/src/config/system-config.service.ts

ADR-041: SimplyPrint Webhook Idempotency and Job Reconciliation

Attribute Value
ID ADR-041
Status Accepted
Date 2026-01-22
Context SimplyPrint webhooks lacked idempotency; polling only detected PRINTING status, not completed/failed jobs

Decision

Add database-backed webhook idempotency to SimplyPrint webhook handling and implement a job reconciliation service that periodically syncs print job statuses with SimplyPrint's API.

Webhook Idempotency:

  • Reuse the existing WebhookIdempotencyRepository (same as Shopify)
  • Deduplicate by webhook_id from SimplyPrint payload
  • Key format: simplyprint/{event} (e.g., simplyprint/job.started)

Job Reconciliation:

  • Scheduled job runs every minute to check active print jobs
  • Query all print jobs with simplyPrintJobId in non-terminal states (QUEUED, ASSIGNED, PRINTING)
  • Compare local status with SimplyPrint's queue and printer states
  • Emit JOB_STATUS_CHANGED events for discrepancies

Rationale

  • Webhook idempotency: SimplyPrint may retry webhooks on timeout; duplicate events could cause issues
  • Existing polling was limited: Only detected PRINTING status via printer polling
  • History-based terminal state detection: If COMPLETED/FAILED/CANCELLED webhooks are missed, the reconciliation service queries SimplyPrint's print history API (GET /{id}/jobs/Get) after a 5-minute grace period to detect terminal states automatically
  • Hybrid approach: Webhooks for real-time + reconciliation for reliability (belt and suspenders)

Implementation

  • SimplyPrintService.handleWebhook(): Added idempotency check using WebhookIdempotencyRepository
  • SimplyPrintReconciliationService: New service with @Cron(EVERY_MINUTE) that reconciles job statuses
  • SimplyPrintReconciliationService.handleMissingJob(): Two-step lookup for missing jobs — getJob() (GetDetails) then getJobHistory() (history list) — with grace period (5 min), rate limiting (max 10/cycle), and escalation logging (30 min)
  • SimplyPrintApiClient.getJobHistory(): Queries the print history endpoint to find completed/failed/cancelled jobs no longer in the queue or on a printer
  • Direct Prisma access: Reconciliation uses PrismaService directly to avoid circular dependency with PrintJobsModule

SimplyPrint Job ID Resolution

SimplyPrint uses three different identifiers for the same logical job:

Identifier Source Format When available
Queue-item created_id AddItem response Integer (e.g. 385029) At queue time
Job uid Webhooks, GetDetails UUID (e.g. da69d2a4-...) After job starts
Job numeric id Webhooks Integer (e.g. 552252) After job starts

When a job is queued, we store created_id as both simplyPrintJobId (mutable) and simplyPrintQueueItemId (persistent). When the first job.started webhook arrives, simplyPrintJobId is updated to the job UID for fast future lookups, but simplyPrintQueueItemId is never overwritten.

Lookup chain in PrintJobsService.handleSimplyPrintStatusChange():

  1. Primary: Find by simplyPrintJobId = webhook job UID
  2. Fallback 1: Find by simplyPrintJobId = webhook numeric job ID
  3. Fallback 2: Call GetDetails API for the job's queued.id, find by simplyPrintJobId = queued.id
  4. Fallback 3: Find by simplyPrintQueueItemId = queued.id (handles re-queued jobs where simplyPrintJobId was already overwritten with the first job's UID)

Fallback 3 includes a safety check: before adopting the matched print job, it verifies that the new job UID is not already linked to another print job in the database (prevents accidentally hijacking a webhook for a different order's job).

Re-queue scenario: When SimplyPrint cancels a job and the operator clears the bed, SimplyPrint revives the same queue item and creates a new job with a different UID but the same queued.id. Fallback 3 matches via simplyPrintQueueItemId and adopts the cancelled print job, updating its simplyPrintJobId to the new UID.

Configuration

Environment Variable Default Description
SIMPLYPRINT_RECONCILIATION_ENABLED true Enable/disable scheduled reconciliation

Consequences

Positive:

  • ✅ Prevents duplicate event processing from webhook retries
  • ✅ Catches missed PRINTING status changes via reconciliation
  • ✅ Uses existing idempotency infrastructure (no new tables)
  • ✅ Observable via event logs
  • ✅ Re-queued jobs are automatically matched back to their original print job via persistent simplyPrintQueueItemId

Negative:

  • ⚠️ Reconciliation cannot detect COMPLETED/FAILED from SimplyPrint API alone (relies on webhooks for terminal states)
  • ⚠️ Adds API calls to SimplyPrint every minute (rate limit aware)
  • ⚠️ Jobs "missing" from SimplyPrint are logged but not auto-updated (avoids incorrect state changes)

Alternatives Considered

Alternative Reason for Rejection
Extend existing polling for all states SimplyPrint API doesn't expose completed/failed job history
Store last-seen status for comparison Over-complicated; event emission on change is sufficient
Skip idempotency (rely on status check) Status check is partial protection; true idempotency is safer
  • ADR-033 (Database-Backed Webhook Idempotency)
  • ADR-040 (Shopify Order Backfill)
  • apps/api/src/simplyprint/simplyprint.service.ts
  • apps/api/src/simplyprint/simplyprint-reconciliation.service.ts

ADR-042: SendCloud Webhook Integration for Shipment Status Updates

Attribute Value
ID ADR-042
Status Accepted
Date 2026-01-22
Context Shipment statuses only updated at label creation; no visibility into transit/delivery state

Decision

Implement SendCloud webhook receiver for real-time shipment status updates with HMAC-SHA256 signature verification and a reconciliation service for backfill.

Webhook Handling:

  • New endpoint: POST /webhooks/sendcloud
  • Verify Sendcloud-Signature header using HMAC-SHA256
  • Process parcel_status_changed events
  • Database-backed idempotency using existing infrastructure

Status Mapping:

SendCloud Status ID ShipmentStatus
1-10 LABEL_CREATED
11-99 ANNOUNCED
1000-1098 IN_TRANSIT
1100-1199 CANCELLED
1999, 2001+ FAILED
2000 DELIVERED

Reconciliation:

  • Scheduled job runs every 5 minutes
  • Polls SendCloud API (getParcel) for active shipments
  • Updates status for any discrepancies found

Rationale

  • Customer visibility: Users need to see when shipments are in transit, delivered, or failed
  • Operational awareness: Operators need to know if shipments encounter problems. The shipmentStatus field on every order API response enables shipping status badges and dedicated shipping filters (Ready to Ship, In Transit, Delivered, Shipping Issues) on the orders list page.
  • Existing UI ready: ShippingInfo component already displays all statuses with color-coded badges. The orders list now shows a shipping badge (with truck icon) alongside the order status badge.
  • Webhook reliability: SendCloud may retry webhooks; idempotency prevents duplicate processing

Implementation

  • SendcloudWebhookGuard: Verifies HMAC-SHA256 signature
  • SendcloudWebhookService: Processes status changes, maps statuses, updates shipments
  • SendcloudReconciliationService: Polls SendCloud API every 5 minutes for active shipments
  • IShipmentsService.findBySendcloudParcelId(): Added to interface for parcel ID lookups
  • OrderResponseDto.shipmentStatus: Every order API response includes the associated shipment status (null if no shipment exists). The OrderQueryDto supports shipmentStatus filter (by exact status) and readyToShip boolean filter (completed orders with PENDING/LABEL_CREATED/ANNOUNCED shipments).
  • ShipmentStatus enum: Shared via @forma3d/domain (libs/domain/src/enums/shipment-status.ts) for use across backend and frontend.

Configuration

Environment Variable Default Description
SENDCLOUD_WEBHOOK_SECRET - HMAC secret for signature verification (same as API Secret Key)
SENDCLOUD_RECONCILIATION_ENABLED true Enable/disable scheduled reconciliation

Consequences

Positive:

  • ✅ Real-time shipment status updates in UI
  • ✅ Automatic detection of delivered/failed shipments
  • ✅ Backfill for missed webhooks via reconciliation
  • ✅ Uses existing idempotency infrastructure
  • ✅ Shipping status surfaced on orders list via shipmentStatus field — operators can filter by shipping status (Ready to Ship, In Transit, Delivered, Shipping Issues) and see badges on each order row

Negative:

  • ⚠️ Requires webhook configuration in SendCloud panel
  • ⚠️ Additional API calls for reconciliation (rate limit aware)
  • ⚠️ Webhook secret must be configured for production security
  • ADR-033 (Database-Backed Webhook Idempotency)
  • ADR-041 (SimplyPrint Webhook Idempotency)
  • apps/api/src/sendcloud/sendcloud-webhook.service.ts
  • apps/api/src/sendcloud/sendcloud-reconciliation.service.ts

ADR-043: PWA Version Mismatch Detection

Attribute Value
ID ADR-043
Status Accepted
Date 2026-01-23
Context Users may run outdated PWA versions if they dismiss the update prompt or if the service worker hasn't yet detected updates

Decision

Implement automatic version mismatch detection on the Settings page that compares the cached PWA version against the server version and triggers the service worker update prompt when they differ.

Rationale

Problem Statement

The PWA displays the frontend version in two places:

  1. Settings page - Shows version from cached /build-info.json
  2. Sidebar footer - Shows version from cached /build-info.json

When a new version is deployed:

  • The service worker checks for updates hourly
  • Users may have dismissed the "Update now" prompt
  • The cached version can become stale

Users visiting the Settings page to check version information should be prompted to update if running an outdated version.

Solution

When the user navigates to the Settings page:

  1. Fetch /build-info.json from the server with cache-busting headers
  2. Compare the server version against the cached PWA version
  3. If versions differ, call registration.update() on the service worker
  4. This triggers the "New version available!" prompt

Implementation

New Components

Component Path Description
ServiceWorkerContext apps/web/src/contexts/service-worker-context.tsx Centralized SW state management, exposes checkForUpdates()
useServerVersion apps/web/src/hooks/use-server-version.ts Fetches fresh version with cache-busting
useVersionMismatchCheck apps/web/src/hooks/use-version-mismatch-check.ts Compares versions, triggers update on mismatch

Architecture

User visits Settings page
        │
        ▼
useVersionMismatchCheck({ checkOnMount: true })
        │
        ▼
fetch('/build-info.json?_=timestamp', { cache: 'no-store' })
        │
        ▼
Compare serverVersion vs cachedVersion
        │
        ▼ (if different)
serviceWorkerContext.checkForUpdates()
        │
        ▼
registration.update() detects new SW
        │
        ▼
needRefresh = true → Update prompt shown

Cache-Busting Strategy

const response = await fetch(`/build-info.json?_=${Date.now()}`, {
  cache: 'no-store',
  headers: {
    'Cache-Control': 'no-cache, no-store, must-revalidate',
    Pragma: 'no-cache',
  },
});

Consequences

Positive:

  • ✅ Users are reliably prompted to update when viewing version info
  • ✅ Works even if previous update prompt was dismissed
  • ✅ No polling overhead - only checks when user visits Settings
  • ✅ Centralized service worker state via React Context
  • ✅ Reusable hooks for future version-aware features

Negative:

  • ⚠️ Extra network request on Settings page load
  • ⚠️ Relies on service worker being registered
  • ADR-035 (Progressive Web App for Cross-Platform Access)
  • ADR-036 (localStorage Fallback for PWA Install Detection)
  • apps/web/src/pwa/sw-update-prompt.tsx
  • apps/web/src/pages/settings/index.tsx

ADR-044: Role-Based Access Control and Tenant-Ready Architecture

Attribute Value
ID ADR-044
Status Accepted
Date 2026-01-24
Context Need to implement multi-user authentication with role-based access control, while preparing for future multi-tenancy

Decision

Implement in-app RBAC and tenant-ready data isolation without external identity providers (no Keycloak/OpenID Connect yet).

Key Decisions

  1. Session-Based Authentication
  2. HTTP-only cookies with express-session and PostgreSQL session store
  3. Argon2id password hashing with automatic rehashing
  4. Legacy API key authentication preserved for backward compatibility

  5. Permission-Based Authorization

  6. Permissions are string constants (e.g., orders.read, orders.write)
  7. Roles are named bundles of permissions (e.g., admin, operator, viewer)
  8. Users can have multiple roles; effective permissions = union of all role permissions
  9. Server-side enforcement via NestJS guards (SessionGuard, PermissionsGuard)

  10. Tenant-Ready Data Model

  11. All tenant-owned entities include tenantId foreign key
  12. Repositories enforce tenant scoping in all queries
  13. Single default tenant (00000000-0000-0000-0000-000000000001) for current operations
  14. Architecture supports future multi-tenant expansion

  15. Security Auditing

  16. AuditLog table captures security-relevant actions
  17. Actor identity and tenant context attached to Sentry error reports
  18. No logging of sensitive data (passwords, tokens, API keys)

Database Schema Additions

-- Core RBAC tables
Tenant, User, Role, Permission, UserRole, RolePermission, Session, AuditLog

-- Tenant scoping on all existing tables
Order, LineItem, PrintJob, ProductMapping, AssemblyPart, Shipment, EventLog, etc.

Implementation Details

Backend (NestJS)

  • SessionGuard: Global guard that validates sessions or falls back to API key
  • PermissionsGuard: Route-level guard that checks required permissions
  • @CurrentUser(): Decorator to inject authenticated user into controllers
  • @RequirePermissions(): Decorator to specify required permissions
  • @Public(): Decorator to mark routes as public (bypass authentication)
  • TenantContextService: Request-scoped service providing tenant context
  • AuditService: Centralized security audit logging

Frontend (React)

  • AuthContext: Provides user state, login/logout, permission checks
  • usePermissions(): Hook for permission-based UI rendering
  • ProtectedRoute: Redirects unauthenticated users to login
  • PermissionGatedRoute: Hides routes based on permissions

User Management UI

  • Location: Settings page → Administration section (visible to users with users.read permission)
  • Route: /admin/users (requires users.read permission to access)
  • Components:
  • UserFormModal: Create/edit users with email, password, and role selection
  • ChangePasswordModal: Change password for existing users
  • UsersPage: User list with search, filtering, and CRUD operations
  • Features:
  • Create new users with email, password, and role assignment
  • Edit existing user email and roles
  • Change user passwords (separate modal for security)
  • Deactivate/reactivate users (soft delete pattern)
  • Role selection with visual indicators for selected roles
  • Permission-gated UI (actions hidden if user lacks users.write)

Default Roles

Role Description Permissions
admin Full system access All permissions
operator Day-to-day operations Orders, print jobs, mappings, shipments, logs (read/write)
viewer Read-only access View-only access to operational data
legacy-admin API key compatibility All permissions (deprecated)

Consequences

Positive:

  • ✅ Multiple users can sign in with different access levels
  • ✅ Server-side permission enforcement (not UI-only security)
  • ✅ Audit trail for security-relevant actions
  • ✅ Architecture ready for future multi-tenancy
  • ✅ Backward compatibility with existing API key integrations
  • ✅ Sentry error reports enriched with user/tenant context

Negative:

  • ⚠️ Session management adds infrastructure complexity
  • ⚠️ All repositories needed updates for tenant scoping
  • ⚠️ Coverage thresholds temporarily lowered for new modules

Migration Path

  1. Run Prisma migration to add RBAC and tenant tables
  2. Run seed script to create default tenant, roles, permissions, and admin user
  3. Existing data migrated to default tenant
  4. Legacy API key authentication continues to work during transition
  • ADR-024 (API Key Authentication for Admin Endpoints)
  • ADR-029 (API Key Authentication for Dashboard)
  • apps/api/src/auth/ module
  • apps/api/src/audit/ module
  • apps/api/src/tenancy/ module
  • apps/api/src/users/ module

ADR-045: pgAdmin for Staging Database Administration

Attribute Value
ID ADR-045
Status Accepted
Date 2026-01-24
Context Need a web-based interface to inspect, query, and manage the PostgreSQL staging database

Decision

Deploy pgAdmin 4 as a Docker container in the staging environment, exposed via Traefik with TLS.

Rationale

  • Official PostgreSQL tool: pgAdmin is the official GUI administration tool for PostgreSQL
  • Web-based access: No need to install desktop software or configure VPN/SSH tunnels
  • Full SQL capabilities: Execute queries, view data, manage schemas, backup/restore
  • Secure access: TLS via Let's Encrypt, separate credentials from database credentials
  • No database exposure: Database remains inaccessible from the internet; pgAdmin connects internally via Docker network

Implementation

Component Value
Container Image dpage/pgadmin4:latest
Subdomain staging-connect-db.forma3d.be
Docker Network forma3d-network (internal)
Data Persistence pgadmin-data Docker volume
TLS Certificate Auto-provisioned via Let's Encrypt

Environment Variables

Variable Description Secret?
PGADMIN_DEFAULT_EMAIL Login email for pgAdmin No
PGADMIN_DEFAULT_PASSWORD Login password for pgAdmin Yes

Usage

  1. Navigate to https://staging-connect-db.forma3d.be
  2. Log in with PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD
  3. Add a new server connection:
  4. Name: Forma3D Staging
  5. Host: Database hostname from DATABASE_URL (DigitalOcean managed PostgreSQL hostname)
  6. Port: 25060 (DigitalOcean managed PostgreSQL port)
  7. Database: defaultdb (or your database name)
  8. Username/Password: From DATABASE_URL
  9. SSL Mode: Require (set in Connection > SSL tab)

Security Considerations

  • pgAdmin credentials are separate from database credentials
  • Database credentials are entered manually in pgAdmin (not stored in environment)
  • Enhanced cookie protection enabled
  • Access is restricted to those who know the pgAdmin login credentials
  • TLS encrypts all traffic

Consequences

Positive:

  • ✅ Easy database inspection without SSH access
  • ✅ Web-based access from any device
  • ✅ Full SQL query capabilities
  • ✅ Visual schema exploration
  • ✅ Data export/import capabilities

Negative:

  • ⚠️ Additional attack surface (mitigated by strong password + TLS)
  • ⚠️ Resource overhead (minimal - pgAdmin is lightweight)
  • ⚠️ Users must manually configure the database server connection
  • deployment/staging/docker-compose.yml
  • deployment/staging/env.staging.template
  • docs/05-deployment/staging-deployment-guide.md

ADR-046: PostgreSQL Session Store for Persistent Authentication

Attribute Value
ID ADR-046
Status Accepted
Date 2026-01-26
Context User sessions were lost on server restarts, causing frequent re-authentication during deployments

Decision

Replace the default in-memory session store with PostgreSQL-backed sessions using connect-pg-simple, and extend session duration from 24 hours to 7 days.

Rationale

The default express-session in-memory store has critical limitations:

Problem Impact Solution
Sessions lost on restart Users logged out during every deployment PostgreSQL persistence
No session sharing Cannot scale to multiple API instances Shared database store
Short session duration Users had to re-login frequently Extended to 7 days
Memory consumption Sessions consume server RAM Offloaded to database

Implementation

Package: connect-pg-simple with @types/connect-pg-simple

Migration: prisma/migrations/20260126000000_add_session_store/migration.sql

CREATE TABLE "session" (
  "sid" varchar NOT NULL COLLATE "default",
  "sess" json NOT NULL,
  "expire" timestamp(6) NOT NULL
);
ALTER TABLE "session" ADD CONSTRAINT "session_pkey" PRIMARY KEY ("sid");
CREATE INDEX "IDX_session_expire" ON "session" ("expire");

Configuration: apps/api/src/main.ts

Setting Value Description
store PgSession PostgreSQL-backed session store
tableName session Database table for sessions
pruneSessionInterval 3600 (1 hour) Expired session cleanup interval
maxAge 7 days (configurable) Session cookie lifetime

Environment Variables

Variable Default Description
SESSION_SECRET (required) Secret key for signing session cookies
SESSION_MAX_AGE_DAYS 7 Session duration in days

Session Lifecycle

User Login → Session created in PostgreSQL → Cookie sent to browser
            ↓
Browser Request → Cookie validated → Session loaded from PostgreSQL
            ↓
Session Expires → Pruned by hourly cleanup job

Consequences

Positive:

  • ✅ Sessions survive server restarts and deployments
  • ✅ Sessions shared across multiple API instances (horizontal scaling ready)
  • ✅ 7-day sessions reduce login friction for users
  • ✅ Automatic cleanup of expired sessions (no manual maintenance)
  • ✅ No additional infrastructure (uses existing PostgreSQL)

Negative:

  • ⚠️ Slight latency increase for session lookups (negligible with connection pooling)
  • ⚠️ Database storage for sessions (minimal - each session ~1-2KB)
  • ⚠️ Migration required on existing deployments
  • apps/api/src/main.ts - Session configuration
  • prisma/migrations/20260126000000_add_session_store/migration.sql - Database schema
  • .env.example - Environment variable documentation
  • deployment/staging/docker-compose.yml - Container configuration

ADR-047: Three-Tier Logging Strategy (Application + Business Events + Sentry Logs)

Attribute Value
ID ADR-047
Status Superseded by ADR-058 (Sentry Logs tier replaced by ClickHouse + Grafana)
Date 2026-01-27
Context Need comprehensive observability with different log types for debugging vs. compliance/audit

Decision

Implement a three-tier logging strategy that separates application logs from business event logs, with Sentry Logs for centralized visibility:

Tier Storage Purpose Examples
Application Logs Pino (stdout) Debugging, performance HTTP requests, service calls
Business Events PostgreSQL (EventLog) + Sentry Logs Business audit trail Order created, shipment status changed
Security Audit PostgreSQL (AuditLog) + Sentry Logs Compliance, security Login success/failure, permission denied
Sentry Logs Sentry (cloud) Centralized visibility All business + audit events in one place

Rationale

Different log types serve different purposes:

Concern Application Logs Business Events Audit Logs Sentry Logs
Retention Short (days/weeks) Long (months/years) Regulatory (years) 30 days (configurable)
Query needs Full-text search Structured filtering Compliance reporting Real-time search
Access control DevOps/Developers Business users Administrators only DevOps team
Storage cost High volume, low cost Moderate volume Low volume, high value Included in Sentry plan

Why Sentry Logs?

  • Single pane of glass: View errors, traces, and logs in one place
  • No additional tooling: Already using Sentry for error tracking
  • Structured attributes: Filter by orderId, eventType, userId, etc.
  • Real-time: Logs appear immediately for debugging
  • Cost-effective: Included in existing Sentry subscription (with limits)

Implementation

Application Logging (Pino via nestjs-pino):

  • Configured in apps/api/src/observability/observability.module.ts
  • Environment-based formatting (pretty dev, JSON prod)
  • Automatic request/response logging via interceptors
  • Redacts sensitive fields (passwords, tokens, cookies)

Business Event Logging (EventLogService):

  • Stored in EventLog PostgreSQL table
  • Structured metadata with orderId, printJobId associations
  • Severity levels: INFO, WARNING, ERROR
  • Triple output: Database + Application logger + Sentry Logs

Security Audit Logging (AuditService):

  • Stored in AuditLog PostgreSQL table
  • Captures actor, action, target, IP address, user agent
  • Tenant-scoped for multi-tenancy support
  • Admin-only access via audit.read permission
  • Also sent to Sentry Logs for real-time visibility

Sentry Logs Integration (SentryLoggerService):

  • Wrapper around Sentry.logger API
  • Centralized service in apps/api/src/observability/services/sentry-logger.service.ts
  • Structured attributes for filtering (eventType, orderId, userId, etc.)
  • Automatic integration with EventLogService and AuditService
  • View in Sentry: Explore > Logs

Event Types

Business Events (EventLog):

Event Type Severity Trigger
order.created INFO Shopify webhook creates order
order.status_changed INFO Order status transition
order.cancelled WARNING Order cancellation
printjob.created INFO Print job created in SimplyPrint
printjob.status_changed INFO/ERROR SimplyPrint status update
shipment.created INFO Shipment record created
shipment.status_changed INFO/WARNING Sendcloud status update
shipment.tracking_updated INFO Tracking number assigned
shipment.label_generated INFO Shipping label created
shipment.cancelled WARNING Shipment cancellation

Audit Events (AuditLog):

Action Success Trigger
auth.login.success true User successfully logged in
auth.login.failure false Invalid credentials
auth.logout true User logged out
permission.denied false Access denied to protected resource
user.created true New user account created
user.updated true User profile updated
password.changed true Password changed

API Endpoints

Endpoint Permission Purpose
GET /api/v1/logs logs.read View business event logs
GET /api/v1/audit-logs audit.read View security audit logs (Admin only)

UI Access

  • Activity Logs: Sidebar → Activity Logs
  • Audit Logs: Settings → Administration → Audit Logs

Consequences

Positive:

  • ✅ Clear separation of concerns (debugging vs. compliance vs. security)
  • ✅ Business events queryable by order/print job for troubleshooting
  • ✅ Audit logs provide compliance trail for security reviews
  • ✅ Structured metadata enables powerful filtering
  • ✅ Admin-only audit access protects sensitive security data

Negative:

  • ⚠️ Three separate log stores to maintain
  • ⚠️ Potential for inconsistency if logging calls are missed
  • ⚠️ Database storage for events (mitigated by pruning/archival)
  • apps/api/src/observability/observability.module.ts - Pino configuration
  • apps/api/src/observability/services/sentry-logger.service.ts - Sentry Logs integration
  • apps/api/src/event-log/event-log.service.ts - Business event logging
  • apps/api/src/audit/audit.service.ts - Security audit logging
  • apps/api/src/audit/audit.controller.ts - Audit logs API endpoint
  • apps/web/src/pages/admin/audit-logs/index.tsx - Audit logs UI
  • Sentry Dashboard: Explore > Logs

ADR-048: Shopify OAuth 2.0 Authentication

Attribute Value
ID ADR-048
Status Implemented
Date 2026-01-28
Context Shopify deprecated legacy custom apps for merchants (January 1, 2026). New merchant stores require OAuth-authenticated apps.

Decision

Implement Shopify OAuth 2.0 Authorization Code Grant flow for app installation and authentication, replacing the static access token approach.

Rationale

  • Production requirement: As of January 2026, Shopify merchants can only install OAuth-authenticated apps
  • Multi-shop support: OAuth enables connecting multiple shops per tenant
  • Token refresh: Offline access tokens have 90-day expiry with refresh capability
  • Security: Tokens encrypted at rest using AES-256-GCM
  • Backward compatibility: Legacy static token mode preserved for development/testing

Implementation

Database Schema:

model ShopifyShop {
  id           String   @id @default(uuid())
  tenantId     String
  shopDomain   String   // e.g., "example.myshopify.com"
  accessToken  String   // Encrypted OAuth access token
  tokenType    String   @default("offline")
  scopes       String[]
  expiresAt    DateTime?
  refreshToken String?
  installedAt  DateTime @default(now())
  uninstalledAt DateTime?
  isActive     Boolean  @default(true)

  tenant Tenant @relation(...)
  @@unique([tenantId, shopDomain])
}

OAuth Flow Endpoints:

Endpoint Purpose
GET /shopify/oauth/authorize?shop=xxx Initiate OAuth flow, redirect to Shopify consent
GET /shopify/oauth/callback Exchange authorization code for token
POST /shopify/oauth/uninstall Handle app uninstallation webhook
GET /shopify/oauth/shops List connected shops for tenant
DELETE /shopify/oauth/shops/:domain Disconnect a shop
GET /shopify/oauth/status Check OAuth/legacy configuration status

Security Measures:

  • HMAC verification on all OAuth callbacks (timing-safe comparison)
  • State parameter with cryptographic nonce (CSRF protection)
  • Token encryption at rest (AES-256-GCM with unique IV per token)
  • Automatic token refresh 24 hours before expiry

Configuration:

# OAuth mode (required for production)
SHOPIFY_API_KEY=<client-id>
SHOPIFY_API_SECRET=<client-secret>
SHOPIFY_APP_URL=https://connect-api.forma3d.be
SHOPIFY_SCOPES=read_orders,write_orders,read_products,write_products,read_fulfillments,write_fulfillments,read_inventory,read_merchant_managed_fulfillment_orders,write_merchant_managed_fulfillment_orders
SHOPIFY_TOKEN_ENCRYPTION_KEY=<64-hex-char-key>

# Legacy mode (optional, for development)
SHOPIFY_SHOP_DOMAIN=forma3d-dev.myshopify.com
SHOPIFY_ACCESS_TOKEN=shpat_xxx

API Client Modes:

The ShopifyApiClient supports both modes:

// Legacy mode (static token)
await shopifyClient.createFulfillment(orderId, data);

// OAuth mode (per-shop token)
await shopifyClient.createFulfillmentForShop(tenantId, shopDomain, orderId, data);

Consequences

Positive:

  • ✅ Production-ready for merchant app installations
  • ✅ Multi-shop support enables B2B scenarios
  • ✅ Automatic token refresh prevents authentication failures
  • ✅ Encrypted tokens protect against database leaks
  • ✅ Backward compatible - existing deployments continue working

Negative:

  • ⚠️ Additional complexity for OAuth flow
  • ⚠️ Requires additional environment variables for production
  • ⚠️ Token refresh failures need monitoring
  • apps/api/src/shopify/shopify-oauth.controller.ts - OAuth endpoints
  • apps/api/src/shopify/shopify-oauth.service.ts - OAuth flow logic
  • apps/api/src/shopify/shopify-token.service.ts - Token management/encryption
  • apps/api/src/shopify/shopify-shop.repository.ts - Database access
  • prisma/migrations/20260128000000_add_shopify_oauth/ - Database migration
  • Shopify OAuth Documentation

Document History

Version Date Author Changes
1.0 2026-01-10 AI Assistant Initial ADR document with 13 decisions
1.1 2026-01-10 AI Assistant Updated ADR-006 for Digital Ocean hosting, added ADR-014 for SimplyPrint
1.2 2026-01-10 AI Assistant Added ADR-015 for Aikido Security Platform
1.3 2026-01-10 AI Assistant Added ADR-016 for Sentry Observability with OpenTelemetry
1.4 2026-01-10 AI Assistant Marked ADR-016 as implemented, added implementation details
1.5 2026-01-10 AI Assistant Added ADR-017 for Docker + Traefik Deployment Strategy
1.6 2026-01-11 AI Assistant Added ADR-018 for Nx Affected Conditional Deployment Strategy
1.7 2026-01-13 AI Assistant Phase 2 updates: Updated ADR-008 with implemented events, added ADR-019 (SimplyPrint Webhook Verification), ADR-020 (Hybrid Status Monitoring)
1.8 2026-01-14 AI Assistant Phase 3 updates: Added ADR-021 (Retry Queue), ADR-022 (Event-Driven Fulfillment), ADR-023 (Email Notifications)
1.9 2026-01-14 AI Assistant Security update: Added ADR-024 (API Key Authentication for Admin Endpoints)
2.0 2026-01-14 AI Assistant Supply chain security: Added ADR-025 (Cosign Image Signing)
2.1 2026-01-14 AI Assistant Phase 4 updates: Added ADR-027 (TanStack Query), ADR-028 (Socket.IO Real-Time), ADR-029 (Dashboard Authentication)
2.2 2026-01-16 AI Assistant SBOM attestations: Added ADR-026 (CycloneDX SBOM Attestations with Syft)
2.3 2026-01-16 AI Assistant Phase 5 updates: Added ADR-030 (Sendcloud for Shipping Integration)
2.4 2026-01-16 AI Assistant Registry cleanup: Added ADR-031 (Automated Container Registry Cleanup)
2.5 2026-01-17 AI Assistant Domain boundary separation: Added ADR-032 (Domain Boundary Separation with Interface Contracts)
2.6 2026-01-17 AI Assistant Critical tech debt resolution: Added ADR-033 (Database-Backed Webhook Idempotency)
2.7 2026-01-19 AI Assistant Infrastructure hardening: Added ADR-034 (Docker Log Rotation & Resource Cleanup)
2.8 2026-01-19 AI Assistant Cross-platform strategy: Added ADR-035 (PWA replaces Tauri/Capacitor native apps)
2.9 2026-01-20 AI Assistant PWA detection: Added ADR-036 (localStorage Fallback for PWA Install Detection)
3.0 2026-01-20 AI Assistant Documentation: Added ADR-037 (Keep a Changelog for Release Documentation)
3.1 2026-01-21 AI Assistant Documentation: Added ADR-038 (Zensical for publishing project documentation from docs/)
3.2 2026-01-22 AI Assistant Resilience: Added ADR-040 (Shopify Order Backfill for Downtime Recovery)
3.3 2026-01-22 AI Assistant Resilience: Added ADR-041 (SimplyPrint Webhook Idempotency and Job Reconciliation)
3.4 2026-01-22 AI Assistant Feature: Added ADR-042 (SendCloud Webhook Integration for Shipment Status Updates)
3.5 2026-01-23 AI Assistant PWA enhancement: Added ADR-043 (PWA Version Mismatch Detection on Settings page)
3.6 2026-01-24 AI Assistant Security: Added ADR-044 (Role-Based Access Control and Tenant-Ready Architecture)
3.7 2026-01-25 AI Assistant Feature: Updated ADR-044 with User Management UI implementation details
3.8 2026-01-25 AI Assistant Documentation: Updated ADR-008 with complete event catalog (Order, PrintJob, Orchestration, SimplyPrint, Shipment, SendCloud, Fulfillment events); all ADRs have Status field indicating implementation
3.9 2026-01-24 AI Assistant Infrastructure: Added ADR-045 (pgAdmin for Staging Database Administration)
4.0 2026-01-26 AI Assistant Session management: Added ADR-046 (PostgreSQL Session Store for Persistent Authentication)
4.1 2026-01-27 AI Assistant Observability: Added ADR-047 (Two-Tier Logging Strategy with Application, Business, and Audit logs)
4.2 2026-01-28 AI Assistant Authentication: Added ADR-048 (Shopify OAuth 2.0 Authentication for production merchant stores)
4.3 2026-02-07 AI Assistant Data model: Added ADR-049 (Optional SKU with Shopify Product/Variant ID Matching Priority)

ADR-049: Optional SKU with Shopify Product/Variant ID Matching Priority

Attribute Value
ID ADR-049
Status Implemented
Date 2026-02-07
Context Shopify product variants may have null/empty SKUs. Merchants do not always configure SKUs, making SKU-only matching unreliable for linking incoming orders to product mappings.

Decision

Make the sku field optional on the ProductMapping model and implement a product/variant ID-first matching strategy with SKU as fallback.

Rationale

  • Shopify reality: SKU is optional on Shopify variants — many merchants leave it empty
  • Reliable matching: Shopify Product ID and Variant ID are always present on order line items and are immutable identifiers
  • Backward compatible: Existing mappings with SKUs continue to work via the fallback path
  • No data loss: SKU remains as a display/search field when available; PostgreSQL treats multiple NULL SKUs as distinct for the unique constraint

Matching Priority

  1. Exact match: shopifyProductId + shopifyVariantId → specific variant mapping
  2. Product-level catch-all: shopifyProductId only (variant ID is null on the mapping) → applies to all variants of that product
  3. SKU fallback: sku match → legacy path for backward compatibility

Implementation

Schema Changes:

model ProductMapping {
  sku  String?  // Was: String — now nullable
  // shopifyProductId and shopifyVariantId remain as before
}

model LineItem {
  shopifyProductId  String?  // NEW — stored from webhook payload
  shopifyVariantId  String?  // NEW — stored from webhook payload
  @@index([shopifyProductId])
}

Service Changes:

  • ProductMappingsRepository.findByShopifyProduct(productId, variantId?) — new method for ID-based lookup
  • ProductMappingsService.findUnmappedLineItems() — replaces findUnmappedSkus(), accepts line item objects with IDs + SKU
  • ProductMappingsService.findMappingForLineItem() — encapsulates the matching priority
  • PrintJobsService.createPrintJobsForLineItem() — uses new matching: tries product ID first, then SKU

Frontend Changes:

  • SKU field marked as optional in the mapping form
  • Search/display handles null SKUs with fallback display ()

Consequences

  • Positive: System works for all Shopify merchants regardless of SKU configuration
  • Positive: More reliable matching — Shopify IDs are guaranteed present and immutable
  • Neutral: Existing mappings with SKUs continue working unchanged
  • Consideration: When both a variant-specific and product-level mapping exist, the variant-specific one takes priority

ADR-050: Apache ECharts for Dashboard Analytics

Attribute Value
ID ADR-050
Status Implemented
Date 2026-02-13
Context The dashboard displayed only static stat cards and lists. Operators lacked visual insight into order, print job, and shipment status distributions, revenue trends, and day-over-day comparisons.

Decision

Adopt Apache ECharts (v6) via echarts-for-react as the charting library for the dashboard analytics feature. Use on-demand imports for bundle optimization and lazy loading (React.lazy) for all chart components.

Rationale

  • Richest chart variety: Donut, bar, line, gauge — all required chart types in a single library
  • On-demand imports: Tree-shakeable core (~225 KB shared bundle) vs ~320 KB full import
  • Dark theme support: Native theme registration via echarts.registerTheme() — consistent with existing dark UI
  • TypeScript-first: Complete type definitions for all chart options and callbacks
  • Active maintenance: Apache Foundation project with large community
  • React wrapper: echarts-for-react provides declarative React component with event handling

Implementation

Frontend — Chart Components (apps/web/src/components/charts/):

  • echarts-setup.ts — On-demand ECharts core with registered forma3d dark theme
  • chart-card.tsx — Reusable wrapper with title, subtitle, loading state, and empty state
  • donut-chart.tsx — Generic donut chart with custom center labels and click-to-filter
  • bar-chart.tsx — Generic bar chart with value labels and prefix/suffix formatting
  • line-chart.tsx — Generic line chart with gradient area fill

Frontend — Analytics Components (apps/web/src/components/analytics/):

  • OrderStatusChart — Order status donut with click-to-filter navigation
  • PrintJobStatusChart — Print job status donut showing active job count
  • ShipmentStatusChart — Shipment status donut showing in-transit count
  • RevenueTrendChart — Weekly revenue bar chart
  • OrderTrendChart — 30-day order volume line chart
  • AnalyticsPeriodDropdown — Shared period selector (Today / Week / Month / All Time)

Frontend — Dashboard Integration (apps/web/src/pages/dashboard.tsx):

  • Enhanced stat cards with trend delta indicators (up/down arrows, day-over-day change)
  • Lazy-loaded chart components with React.lazy() and <Suspense> fallback
  • Shared AnalyticsPeriod state driving all three donut charts simultaneously

Backend — Analytics Module (apps/api/src/analytics/):

  • AnalyticsRepository — Prisma groupBy for status distributions, $queryRaw for daily trend aggregation
  • AnalyticsService — Business logic for percentages, success rates, comparison deltas
  • AnalyticsController — 6 REST endpoints under /api/v1/analytics/*
  • DTOs with Swagger decorators for API documentation

Shared Contracts (libs/domain-contracts/src/api/analytics.api.ts):

  • AnalyticsPeriod, OrderAnalyticsApiResponse, PrintJobAnalyticsApiResponse
  • ShipmentAnalyticsApiResponse, TrendsApiResponse, EnhancedDashboardStatsApiResponse

Database Indexes (prisma/schema.prisma):

  • Composite indexes @@index([tenantId, status, createdAt]) on Order, PrintJob, and Shipment models

Data Fetching (apps/web/src/hooks/use-analytics.ts):

  • TanStack Query hooks: useOrderAnalytics, usePrintJobAnalytics, useShipmentAnalytics
  • Trend hooks: useRevenueTrend, useOrderTrend
  • Enhanced stats: useEnhancedDashboardStats (30s refresh for KPI tiles)

Consequences

  • Positive: Operators get immediate visual insight into 3D print farm operations
  • Positive: Bundle size managed through on-demand imports and lazy loading (~225 KB shared chunk)
  • Positive: Consistent dark theme integration with existing UI
  • Positive: Click-to-filter on donut slices enables quick navigation to filtered views
  • Neutral: New dependency (echarts + echarts-for-react) — well-maintained Apache project
  • Consideration: ECharts v6 has stricter TypeScript types requiring CallbackDataParams for formatters
  • Trade-off: Raw SQL used for date truncation in trend queries (Prisma groupBy lacks DATE() support)

Alternatives Considered

Library Pros Cons
Recharts Simple React API Limited donut customization, fewer chart types
Chart.js Lightweight Weak TypeScript, less donut label flexibility
Nivo Beautiful defaults Heavier bundle, React-specific only
D3.js Maximum flexibility High complexity, no React integration out of box
ECharts Rich charts, tree-shakeable Larger full bundle (mitigated by on-demand)

Test Coverage

  • Backend: 34 unit tests across analytics.repository.spec.ts, analytics.service.spec.ts, analytics.controller.spec.ts
  • Frontend: 14 hook tests in use-analytics.test.tsx with MSW handlers for all 6 analytics endpoints
  • Type Safety: Full TypeScript strict mode compliance (ECharts v6 types)

ADR-051: Decompose Monolithic API into Domain-Aligned Microservices

Attribute Value
ID ADR-051
Status Accepted
Date 2026-02-15
Context The monolithic apps/api was growing beyond 300+ files across multiple domains (orders, print jobs, shipping, fulfillment, GridFlock). Feature work required understanding the entire codebase, and deployments restarted all domains even for single-domain changes. The upcoming GridFlock pipeline added compute-intensive STL generation that could block the API request thread.

Decision

Decompose the monolithic API into five domain-aligned microservices plus an API Gateway:

Service Port Domain
API Gateway 3000 Auth, routing, WebSocket, sessions
Order Service 3001 Orders, mappings, orchestration
Print Service 3002 Print jobs, SimplyPrint integration
Shipping Service 3003 Shipments, Sendcloud integration
GridFlock Service 3004 STL generation, slicing pipeline
Slicer Container 3010 BambuStudio CLI headless slicing

The Gateway is the single entry point for all external traffic, routing to downstream services via HTTP proxy. Services communicate asynchronously via BullMQ event queues (Redis) and synchronously via internal HTTP APIs protected by X-Internal-Key header.

Rationale

  • Independent deployability: Each service can be deployed without affecting others
  • Domain isolation: GridFlock compute-intensive work cannot block order processing
  • Horizontal scalability: Services can be scaled independently based on load
  • Team scalability: Different domains can be worked on independently
  • Fault isolation: A failure in one service does not cascade to others

Consequences

  • Positive: Independent deployment and scaling per domain
  • Positive: GridFlock STL generation isolated from order processing
  • Positive: Clear domain boundaries enforced by service boundaries
  • Positive: Each service has a smaller, focused codebase
  • Negative: Increased operational complexity (more containers to manage)
  • Negative: Network latency added for inter-service calls
  • Negative: Distributed transaction complexity
  • Trade-off: Shared database via Prisma (no per-service database yet)

Alternatives Considered

Approach Pros Cons
Keep monolith Simple operations Growing complexity, deployment coupling
Modular monolith Simpler networking Still single deployment unit
Microservices Full isolation, scalability More containers, networking complexity
Serverless functions Auto-scaling Cold starts, vendor lock-in

ADR-052: BullMQ Event Queues for Inter-Service Async Communication

Attribute Value
ID ADR-052
Status Accepted
Date 2026-02-15
Context The monolithic API used EventEmitter2 for internal events. In a microservice architecture, events need to cross process boundaries. We need reliable, at-least-once delivery with retry capability.

Decision

Use BullMQ (backed by Redis) for inter-service asynchronous event communication. Each event type gets its own dedicated BullMQ queue. The @forma3d/service-common library provides a shared BullMqEventBus abstraction.

Event types:

  • order.created, order.ready-for-fulfillment, order.cancelled
  • print-job.completed, print-job.failed, print-job.status-changed, print-job.cancelled
  • shipment.created, shipment.status-changed
  • gridflock.mapping-ready, gridflock.pipeline-failed

Configuration:

  • Concurrency: 5 workers per queue
  • Retries: 3 attempts with exponential backoff
  • Dead letter: Failed events retained (removeOnFail: 5000)
  • Completed cleanup: removeOnComplete: 1000

Rationale

  • At-least-once delivery: BullMQ guarantees delivery with retries
  • Redis already present: Required for sessions and Socket.IO adapter
  • Built-in retry: Exponential backoff without custom implementation
  • Visibility: Job status, progress, and failure tracking via Bull Board
  • NestJS integration: @nestjs/bullmq provides native module support

Consequences

  • Positive: Reliable cross-service event delivery with retries
  • Positive: Dead letter queue for debugging failed events
  • Positive: Event handlers are idempotent (check before acting)
  • Negative: Redis becomes a critical infrastructure dependency
  • Trade-off: At-least-once semantics require idempotent handlers

Alternatives Considered

Approach Pros Cons
RabbitMQ Feature-rich, routing Additional infrastructure, more complex
Kafka High throughput, replay Overkill for this scale, complex setup
BullMQ Simple, Redis-native Redis single point of failure
HTTP webhooks Simple to implement No retry guarantees, no backpressure
AWS SQS/SNS Managed, scalable Vendor lock-in, latency

ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)

Attribute Value
ID ADR-053
Status Accepted
Date 2026-02-15
Context The GridFlock pipeline generates STL files, slices them to gcode, and uploads to SimplyPrint. In a containerized environment with potential horizontal scaling, local file storage creates state that prevents scaling and requires cleanup.

Decision

The entire GridFlock pipeline operates on in-memory buffers. No files are written to the local filesystem at any point in the pipeline:

  1. STL Generation: JSCAD generates geometry → serialized to STL binary buffer
  2. Slicing: STL buffer sent to Slicer container via HTTP → gcode buffer returned
  3. Upload: Gcode buffer uploaded directly to SimplyPrint Files API
  4. Mapping: ProductMapping created in database referencing SimplyPrint file ID

Plates in a plate set are processed sequentially to bound memory usage (one plate buffer at a time).

Rationale

  • Stateless containers: No local state means any replica can handle any request
  • Horizontal scaling: Multiple GridFlock Service instances can run concurrently
  • No cleanup needed: No temp files to garbage collect
  • SimplyPrint as storage: The only permanent storage is SimplyPrint (source of truth for gcode)

Consequences

  • Positive: Fully stateless, horizontally scalable
  • Positive: No disk I/O bottleneck
  • Positive: No file cleanup cron jobs needed
  • Negative: Memory-bound (large plate sets consume RAM)
  • Mitigation: Sequential plate processing bounds peak memory to ~50MB per plate

ADR-054: SimplyPrint API Files for Gcode Upload

Attribute Value
ID ADR-054
Status Accepted
Date 2026-02-15
Context After slicing GridFlock baseplates, the gcode must be stored and made available for printing via SimplyPrint. SimplyPrint offers a Files API for uploading files to the print farm. The SimplyPrint cloud slicer is not accessible via API.

Decision

Upload sliced gcode files to SimplyPrint via the Files API (requires Print Farm plan). Each gcode file is uploaded as a buffer with metadata, and SimplyPrint returns a file ID used for creating print jobs.

Rationale

  • Single source of truth: SimplyPrint stores all printable files
  • No local storage: Aligns with buffer-based pipeline (ADR-053)
  • Print job integration: File IDs directly used in SimplyPrint print queue
  • Existing API: Files API already used for manual file uploads

Consequences

  • Positive: No separate file storage infrastructure needed
  • Positive: Files immediately available for printing
  • Negative: Requires SimplyPrint Print Farm plan (API file access)
  • Negative: Upload latency adds to pipeline time (~2-5 seconds per file)

ADR-055: BambuStudio CLI Slicer Container

Attribute Value
ID ADR-055
Status Accepted
Date 2026-02-15
Context The GridFlock pipeline needs to slice STL files into gcode with specific printer profiles (nozzle diameter, layer height, filament type). SimplyPrint's cloud slicer is not API-accessible. We need a headless slicer that supports Bambu Lab and Prusa printer profiles.

Decision

Run BambuStudio CLI (fork of PrusaSlicer/SuperSlicer) in a dedicated Docker container as a headless slicing service. The container exposes an HTTP API that accepts STL buffers and returns gcode buffers. Printer profiles are configurable per tenant via SystemConfig.

Rationale

  • Bambu Lab support: Native profiles for X1 Carbon, P1S printers
  • Prusa support: Backward-compatible with PrusaSlicer profiles
  • Deterministic: Same input always produces same output
  • Containerized: Isolated from other services, independently scalable
  • CLI-based: No GUI dependencies, runs in headless mode

Consequences

  • Positive: Full control over slicing parameters
  • Positive: Tenant-configurable print profiles
  • Positive: No dependency on SimplyPrint cloud slicer
  • Negative: Additional container to maintain and update
  • Negative: BambuStudio updates require container rebuilds

Alternatives Considered

Approach Pros Cons
SimplyPrint slicer No additional container Not API-accessible
PrusaSlicer CLI Lighter weight No native Bambu Lab profiles
BambuStudio CLI Bambu + Prusa support Heavier container (~1.5GB)
CuraEngine Popular slicer Different profile format

ADR-056: Redis for Sessions, Event Queues, and Socket.IO Adapter

Attribute Value
ID ADR-056
Status Accepted
Date 2026-02-15
Context The microservice architecture requires shared infrastructure for sessions, inter-service events, and WebSocket broadcasting. Rather than introducing multiple infrastructure components, a single Redis instance can serve all three purposes.

Decision

Use a single Redis 7 instance for three purposes:

  1. Session Store: Gateway stores Express sessions in Redis (replaces PostgreSQL connect-pg-simple), enabling session sharing across Gateway replicas
  2. Event Bus: BullMQ queues for inter-service async events (see ADR-052)
  3. Socket.IO Adapter: Redis adapter enables WebSocket event broadcasting across multiple Gateway instances

Rationale

  • Single infrastructure: One Redis instance serves all three use cases
  • Horizontal scaling: Sessions and WebSockets work across Gateway replicas
  • Performance: In-memory data store with sub-millisecond latency
  • Proven stack: Redis + BullMQ + Socket.IO Redis adapter is a well-tested combination

Consequences

  • Positive: Enables horizontal scaling of Gateway
  • Positive: Single infrastructure dependency for multiple features
  • Positive: Fast session lookups (vs. PostgreSQL round-trip)
  • Negative: Redis becomes a critical dependency (all services depend on it)
  • Mitigation: Redis is deployed with persistence enabled and health checks
  • Trade-off: Session data is ephemeral (Redis restart clears sessions)

ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy

Attribute Value
ID ADR-057
Status Accepted
Date 2026-02-15
Context Pipeline build times grew significantly with 8+ microservice Docker images

Decision

Deploy a self-hosted Azure DevOps build agent on a DigitalOcean droplet (4 vCPU / 8 GB RAM) running 2 agent instances, and adopt a hybrid agent strategy:

  • MS-hosted agent handles lightweight jobs (lint, Nx builds, deployments)
  • Self-hosted agents handle Docker packaging with persistent local layer cache
  • Merged Validate & Test stage runs lint, typecheck, and unit tests in parallel across all 3 agents

Rationale

  • Docker layer caching: MS-hosted agents are ephemeral (cold cache every run). Self-hosted agents maintain a warm Docker layer cache between builds, reducing per-service build time from ~7-10 min to ~2-3 min
  • Pre-installed tools: Cosign and Syft are pre-installed on the self-hosted agent instead of being downloaded on every job (~1 min saved per service)
  • Cost-effective parallelism: 2 self-hosted agent instances at $48/month provide more parallelism than buying 1 extra MS-hosted parallel job at $40/month
  • Stage merging: Combining Validate and Test into a single stage eliminates ~5 min of sequential stage overhead by running Lint, TypeCheck, and UnitTests in parallel

Pipeline Architecture

Agent Stage Jobs
MS-hosted Validate & Test Lint
DO Agent 1 Validate & Test TypeCheck
DO Agent 2 Validate & Test UnitTests
MS-hosted Build & Package DetectAffected, BuildAll
DO Agent 1+2 Build & Package All Package* Docker jobs
MS-hosted Deploy, Acceptance, Production All remaining stages

Infrastructure

Component Specification
Droplet DigitalOcean s-4vcpu-8gb ($48/month)
OS Ubuntu 22.04 LTS
Agent Pool DO-Build-Agents (self-hosted)
Agent Instances 2 (do-build-agent-1, do-build-agent-2)
Setup Script deployment/build-agent/setup-build-agent.sh

Performance Impact

Metric Before (1 MS-hosted) After (hybrid) Improvement
Validate + Test ~18 min (sequential) ~8 min (parallel) 56% faster
Build & Package (full) ~75 min ~15 min 80% faster
Full pipeline (main) ~133 min ~63 min 53% faster
Monthly cost $0 $48

Consequences

  • Positive: Dramatically faster builds, especially Docker packaging
  • Positive: Cost-effective compared to buying MS-hosted parallel jobs
  • Positive: Docker layer cache persists between builds
  • Negative: Self-hosted agent requires maintenance (Docker cleanup, OS updates, agent updates)
  • Mitigation: Automated weekly Docker cleanup and daily disk monitoring cron jobs
  • Negative: Single point of failure (if droplet goes down, Docker builds queue on MS-hosted)
  • Mitigation: Pipeline falls back gracefully; Package jobs wait for agent availability

ADR-058: Self-Hosted Log Infrastructure (ClickHouse + Grafana via OpenTelemetry)

Attribute Value
ID ADR-058
Status Accepted
Date 2026-02-21
Context Sentry Logs has limited retention (30 days), query capabilities, and cost scalability for structured business/audit logs

Decision

Migrate structured logging (business events, audit logs, observability) from Sentry Logs to a self-hosted ClickHouse + Grafana stack, using OpenTelemetry Collector as the ingestion pipeline and Pino as the application-level logger. Sentry remains the platform for error tracking, distributed tracing, performance monitoring, and profiling.

Architecture

uml diagram

Responsibility Split

Concern Platform Rationale
Error tracking Sentry Best-in-class stack traces, issue grouping, alerting
Distributed tracing Sentry End-to-end traces across services with Sentry UI
Performance monitoring Sentry Request latency, database query profiling
Profiling Sentry Node.js CPU profiling in production
Structured logging ClickHouse + Grafana Unlimited retention, SQL queries, self-hosted cost control
Business event logs ClickHouse + Grafana Long-term queryable audit trail
Security audit logs ClickHouse + Grafana Compliance-grade retention
Log dashboards Grafana Custom dashboards, alerting rules

Implementation Details

Shared Library (libs/observability):

  • otel-logger.ts — Pino logger factory with configurable log level and pino-pretty for development
  • OtelLoggerService — NestJS injectable service replacing SentryLoggerService, providing info, warn, error, debug, logEvent, and logAudit methods

Instrumentation (apps/*/src/observability/instrument.ts):

  • OpenTelemetry SDK initializes before Sentry to enable Pino-OTel bridging
  • @opentelemetry/instrumentation-pino auto-bridges Pino logs to OTLP
  • @opentelemetry/exporter-logs-otlp-grpc exports logs to OTel Collector
  • Sentry _experiments: { enableLogs: true } flag removed

Infrastructure (deployment/staging/):

  • otel-collector-config.yaml — OTLP receiver → batch processor → ClickHouse exporter
  • clickhouse-config.xml — S3 backup disk with from_env credential injection
  • clickhouse-users.xmlotel user for collector writes
  • grafana/provisioning/datasources/clickhouse.yaml — ClickHouse datasource with OTel schema
  • scripts/backup-clickhouse-logs.sh — Daily backup cron job

Pipeline (azure-pipelines.yml):

  • 7 new variables: CLICKHOUSE_PASSWORD, GRAFANA_ADMIN_PASSWORD, DO_SPACES_KEY, DO_SPACES_SECRET, DO_SPACES_REGION, DO_SPACES_BUCKET, DO_SPACES_LOG_PREFIX
  • Variables flow from Azure DevOps → .env → Docker Compose → container environment → ClickHouse from_env XML

Log Retention (ClickHouse TTL)

Tier Retention Data
Hot 30 days Full structured logs
Warm 90 days Aggregated summaries
Archive 365 days Daily backups to DigitalOcean Spaces

Consequences

Positive:

  • ✅ Unlimited log retention at self-hosted cost (~$0 incremental on existing droplet)
  • ✅ Full SQL query capability via Grafana for log analysis
  • ✅ Sentry retains its strengths (error tracking, tracing, profiling) without log clutter
  • ✅ ClickHouse columnar storage compresses logs 10-20x vs PostgreSQL
  • ✅ Vendor-neutral via OpenTelemetry — can swap backends without code changes
  • ✅ Daily automated backups to DigitalOcean Spaces (S3-compatible)

Negative:

  • ⚠️ Additional infrastructure to maintain (3 new containers: OTel Collector, ClickHouse, Grafana)
  • ⚠️ ~1 GB additional RAM on staging droplet
  • ⚠️ Backup credentials require Azure DevOps variable management
  • Mitigation: All configuration is pipeline-driven; containers are stateless except ClickHouse data volume

Alternatives Considered

Alternative Reason for Rejection
Keep Sentry Logs only 30-day retention, limited querying, potential cost at scale
Grafana + Loki Loki less efficient for structured logs than ClickHouse
ELK Stack (Elasticsearch) Heavy resource requirements, complex to operate
Datadog / New Relic Expensive SaaS, vendor lock-in
  • ClickHouse + Grafana Logging Research
  • ADR-016: Sentry Observability with OpenTelemetry (updated — Sentry no longer handles structured logging)
  • ADR-047: Three-Tier Logging Strategy (superseded — Sentry Logs tier replaced by ClickHouse)

ADR-059: Nx Affected Resilience via Last-Successful-Deploy Tag

Attribute Value
ID ADR-059
Status Accepted
Date 2026-02-22
Context Nx affected with HEAD~1 base loses track of undeployed changes when a pipeline run fails partway through

Decision

Replace the hard-coded --base=HEAD~1 in the DetectAffected job with a last-successful-deploy git tag that is only advanced after the full pipeline succeeds. A new UpdateDeployTag stage at the end of the pipeline pushes the tag forward on success.

Problem

When the pipeline runs on main, nx affected --base=HEAD~1 compares the current commit against the previous commit. If the pipeline fails (e.g., during DeployStaging or AcceptanceTest), the changes in that commit are never deployed. When the next commit arrives with a fix, HEAD~1 now points to the failed commit — the originally undeployed changes are invisible to nx affected and are permanently skipped.

main:  A ─── B (deploy fails) ─── C (fix)
              │                     │
              └── HEAD~1 base ──────┘
              changes X,Y,Z         only fix Z is detected
              never deployed         X,Y permanently lost

Solution

main:  A ─── B (deploy fails) ─── C (fix)
       │                           │
       └── last-successful-deploy  └── HEAD
       tag stays at A              affected sees A→C (includes X,Y,Z + fix)

DetectAffected job:

  1. Check if last-successful-deploy tag exists
  2. If yes, use it as --base for nx affected and git diff
  3. If no (first run), fall back to HEAD~1

UpdateDeployTag stage:

  1. Depends on all pipeline stages (Build, DeployStaging, AcceptanceTest, LoadTest, DeployProduction, SmokeTest)
  2. Runs only when Build, DeployStaging, and AcceptanceTest did not fail (disabled stages like DeployProduction/SmokeTest are allowed to be skipped)
  3. Force-pushes the last-successful-deploy tag to the current commit

Bootstrap

No manual setup required. On the first pipeline run, the tag does not exist and affected detection falls back to HEAD~1. After the first successful run, the tag is created automatically.

Consequences

Positive:

  • No changes are ever "forgotten" — failed deployments are re-evaluated on the next run
  • Self-healing: a fix PR automatically includes all previously missed changes
  • ForceFullVersioningAndDeployment parameter remains as a manual override
  • Zero infrastructure dependencies (uses git tags, no external state store)

Negative:

  • Requires persistCredentials: true on the UpdateDeployTag checkout step for git push
  • Force-pushing tags requires appropriate repository permissions for the pipeline service account
  • After extended failures, the affected set may be large (all changes since last success)
  • Mitigation: This is intentional — better to rebuild too much than to skip changes

Alternatives Considered

Alternative Reason for Rejection
HEAD~1 (previous approach) Loses undeployed changes after pipeline failures
External state store (S3, SSM) Additional infrastructure dependency for a simple use case
Nx Cloud affected tracking Requires Nx Cloud subscription; overkill for current scale
Manual ForceFullVersioningAndDeployment after failures Error-prone, depends on human remembering to toggle
  • ADR-018: Nx Affected Conditional Deployment Strategy (extended by this ADR)
  • ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
  • ADR-006: Azure DevOps for CI/CD

ADR-060: Single Source of Truth for STL Preview Generation

Attribute Value
ID ADR-060
Status Accepted (extended by ADR-061)
Date 2026-02-27
Context STL preview generation logic was duplicated between the NestJS service and needed by offline cache population scripts

Decision

Extract all STL preview generation logic from GridflockPreviewService into a standalone generatePreviewStl() function in @forma3d/gridflock-core. Both the NestJS service and offline scripts import the same function — a single source of truth for preview STL generation.

Problem

The GridflockPreviewService contained the full preview generation pipeline (plate set calculation, offset computation, parallel plate generation, STL combining) as private methods and hardcoded constants. To pre-populate the STL preview cache offline (eliminating cold-start latency for 16,000+ dimension combinations), this logic needed to be callable from a standalone CLI script without depending on NestJS, Prisma, Redis, or any server infrastructure.

Duplicating the generation logic in the script would create a maintenance risk — any change to STL generation would need to be applied in two places, and divergence would produce inconsistent cached files.

Solution

New module: libs/gridflock-core/src/lib/preview-generator.ts

Exports generatePreviewStl(widthMm, heightMm, options?) which orchestrates the full pipeline:

  1. Calculate plate set using PRINTER_PROFILES['bambu-a1']
  2. Compute X/Y offsets per plate
  3. Generate plates in parallel via generatePlatesParallel() (with configurable maxWorkers)
  4. Combine into a single binary STL via combineStlBuffers()

Moved from service to library:

  • computeUniformOffsets() — cumulative X-axis offset calculation
  • computeOffsetsPerColumn() — per-column Y-axis offset calculation
  • PLATE_GAP_MM = 10 — gap constant between plates in the preview
  • DEFAULT_PREVIEW_OPTIONS — intersection-puzzle connectors, magnets disabled

maxWorkers parameter: Added to generatePlatesParallel() (backward-compatible optional parameter) so the offline script can limit worker threads per combination when running multiple combinations concurrently.

Service refactored: GridflockPreviewService.generatePreview() became a thin wrapper:

  1. Normalize dimensions (larger dimension first)
  2. Attempt plate-level assembly → return if successful
  3. Fall back to generatePreviewStl(w, h, { log }) → return

Note: ADR-061 extends this architecture with a plate-level cache that assembles previews from ~268 cached base plates + dynamically generated border geometry (~60 MB total) while supporting any input resolution. The legacy full-preview disk cache (16,471 files, ~32 GB) was removed in March 2026.

Consequences

Positive:

  • Single source of truth — changes to STL generation logic happen in one place
  • Offline scripts produce byte-for-byte identical output to the server
  • maxWorkers enables CPU-aware parallelism in the population script without oversubscription
  • Cache key normalization prevents duplicate entries (320×450 = 450×320)
  • @forma3d/gridflock-core remains NestJS-independent — usable in any Node.js context

Negative:

  • Preview generation parameters (printer profile, connector type) are hardcoded in the library rather than configurable per-tenant
  • Mitigation: When multi-tenant preview customization is needed, generatePreviewStl can accept a configuration parameter

Alternatives Considered

Alternative Reason for Rejection
Duplicate generation logic in the script Maintenance risk — two codepaths that must stay in sync
Import NestJS service in the script Would require NestJS DI container, Prisma, Redis — heavyweight for an offline tool
Use the server's REST API from the script Network-bound, requires running server, doesn't leverage local CPU cores
  • ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
  • ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly
  • STL Cache Pre-Population Research
  • scripts/populate-plate-cache.ts — plate-level cache population script

ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly

Attribute Value
ID ADR-061
Status Accepted
Date 2026-03-01
Context Full-preview-per-dimension cache requires ~32 GB for 0.5 cm resolution (16,471 files) and ~853 GB for 1 mm resolution (406,351 files), making fine-grained drawer-fit precision impractical

Decision

Replace the full-preview-per-dimension cache with a plate-level cache of 200 base plates (~41 MB) that are assembled on the fly with dynamically generated border geometry to produce previews for any dimension at any resolution.

Problem

The storefront configurator's 3D preview required pre-populating one STL file per dimension pair. At 0.5 cm resolution (step 5 mm), this was 16,471 files at ~32 GB — already impractical to generate (10–14 hours) and deploy. Supporting 1 mm input precision (for exact drawer fit) would require 406,351 files at ~853 GB, which was not feasible.

Analysis revealed that each preview's plate geometry is determined by only two factors: grid size (1–6 × 1–6) and connector edge pattern (4 booleans). The border around the grid cells is what creates the combinatorial explosion (147,500 unique plates with border vs only 200 without border). The border itself is a trivial rectangular solid that can be generated in microseconds.

Solution

Three-tier architecture:

  1. 200 base plates cached — each is a unique (gridSize, connectorEdges) combination with zero border and no plate number, generated via JSCAD CSG and stored as binary STL
  2. Border strips generated on the fly — simple rectangular cuboids (12 triangles, 684 bytes each) created as raw binary STL without any JSCAD dependency
  3. Assembly via combineStlBuffers() — existing buffer concatenation with vertex offset translation

New modules in @forma3d/gridflock-core:

  • preview-generator.ts additions:
  • basePlateCacheKey() — deterministic key plate-{cols}x{rows}-{NESW}.stl
  • generateBasePlateStl() — JSCAD plate with NO_BORDER and plateNumber: false
  • enumerateAllBasePlateKeys() — discovers all 200 keys via representative dimensions
  • assemblePreviewFromPlateCache() — assembly function accepting a plate-lookup callback
  • border-generator.ts — pure binary STL box generation, no JSCAD dependency

New service in gridflock-service:

  • PlateCacheService — loads all 200 base plates into memory at startup (~41 MB), provides synchronous lookup by key

Preview resolution cascade:

  1. Plate-level assembly (base plates + dynamic borders) → 10–100 ms
  2. Full JSCAD generation via generatePreviewStl() → 12–30 seconds (fallback)

Dimension validation updated:

  • Shopify configurator: step="0.1" (1 mm precision), max="100" (100 cm)
  • Backend DTOs: @Max(1000) for both width and height (both preview and checkout)
  • Sub-millimeter inputs rounded down (floor) to nearest 0.1 cm

Consequences

Positive:

  • Cache reduced from ~32 GB (16,471 files) to ~60 MB (~268 files)
  • Population time reduced from 10–14 hours to 2–5 minutes
  • Supports any input resolution (1 mm, 0.5 cm, continuous) with the same cache
  • Preview assembly completes in 10–100 ms — no perceptible delay
  • Production STL generation path is completely unchanged (byte-identical output)
  • Legacy full-preview cache was removed entirely (March 2026) — no longer needed

Negative:

  • Preview STLs are not byte-identical to the legacy full previews (different border geometry, no plate numbers)
  • Mitigation: Visually equivalent in the 3D viewer; the differences (rounded vs square border corners, absent plate numbers) are invisible in the preview context. Production plates remain unchanged.

Alternatives Considered

Alternative Reason for Rejection
Cache individual plates with border baked in 147,500 files at ~29 GB — still too large, does not solve the scaling problem
Generate full previews at 1 mm resolution ~853 GB, ~69 hours to generate — completely impractical
Server-side rendering (image preview instead of STL) Loses the interactive 3D preview that customers value; would require a rendering pipeline
  • ADR-060: Single Source of Truth for STL Preview Generation
  • ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
  • Plate-Level Preview Cache Prompt — Full analysis with combinatorics
  • STL Cache Pre-Population Research
  • scripts/populate-plate-cache.ts — base plate population script
  • libs/gridflock-core/src/lib/border-generator.ts — pure binary STL border generation
  • apps/gridflock-service/src/gridflock/plate-cache.service.ts — in-memory plate cache

ADR-062: Inventory Tracking and Stock Replenishment

Attribute Value
ID ADR-062
Status Accepted
Date 2026-03-08
Context Forma3D.Connect operates as a pure print-to-order platform, meaning every order triggers a new print job. Popular products are reprinted constantly, causing fulfillment delays during peak demand and leaving printers idle during quiet periods.

Decision

Introduce a hybrid fulfillment model with opt-in inventory tracking at the ProductMapping level, scheduled stock replenishment during quiet periods, and stock-aware order fulfillment that consumes available stock before creating print jobs.

Problem

Popular products (best-sellers) follow a predictable demand pattern, yet every order triggers a full print cycle (4–24 hours). During weekend order surges, backlogs build up. During weekday quiet periods, printers sit idle. There is no mechanism to pre-print popular products or track physical stock of completed units.

Solution

Three new capabilities, all placed in order-service (tightly coupled with orchestration):

1. Inventory Tracking (InventoryModule)

  • Extended ProductMapping with stock fields: currentStock, minimumStock, maximumStock, replenishmentPriority, replenishmentBatchSize
  • Stock management is opt-in: minimumStock = 0 (default) keeps the product as print-to-order; minimumStock > 0 enables stock tracking
  • One stock unit = one complete set of all AssemblyParts for a product
  • All stock mutations (production, consumption, adjustment, scrapping) create InventoryTransaction records for a full audit trail
  • currentStock can never go negative; all mutations are atomic via database transactions

2. Stock Replenishment (StockReplenishmentModule)

  • Cron scheduler evaluates stock levels every 10 minutes
  • Respects configurable allowedHours, allowedDays to run during quiet periods only
  • Skips when order print queue exceeds orderQueueThreshold (order jobs always take priority)
  • Skips when active stock jobs exceed maxConcurrentStockJobs capacity
  • For each product where currentStock < minimumStock, calculates deficit (accounting for pending StockBatches) and creates batches
  • One StockBatch = one sellable unit = PrintJob records for all AssemblyParts × quantityPerProduct
  • PrintJob records created with purpose = 'STOCK', lineItemId = null, stockBatchId set
  • Global STOCK_REPLENISHMENT_ENABLED environment variable acts as master switch

3. Stock-Aware Order Fulfillment (updated OrchestrationService)

  • OrchestrationService.handleOrderCreated() now calls InventoryService.tryConsumeStock() before creating print jobs
  • If stock covers the full order quantity, no print jobs are created; order completes immediately
  • Partial fulfillment supported: consume available stock, print remaining units
  • Products with minimumStock = 0 bypass stock consumption entirely (unchanged print-to-order flow)
  • GridFlock products bypass stock consumption (custom STL pipeline unchanged)

4. Schema Changes

  • PrintJob.lineItemId made nullable (stock jobs have no line item)
  • PrintJob.purpose field added (enum: ORDER | STOCK, default ORDER)
  • PrintJob.stockBatchId FK added (nullable, references StockBatch)
  • New StockBatch model (id, productMappingId, status, totalJobs, completedJobs)
  • New InventoryTransaction model (id, productMappingId, transactionType, quantity, direction, referenceType, referenceId, notes, createdBy)
  • New enums: PrintJobPurpose, StockBatchStatus, InventoryTransactionType, StockDirection, InventoryReferenceType

5. API Endpoints (proxied via gateway at /api/v1/inventory/*)

Method Path Permission Purpose
GET /api/v1/inventory/stock inventory.read Stock levels for all managed products
PUT /api/v1/inventory/stock/:id/config inventory.write Update stock configuration
POST /api/v1/inventory/stock/:id/adjust inventory.write Manual stock adjustment with audit trail
POST /api/v1/inventory/stock/:id/scrap inventory.write Scrap damaged stock with audit trail
GET /api/v1/inventory/stock/:id/transactions inventory.read Transaction history (paginated)
GET /api/v1/inventory/replenishment/status inventory.read Replenishment system status

6. Event Flow

  • STOCK_REPLENISHMENT_SCHEDULED event published per created stock print job (BullMQ)
  • print-job.completed events with purpose === 'STOCK' route to InventoryService.handleStockJobCompleted() instead of order orchestration
  • When a StockBatch completes (all jobs done), ProductMapping.currentStock is incremented by 1 and a PRODUCED transaction is recorded

Consequences

Positive:

  • Best-seller orders can be fulfilled in minutes (from stock) instead of 4–24 hours (printing)
  • Printer utilization improves — idle periods fill with stock replenishment work
  • Weekend order surges are absorbed by pre-built stock
  • Full audit trail of every stock movement via InventoryTransaction ledger
  • Existing print-to-order flow is completely unchanged for products with minimumStock = 0
  • Existing GridFlock custom STL pipeline is unaffected

Negative:

  • PrintJob.lineItemId is now nullable, requiring safe access patterns (?. and ?? '') across order-service and print-service
  • Mitigation: All existing services updated; comprehensive unit tests added for both null and non-null cases
  • Stock jobs consume printer capacity that could serve order jobs during unexpected demand spikes
  • Mitigation: orderQueueThreshold ensures stock replenishment yields to order jobs; replenishment only runs during configurable quiet periods

Alternatives Considered

Alternative Reason for Rejection
Separate inventory microservice Tight coupling with orchestration logic (stock consumption happens during order processing); would require distributed transactions or saga pattern for atomicity
Track inventory at the AssemblyPart level Overly complex; operators think in "sellable units," not individual parts. Part-level tracking would require complex partial-unit logic
Manual-only replenishment (no scheduling) Defeats the purpose of utilizing idle printer capacity; operators would need to manually evaluate and trigger batches
Priority queue preemption for order jobs Too complex for v1; simple threshold check achieves the same goal with much less risk
  • ADR-008: Event-Driven Internal Communication
  • ADR-012: Assembly Parts Model for Product Mapping
  • ADR-022: Event-Driven Fulfillment Architecture
  • ADR-051: Decompose Monolithic API into Domain-Aligned Microservices
  • ADR-052: BullMQ Event Queues for Inter-Service Async Communication
  • Stock Management Prompt — Full implementation specification
  • apps/order-service/src/inventory/ — InventoryModule implementation
  • apps/order-service/src/stock-replenishment/ — StockReplenishmentModule implementation
  • libs/domain-contracts/src/api/inventory.api.ts — API response contracts

ADR-063: ORDER-over-STOCK Print Queue Priority

Attribute Value
ID ADR-063
Status Accepted
Date 2026-03-09
Context With the introduction of stock replenishment (ADR-062), both ORDER-purpose and STOCK-purpose print jobs share the same SimplyPrint print queue. Without explicit ordering, STOCK jobs scheduled during quiet periods could delay incoming customer orders.

Decision

Implement best-effort FIFO-within-priority-class queue ordering. ORDER-purpose jobs always precede STOCK-purpose jobs in the SimplyPrint queue. Within each class, jobs are processed in FIFO order.

Problem

When a customer order arrives and the SimplyPrint queue already contains STOCK replenishment jobs, the new ORDER job is appended at the end of the queue. This means the customer's order waits behind pre-printing stock jobs, causing unnecessary fulfillment delays — defeating the purpose of replenishment (which is meant to improve, not degrade, customer experience).

Solution

After each ORDER-purpose print job is added to the SimplyPrint queue, the system:

  1. Queries the current queue from SimplyPrint (GET /{id}/queue/GetItems)
  2. Queries local PrintJob records to identify which queue items are ORDER-purpose (findActiveOrderQueueItemIds)
  3. Calculates the correct position: existingOrderCount + 1 (after all existing ORDER items)
  4. Moves the new item if its current position is behind the target, using SimplyPrint's SetOrder endpoint (GET /{id}/queue/SetOrder?queue_item={id}&to={position})

STOCK-purpose jobs are never explicitly reordered — they naturally accumulate after ORDER jobs as new ORDER items are inserted before them.

Key design properties:

  • FIFO within each class: New ORDER jobs go after existing ORDER jobs; STOCK jobs maintain their original arrival order
  • Best-effort: If the reorder API call fails, the job remains in the queue at its default position. The operation logs a warning but does not fail the print job creation
  • Retry-aware: Retried ORDER jobs also receive priority positioning
  • No preemption: Jobs already assigned to printers or in-progress are not affected

Consequences

Positive:

  • Customer orders are always printed before stock replenishment items
  • FIFO ordering within each priority class prevents starvation and ensures fairness
  • Best-effort approach means a SimplyPrint API failure cannot block print job creation
  • No additional database schema changes required

Negative:

  • Two additional API calls per ORDER print job (getQueue + setQueueOrder) add latency
  • Mitigation: Both calls are fast (<100ms each) and only occur for ORDER jobs
  • Race condition between concurrent ORDER job insertions could result in suboptimal ordering
  • Mitigation: The overall constraint (ORDER before STOCK) is maintained; within-ORDER FIFO may have minor deviation under high concurrency, which is acceptable
  • Relies on SimplyPrint's SetOrder API being available and correctly implemented
  • Mitigation: Graceful degradation — jobs remain in queue at default position on failure

Alternatives Considered

Alternative Reason for Rejection
SimplyPrint priority field on AddItem SimplyPrint's API does not expose a priority parameter for queue items
Insert ORDER jobs at position 1 (to=1) Breaks FIFO among ORDER jobs — later orders would execute before earlier ones (LIFO)
Separate SimplyPrint queues per purpose SimplyPrint does not support multiple queues per company; would require separate company accounts
Threshold-only approach (ADR-062 original) Prevents stock jobs from being created when orders are busy, but does not help when stock jobs are already queued and a new order arrives
  • ADR-062: Inventory Tracking and Stock Replenishment
  • ADR-052: BullMQ Event Queues for Inter-Service Async Communication
  • apps/print-service/src/print-jobs/print-jobs.service.tsprioritizeOrderJobInQueue() implementation
  • apps/print-service/src/simplyprint/simplyprint-api.client.tssetQueueOrder() method
  • SimplyPrint Queue API — SetOrder endpoint documentation

ADR-064: Stock Replenishment Event Subscriber for SimplyPrint Queue

Attribute Value
ID ADR-064
Status Accepted
Date 2026-03-09
Context ADR-062 introduced stock replenishment with the StockReplenishmentService creating StockBatch and PrintJob records and publishing STOCK_REPLENISHMENT_SCHEDULED events via BullMQ. However, no subscriber was wired up to consume these events, so stock print jobs remained in QUEUED status indefinitely and never reached the SimplyPrint print queue.

Decision

Wire a subscriber for STOCK_REPLENISHMENT_SCHEDULED events in the order-service's EventSubscriberService that queues each stock print job to SimplyPrint via SimplyPrintApiClient.addToQueue().

Problem

After the stock replenishment scheduler creates PrintJob records with purpose = 'STOCK' and publishes STOCK_REPLENISHMENT_SCHEDULED events, nothing consumes those events. The print jobs sit in QUEUED status in the database but never reach the SimplyPrint API queue. Printers never receive stock jobs, defeating the purpose of the replenishment system.

Solution

Add a subscription for SERVICE_EVENTS.STOCK_REPLENISHMENT_SCHEDULED in EventSubscriberService.onModuleInit(). The handler (queueStockJobToSimplyPrint) mirrors the ORDER-purpose flow in PrintJobsService.createSinglePrintJob() but without order/line-item context or ORDER-over-STOCK priority reordering:

  1. Validates fileId is present (skips if null)
  2. Checks SimplyPrintApiClient.isEnabled() (skips if not configured)
  3. Looks up the PrintJob by ID (skips if not found)
  4. Checks idempotency — skips if job already has a simplyPrintJobId
  5. Calls SimplyPrintApiClient.addToQueue({ fileId, amount: 1 })
  6. Releases any stale simplyPrintJobId from old jobs (SimplyPrint may reuse queue-item IDs)
  7. Updates the PrintJob with simplyPrintJobId and simplyPrintQueueItemId

Why the order-service subscribes to its own event (not the print-service):

  • The ORDER-purpose print job flow runs entirely within the order-service (via OrchestrationServicePrintJobsServiceSimplyPrintApiClient)
  • The order-service already has SimplyPrintApiClient, PrintJobsRepository, and the idempotency/release logic
  • Keeping STOCK job queuing in the same service avoids duplicating SimplyPrint integration code across services
  • The EventSubscriberService already bridges BullMQ events to local handlers for print-job completion, shipments, and integrations

STOCK jobs are not priority-reordered. ORDER-over-STOCK priority (ADR-063) is handled by PrintJobsService.prioritizeOrderJobInQueue() on the ORDER side. STOCK jobs naturally queue behind ORDER jobs.

Consequences

Positive:

  • Stock replenishment now works end-to-end: cron → batch creation → event → SimplyPrint queue → printer
  • Idempotent: duplicate events are safely ignored (job already has SimplyPrint ID)
  • Graceful degradation: SimplyPrint API failures are logged but don't crash the event processing
  • No changes to the StockReplenishmentService itself — it continues to publish events as designed

Negative:

  • EventSubscriberService now depends on SimplyPrintApiClient (previously it only used repositories and re-emitted events)
  • Mitigation: SimplyPrintModule was already imported in EventsModule; the additional dependency is minimal

Alternatives Considered

Alternative Reason for Rejection
Subscribe in print-service EventSubscriberService Print-service's ORDER flow is unused (order-service handles everything); would duplicate SimplyPrint integration code and require maintaining two parallel queue paths
Call SimplyPrintApiClient.addToQueue() directly in StockReplenishmentService.evaluateAndSchedule() Mixes batch-creation concern with queue-dispatch concern; the event-driven approach allows retry/replay and keeps the replenishment service focused on scheduling logic
Create a dedicated StockJobDispatcherService Over-engineering for a single subscriber; EventSubscriberService already handles similar bridge logic for print-job and shipment events
  • ADR-062: Inventory Tracking and Stock Replenishment
  • ADR-063: ORDER-over-STOCK Print Queue Priority
  • ADR-052: BullMQ Event Queues for Inter-Service Async Communication
  • apps/order-service/src/events/event-subscriber.service.tsqueueStockJobToSimplyPrint() implementation
  • apps/order-service/src/stock-replenishment/stock-replenishment.service.ts — event publisher
  • docs/03-architecture/sequences/C4_Seq_11_StockReplenishment.puml — updated sequence diagram

ADR-065: SonarCloud for Continuous Code Quality Analysis

Status Accepted
Date 2026-03-12
Context The codebase has ESLint for linting and Syft + Grype for container security scanning, but lacks cross-cutting code quality metrics: duplicated code detection, cognitive complexity scoring, technical debt quantification, and historical trend tracking. A dedicated static analysis platform was needed to fill this gap.

Decision

Adopt SonarCloud Team ($32/month) as the continuous code quality platform, integrated into the Azure DevOps CI/CD pipeline.

Problem

Without a cross-cutting code quality platform:

  • Duplicated code across microservices was invisible — at initial scan, 19.5% of the codebase was duplicated
  • Cognitive complexity of functions was unchecked, leading to unmaintainable business logic
  • Security hotspots (regex DoS, pseudorandom generators, publicly writable directories) were undetected
  • Technical debt had no quantification or trend tracking
  • PR reviews lacked automated quality gate enforcement

Solution

SonarCloud analyzes every push to main and every PR, providing:

  1. sonar-project.properties at the repository root — configures source directories, exclusions, coverage report paths, and rule suppressions
  2. CodeQuality job in the ValidateAndTest pipeline stage — runs after UnitTests, downloads coverage artifacts, executes SonarCloud analysis
  3. PR decoration — SonarCloud posts quality gate status and issue summaries directly on Azure DevOps pull requests
  4. Coverage integrationlcov.info reports from Vitest/Jest are uploaded to SonarCloud; sonar.coverage.exclusions aligns the denominator with the test frameworks' exclusion patterns
  5. Quality gate — blocks merges when new code introduces bugs, vulnerabilities, or excessive duplication
  6. Rule suppression — false positives and won't-fix items are suppressed via sonar.issue.ignore.multicriteria in sonar-project.properties (inline // NOSONAR comments do not work for JS/TS in SonarCloud)
  7. AI Code Assurance — SonarCloud applies a stricter quality gate to AI-generated code, requiring higher coverage and zero issues

Key Configuration Decisions

Decision Rationale
SonarCloud (SaaS) over SonarQube (self-hosted) Zero infrastructure overhead; staging droplet already at 96% memory usage
Monorepo-level scan (not per-project) Single quality gate covers all 14 source directories; simpler than Nx-per-project scanning
configMode: 'file' in pipeline All configuration centralized in sonar-project.properties, not scattered in YAML
CodeQuality job on MS-hosted agent No load on the self-hosted DO build agent; SonarCloud analysis is network-bound, not CPU-bound
Rule suppressions in properties file // NOSONAR does not work for TypeScript/JavaScript in SonarCloud; properties file suppressions are auditable and centralized
Inline comments with rule keys Each suppressed code location has a // Sonar suppression — typescript:SXXXX: reason comment for traceability

Results (First Week)

Metric Before (2026-03-12) After (2026-03-13) Change
Total issues 769 244 -68%
Bugs 9 0 -100%
Vulnerabilities 12 0 -100%
Code smells 748 244 -67%
Security hotspots 6 (TO_REVIEW) 0 -100%
Duplication 19.5% 15.7% -3.8pp
Duplicated lines ~13,300 10,743 -19%

Consequences

Positive:

  • Every PR now has automated quality gate enforcement with inline issue annotations
  • Code duplication is visible and measurable — drove the extraction of libs/service-common (12,900 duplicated lines removed)
  • Security hotspots are reviewed and tracked
  • Technical debt is quantified with effort estimates
  • Coverage discrepancies between Azure DevOps and SonarCloud are resolved by aligning sonar.coverage.exclusions with Vitest/Jest collectCoverageFrom patterns

Negative:

  • Monthly cost of $32 for SonarCloud Team plan
  • Mitigation: Cost is trivial compared to the engineering time saved on code review and technical debt discovery
  • sonar.issue.ignore.multicriteria in properties file must be maintained as rules are suppressed
  • Mitigation: Each suppression has a documented rationale and inline code comments

Alternatives Considered

Alternative Reason for Rejection
SonarQube self-hosted Staging droplet already at 96% memory; would need separate infrastructure (~$20/month for a 2 GB droplet + maintenance overhead)
ESLint-only (no SonarCloud) ESLint cannot detect cross-file duplication, cognitive complexity trends, or security hotspots; no PR decoration or historical dashboards
CodeClimate Less mature TypeScript support; no native Azure DevOps integration; higher cost at scale
Codacy Similar capabilities but SonarCloud has stronger NestJS/React ecosystem support and the team already evaluated it in the research phase

ADR-066: CodeCharta City Visualization for Codebase Insight

Status Accepted
Date 2026-03-14
Context SonarCloud provides numeric code quality metrics (complexity, duplication, code smells, coverage), but these numbers lack spatial context. Developers cannot easily identify hotspots — large, complex, frequently-changed files — or knowledge silos (single-author modules). A visual representation was needed to make these metrics actionable for sprint planning, retrospectives, and onboarding.

Decision

Integrate CodeCharta into the CI/CD pipeline (Option C from the research document) to generate a 3D city map from SonarCloud metrics + git history, served from the existing docs container with shareable URLs.

Problem

  • SonarCloud metrics are presented as flat lists and numeric summaries — they do not convey spatial relationships between files
  • Identifying complexity hotspots, change frequency patterns, and knowledge silos requires manually correlating multiple SonarCloud views
  • New team members have no visual onboarding aid to understand codebase structure
  • No way to share preconfigured metric views with the team via bookmarkable URLs

Solution

  1. GenerateCodeCharta pipeline job in the Build stage — runs on MS-hosted ubuntu-latest, uses the codecharta/codecharta-analysis Docker image (~1.2 GB, CI-only), imports SonarCloud metrics via ccsh sonarimport and parses git history via ccsh gitlogparser, then merges both into forma3d.cc.json
  2. Artifact handoff — the .cc.json is published as a pipeline artifact and downloaded by the PackageDocs job
  3. Dockerfile integrationCOPY codecharta/forma3d.cc.jso[n] in the docs Dockerfile uses a glob pattern to gracefully handle the file's absence on PR branches
  4. Nginx CORS — a /codecharta/ location block serves the file with Access-Control-Allow-Origin: https://codecharta.com so the hosted Web Studio can fetch it via XHR
  5. Shareable URLs — bookmarkable links to codecharta.com/visualization/app/index.html?file=... with preconfigured metric mappings (area=ncloc, height=cognitive_complexity, color=code_smells)
  6. Settings page link — a "Codebase City Map" link in the Help & Support section, visible only to admin users of the default tenant

Key Configuration Decisions

Decision Rationale
Option C (hosted Web Studio + docs-served data) over Options A/B/D No new container, no new DNS record, no self-hosted visualization; reuses existing docs infrastructure
Separate read-only SonarCloud token (SONARCLOUD_CODECHARTA_TOKEN) Security isolation from the service connection used for analysis; revocable without affecting CI
fetchDepth: 0 in GenerateCodeCharta job only Full git history needed for gitlogparser; other jobs keep shallow clones for speed
CORS restricted to https://codecharta.com Not a wildcard *; only the CodeCharta Web Studio can fetch the file
Cache-Control: no-cache on /codecharta/ Users always see the latest map after a pipeline run
Glob trick forma3d.cc.jso[n] in Dockerfile Docker COPY fails on missing source files; the glob pattern makes it optional without conditional logic
PackageDocs condition updated with or() Docs rebuild on main when CodeCharta succeeds, even if docs content hasn't changed — ensures fresh maps

Consequences

Positive:

  • The team can visualize the codebase as a 3D city — buildings represent files, dimensions encode metrics (lines of code, cognitive complexity, code smells, coverage, change frequency)
  • Hotspots, knowledge silos, and temporal coupling are immediately visible
  • Shareable URLs enable preconfigured views for sprint planning and retrospectives
  • Zero infrastructure cost — reuses existing docs container and publicly hosted CodeCharta Web Studio
  • The .cc.json contains only file paths and numeric metrics — no source code is exposed

Negative:

  • The GenerateCodeCharta job adds ~2–3 minutes to the pipeline on main branch builds
  • Mitigation: Runs on MS-hosted agents, in parallel with other packaging jobs
  • Dependency on the publicly hosted CodeCharta Web Studio at codecharta.com
  • Known limitation: CodeCharta's CSP (default-src 'self') blocks XHR to external origins, so the shareable ?file=URL approach does not work. Users must download the .cc.json file and drag-and-drop it into the Web Studio manually.
  • Mitigation: Option D (self-hosted visualization) can be adopted later to restore shareable URLs with a relaxed CSP
  • The codecharta/codecharta-analysis Docker image is ~1.2 GB, pulled on every main build
  • Mitigation: Docker layer caching on MS-hosted agents; image is not deployed to staging

Alternatives Considered

Alternative Reason for Rejection
Option A: Local developer workstation only Not shareable; requires local tooling setup; maps not versioned
Option B: CI-generated artifacts without serving Team cannot visualize without downloading and opening manually
Option D: Self-hosted visualization container Adds a new container to staging; increases memory pressure and maintenance overhead
Custom visualization dashboard Significant development effort; CodeCharta Web Studio is mature and feature-rich
  • CodeCharta City Visualization Research
  • ADR-065: SonarCloud for Continuous Code Quality Analysis
  • ADR-006: Azure DevOps for CI/CD with Digital Ocean Hosting
  • ADR-038: Zensical for Publishing Project Documentation
  • ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy

ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance

Status Accepted
Date 2026-03-17
Context The pipeline generates CycloneDX SBOMs for every container image (ADR-026), but SBOMs alone are passive inventories — they list components without evaluating them for known vulnerabilities. A quality gate was needed to prevent deploying images with exploitable CVEs.

Decision

Integrate Grype (by Anchore) into the CI/CD pipeline to scan every SBOM for known CVEs, configured to fail on High severity vulnerabilities that have available fixes (--fail-on high --only-fixed). Use a .grype.yaml exclusion file for vulnerabilities that cannot be patched at the project level. Exclude the Slicer container from scanning entirely due to its unpatachable base image.

Key Concepts

CVE (Common Vulnerabilities and Exposures): A standardized identifier for a publicly known security vulnerability. Each CVE has a severity rating (Critical, High, Medium, Low) based on the CVSS scoring system.

EPSS (Exploit Prediction Scoring System): A data-driven model maintained by FIRST.org that estimates the probability a CVE will be exploited in the wild within 30 days, expressed as a percentage (0–100%) and a percentile rank. Unlike CVSS severity (which measures potential impact), EPSS measures likelihood of actual exploitation. For example:

  • CVE-2024-9680 (Firefox): EPSS 30.8% (96th percentile) — actively exploited, high urgency
  • GHSA-p436-gjf2-799p (docker/cli): EPSS < 0.1% (1st percentile) — theoretically vulnerable but extremely unlikely to be exploited

EPSS is used in this project to inform risk acceptance decisions: Go module CVEs with near-zero EPSS scores from Alpine's docker-cli package are excluded from the scan rather than blocking deployments.

SBOM (Software Bill of Materials): A complete inventory of components in a container image, generated by Syft in CycloneDX format (see ADR-026). Grype scans the SBOM rather than the image directly, which is faster and produces deterministic results.

Problem

  • Container images contained transitive npm dependencies with High severity CVEs (cross-spawn, minimatch, tar, glob, serialize-javascript)
  • The node:20-alpine base image bundles npm at runtime, which includes its own vulnerable dependencies (tar@6.2.1, glob@10.4.2) even though npm is not needed for running the Node.js application
  • Alpine system packages (docker-cli, zlib) contained Go module and C library CVEs
  • The Slicer container (BambuStudio v1 base image) had 800+ CVEs from its Debian 12 desktop environment
  • Without automated scanning, these vulnerabilities would accumulate silently

Solution

Pipeline integration:

Each container packaging job includes a Grype scan step after SBOM generation:

- script: |
    grype sbom:<service>-sbom.cdx.json --output table --fail-on high --only-fixed
  displayName: 'Scan SBOM for CVEs (<Service>)'
  condition: eq('${{ parameters.enableSigning }}', 'true')

The --only-fixed flag is critical: it only reports CVEs that have a fix available, preventing false failures from vulnerabilities that no one can remediate yet.

Remediation strategy (three layers):

Layer Source Fix
npm transitive dependencies cross-spawn, minimatch, file-type, lodash, ajv, bn.js, serialize-javascript, qs pnpm overrides in package.json forcing patched versions
Docker base image bundled npm tar@6.2.1, glob@10.4.2, cross-spawn@7.0.3 from node:20-alpine's npm Strip npm from production images (rm -rf /usr/local/lib/node_modules/npm)
Alpine system packages zlib, docker-cli Go binaries apk upgrade --no-cache in Dockerfile production stage

Risk acceptance (.grype.yaml):

Go module CVEs compiled into Alpine's docker-cli and containerd packages cannot be patched without Alpine shipping updated packages. These are excluded from the scan with documented rationale:

ignore:
  - vulnerability: GHSA-p436-gjf2-799p  # docker/cli v28→29 (High, EPSS <0.1%)
    package:
      type: go-module

All excluded CVEs have EPSS scores at the 0th–1st percentile (near-zero exploitation probability).

Slicer exclusion:

The Slicer container uses linuxserver/bambustudio:01.08.03 (Debian 12) with 38,731 SBOM components and 800+ CVEs from system packages (Firefox ESR, glibc, ffmpeg, GStreamer, Qt5), Go binaries (buildkit, runc, containerd), and Python packages (cryptography). These cannot be fixed without an upstream base image update. The grype scan is commented out with rationale, and a TODO.md entry tracks the BambuStudio v2 upgrade.

Key Design Decisions

Decision Rationale
--fail-on high (not critical) Critical-only would miss many actionable High CVEs; High threshold catches the most important vulnerabilities while keeping Medium/Low informational
--only-fixed Prevents pipeline failures from CVEs with no available fix — avoids blocking deployments on problems no one can solve
Strip npm from production images The runtime only needs node, not npm/npx; removing npm eliminates an entire class of bundled dependency CVEs
pnpm overrides (not dependency updates) Transitive dependencies can't be updated directly; overrides force specific versions across the entire dependency tree
.grype.yaml exclusions scoped to type: go-module Exclusions are narrow — they only apply to Go binaries, not npm or Alpine packages
Slicer excluded entirely (not just Go modules) The base image has CVEs across all layers (deb, Go, Python, npm); partial exclusions would still fail the scan
EPSS for risk acceptance CVSS severity alone doesn't indicate exploitation likelihood; EPSS provides data-driven prioritization

Consequences

Positive:

  • Every container image is scanned for CVEs before deployment — vulnerabilities cannot reach staging silently
  • The --only-fixed flag eliminates false positives from unfixable CVEs
  • Risk acceptance is documented and auditable (.grype.yaml with comments)
  • EPSS-informed decisions prevent security theater (blocking on theoretical vulnerabilities with zero exploitation probability)
  • npm stripping reduces production image attack surface beyond just CVE remediation

Negative:

  • Go module CVEs in Alpine packages require manual exclusion maintenance
  • Mitigation: .grype.yaml includes review notes; exclusions should be removed when Alpine ships updates
  • The Slicer has no CVE scanning coverage
  • Mitigation: TODO.md tracks BambuStudio v2 upgrade to reinstate scanning
  • Grype must be installed on each pipeline run (~3 seconds)
  • Mitigation: Installed to $HOME/.local/bin which persists across steps within a job
  • ADR-025: Cosign Image Signing for Supply Chain Security
  • ADR-026: CycloneDX SBOM Attestations
  • ADR-055: BambuStudio CLI Slicer Container
  • ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
  • FIRST EPSS Model
  • Grype Documentation

ADR-068: Dependency License Compliance Check

Status Accepted
Date 2026-03-19
Context The project uses ~200 npm dependencies (direct + transitive). Without automated checking, a non-permissive license (GPL, AGPL, SSPL, Commons Clause) could enter the dependency tree unnoticed through a transitive update, creating legal risk for a proprietary/commercial product. The pipeline already has CVE scanning (Grype, ADR-067) and code quality gates (SonarCloud, ADR-065), but no license compliance gate.

Decision

Add a lightweight dependency license check to the CI pipeline that fails the build if any package in the dependency tree has a non-permissive license. Use license-checker-rseidelsohn — an actively maintained fork of the original license-checker — with a small custom script (scripts/check-licenses.js).

Problem

  • Transitive dependency updates (via pnpm update or lockfile refresh) can silently introduce packages with strong copyleft licenses (GPL-2.0, GPL-3.0, AGPL-3.0) or restrictive terms (SSPL, Commons Clause)
  • The project already encountered a licensing concern with Gridfinity GRIPS (non-permissive), leading to the creation of GridFlock under MIT — demonstrating that license awareness is an active concern
  • Manual auditing of license changes in pnpm-lock.yaml is impractical at scale

Solution

Script (scripts/check-licenses.js):

A ~50-line Node.js script that:

  1. Uses license-checker-rseidelsohn to scan the full dependency tree
  2. Matches each package's license string against a disallowed pattern: GPL, AGPL, SSPL, Commons Clause (case-insensitive)
  3. Excludes private packages (the project's own UNLICENSED root)
  4. Exits with code 1 and lists offending packages if any match
  5. Exits with code 0 on success

Pipeline integration:

The check runs as the first step in the Lint job (both azure-pipelines.yml and ci.yml), after dependency installation but before linting:

- script: pnpm run license-check
  displayName: 'Check dependency licenses (fail on non-permissive)'

Why license-checker-rseidelsohn:

Option Status Rationale
license-checker (davglass) Abandoned (last release Jan 2019, 75 open issues) Not suitable for a maintained project
license-checker-rseidelsohn Actively maintained fork (~200k weekly downloads) Compatible API, receives updates and bugfixes
Grant (Anchore) Active Heavier; designed for SBOM/container scanning rather than npm dependency trees
pnpm licenses list Built-in No built-in fail-on-disallowed; requires more scripting to parse output

Key Design Decisions

Decision Rationale
Deny-list (not allow-list) New permissive licenses (e.g. BlueOak-1.0.0) shouldn't require allowlist updates; only known problematic licenses are blocked
Case-insensitive regex License strings in package.json vary in casing (GPL-3.0, gpl-3.0-only, etc.)
Run in Lint job Lint is the fastest-feedback job and already runs on every push; license violations are caught before tests or builds run
excludePrivatePackages: true The project root has "license": "UNLICENSED" which is valid for a private project but would false-positive against a strict allow-list
Custom script (not CLI flags) --failOn only matches exact license names; a regex handles compound expressions like (GPL-2.0 OR MIT) correctly

Consequences

Positive:

  • Non-permissive licenses cannot enter the dependency tree without failing the pipeline
  • Developers get fast feedback (license check runs in ~1 second)
  • No external service dependency — runs offline against node_modules
  • Complements Grype CVE scanning: vulnerabilities are caught by Grype, license violations by license-checker

Negative:

  • Dual-licensed packages where one option is permissive (e.g. MIT OR GPL-3.0) will be flagged
  • Mitigation: The deny-list regex is intentionally broad; if a valid dual-licensed package is flagged, add a documented exception to the script
  • Does not cover non-npm dependencies (e.g. Docker base image licenses, system packages)
  • Mitigation: Container-level license scanning could be added via Grant if needed in the future
  • ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance
  • ADR-026: CycloneDX SBOM Attestations
  • ADR-065: SonarCloud for Continuous Code Quality Analysis
  • license-checker-rseidelsohn

ADR-069: Agent CLAUDE.md Governance — Repo as Source of Truth

Status Accepted
Date 2026-03-22
Context The Nanoclaw agentic team (Ryan, Sam, Cody) each have a CLAUDE.md file that defines their identity, responsibilities, protocols, and behavioral rules. These files are mounted read-write into agent containers, meaning agents can technically modify their own instructions. During initial deployment, agents occasionally self-modified their CLAUDE.md or had their files overwritten during sync operations, leading to drift between what the repo contained and what was running on the droplet.

Decision

Adopt a strict governance model for agent CLAUDE.md files:

  1. The repo (agentic-team/agents/) is the single source of truth. All canonical versions of agent CLAUDE.md files live here.
  2. Individual agents must not self-modify. Each agent's CLAUDE.md contains a governance rule: "You MUST NOT modify your own CLAUDE.md or any other agent's CLAUDE.md."
  3. The Team agent (main channel) is the only agent authorized to edit CLAUDE.md files. It has write access to all group folders via additionalMounts. When Jan requests a behavioral change in the main chat, Team makes the edit.
  4. Jan and the AI assistant (Cursor) are reviewers. Changes made by Team on the droplet should be periodically synced back to the repo. Changes made in the repo should be pushed to the droplet via scp or the deploy script.

Flow

Jan (WhatsApp main chat or Cursor) → Team agent edits CLAUDE.md on droplet
                                    ↓
                          Periodic sync: droplet → repo (manual)
                                    ↓
                          Repo is the canonical record

Consequences

Positive:

  • No silent behavioral drift — agents cannot quietly rewrite their own rules
  • All changes are auditable through the repo's git history
  • Team agent provides a conversational interface for behavioral changes without needing SSH or Cursor
  • Clear chain of authority: Jan → Team → individual agents

Negative:

  • Requires discipline to sync droplet changes back to the repo — if forgotten, the repo becomes stale
  • Mitigation: Before pushing repo files to the droplet, always compare checksums first (md5sum on droplet vs md5 locally) to avoid overwriting agent-side changes
  • Agents cannot adapt their own instructions based on learned patterns — all adaptations require human approval
  • Mitigation: Agents can suggest changes by asking Jan in their group chat; Jan routes through Team

ADR-070: Per-Agent Claude Model Selection

Status Accepted
Date 2026-03-22
Context The Nanoclaw agentic team has three agents with different cognitive demands. Ryan (DevOps) and Sam (Infra) primarily run SSH health checks, query APIs, and route information — tasks that don't require deep code reasoning. Cody (Dev) diagnoses code failures, writes fixes, and opens PRs — tasks that benefit from the strongest available model. All agents were initially running on Claude Sonnet 4.6 (the default). The Anthropic API pricing difference is significant: Sonnet costs \(3/\)15 per MTok (input/output) while Opus costs \(15/\)75 — a 5x multiplier.

Decision

Configure per-agent model selection via Nanoclaw's containerConfig.model field in the registered_groups database:

  • Ryan (DevOps): Claude Sonnet 4.6 (default) — SSH checks, API queries, routing
  • Sam (Infra): Claude Sonnet 4.6 (default) — health monitoring, diagnostics
  • Cody (Dev): Claude Opus 4.6 — code reasoning, fix generation, PR creation
  • Team (main): Claude Sonnet 4.6 (default) — admin tasks, group management

The model is passed as a CLAUDE_MODEL environment variable to the agent container, which the agent-runner forwards to the Claude Agent SDK's query() call. Agents without a model override use the SDK's default (currently Sonnet).

Implementation

  1. Added model?: string to Nanoclaw's ContainerConfig type
  2. Container-runner reads group.containerConfig.model and injects -e CLAUDE_MODEL=... into Docker args
  3. Agent-runner passes process.env.CLAUDE_MODEL to the SDK's query({ options: { model } })
  4. Cody's database entry: container_config = '{"model":"claude-opus-4-6"}'

Limitation

Agents cannot reliably self-report which model they are using when asked conversationally. The Claude Agent SDK does not expose the active model name to the agent's own context. Verification must be done through the Anthropic console usage logs or by inspecting the container's environment variables (docker inspect).

Consequences

Positive:

  • Cody produces higher-quality fixes with fewer retry loops, potentially offsetting the higher per-token cost
  • Ryan and Sam stay cost-efficient on Sonnet for tasks that don't need deep reasoning
  • Model selection is per-agent, not global — can be tuned independently
  • Easy to change: single database update + restart, no code changes

Negative:

  • Opus invocations are 5x more expensive — a typical Cody fix cycle costs $2.50-10.00 vs $0.50-2.00 on Sonnet
  • Mitigation: Prepaid Anthropic credit with no auto-reload acts as a hard spending cap; CONTAINER_TIMEOUT limits per-invocation token burn
  • Modifying Nanoclaw's upstream source (container-runner, agent-runner, types) means patches must be re-applied after upgrades
  • Mitigation: Document the patches in agentic-team/README.md troubleshooting section

References