Architecture Decision Records (ADR)¶
Project: Forma3D.Connect
Version: 4.5
Last Updated: March 22, 2026
This document captures the significant architectural decisions made during the development of Forma3D.Connect.
Table of Contents¶
- ADR-001: Monorepo with Nx
- ADR-002: NestJS for Backend Framework
- ADR-003: React 19 with Vite for Frontend
- ADR-004: PostgreSQL with Prisma ORM
- ADR-005: TypeScript Strict Mode
- ADR-006: Azure DevOps for CI/CD
- ADR-007: Layered Architecture with Repository Pattern
- ADR-008: Event-Driven Internal Communication
- ADR-009: OpenAPI/Swagger for API Documentation
- ADR-010: HMAC Verification for Webhooks
- ADR-011: Idempotent Webhook Processing
- ADR-012: Assembly Parts Model for Product Mapping
- ADR-013: Shared Domain Library
- ADR-014: SimplyPrint as Unified Print Farm Controller
- ADR-015: Aikido Security Platform (Superseded)
- ADR-016: Sentry Observability with OpenTelemetry
- ADR-017: Docker + Traefik Deployment Strategy
- ADR-018: Nx Affected Conditional Deployment Strategy
- ADR-019: SimplyPrint Webhook Verification
- ADR-020: Hybrid Status Monitoring (Polling + Webhooks)
- ADR-021: Retry Queue with Exponential Backoff
- ADR-022: Event-Driven Fulfillment Architecture
- ADR-023: Email Notification Strategy
- ADR-024: API Key Authentication for Admin Endpoints
- ADR-025: Cosign Image Signing for Supply Chain Security
- ADR-026: CycloneDX SBOM Attestations
- ADR-027: TanStack Query for Server State Management
- ADR-028: Socket.IO for Real-Time Dashboard Updates
- ADR-029: API Key Authentication for Dashboard
- ADR-030: Sendcloud for Shipping Integration
- ADR-031: Automated Container Registry Cleanup
- ADR-032: Domain Boundary Separation with Interface Contracts
- ADR-033: Database-Backed Webhook Idempotency
- ADR-034: Docker Log Rotation & Resource Cleanup
- ADR-035: Progressive Web App (PWA) for Cross-Platform Access
- ADR-036: localStorage Fallback for PWA Install Detection
- ADR-037: Keep a Changelog for Release Documentation
- ADR-038: Zensical for Publishing Project Documentation
- ADR-039: Global API Key Authentication (Fail-Closed)
- ADR-040: Shopify Order Backfill for Downtime Recovery
- ADR-041: SimplyPrint Webhook Idempotency and Job Reconciliation
- ADR-042: SendCloud Webhook Integration for Shipment Status Updates
- ADR-043: PWA Version Mismatch Detection
- ADR-044: Role-Based Access Control and Tenant-Ready Architecture
- ADR-045: pgAdmin for Staging Database Administration
- ADR-046: PostgreSQL Session Store for Persistent Authentication
- ADR-047: Two-Tier Logging Strategy (Application + Business Events)
- ADR-048: Shopify OAuth 2.0 Authentication
- ADR-049: Optional SKU with Shopify Product/Variant ID Matching Priority
- ADR-050: Apache ECharts for Dashboard Analytics
- ADR-051: Decompose Monolithic API into Domain-Aligned Microservices
- ADR-052: BullMQ Event Queues for Inter-Service Async Communication
- ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
- ADR-054: SimplyPrint API Files for Gcode Upload
- ADR-055: BambuStudio CLI Slicer Container
- ADR-056: Redis for Sessions, Event Queues, and Socket.IO Adapter
- ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
- ADR-058: Self-Hosted Log Infrastructure (ClickHouse + Grafana via OpenTelemetry)
- ADR-059: Nx Affected Resilience via Last-Successful-Deploy Tag
- ADR-060: Single Source of Truth for STL Preview Generation
- ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly
- ADR-062: Inventory Tracking and Stock Replenishment
- ADR-063: ORDER-over-STOCK Print Queue Priority
- ADR-064: Stock Replenishment Event Subscriber for SimplyPrint Queue
- ADR-065: SonarCloud for Continuous Code Quality Analysis
- ADR-066: CodeCharta City Visualization for Codebase Insight
- ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance
- ADR-068: Dependency License Compliance Check
- ADR-069: Agent CLAUDE.md Governance — Repo as Source of Truth
- ADR-070: Per-Agent Claude Model Selection
ADR-001: Monorepo with Nx¶
| Attribute | Value |
|---|---|
| ID | ADR-001 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need to manage multiple applications (API, Web, Desktop, Mobile) and shared libraries in a single repository |
Decision¶
Use Nx (v19.x) as the monorepo management tool with pnpm as the package manager.
Rationale¶
- Unified tooling: Single command to build, test, lint all projects
- Dependency graph: Nx understands project dependencies and can run only affected tests
- Caching: Local and remote caching speeds up CI/CD pipelines
- Code sharing: Shared libraries (
@forma3d/domain,@forma3d/utils, etc.) are first-class citizens - Plugin ecosystem: Built-in support for NestJS, React, and other frameworks
Consequences¶
- ✅ Fast CI through affected commands and caching
- ✅ Consistent tooling across all projects
- ✅ Easy code sharing via path aliases
- ⚠️ Learning curve for developers unfamiliar with Nx
- ⚠️ Initial setup complexity
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Turborepo | Less mature NestJS support |
| Lerna | Deprecated in favor of Nx |
| Separate repositories | Too much overhead for shared code |
ADR-002: NestJS for Backend Framework¶
| Attribute | Value |
|---|---|
| ID | ADR-002 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need a robust, scalable backend framework for the integration API |
Decision¶
Use NestJS (v10.x) as the backend framework.
Rationale¶
- Enterprise-grade: Built-in support for dependency injection, modules, guards, interceptors
- TypeScript-first: Native TypeScript support with decorators
- Modular architecture: Easy to organize code by feature
- Excellent documentation: Well-documented with active community
- Testing support: Built-in testing utilities with Jest
- OpenAPI support: First-class Swagger/OpenAPI integration via
@nestjs/swagger
Consequences¶
- ✅ Clean, maintainable code structure
- ✅ Easy to add new features as modules
- ✅ Built-in validation with class-validator
- ✅ Excellent integration with Prisma
- ⚠️ Verbose compared to Express.js
- ⚠️ Decorator-heavy syntax
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Express.js | Too low-level, lacks structure |
| Fastify | Less ecosystem support |
| Hono | Too new, less enterprise features |
ADR-003: React 19 with Vite for Frontend¶
| Attribute | Value |
|---|---|
| ID | ADR-003 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need a modern frontend framework for the admin dashboard |
Decision¶
Use React 19 with Vite as the bundler and Tailwind CSS for styling.
Rationale¶
- React 19: Latest version with improved performance and new features
- Vite: Extremely fast development server and build times
- Tailwind CSS: Utility-first CSS for rapid UI development
- TanStack Query: Excellent server state management
- React Router: Standard routing solution
Consequences¶
- ✅ Fast development experience with HMR
- ✅ Modern React features (Server Components ready)
- ✅ Consistent styling with Tailwind
- ⚠️ Tailwind learning curve for traditional CSS developers
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Next.js | Overkill for admin dashboard, SSR not needed |
| Angular | Less flexibility, steeper learning curve |
| Vue.js | Team expertise in React |
ADR-004: PostgreSQL with Prisma ORM¶
| Attribute | Value |
|---|---|
| ID | ADR-004 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need a reliable database with type-safe access |
Decision¶
Use PostgreSQL 16 as the database with Prisma 5 as the ORM.
Rationale¶
- PostgreSQL: Robust, ACID-compliant, excellent JSON support
- Prisma: Type-safe database access, auto-generated client
- Schema-first: Prisma schema as single source of truth
- Migrations: Built-in migration system
- Studio: Visual database browser for development
Consequences¶
- ✅ Full type safety from database to API
- ✅ Easy schema changes with migrations
- ✅ No raw SQL in application code
- ⚠️ Prisma Client must be regenerated after schema changes
- ⚠️ Some complex queries require raw SQL
Schema Design Decisions¶
- UUIDs for primary keys (portability, no sequence conflicts)
- JSON columns for flexible data (shipping address, print profiles)
- Decimal type for monetary values (precision)
- Timestamps with timezone (audit trail)
ADR-005: TypeScript Strict Mode¶
| Attribute | Value |
|---|---|
| ID | ADR-005 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need to ensure code quality and catch errors early |
Decision¶
Enable TypeScript strict mode with additional strict checks:
{
"strict": true,
"noImplicitAny": true,
"strictNullChecks": true,
"noUnusedLocals": true,
"noUnusedParameters": true
}
Rationale¶
- Early error detection: Catch type errors at compile time
- Self-documenting code: Types serve as documentation
- Refactoring safety: IDE can safely refactor with full type information
- No
anytype: Prevents type escape hatches
Consequences¶
- ✅ Higher code quality
- ✅ Better IDE support and autocomplete
- ✅ Safer refactoring
- ⚠️ More verbose code with explicit types
- ⚠️ Stricter null checking requires careful handling
ADR-006: Azure DevOps for CI/CD with Digital Ocean Hosting¶
| Attribute | Value |
|---|---|
| ID | ADR-006 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need a CI/CD pipeline for automated testing and deployment |
Decision¶
Use Azure DevOps Pipelines for CI/CD and Digital Ocean for hosting.
Rationale¶
- Azure DevOps: Existing team expertise with YAML pipelines
- Digital Ocean: Cost-effective, simple infrastructure for small-scale deployment
- Separation of concerns: CI/CD tooling separate from hosting
- Docker-based: Consistent container deployment across environments
- Managed Database: Digital Ocean managed PostgreSQL for reliability
Infrastructure¶
| Component | Service | Purpose |
|---|---|---|
| CI/CD | Azure DevOps Pipelines | Build, test, deploy automation |
| Container Registry | Digital Ocean Registry | Docker image storage |
| Staging | Digital Ocean Droplet | Staging environment |
| Production | Digital Ocean Droplet | Production environment |
| Database | Digital Ocean Managed PostgreSQL | Data persistence |
Pipeline Stages¶
- Validate & Test: Lint, type check, and unit tests (parallel across 3 agents)
- Build & Package: Detect affected, build projects, Docker images on self-hosted agents
- Deploy Staging: Auto-deploy affected services on main branch
- Acceptance Test: Playwright tests against staging
- Deploy Production: Manual approval gate
Updated Feb 2026: Stages merged and agents optimized per ADR-057.
Consequences¶
- ✅ Automated quality gates
- ✅ Fast feedback on PRs
- ✅ Consistent deployments
- ✅ Cost-effective hosting
- ⚠️ Need to manage Docker deployments on Droplets
ADR-007: Layered Architecture with Repository Pattern¶
| Attribute | Value |
|---|---|
| ID | ADR-007 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need a clean separation of concerns in the backend |
Decision¶
Implement a layered architecture with the Repository Pattern:
Layer Responsibilities¶
| Layer | Responsibility | Example |
|---|---|---|
| Controller | HTTP handling, validation, routing | OrdersController |
| Service | Business logic, orchestration | OrdersService |
| Repository | Data access, Prisma queries | OrdersRepository |
| DTO | Data transfer, validation | CreateOrderDto |
Rationale¶
- Testability: Each layer can be tested in isolation
- Single responsibility: Clear separation of concerns
- Flexibility: Easy to swap implementations (e.g., different databases)
- Maintainability: Changes in one layer don't affect others
Consequences¶
- ✅ Clean, maintainable code
- ✅ Easy to unit test with mocks
- ✅ Prisma isolated to repository layer
- ⚠️ More files per feature
- ⚠️ Some boilerplate code
ADR-008: Event-Driven Internal Communication¶
| Attribute | Value |
|---|---|
| ID | ADR-008 |
| Status | ✅ Implemented |
| Date | 2026-01-09 (Updated: 2026-01-25) |
| Context | Need to decouple components and trigger actions on state changes |
Decision¶
Use NestJS EventEmitter for internal event-driven communication.
Events Defined¶
Order Events (ORDER_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
order.created |
New order from Shopify webhook | OrchestrationService, EventsGateway, PushService |
order.status_changed |
Order status update | EventsGateway, PushService |
order.cancelled |
Order cancellation | CancellationService, EventsGateway |
order.ready-for-fulfillment |
All print jobs completed | SendcloudService, FulfillmentService, EventsGateway |
order.fulfilled |
Order shipped | EventsGateway |
order.failed |
Order processing failed | EventsGateway |
Print Job Events (PRINT_JOB_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
printjob.created |
Print job created in SimplyPrint | EventsGateway |
printjob.status-changed |
Print job status update | OrchestrationService, EventsGateway, PushService |
printjob.completed |
Print job finished successfully | OrchestrationService, EventsGateway |
printjob.failed |
Print job failed | OrchestrationService, EventsGateway, NotificationsService |
printjob.cancelled |
Print job cancelled | EventsGateway |
printjob.retry-requested |
Print job retry initiated | (EventLogService) |
Orchestration Events (ORCHESTRATION_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
order.ready-for-fulfillment |
All print jobs for order complete | SendcloudService, FulfillmentService |
order.partially-completed |
Some jobs complete, some pending | (Logging) |
order.all-jobs-failed |
All print jobs for order failed | (Logging) |
SimplyPrint Events (SIMPLYPRINT_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
simplyprint.job-status-changed |
SimplyPrint webhook/poll update | PrintJobsService |
Shipment Events (SHIPMENT_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
shipment.created |
Shipment created | FulfillmentService, PushService |
shipment.label-ready |
Shipping label downloaded | PushService |
shipment.failed |
Shipment creation failed | (Logging) |
shipment.updated |
Shipment status update | (Logging) |
SendCloud Webhook Events (SENDCLOUD_WEBHOOK_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
sendcloud.shipment.status_changed |
SendCloud webhook/reconciliation | ShipmentsService |
Fulfillment Events (FULFILLMENT_EVENTS)¶
| Event | Trigger | Listeners |
|---|---|---|
fulfillment.created |
Shopify fulfillment created | (Logging) |
fulfillment.failed |
Shopify fulfillment failed | NotificationsService |
fulfillment.retrying |
Fulfillment retry in progress | (Logging) |
Event Flow Diagram¶
Shopify Webhook → OrdersService → order.created
↓
OrchestrationService
↓
PrintJobsService → printjob.created
↓
SimplyPrint API
SimplyPrint Webhook → SimplyPrintService → simplyprint.job-status-changed
↓
PrintJobsService → printjob.status-changed
↓ ↓
printjob.completed printjob.failed
↓ ↓
OrchestrationService ←───────┘
↓
order.ready-for-fulfillment
↓
┌─────────────────┴─────────────────┐
↓ ↓
SendcloudService FulfillmentService
↓ ↓
shipment.created ──────────────────→ Shopify Fulfillment
↓
SendCloud Webhook → sendcloud.shipment.status_changed
Rationale¶
- Decoupling: Services don't directly depend on each other
- Extensibility: Easy to add new listeners
- Async processing: Events can be processed asynchronously
- Audit trail: Events naturally support logging
- Orchestration: Clean separation between job creation and completion tracking
- Real-time updates: EventsGateway broadcasts to dashboard via Socket.IO
- Push notifications: PushService sends alerts to subscribed PWA clients
Consequences¶
- ✅ Loose coupling between modules
- ✅ Easy to add new functionality
- ✅ Clear event flow
- ✅ Enables reactive order completion tracking
- ✅ Real-time dashboard updates via Socket.IO
- ✅ Push notifications for mobile/desktop PWA
- ⚠️ Harder to trace execution flow (mitigated by correlation IDs and logging)
- ⚠️ Eventual consistency considerations
ADR-009: OpenAPI/Swagger for API Documentation¶
| Attribute | Value |
|---|---|
| ID | ADR-009 |
| Status | Accepted |
| Date | 2026-01-10 |
| Context | Need interactive API documentation for developers |
Decision¶
Use @nestjs/swagger for OpenAPI 3.0 documentation with Swagger UI.
Implementation¶
- Swagger UI: Available at
/api/docs - OpenAPI JSON: Available at
/api/docs-json - Environment restriction: Only enabled in non-production
- Decorator-based: All endpoints documented via decorators
Decorators Used¶
| Decorator | Purpose |
|---|---|
@ApiTags |
Group endpoints by feature |
@ApiOperation |
Describe endpoint purpose |
@ApiResponse |
Document response schemas |
@ApiProperty |
Document DTO properties |
@ApiParam |
Document path parameters |
@ApiQuery |
Document query parameters |
Consequences¶
- ✅ Interactive API testing
- ✅ Auto-generated documentation
- ✅ Type-safe documentation
- ⚠️ Must keep decorators in sync with code
ADR-010: HMAC Verification for Webhooks¶
| Attribute | Value |
|---|---|
| ID | ADR-010 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need to verify webhook requests are genuinely from Shopify |
Decision¶
Implement HMAC-SHA256 signature verification for all Shopify webhooks.
Implementation¶
// ShopifyWebhookGuard
const hash = crypto.createHmac('sha256', webhookSecret).update(rawBody, 'utf8').digest('base64');
return crypto.timingSafeEqual(Buffer.from(hash), Buffer.from(hmacHeader));
Rationale¶
- Security: Prevents forged webhook requests
- Shopify standard: Required by Shopify webhook specification
- Timing-safe comparison: Prevents timing attacks
- Raw body access: NestJS configured to preserve raw body
Consequences¶
- ✅ Secure webhook endpoint
- ✅ Compliant with Shopify requirements
- ⚠️ Requires raw body access (special NestJS configuration)
ADR-011: Idempotent Webhook Processing¶
| Attribute | Value |
|---|---|
| ID | ADR-011 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Shopify may send duplicate webhooks; need to handle gracefully |
Decision¶
Implement idempotent webhook processing using:
- Webhook ID tracking (in-memory Set)
- Database unique constraints (shopifyOrderId)
Implementation¶
// ShopifyService
private readonly processedWebhooks = new Set<string>();
if (this.processedWebhooks.has(webhookId)) {
return; // Skip duplicate
}
this.processedWebhooks.add(webhookId);
// OrdersService
const existing = await this.ordersRepository.findByShopifyOrderId(id);
if (existing) {
return existing; // Return existing, don't create duplicate
}
Consequences¶
- ✅ No duplicate orders created
- ✅ Safe to retry failed webhooks
- ⚠️ In-memory Set resets on restart (database constraint is primary guard)
ADR-012: Assembly Parts Model for Product Mapping¶
| Attribute | Value |
|---|---|
| ID | ADR-012 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | A single Shopify product may require multiple 3D printed parts |
Decision¶
Implement ProductMapping → AssemblyPart one-to-many relationship.
Data Model¶
Fields¶
- ProductMapping: shopifyProductId, SKU (optional), defaultPrintProfile
- AssemblyPart: partName, partNumber, simplyPrintFileId, quantityPerProduct
Rationale¶
- Flexibility: Support both single-part (1 part) and multi-part products
- Quantity support:
quantityPerProductfor parts needed multiple times (e.g., 4 wheels) - Print profiles: Override default profile per part if needed
Consequences¶
- ✅ Supports complex assemblies
- ✅ Clear part ordering via
partNumber - ✅ Flexible print settings per part
- ⚠️ More complex order processing logic
ADR-013: Shared Domain Library¶
| Attribute | Value |
|---|---|
| ID | ADR-013 |
| Status | Accepted |
| Date | 2026-01-09 |
| Context | Need to share types between frontend, backend, and external integrations |
Decision¶
Create a shared @forma3d/domain library containing:
- Entity types
- Enums
- Shopify types
- Common interfaces
Structure¶
libs/domain/src/
├── entities/
│ ├── order.ts
│ ├── line-item.ts
│ ├── print-job.ts
│ └── product-mapping.ts
├── enums/
│ ├── order-status.ts
│ ├── line-item-status.ts
│ └── print-job-status.ts
├── shopify/
│ ├── shopify-order.entity.ts
│ └── shopify-product.entity.ts
└── index.ts
Rationale¶
- Single source of truth: Types defined once, used everywhere
- Type safety: Frontend and backend share exact same types
- Nx integration: Clean imports via path aliases
Consequences¶
- ✅ Consistent types across codebase
- ✅ No type drift between frontend/backend
- ✅ Easy to update types in one place
- ⚠️ Must rebuild library on changes
ADR-014: SimplyPrint as Unified Print Farm Controller¶
| Attribute | Value |
|---|---|
| ID | ADR-014 |
| Status | ✅ Implemented (Phase 2) |
| Date | 2026-01-10 (Updated: 2026-01-13) |
| Context | Need to control multiple 3D printer brands (Prusa, Bambu Lab) from one API |
Decision¶
Use SimplyPrint as the unified print farm management solution with an edge device connecting to all printers via LAN.
Architecture¶
Rationale¶
- Unified API: Single integration point for all printer brands
- LAN mode: Direct communication with printers, no cloud dependency for print control
- Edge device: Handles printer communication, buffering, and monitoring
- Multi-brand support: Prusa and Bambu Lab printers managed together
- No Bambu Cloud dependency: Avoids Bambu Lab Cloud API limitations
Printer Support¶
| Brand | Models | Connection |
|---|---|---|
| Prusa | MK3S+, XL, Mini | LAN via SimplyPrint edge device |
| Bambu Lab | X1 Carbon, P1S | LAN via SimplyPrint edge device |
Implementation Details (Phase 2)¶
API Client (apps/api/src/simplyprint/simplyprint-api.client.ts):
- HTTP Basic Authentication with Company ID and API Key
- Typed methods for files, jobs, printers, and queue operations
- Automatic connection verification on startup
- Sentry integration for 5xx error tracking
Webhook Controller (apps/api/src/simplyprint/simplyprint-webhook.controller.ts):
- Endpoint:
POST /webhooks/simplyprint - X-SP-Token verification via guard
- Event-driven status updates
Print Jobs Service (apps/api/src/print-jobs/print-jobs.service.ts):
- Creates print jobs in SimplyPrint when orders arrive
- Updates local status based on SimplyPrint events
- Supports cancel and retry operations
API Endpoints Used:
| Endpoint | Method | Purpose |
|---|---|---|
/{companyId}/files/GetFiles |
GET | List available print files |
/{companyId}/printers/Get |
GET | Get printer statuses |
/{companyId}/printers/actions/CreateJob |
POST | Create new print job |
/{companyId}/printers/actions/Cancel |
POST | Cancel active job |
/{companyId}/queue/GetItems |
GET | Get queue items |
/{companyId}/queue/AddItem |
POST | Add item to queue |
/{companyId}/queue/RemoveItem |
POST | Remove from queue |
Consequences¶
- ✅ Single API for all printers
- ✅ No dependency on Bambu Lab Cloud
- ✅ Local network resilience
- ✅ Real-time printer status via edge device
- ✅ Typed API client with full error handling
- ✅ Webhook and polling support for status updates
- ⚠️ Requires edge device on print farm network
- ⚠️ SimplyPrint subscription required
ADR-015: Aikido Security Platform for Continuous Security Monitoring¶
| Attribute | Value |
|---|---|
| ID | ADR-015 |
| Status | Superseded by ADR-067 (Grype CVE Scanning) |
| Date | 2026-01-10 |
| Context | Need continuous security monitoring, vulnerability scanning, and SBOM generation |
Decision¶
Use Aikido Security Platform as the centralized security monitoring and compliance solution integrated into the CI/CD pipeline.
Security Checks Implemented¶
| Check | Status | Description |
|---|---|---|
| Open Source Dependency Monitoring | Active | Monitors 3rd party dependencies for vulnerabilities |
| Exposed Secrets Monitoring | Compliant | Detects accidentally exposed secrets in source code |
| License Management | Compliant | Validates dependency licenses for legal compliance |
| SAST | Compliant | Static Application Security Testing |
| IaC Testing | Compliant | Infrastructure as Code security analysis |
| Malware Detection | Compliant | Detects malware in dependencies |
| Mobile Issues | Compliant | Mobile manifest file monitoring |
| SBOM Generation | Active | Software Bill of Materials for supply chain security |
Rationale¶
- Comprehensive coverage: Single platform covers multiple security domains
- CI/CD integration: Automated scanning on every code change
- SBOM generation: Critical for supply chain security and compliance
- License compliance: Automated license validation prevents legal issues
- Developer-friendly: Clear dashboards and actionable remediation guidance
- Proactive detection: Continuous monitoring catches issues before production
Future Enhancements¶
- Code Quality Analysis: Will be enabled in a subsequent phase to complement security scanning
Consequences¶
- ✅ Continuous security visibility across the codebase
- ✅ Automated vulnerability detection in dependencies
- ✅ SBOM generation for supply chain transparency
- ✅ License compliance validation
- ✅ Secrets exposure prevention
- ⚠️ Requires Aikido platform subscription
- ⚠️ May flag false positives requiring triage
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Snyk | More expensive, less comprehensive for our needs |
| GitHub Advanced Security | Limited to GitHub, not as comprehensive |
| Manual audits | Not scalable, too slow for continuous delivery |
| Dependabot only | Only covers dependency vulnerabilities, not comprehensive |
ADR-016: Sentry Observability with OpenTelemetry ✅¶
| Attribute | Value |
|---|---|
| ID | ADR-016 |
| Status | ✅ Implemented (Updated by ADR-058: structured logging moved to ClickHouse) |
| Date | 2026-01-10 |
| Context | Need comprehensive observability: error tracking, performance monitoring, distributed tracing |
Decision¶
Use Sentry as the observability platform with an OpenTelemetry-first architecture for vendor neutrality.
Architecture¶
Implementation Details¶
Backend (NestJS):
@sentry/nestjsfor error tracking and performance@sentry/profiling-nodefor profilingnestjs-pinofor structured JSON logging- OpenTelemetry auto-instrumentation for Prisma queries
- Global exception filter with Sentry capture
- Logging interceptor with correlation IDs
Frontend (React):
@sentry/reactfor error tracking- Custom
ErrorBoundarycomponent with Sentry integration - Browser tracing for page navigation
- User-friendly error fallback UI
Sampling Configuration (Free Tier Compatible):
| Environment | Traces | Profiles | Errors |
|---|---|---|---|
| Development | 100% | 100% | 100% |
| Production | 10% | 10% | 100% |
Rationale¶
- Sentry: Industry-leading error tracking with excellent stack trace support
- OpenTelemetry: Vendor-neutral instrumentation standard, future-proof
- Structured Logging: JSON logs enable log aggregation and searching
- Correlation IDs: End-to-end request tracing across frontend and backend
- Free Tier: Sufficient for small-scale production (10K errors/month)
Data Privacy¶
Sensitive data is automatically scrubbed:
- Authorization headers
- Cookies
- API tokens
- Passwords
- Shopify access tokens
Implementation Details (Phase 1b)¶
Backend (apps/api):
instrument.ts- Sentry initialization with profiling (imported first inmain.ts)ObservabilityModule- Global module with Pino logger and Sentry integrationSentryExceptionFilter- Captures all exceptions with request contextLoggingInterceptor- Request/response logging with correlation IDsObservabilityController- Test endpoints for verifying observability (non-prod only)- Prisma service enhanced with Sentry breadcrumbs for query tracing
Frontend (apps/web):
sentry.ts- Sentry initialization with browser tracing and session replayErrorBoundary.tsx- React error boundary with Sentry integration
Shared Library (libs/observability):
sentry.config.ts- Shared Sentry configuration with 100% samplingotel.config.ts- OpenTelemetry configurationconstants.ts- Trace/request ID header constants
Sampling Decision:
- 100% sampling for all environments (traces and profiles)
- Rationale: Full visibility needed during early development
- Can be reduced when traffic increases and limits are reached
Consequences¶
- ✅ Comprehensive error visibility with stack traces and context
- ✅ Performance monitoring for API endpoints and database queries
- ✅ Distributed tracing across frontend and backend
- ✅ Structured logs with correlation IDs for debugging
- ✅ Vendor-neutral instrumentation via OpenTelemetry
- ✅ Test endpoints for verifying observability in development
- ⚠️ Requires Sentry account (free tier available)
- ⚠️ Must initialize Sentry before other imports in main.ts
- ⚠️ 100% sampling may hit free tier limits with high traffic
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Datadog | Expensive for small-scale, overkill for current needs |
| New Relic | Expensive, complex pricing model |
| Grafana + Loki | Requires self-hosting, more operational overhead |
| ELK Stack | Complex to set up and maintain, expensive at scale |
| Console.log only | No centralized visibility, hard to debug production issues |
ADR-017: Docker + Traefik Deployment Strategy¶
| Attribute | Value |
|---|---|
| ID | ADR-017 |
| Status | ⏳ In Progress |
| Date | 2026-01-10 |
| Context | Need a deployment strategy for staging/production on DigitalOcean with TLS and zero-downtime |
Decision¶
Use Docker Compose with Traefik reverse proxy for deploying to DigitalOcean Droplets.
Architecture¶
Deployment Components¶
| Component | Technology | Purpose |
|---|---|---|
| Reverse Proxy | Traefik v3 | TLS termination, routing, load balancing |
| TLS Certificates | Let's Encrypt | Automatic certificate issuance/renewal |
| Container Orchestration | Docker Compose | Service definition and networking |
| Image Registry | DigitalOcean Registry | Private Docker image storage |
| Database | DO Managed PostgreSQL | Persistent data storage with TLS |
Traefik Configuration¶
| Feature | Implementation |
|---|---|
| Entry Points | HTTP (:80) with redirect to HTTPS (:443) |
| Certificate Resolver | Let's Encrypt with HTTP challenge |
| Service Discovery | Docker labels on containers |
| Health Checks | HTTP health endpoints (/health, /health/live, /health/ready) |
| Logging | JSON format for log aggregation |
Staging URLs¶
| Service | URL |
|---|---|
| API | https://staging-connect-api.forma3d.be |
| Web | https://staging-connect.forma3d.be |
Pipeline Integration¶
| Stage | Trigger | Action |
|---|---|---|
| Package | develop branch | Build Docker images, push to DO Registry |
| Deploy Staging | develop branch | SSH + docker compose up |
| Deploy Production | main branch | Manual approval + SSH deploy |
Image Tagging Strategy¶
| Tag Format | Example | Purpose |
|---|---|---|
| Pipeline Instance | 20260110143709 |
Immutable deployment reference |
| Latest | latest |
Convenience for development |
Database Migration Strategy¶
Prisma migrations run before container deployment:
# Executed in pipeline before docker compose up
docker compose run --rm api npx prisma migrate deploy
Rationale¶
- Traefik: Automatic TLS, Docker-native, label-based configuration
- Docker Compose: Simple, declarative, easy to understand
- SSH deployment: Direct control, no additional orchestration overhead
- Managed PostgreSQL: Reliability, automated backups, TLS built-in
- Let's Encrypt: Free, automated TLS certificates
Zero-Downtime Deployment¶
# Pull new images
docker compose pull
# Run migrations (idempotent)
docker compose run --rm api npx prisma migrate deploy
# Start new containers (Compose handles replacement)
docker compose up -d --remove-orphans
# Clean up old images
docker image prune -f
Consequences¶
- ✅ Automatic TLS certificate management
- ✅ Simple deployment via SSH + Docker Compose
- ✅ Zero-downtime container replacement
- ✅ Docker labels for routing configuration
- ✅ Consistent image tagging with pipeline ID
- ⚠️ Single droplet = single point of failure (acceptable for staging)
- ⚠️ Requires manual SSH key management in Azure DevOps
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Kubernetes | Overkill for current scale, operational complexity |
| Docker Swarm | Less ecosystem support, not needed for single-node |
| Nginx | Manual certificate management, less dynamic |
| Caddy | Less mature Docker integration than Traefik |
| DigitalOcean App Platform | Less control, higher cost |
ADR-018: Nx Affected Conditional Deployment Strategy¶
| Attribute | Value |
|---|---|
| ID | ADR-018 |
| Status | ✅ Implemented |
| Date | 2026-01-11 |
| Context | Need to avoid unnecessary Docker builds and deployments when only part of the codebase changes |
Decision¶
Use Nx affected to detect which applications have changed and conditionally run package/deploy stages only for affected apps.
Architecture¶
Pipeline Parameters¶
| Parameter | Type | Default | Purpose |
|---|---|---|---|
ForceFullVersioningAndDeployment |
boolean | true |
Bypass affected detection, deploy all apps |
breakingMigration |
boolean | false |
Stop API before migrations |
How Affected Detection Works¶
The pipeline runs pnpm nx show projects --affected --type=app to identify which applications have changed compared to the base branch (origin/main).
Scenarios:
| Change Location | API Affected | Web Affected | Reason |
|---|---|---|---|
apps/api/** |
✅ | ❌ | Only API code changed |
apps/web/** |
❌ | ✅ | Only Web code changed |
libs/domain/** |
✅ | ✅ | Shared library affects both apps |
libs/api-client/** |
❌ | ✅ | API client only used by Web |
prisma/** |
✅ | ❌ | Database schema affects API |
docs/**, *.md |
❌ | ❌ | Docs are published as a separate static site (Zensical) |
Migration Safety¶
The deployment follows a specific order to ensure database safety:
- Pull new images (uses latest code with new Prisma schema)
- Stop API (only if
breakingMigration=true) - Run migrations (using new image via
docker compose run --rm) - Start API (after migrations complete)
Migration Types:
| Migration Type | Safe During Old API? | Recommended Action |
|---|---|---|
| Add nullable column | ✅ Safe | Normal deployment |
| Add column with default | ✅ Safe | Normal deployment |
| Add new table | ✅ Safe | Normal deployment |
| Drop column | ❌ Dangerous | Use breakingMigration=true |
| Rename column | ❌ Dangerous | Use breakingMigration=true |
| Add non-nullable column | ❌ Dangerous | Use breakingMigration=true |
Rationale¶
- Efficiency: Avoid building/pushing Docker images when code hasn't changed
- Cost reduction: Fewer container registry pushes, less storage used
- Faster deployments: Only affected services are restarted
- Cleaner versioning: New version tags only when actual code changes
- Nx integration: Leverages existing monorepo tooling for dependency detection
Consequences¶
- ✅ Significantly faster CI/CD for partial changes
- ✅ Reduced container registry costs
- ✅ Cleaner deployment history (versions reflect actual changes)
- ✅ Safe migration order (migrations before restart)
- ✅ Support for breaking migrations with explicit parameter
- ✅ Override available via
ForceFullVersioningAndDeploymentparameter - ⚠️ First pipeline run on new branch may show all apps affected
- ⚠️ Shared library changes trigger both app deployments (by design)
- ⚠️
breakingMigrationrequires manual assessment of migration type
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Always build both apps | Wasteful, slow, unnecessary version proliferation |
| Manual selection of apps | Error-prone, requires human decision each time |
| Git diff on Dockerfiles only | Misses shared library changes |
| Separate pipelines per app | Loses monorepo benefits, harder to maintain |
ADR-019: SimplyPrint Webhook Verification¶
| Attribute | Value |
|---|---|
| ID | ADR-019 |
| Status | ✅ Implemented |
| Date | 2026-01-13 |
| Context | Need to verify webhook requests are genuinely from SimplyPrint |
Decision¶
Implement X-SP-Token header verification with timing-safe comparison for all SimplyPrint webhooks.
Implementation¶
// SimplyPrintWebhookGuard
@Injectable()
export class SimplyPrintWebhookGuard implements CanActivate {
canActivate(context: ExecutionContext): boolean {
const request = context.switchToHttp().getRequest();
const token = request.headers['x-sp-token'];
if (!this.webhookSecret) {
this.logger.warn('SimplyPrint webhook secret not configured, skipping verification');
return true;
}
if (!token) {
throw new UnauthorizedException('Missing X-SP-Token header');
}
// Timing-safe comparison to prevent timing attacks
const tokenBuffer = Buffer.from(token);
const secretBuffer = Buffer.from(this.webhookSecret);
if (tokenBuffer.length !== secretBuffer.length) {
throw new UnauthorizedException('Invalid SimplyPrint webhook signature');
}
if (!crypto.timingSafeEqual(tokenBuffer, secretBuffer)) {
throw new UnauthorizedException('Invalid SimplyPrint webhook signature');
}
return true;
}
}
Rationale¶
- Security: Prevents forged webhook requests
- SimplyPrint standard: Uses the X-SP-Token header as per SimplyPrint documentation
- Timing-safe comparison: Prevents timing attacks on secret comparison
- Graceful degradation: Allows bypassing verification in development when secret not configured
Webhook Endpoint¶
| Endpoint | Method | Purpose |
|---|---|---|
/webhooks/simplyprint |
POST | Receive SimplyPrint events |
Supported Events¶
| Event | Action |
|---|---|
job.started |
Update job status to PRINTING |
job.done |
Update job status to COMPLETED |
job.failed |
Update job status to FAILED |
job.cancelled |
Update job status to CANCELLED |
job.paused |
Keep as PRINTING (temporary state) |
job.resumed |
Keep as PRINTING |
printer.* |
Ignored (no job status change) |
Consequences¶
- ✅ Secure webhook endpoint
- ✅ Protection against timing attacks
- ✅ Clear event-to-status mapping
- ✅ Development-friendly (optional verification)
- ⚠️ Requires SIMPLYPRINT_WEBHOOK_SECRET environment variable
ADR-020: Hybrid Status Monitoring (Polling + Webhooks)¶
| Attribute | Value |
|---|---|
| ID | ADR-020 |
| Status | ✅ Implemented |
| Date | 2026-01-13 |
| Context | Need reliable print job status updates even if webhooks fail or are delayed |
Decision¶
Implement a hybrid approach using both SimplyPrint webhooks (primary) and periodic polling (fallback) for job status monitoring.
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Status Update Sources │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SimplyPrint Cloud │
│ │ │
│ ├─── Webhooks (Primary, Real-time) ───┐ │
│ │ • Immediate notification │ │
│ │ • Event: job.started/done/failed │ │
│ │ ▼ │
│ │ SimplyPrintService │
│ │ │ │
│ └─── Polling (Fallback, 30s) ────────►│ │
│ • @Cron every 30 seconds │ │
│ • Checks queue and printers │ │
│ • Catches missed webhooks ▼ │
│ simplyprint.job-status-changed│
│ │ │
│ ▼ │
│ PrintJobsService │
│ │ │
│ ▼ │
│ Database Update │
└─────────────────────────────────────────────────────────────────┘
Implementation¶
Webhook Handler (Primary):
async handleWebhook(payload: SimplyPrintWebhookPayload): Promise<void> {
const jobData = payload.data.job;
if (!jobData) return;
const newStatus = this.mapWebhookEventToStatus(payload.event);
if (!newStatus) return;
this.eventEmitter.emit(SIMPLYPRINT_EVENTS.JOB_STATUS_CHANGED, {
simplyPrintJobId: jobData.uid,
newStatus,
printerId: payload.data.printer?.id,
printerName: payload.data.printer?.name,
timestamp: new Date(payload.timestamp * 1000),
});
}
Polling Fallback:
@Cron(CronExpression.EVERY_30_SECONDS)
async pollJobStatuses(): Promise<void> {
if (!this.pollingEnabled || this.isPolling) return;
this.isPolling = true;
try {
const printers = await this.simplyPrintClient.getPrinters();
for (const printer of printers) {
if (printer.currentJobId && printer.status === 'printing') {
this.eventEmitter.emit(SIMPLYPRINT_EVENTS.JOB_STATUS_CHANGED, {
simplyPrintJobId: printer.currentJobId,
newStatus: PrintJobStatus.PRINTING,
printerId: printer.id,
printerName: printer.name,
timestamp: new Date(),
});
}
}
} finally {
this.isPolling = false;
}
}
Configuration¶
| Environment Variable | Default | Description |
|---|---|---|
SIMPLYPRINT_POLLING_ENABLED |
true |
Enable/disable polling fallback |
SIMPLYPRINT_POLLING_INTERVAL_MS |
30000 |
Polling interval in milliseconds |
Rationale¶
- Reliability: Webhooks can fail due to network issues, SimplyPrint outages, or configuration problems
- Real-time updates: Webhooks provide immediate notification when status changes
- Consistency: Polling catches any status changes that webhooks might miss
- Idempotency: Status updates check current status before updating, preventing duplicate updates
- Configurable: Polling can be disabled in environments where webhooks are reliable
Status Deduplication¶
The system handles duplicate status updates gracefully:
async updateJobStatus(simplyPrintJobId: string, newStatus: PrintJobStatus): Promise<PrintJob> {
const printJob = await this.findBySimplyPrintJobId(simplyPrintJobId);
// Skip if status unchanged (idempotent)
if (printJob.status === newStatus) {
return printJob;
}
// Update and emit events
// ...
}
Consequences¶
- ✅ High reliability for status updates
- ✅ Real-time updates via webhooks
- ✅ Catches missed webhooks via polling
- ✅ Configurable polling interval
- ✅ Idempotent status updates
- ⚠️ Polling adds API calls every 30 seconds (minimal overhead)
- ⚠️ Potential for slight delay if only relying on polling
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Webhooks only | Single point of failure, missed events cause stale status |
| Polling only | Higher latency, unnecessary API calls when webhooks work |
| WebSocket connection | SimplyPrint doesn't offer WebSocket API |
| Manual refresh button | Poor UX, requires operator intervention |
ADR-021: Retry Queue with Exponential Backoff¶
| Attribute | Value |
|---|---|
| ID | ADR-021 |
| Status | ✅ Implemented |
| Date | 2026-01-14 |
| Context | Need to handle transient failures in external API calls (Shopify, SimplyPrint) gracefully |
Decision¶
Implement a database-backed retry queue with exponential backoff and jitter for all retryable operations.
Configuration¶
| Setting | Value | Description |
|---|---|---|
| Max Retries | 5 | Maximum retry attempts |
| Initial Delay | 1 second | First retry delay |
| Max Delay | 1 hour | Maximum retry delay |
| Backoff Multiplier | 2 | Exponential growth factor |
| Jitter | ±10% | Randomization to prevent thundering herd |
| Cleanup | 7 days | Old completed jobs deleted |
Implementation¶
calculateDelay(attempt: number): number {
let delay = initialDelayMs * Math.pow(backoffMultiplier, attempt - 1);
delay = Math.min(delay, maxDelayMs);
const jitter = delay * 0.1 * (Math.random() * 2 - 1);
return Math.round(delay + jitter);
}
Supported Job Types¶
| Job Type | Description |
|---|---|
FULFILLMENT |
Shopify fulfillment creation |
PRINT_JOB_CREATION |
SimplyPrint job creation |
CANCELLATION |
Job cancellation operations |
NOTIFICATION |
Email notification sending |
Consequences¶
- ✅ Automatic recovery from transient failures
- ✅ Prevents thundering herd with jitter
- ✅ Persistent queue survives application restarts
- ✅ Failed jobs trigger operator alerts
- ⚠️ Adds database table for queue persistence
ADR-022: Event-Driven Fulfillment Architecture¶
| Attribute | Value |
|---|---|
| ID | ADR-022 |
| Status | ✅ Implemented |
| Date | 2026-01-14 |
| Context | Need to automatically create Shopify fulfillments when all print jobs complete |
Decision¶
Use NestJS Event Emitter to trigger fulfillment creation when the orchestration service determines all print jobs for an order are complete.
Event Flow¶
PrintJob.COMPLETED → OrchestrationService checks all jobs
→ If all complete: emit order.ready-for-fulfillment
→ FulfillmentService listens and creates Shopify fulfillment
Key Events¶
| Event | Producer | Consumer |
|---|---|---|
order.ready-for-fulfillment |
OrchestrationService | FulfillmentService |
fulfillment.created |
FulfillmentService | NotificationsService |
fulfillment.failed |
FulfillmentService | NotificationsService |
order.cancelled |
OrdersService | CancellationService |
Consequences¶
- ✅ Loose coupling between order management and fulfillment
- ✅ Easy to add additional listeners (logging, analytics)
- ✅ Failure in fulfillment doesn't block order completion
- ⚠️ Event ordering not guaranteed (acceptable for this use case)
ADR-023: Email Notification Strategy¶
| Attribute | Value |
|---|---|
| ID | ADR-023 |
| Status | ✅ Implemented |
| Date | 2026-01-14 |
| Context | Need to alert operators when automated processes fail and require attention |
Decision¶
Implement email notifications via SMTP using Nodemailer with Handlebars templates for operator alerts.
Notification Triggers¶
| Trigger | Severity | Description |
|---|---|---|
| Fulfillment failed (final) | ERROR | Fulfillment failed after max retries |
| Print job failed (final) | ERROR | Print job failed after max retries |
| Cancellation needs review | WARNING | Order cancelled with in-progress prints |
| Retry exhausted | ERROR | Any retry job exceeded max attempts |
Configuration¶
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=notifications@forma3d.be
SMTP_PASS=***
SMTP_FROM=noreply@forma3d.be
OPERATOR_EMAIL=operator@forma3d.be
NOTIFICATIONS_ENABLED=true
Consequences¶
- ✅ Operators notified of issues requiring attention
- ✅ Email templates are maintainable and customizable
- ✅ Graceful degradation if email unavailable
- ✅ Can be disabled in development
- ⚠️ Requires SMTP configuration for each environment
ADR-024: API Key Authentication for Admin Endpoints¶
| Attribute | Value |
|---|---|
| ID | ADR-024 |
| Status | ✅ Implemented |
| Date | 2026-01-14 |
| Context | Admin endpoints (fulfillment, cancellation) need protection from unauthorized access |
Decision¶
Implement API key authentication using a custom NestJS guard for all admin endpoints that modify order state.
Implementation¶
// ApiKeyGuard
@Injectable()
export class ApiKeyGuard implements CanActivate {
canActivate(context: ExecutionContext): boolean {
if (!this.isEnabled) return true; // Development mode
const request = context.switchToHttp().getRequest();
const providedKey = request.headers['x-api-key'];
if (!providedKey) {
throw new UnauthorizedException('API key required');
}
// Timing-safe comparison to prevent timing attacks
if (!crypto.timingSafeEqual(Buffer.from(providedKey), Buffer.from(this.apiKey))) {
throw new UnauthorizedException('Invalid API key');
}
return true;
}
}
Protected Endpoints¶
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/fulfillments/order/:orderId |
POST | Create fulfillment |
/api/v1/fulfillments/order/:orderId/force |
POST | Force fulfill order |
/api/v1/fulfillments/order/:orderId/status |
GET | Get fulfillment status |
/api/v1/cancellations/order/:orderId |
POST | Cancel order |
/api/v1/cancellations/print-job/:jobId |
POST | Cancel single print job |
Authentication Methods Summary¶
| Endpoint Type | Method | Header | Verification |
|---|---|---|---|
| Shopify Webhooks | HMAC-SHA256 Signature | X-Shopify-Hmac-Sha256 |
Timing-safe comparison |
| SimplyPrint Webhooks | Token Verification | X-SP-Token |
Timing-safe comparison |
| Admin Endpoints | API Key | X-API-Key |
Timing-safe comparison |
| Public Endpoints | None | - | - |
Configuration¶
# Generate secure API key
openssl rand -hex 32
# Environment variable
INTERNAL_API_KEY="your-secure-api-key"
Security Considerations¶
- Timing-safe comparison: Prevents timing attacks on key validation
- Generic error messages: Returns "API key required" or "Invalid API key" to prevent information leakage
- Audit logging: Access attempts are logged for security monitoring
- Development mode: If
INTERNAL_API_KEYnot set, endpoints are accessible (development only)
Rationale¶
- IDOR Prevention: Addresses Insecure Direct Object Reference (IDOR) vulnerabilities flagged by security scanners
- Defense in Depth: Additional layer of protection for sensitive operations
- Simple Implementation: API keys are stateless and easy to rotate
- Swagger Integration: API key documented in OpenAPI spec for easy testing
Consequences¶
- ✅ Protection against unauthorized access to admin functions
- ✅ IDOR vulnerability mitigated
- ✅ Timing-safe implementation prevents timing attacks
- ✅ Development-friendly (optional in dev mode)
- ✅ Documented in Swagger UI
- ⚠️ Requires secure key management in production
- ⚠️ Key must be rotated if compromised
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| OAuth 2.0 / JWT | Overkill for internal B2B system with no user accounts |
| IP Whitelisting | Too inflexible, requires network configuration |
| mTLS | Complex certificate management for simple use case |
| No authentication | Unacceptable security risk (IDOR vulnerability) |
ADR-025: Cosign Image Signing for Supply Chain Security¶
| Attribute | Value |
|---|---|
| ID | ADR-025 |
| Status | ✅ Implemented |
| Date | 2026-01-14 |
| Context | Need to cryptographically sign container images and create attestations for promotion tracking |
Decision¶
Implement key-based container image signing using cosign from the Sigstore project, with attestations to track image promotions through environments.
Architecture¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Azure DevOps Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Build & Package Acceptance Test Production │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Build Docker│ │ Deploy to │ │ Verify │ │
│ │ Images │ │ Staging │ │ Staging │ │
│ └──────┬──────┘ └──────┬──────┘ │ Attestation │ │
│ │ │ └──────┬──────┘ │
│ ▼ ▼ │ │
│ ┌─────────────┐ ┌─────────────┐ ▼ │
│ │ Sign with │ │ Run Tests │ ┌─────────────┐ │
│ │ cosign.key │ └──────┬──────┘ │ Deploy to │ │
│ └──────┬──────┘ │ │ Production │ │
│ │ ▼ └──────┬──────┘ │
│ │ ┌─────────────┐ │ │
│ │ │ Create │ ▼ │
│ │ │ Staging │ ┌─────────────┐ │
│ │ │ Attestation │ │ Create Prod │ │
│ │ └─────────────┘ │ Attestation │ │
│ │ └─────────────┘ │
└─────────┼──────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐ ┌─────────────────────┐
│ DigitalOcean Container Registry │ │ Repository │
│ ─────────────────────────────────│ │ ───────────────── │
│ • Image:tag │ │ • cosign.pub │
│ • Image.sig (signature) │◄────│ (public key) │
│ • Image.att (attestation) │ │ │
└─────────────────────────────────────┘ └─────────────────────┘
Implementation¶
Key-Based Signing (chosen approach):
# Azure DevOps Pipeline
- task: DownloadSecureFile@1
name: cosignKey
inputs:
secureFile: 'cosign.key'
- script: |
cosign sign \
--key $(cosignKey.secureFilePath) \
--annotations "build.number=$(imageTag)" \
$(dockerRegistry)/$(imageName)@$(digest)
env:
COSIGN_PASSWORD: $(COSIGN_PASSWORD)
Attestation for Promotion Tracking:
{
"_type": "https://forma3d.com/attestations/promotion/v1",
"environment": "staging",
"promotedAt": "2026-01-14T16:00:00+00:00",
"build": {
"number": "20260114160000",
"pipeline": "forma-3d-connect",
"commit": "abc123..."
},
"verification": {
"healthCheckPassed": true,
"acceptanceTestsPassed": true
}
}
Key Management¶
| File | Location | Purpose |
|---|---|---|
cosign.key |
Azure DevOps Secure Files | Sign images (private) |
COSIGN_PASSWORD |
Azure DevOps Variable Group | Decrypt private key |
cosign.pub |
Repository root (/cosign.pub) |
Verify signatures (public) |
Signing Workflow¶
| Stage | Action | Artifact Created |
|---|---|---|
| Build & Package | Sign image after push | Image signature (.sig) |
| Staging Deploy | Create staging attestation | Staging attestation (.att) |
| Production Deploy | Verify staging attestation, then sign | Production attestation |
Rationale¶
- Supply chain security: Cryptographic proof that images were built by the CI/CD pipeline
- Promotion tracking: Attestations provide audit trail without modifying image tags
- Tamper detection: Modifications to signed images are detectable
- Key-based over keyless: Keyless (OIDC) requires workload identity federation which adds complexity; key-based is simpler and fully functional in Azure DevOps
Why Key-Based Instead of Keyless¶
Sigstore's "keyless" signing uses OIDC tokens from identity providers (GitHub Actions, Google Cloud, etc.). While elegant, it has challenges in Azure DevOps:
| Approach | Pros | Cons |
|---|---|---|
| Keyless (OIDC) | No key management, identity-based | Requires Azure Workload Identity Federation, falls back to device flow in CI (fails) |
| Key-Based | Works immediately in any CI | Requires secure key storage and rotation |
We chose key-based because:
- Azure DevOps doesn't have native OIDC integration with Sigstore
- Device flow authentication cannot work in non-interactive CI
- Key-based signing is well-supported and reliable
Security Considerations¶
- Private key protection: Stored in Azure DevOps Secure Files (encrypted at rest)
- Password protection: Private key is encrypted, password in secret variable
- Timing-safe verification: Public key verification uses constant-time comparison
- Key rotation: Documented procedure for rotating keys periodically (see Cosign Setup Guide)
Pipeline Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
enableSigning |
boolean | true |
Enable/disable image signing and attestations |
Verification Commands¶
# Verify image signature
cosign verify --key cosign.pub \
registry.digitalocean.com/forma-3d/forma3d-connect-api:20260114160000
# View attestations attached to image
cosign tree registry.digitalocean.com/forma-3d/forma3d-connect-api:20260114160000
# Verify and decode attestation
cosign verify-attestation --key cosign.pub --type custom \
registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
| jq '.payload | @base64d | fromjson | .predicate'
Local Tooling¶
A script is provided to view image promotion status:
# List all images with their promotion status
./scripts/list-image-promotions.sh
# Output shows signed status and promotion level
TAG PROMOTION SIGNED UPDATED
20260114160000 STAGING ✓ 2026-01-14
20260114120000 none ✓ 2026-01-14
Consequences¶
- ✅ Cryptographic proof of image provenance
- ✅ Tamper detection for container images
- ✅ Audit trail for environment promotions
- ✅ Works reliably in Azure DevOps without OIDC setup
- ✅ Can verify images locally with public key
- ⚠️ Requires secure key management
- ⚠️ Keys must be rotated periodically (recommended: 6-12 months)
- ⚠️ Pipeline requires secure files and variables to be configured
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| No signing | No supply chain security, no tamper detection |
| Keyless signing (OIDC) | Falls back to device flow in Azure DevOps, requires manual auth |
| Docker Content Trust (DCT) | Less flexible, no custom attestations, vendor lock-in |
| Image tags for promotion | Tags can be overwritten, no cryptographic verification |
| External attestation store | Additional infrastructure, attestations separate from images |
Related Documentation¶
- Cosign Setup Guide - Step-by-step key generation and Azure DevOps configuration
- Sigstore Documentation - Official cosign documentation
- Container Image Promotion - Usage instructions for promotion scripts
ADR-026: CycloneDX SBOM Attestations¶
| Attribute | Value |
|---|---|
| ID | ADR-026 |
| Status | ✅ Implemented |
| Date | 2026-01-16 |
| Context | Need to generate and attach Software Bill of Materials (SBOM) to container images for supply chain transparency |
Decision¶
Generate CycloneDX SBOMs using Syft and attach them as signed attestations using cosign.
Architecture¶
Each container image in the registry will have multiple attestations stored as separate OCI artifacts:
Container Image (e.g., forma3d-connect-api:20260116120000)
├── Image signature (.sig) ─────────────── cosign sign
├── SBOM attestation (.att) ────────────── cosign attest --type cyclonedx
├── Staging promotion attestation (.att) ─ cosign attest --type custom
└── Production promotion attestation (.att) cosign attest --type custom
Why CycloneDX over SPDX¶
| Criteria | CycloneDX | SPDX |
|---|---|---|
| Primary Focus | Security & DevSecOps | License compliance |
| VEX Support | Native | Separate spec |
| Tool Ecosystem | Excellent (Grype, Syft) | Good |
| Format Complexity | Simpler | More complex |
| OWASP Alignment | Yes (OWASP project) | No |
CycloneDX was chosen because:
- Better integration with vulnerability scanners (Grype, Trivy)
- Native support for VEX (Vulnerability Exploitability eXchange)
- Simpler format for debugging
- Aligns with OWASP security practices
- Growing adoption in DevSecOps pipelines
Implementation¶
Pipeline Step (after image signing):
- script: |
set -e
# Install Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Generate CycloneDX SBOM
syft $(dockerRegistry)/$(imageName)@$(digest) \
--output cyclonedx-json=sbom.cdx.json
# Attach as signed attestation
cosign attest \
--yes \
--key $(cosignKey.secureFilePath) \
--predicate sbom.cdx.json \
--type cyclonedx \
$(dockerRegistry)/$(imageName)@$(digest)
displayName: 'Generate and Attach SBOM'
env:
COSIGN_PASSWORD: $(COSIGN_PASSWORD)
Attestation Types in Registry¶
After deployment, each image has multiple separate attestations:
| Attestation Type | Purpose | Created By |
|---|---|---|
| Signature | Proves image was built by CI/CD | cosign sign |
| CycloneDX SBOM | Lists all components/packages | cosign attest --type cyclonedx |
| Staging | Proves image passed staging | cosign attest --type custom |
| Production | Proves image deployed to prod | cosign attest --type custom |
Verification Commands¶
# View all attestations attached to an image
cosign tree registry.digitalocean.com/forma-3d/forma3d-connect-api:latest
# Verify and extract SBOM
cosign verify-attestation \
--key cosign.pub \
--type cyclonedx \
registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
| jq -r '.payload' | base64 -d | jq '.predicate'
# Count components in SBOM
cosign verify-attestation --key cosign.pub --type cyclonedx \
registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
| jq -r '.payload' | base64 -d | jq '.predicate.components | length'
Scanning for Vulnerabilities¶
With the SBOM attached, you can scan for vulnerabilities without pulling the full image:
# Extract SBOM and scan with Grype
cosign verify-attestation --key cosign.pub --type cyclonedx \
registry.digitalocean.com/forma-3d/forma3d-connect-api@sha256:... \
| jq -r '.payload' | base64 -d | jq '.predicate' > sbom.cdx.json
grype sbom:sbom.cdx.json
Rationale¶
- Supply chain transparency: SBOM provides complete visibility into image contents
- Vulnerability management: Enables scanning without pulling full images
- Compliance: Meets requirements for software transparency (US Executive Order 14028)
- Signed attestation: SBOM itself is cryptographically signed, preventing tampering
- Tool-agnostic: CycloneDX is an open standard supported by many tools
Consequences¶
- ✅ Complete visibility into image dependencies
- ✅ Enables vulnerability scanning from SBOM
- ✅ Signed attestation prevents SBOM tampering
- ✅ Supports compliance requirements
- ✅ Works with existing cosign infrastructure
- ⚠️ Adds ~10-15 seconds to pipeline per image
- ⚠️ SBOM attestation adds ~2KB manifest to registry
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| SPDX format | More focused on licensing, less security tooling |
| Syft native format | Not an industry standard, limited tool support |
| Docker Buildx --sbom | Requires buildx, less control over format |
| No SBOM | Missing supply chain transparency |
| SBOM in image labels | Not cryptographically signed, can be tampered |
Tools Used¶
| Tool | License | Purpose |
|---|---|---|
| Syft | Apache 2.0 | Generate CycloneDX SBOM |
| Cosign | Apache 2.0 | Sign and attach as attestation |
| Grype | Apache 2.0 | Vulnerability scanning (optional) |
ADR-027: TanStack Query for Server State Management¶
| Attribute | Value |
|---|---|
| ID | ADR-026 |
| Status | Accepted |
| Date | 2026-01-14 |
| Context | Need to manage server state in the React dashboard with caching, refetching, and loading states |
Decision¶
Use TanStack Query (v5.x, formerly React Query) for server state management in the dashboard.
Rationale¶
- Automatic caching: Query results are cached and deduplicated automatically
- Background refetching: Data stays fresh with configurable stale times and refetch intervals
- Loading/error states: Built-in loading, error, and success states reduce boilerplate
- Optimistic updates: Supports optimistic updates for better UX on mutations
- DevTools: React Query DevTools for debugging cache state
- TypeScript support: Excellent TypeScript integration with inferred types
Implementation¶
// Query client configuration (apps/web/src/lib/query-client.ts)
const queryClient = new QueryClient({
defaultOptions: {
queries: {
staleTime: 30 * 1000, // 30 seconds
gcTime: 5 * 60 * 1000, // 5 minutes cache
retry: 1,
refetchOnWindowFocus: false,
},
},
});
// Example hook (apps/web/src/hooks/use-orders.ts)
export function useOrders(query: OrdersQuery = {}) {
return useQuery({
queryKey: ['orders', query],
queryFn: () => apiClient.orders.list(query),
});
}
Consequences¶
- ✅ Eliminates manual loading/error state management
- ✅ Automatic cache invalidation on mutations
- ✅ Integrates well with Socket.IO for real-time updates
- ✅ Reduces API calls through intelligent caching
- ⚠️ Requires understanding of query keys for proper cache invalidation
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Redux | Too much boilerplate for server state |
| SWR | Less features than TanStack Query |
| Apollo Client | GraphQL-focused, overkill for REST API |
| Manual fetch | Requires implementing caching/loading states manually |
ADR-028: Socket.IO for Real-Time Dashboard Updates¶
| Attribute | Value |
|---|---|
| ID | ADR-028 |
| Status | Accepted |
| Date | 2026-01-14 |
| Context | Dashboard needs real-time updates when orders and print jobs change status |
Decision¶
Use Socket.IO for real-time WebSocket communication between backend and dashboard.
Architecture¶
Backend Events WebSocket Gateway React Dashboard
│ │ │
│ order.created │ │
├───────────────────────►│ │
│ │ order:created │
│ ├───────────────────────►│
│ │ │ invalidateQueries()
│ │ │ toast.success()
Implementation¶
Backend (NestJS WebSocket Gateway):
// apps/api/src/gateway/events.gateway.ts
@WebSocketGateway({ namespace: '/events' })
export class EventsGateway {
@WebSocketServer()
server!: Server;
@OnEvent(ORDER_EVENTS.CREATED)
handleOrderCreated(event: OrderEventPayload): void {
this.server.emit('order:created', { ... });
}
}
Frontend (React Context):
// apps/web/src/contexts/socket-context.tsx
socketInstance.on('order:created', (data) => {
toast.success(`New order: #${data.orderNumber}`);
queryClient.invalidateQueries({ queryKey: ['orders'] });
});
Rationale¶
- Already installed: Socket.IO server was already in dependencies for Phase 3
- Bidirectional: Supports future features like notifications and chat
- Automatic reconnection: Handles network interruptions gracefully
- Namespace support: Can separate different event channels
- Browser compatibility: Works across all modern browsers
Consequences¶
- ✅ Real-time updates without polling
- ✅ Toast notifications on important events
- ✅ Automatic TanStack Query cache invalidation
- ✅ Connection status visible in UI
- ⚠️ Requires WebSocket support in infrastructure
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Polling | Higher latency, more server load |
| Server-Sent Events | One-directional only |
| Raw WebSockets | Less features than Socket.IO (rooms, reconnection) |
| Pusher/Ably | External dependency, cost |
ADR-029: API Key Authentication for Dashboard¶
| Attribute | Value |
|---|---|
| ID | ADR-029 |
| Status | Accepted |
| Date | 2026-01-14 |
| Context | Dashboard needs authentication to protect admin operations |
Decision¶
Use API key authentication stored in browser localStorage for dashboard authentication.
Implementation¶
// apps/web/src/contexts/auth-context.tsx
const AUTH_STORAGE_KEY = 'forma3d_api_key';
export function AuthProvider({ children }: { children: ReactNode }) {
const [apiKey, setApiKey] = useState<string | null>(() => {
return localStorage.getItem(AUTH_STORAGE_KEY);
});
const login = (key: string) => {
localStorage.setItem(AUTH_STORAGE_KEY, key);
setApiKey(key);
};
// ...
}
// Protected routes redirect to /login if not authenticated
function ProtectedRoute({ children }: { children: ReactNode }) {
const { isAuthenticated } = useAuth();
if (!isAuthenticated) return <Navigate to="/login" replace />;
return <>{children}</>;
}
Rationale¶
- Simplicity: No session management, token refresh, or OAuth complexity
- Consistent with API: Uses same API key authentication as backend (ADR-024)
- Offline-capable: Works without server validation on page load
- Single operator: System is used by single operator, not public users
Security Considerations¶
- API key stored in localStorage (acceptable for internal admin tool)
- Key sent via
X-API-Keyheader for mutations - HTTPS required in production
- Key should be rotated periodically
Consequences¶
- ✅ Simple implementation and user experience
- ✅ Consistent with existing API key guard on backend
- ✅ No additional authentication infrastructure needed
- ⚠️ API key visible in localStorage (acceptable for admin tool)
- ⚠️ No role-based access control (single admin role)
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| OAuth/OIDC | Overkill for single-operator system |
| JWT tokens | Adds complexity without benefit for this use case |
| Session cookies | Requires server-side session management |
| No auth | Admin operations must be protected |
ADR-030: Sendcloud for Shipping Integration¶
| Attribute | Value |
|---|---|
| ID | ADR-030 |
| Status | Accepted |
| Date | 2026-01-16 |
| Context | Need to generate shipping labels and sync tracking information to Shopify |
Decision¶
Use Sendcloud API (custom integration) rather than the native Sendcloud-Shopify app for shipping label generation and tracking.
Rationale¶
Why Sendcloud as a Platform¶
- Multi-carrier support: Single API for PostNL, DPD, DHL, UPS, and 80+ other carriers
- European focus: Strong presence in Belgium/Netherlands matching Forma3D's primary market
- Simple API: REST API with Basic Auth, parcel creation returns label PDF immediately
- Automatic tracking: Tracking numbers and URLs provided on parcel creation
- Webhook support: Status updates available via webhooks (for future enhancement)
- Competitive pricing: Pay-per-label pricing suitable for small business volumes
- Label formats: Supports A4, A6, and thermal printer formats
Why Custom API Integration vs Native Shopify-Sendcloud App¶
Sendcloud offers a native Shopify integration that automatically syncs orders. However, we chose a custom API integration for the following reasons:
| Aspect | Native Sendcloud-Shopify App | Our Custom API Integration |
|---|---|---|
| Trigger | Manual — operator must create label in Sendcloud dashboard | Automatic — triggered when all print jobs complete |
| Print awareness | None — doesn't know about 3D printing workflow | Full — waits for SimplyPrint jobs to finish |
| Unified dashboard | Split across Shopify + Sendcloud panels | Single dashboard — orders, prints, shipments in one place |
| Audit trail | Separate logs in each system | Integrated event log with full traceability |
| Custom workflow | Generic e-commerce flow | Custom print-to-ship automation |
| Tracking sync timing | After manual label creation | Immediate — included in Shopify fulfillment |
Key insight: The native integration doesn't know when 3D printing is complete. An operator would need to:
- Monitor SimplyPrint for job completion
- Switch to Sendcloud dashboard
- Find the order and create a label
- Wait for tracking to sync back to Shopify
Our custom integration automates this entire workflow:
Print Jobs Complete → Auto-Generate Label → Auto-Fulfill with Tracking → Customer Notified
This reduces manual intervention from ~5 minutes per order to zero, which is critical for scaling order volumes.
Implementation¶
apps/api/src/
├── sendcloud/
│ ├── sendcloud-api.client.ts # HTTP client with Basic Auth
│ ├── sendcloud.service.ts # Business logic, event listener
│ ├── sendcloud.controller.ts # REST endpoints
│ └── sendcloud.module.ts
├── shipments/
│ ├── shipments.repository.ts # Prisma queries for Shipment
│ ├── shipments.controller.ts # REST endpoints
│ └── shipments.module.ts
libs/api-client/src/
└── sendcloud/
└── sendcloud.types.ts # Typed DTOs for Sendcloud API
Event Flow¶
- All print jobs complete →
OrchestrationServiceemitsorder.ready-for-fulfillment SendcloudServicelistens → creates parcel via Sendcloud API- Sendcloud returns label URL + tracking number
- Shipment record stored in database
SendcloudServiceemitsshipment.createdeventFulfillmentServicelistens → creates Shopify fulfillment with tracking info- Customer receives email notification with tracking link
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ SimplyPrint │───▶│ Orchestration│───▶│ Sendcloud │───▶│ Fulfillment │
│ (prints) │ │ Service │ │ Service │ │ Service │
└─────────────┘ └──────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ order.ready- │ shipment. │ Shopify
│ for-fulfillment │ created │ Fulfillment
▼ ▼ ▼
[All jobs done] [Label + tracking] [Customer notified]
Consequences¶
- ✅ Single integration for multiple carriers
- ✅ Automatic label PDF generation
- ✅ Tracking information synced to Shopify fulfillments
- ✅ Dashboard displays shipment status and label download
- ⚠️ Dependent on Sendcloud uptime and API availability
- ⚠️ Limited to carriers supported by Sendcloud
- ⚠️ Requires Sendcloud account and sender address configuration
Environment Variables¶
SENDCLOUD_PUBLIC_KEY=xxx
SENDCLOUD_SECRET_KEY=xxx
SENDCLOUD_API_URL=https://panel.sendcloud.sc/api/v2
DEFAULT_SHIPPING_METHOD_ID=8
DEFAULT_SENDER_ADDRESS_ID=12345
SHIPPING_ENABLED=true
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Native Sendcloud-Shopify app | Requires manual label creation; no print workflow awareness |
| Direct carrier APIs | Too many integrations to maintain, each with different APIs |
| ShipStation | US-focused, less European carrier support |
| EasyPost | Less European carrier coverage than Sendcloud |
| Manual labels | Does not meet automation requirements; ~5 min overhead per order |
ADR-031: Automated Container Registry Cleanup¶
| Attribute | Value |
|---|---|
| ID | ADR-031 |
| Status | Accepted |
| Date | 2026-01-16 |
| Context | Container registries accumulate old images over time, increasing storage costs and clutter |
Decision¶
Implement automated container registry cleanup that runs after each successful staging deployment and attestation. The cleanup uses attestation-based policies to determine which images to keep or delete.
Rationale¶
The Problem¶
Without automated cleanup, the DigitalOcean Container Registry accumulates images indefinitely:
- Each CI build creates new images with timestamped tags (e.g.,
20260116120000) - Signature and attestation artifacts add ~2KB per image
- Storage costs grow linearly with deployment frequency
- Old images provide no value after newer versions are verified in production
Attestation-Based Cleanup Policy¶
The cleanup leverages the cosign attestation system (ADR-025) to make intelligent retention decisions:
| Image Status | Action | Rationale |
|---|---|---|
| PRODUCTION attestation | Keep | May need for rollback |
| Currently deployed | Keep | Active in production/staging |
| Recent (last 5) | Keep | Recent builds for debugging |
| STAGING-only attestation | Delete | Superseded by newer staging builds |
| No attestation | Delete | Never passed acceptance tests |
This policy ensures:
- Rollback capability: Production-attested images are always available
- Debugging support: Recent images preserved for investigation
- Automatic garbage collection: Old staging/unsigned images removed
Integration with Health Endpoints¶
The cleanup script queries the /health endpoints to determine which images are currently deployed:
# API health endpoint returns current build number
curl https://staging-connect-api.forma3d.be/health
# Response: { "build": { "number": "20260116120000" }, ... }
This prevents accidental deletion of running containers.
Implementation¶
scripts/
└── cleanup-registry.sh # Cleanup script with attestation checking
azure-pipelines.yml
└── RegistryMaintenance stage # Runs on every main branch pipeline
└── CleanupRegistry job # Cleans manifests + triggers GC
Cleanup Script¶
The scripts/cleanup-registry.sh script:
- Authenticates to DigitalOcean Container Registry via
doctl - Queries health endpoints to find currently deployed image tags
- Lists all images in the registry for each repository
- Checks attestations using
cosign verify-attestationwith the public key - Applies retention policy based on attestation status
- Deletes eligible images via
doctl registry repository delete-manifest - Triggers garbage collection to reclaim storage space
Pipeline Integration¶
The cleanup runs in a dedicated RegistryMaintenance stage that executes on every main branch pipeline, even when no apps are affected (DeployStaging skipped):
- stage: RegistryMaintenance
dependsOn: [Build, DeployStaging]
condition: and(not(canceled()), eq(variables.isMain, true))
Cleanup Flow¶
┌─────────────────────────────────────────────────────────────────────┐
│ RegistryMaintenance Stage │
│ (runs on every main branch pipeline) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ CleanupRegistry │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 1. Query /health endpoints for deployed versions │ │
│ │ 2. List all images in registry │ │
│ │ 3. For each image: │ │
│ │ - Check if PRODUCTION attested → KEEP │ │
│ │ - Check if currently deployed → KEEP │ │
│ │ - Check if in top 5 recent → KEEP │ │
│ │ - Check if STAGING-only attested → DELETE │ │
│ │ - Check if no attestation → DELETE │ │
│ │ 4. Wait for any active GC, start new GC, verify completion │ │
│ │ 5. EXIT trap ensures GC runs even if script crashes │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Usage¶
Local Testing (Dry Run)¶
# Preview what would be deleted
./scripts/cleanup-registry.sh \
--key cosign.pub \
--api-url https://staging-connect-api.forma3d.be \
--web-url https://staging-connect.forma3d.be \
--dry-run \
--verbose
Manual Cleanup¶
# Perform actual cleanup
./scripts/cleanup-registry.sh \
--key cosign.pub \
--api-url https://staging-connect-api.forma3d.be \
--web-url https://staging-connect.forma3d.be \
--verbose
Script Options¶
| Option | Description |
|---|---|
-k, --key FILE |
Public key for attestation verification (required) |
--api-url URL |
API health endpoint URL (required) |
--web-url URL |
Web health endpoint URL (required) |
--keep-recent N |
Keep N most recent images (default: 5) |
--dry-run |
Preview deletions without executing |
-v, --verbose |
Show detailed output |
Consequences¶
- ✅ Automatic storage management reduces costs
- ✅ Attestation-based policy ensures production rollback capability
- ✅ Health endpoint check prevents deletion of running containers
- ✅ Dry-run mode enables safe testing
- ✅ Garbage collection reclaims space after deletion
- ⚠️ Requires health endpoints to return build information
- ⚠️ Dependent on cosign/doctl availability in pipeline
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Time-based retention (e.g., 30 days) | Doesn't account for promotion status; may delete production-ready images |
Tag-based retention (e.g., keep latest) |
latest tag is mutable; doesn't guarantee correct image |
| Manual cleanup | Error-prone, inconsistent, doesn't scale |
| Registry auto-purge policies | DigitalOcean doesn't support attestation-aware policies |
ADR-032: Domain Boundary Separation with Interface Contracts¶
| Attribute | Value |
|---|---|
| ID | ADR-032 |
| Title | Domain Boundary Separation with Interface Contracts |
| Status | Implemented |
| Context | Prepare the modular monolith for potential future microservices extraction by establishing clean domain boundaries |
| Date | 2026-01-17 |
Context¶
As the application grows, we need to ensure domain boundaries are well-defined to:
- Enable future microservices extraction without major refactoring
- Reduce coupling between modules
- Enable independent testing of domain logic
- Provide distributed tracing capabilities
Decision¶
We implement domain boundary separation with the following patterns:
1. Domain Contracts Library (libs/domain-contracts)¶
Create a dedicated library containing:
- Interface definitions (
IOrdersService,IPrintJobsService, etc.) - DTOs for cross-domain communication (
OrderDto,PrintJobDto, etc.) - Symbol injection tokens (
ORDERS_SERVICE,PRINT_JOBS_SERVICE, etc.)
2. Correlation ID Infrastructure¶
Add correlation ID propagation for distributed tracing:
CorrelationMiddlewareextracts/generatesx-correlation-idheadersCorrelationServiceusesAsyncLocalStoragefor context propagation- All domain events include
correlationId,timestamp, andsourcefields
3. Repository Encapsulation¶
Repositories are internal implementation details:
- Modules stop exporting repositories
- Only interface tokens are exported for cross-domain communication
- Services implement domain interfaces
4. Event-Based Base Interfaces¶
Define base event interfaces that all domain events extend:
interface BaseEvent {
correlationId: string;
timestamp: Date;
source: string;
}
Implementation¶
| Component | Path | Description |
|---|---|---|
| Domain Contracts | libs/domain-contracts/ |
Interface definitions and DTOs |
| Correlation Service | apps/api/src/common/correlation/ |
Request context propagation |
| Base Events | libs/domain/src/events/ |
Base event interfaces |
Interface Tokens Pattern¶
// In domain-contracts library
export const ORDERS_SERVICE = Symbol('IOrdersService');
export interface IOrdersService {
findById(id: string): Promise<OrderDto | null>;
updateStatus(id: string, status: OrderStatus): Promise<OrderDto>;
// ... other methods
}
// In module
@Module({
providers: [OrdersService, { provide: ORDERS_SERVICE, useExisting: OrdersService }],
exports: [ORDERS_SERVICE], // No longer exports repository
})
export class OrdersModule {}
// In consumer service
@Injectable()
export class FulfillmentService {
constructor(
@Inject(ORDERS_SERVICE)
private readonly ordersService: IOrdersService
) {}
}
Scope¶
Interface tokens (@Inject(ORDERS_SERVICE), etc.) enforce boundaries between domains. Services that live within the same domain module should inject the concrete class directly rather than going through the token indirection. For example, OrchestrationService injects PrintJobsService directly because both live inside the order-service; it injects IOrdersService via ORDERS_SERVICE because orders are a separate domain boundary.
Consequences¶
Positive:
- Clear domain boundaries enable future microservices extraction
- Reduced coupling between modules
- Better testability with interface-based mocking
- Distributed tracing via correlation IDs
- Repository details are now private implementation
Negative:
- Slight increase in boilerplate (interface definitions, DTOs)
- Need to maintain DTO mapping logic
- Some
forwardRef()usages remain for circular retry patterns
Related ADRs¶
- ADR-007: Layered Architecture with Repository Pattern
- ADR-008: Event-Driven Internal Communication
- ADR-013: Shared Domain Library
ADR-033: Database-Backed Webhook Idempotency¶
| Attribute | Value |
|---|---|
| ID | ADR-033 |
| Title | Database-Backed Webhook Idempotency |
| Status | Implemented |
| Context | In-memory webhook idempotency cache doesn't work in multi-instance deployments |
| Date | 2026-01-17 |
Context¶
The original implementation used an in-memory Set<string> for webhook idempotency tracking:
private readonly processedWebhooks = new Set<string>();
This approach had critical problems:
- Horizontal Scaling Failure: In a multi-instance deployment, each API instance has its own cache. Webhooks may be processed multiple times across instances.
- Memory Leak: The Set grows unbounded as webhooks are processed, causing memory pressure in long-running instances.
- Restart Data Loss: All idempotency data is lost on application restart, allowing duplicate processing during restarts.
Decision¶
Use a PostgreSQL table (ProcessedWebhook) for webhook idempotency instead of Redis or in-memory caching.
Rationale¶
- No additional infrastructure: Uses existing PostgreSQL database
- Transactional safety: Database unique constraint ensures race-condition-safe idempotency
- Simple cleanup: Scheduled job removes expired records hourly
- Debugging support: Records include metadata (webhook type, order ID, timestamps)
- Horizontal scaling: Works correctly across multiple API instances
Implementation¶
// Atomic check-and-mark using unique constraint
async isProcessedOrMark(webhookId: string, type: string): Promise<boolean> {
try {
await this.prisma.processedWebhook.create({
data: { webhookId, webhookType: type, expiresAt }
});
return false; // First time processing
} catch (error) {
if (error.code === 'P2002') return true; // Already processed
throw error;
}
}
Database Schema¶
model ProcessedWebhook {
id String @id @default(uuid())
webhookId String @unique // The Shopify webhook ID
webhookType String // e.g., "orders/create"
processedAt DateTime @default(now())
expiresAt DateTime // When this record can be cleaned up
orderId String? // Associated order for debugging
@@index([expiresAt]) // For cleanup job queries
@@index([processedAt]) // For monitoring
}
Alternatives Considered¶
| Alternative | Pros | Cons | Decision |
|---|---|---|---|
| Redis | TTL support, fast | Additional infrastructure | Rejected |
| Distributed Lock | Works with DB | Complex, race conditions | Rejected |
| Database Table | Simple, no new infra | Needs cleanup job | Selected |
Consequences¶
Positive:
- ✅ Works correctly in multi-instance deployments
- ✅ Survives application restarts
- ✅ No memory leaks
- ✅ Auditable (can query processed webhooks)
- ✅ Race-condition safe via unique constraint
Negative:
- ⚠️ Slightly higher latency than in-memory (< 10ms)
- ⚠️ Requires cleanup job (runs hourly)
Related ADRs¶
- ADR-007: Layered Architecture with Repository Pattern
- ADR-021: Retry Queue for Resilient Operations
ADR-034: Docker Infrastructure Hardening (Log Rotation & Resource Cleanup)¶
| Status | Date | Context |
|---|---|---|
| Accepted | 2026-01-19 | Prevent disk exhaustion from Docker logs and images |
Context¶
During staging operations, the server disk filled to 100% due to:
- Unbounded Docker logs: The default
json-filelog driver has no size limits, causing container logs to grow indefinitely - Accumulated old images: Each deployment pulls new images but old versions remained on disk
- Health check failures: When disk was full, Docker couldn't execute health checks, causing containers to be marked unhealthy and Traefik to stop routing traffic
Decision¶
Implement automated infrastructure hardening in the deployment pipeline:
- Docker Log Rotation: Configure daemon-level log rotation with size limits
- Aggressive Resource Cleanup: Remove unused images, volumes, and networks after each deployment
- Separate Image Tags: Use independent version tags for API and Web to support partial deployments
Implementation¶
1. Docker Log Rotation Configuration¶
The pipeline automatically creates /etc/docker/daemon.json if missing:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
This limits each container to:
- Maximum 10MB per log file
- Maximum 3 rotated files
- Total: 30MB per container (90MB for all 3 containers)
2. Deployment Cleanup Steps¶
After container restart, the pipeline runs:
# Remove dangling images
docker image prune -f
# Remove unused images older than 24h
docker image prune -a -f --filter "until=24h"
# Clean up unused volumes and networks
docker volume prune -f
docker network prune -f
3. Separate Image Tags¶
docker-compose.yml now uses independent tags:
api:
image: ${REGISTRY_URL}/forma3d-connect-api:${API_IMAGE_TAG:-latest}
web:
image: ${REGISTRY_URL}/forma3d-connect-web:${WEB_IMAGE_TAG:-latest}
This allows:
- Deploying only API without changing Web version
- Deploying only Web without changing API version
- Independent rollbacks for each service
Consequences¶
Positive:
- ✅ Prevents disk exhaustion from unbounded log growth
- ✅ Reduces disk usage by cleaning old images after deployment
- ✅ Supports independent versioning for API and Web
- ✅ Self-healing: Pipeline automatically configures log rotation if missing
- ✅ No manual intervention required
Negative:
- ⚠️ Docker daemon restart required if log rotation config is missing (brief container interruption)
- ⚠️ Log history limited to ~30MB per container (may need external log aggregation for production)
Configuration Summary¶
| Setting | Value | Rationale |
|---|---|---|
max-size |
10m | Balance between history and disk usage |
max-file |
3 | Keeps ~30MB per container |
| Image cleanup filter | 24h | Keeps recent images for quick rollback |
Related ADRs¶
- ADR-017: Docker + Traefik Deployment Strategy
- ADR-031: Automated Container Registry Cleanup
ADR-035: Progressive Web App (PWA) for Cross-Platform Access¶
| Attribute | Value |
|---|---|
| ID | ADR-035 |
| Status | Accepted |
| Date | 2026-01-19 |
| Context | Need to provide mobile and desktop access for operators monitoring print jobs and managing orders while away from desk |
Decision¶
Adopt Progressive Web App (PWA) technology for the existing React web application, replacing the planned Tauri (desktop) and Capacitor (mobile) native shell applications.
The web application will be enhanced with:
- Web App Manifest for installability
- Service Worker for offline caching and push notifications
- Web Push API for real-time alerts on print job status
Rationale¶
PWA Suitability for Admin Dashboards¶
Research conducted in January 2026 confirms PWA is an ideal fit for Forma3D.Connect:
- Application type: Admin dashboards and SaaS tools are PWA's primary use case
- Feature requirements: Order management, real-time updates, and push notifications are fully supported
- Device features: No deep hardware integration (Bluetooth, NFC, sensors) required
iOS/Safari PWA Support (2026)¶
Apple has significantly improved PWA support:
| Feature | iOS Version | Status |
|---|---|---|
| Web Push Notifications | iOS 16.4+ | ✅ Supported (Home Screen install required) |
| Badging API | iOS 16.4+ | ✅ Supported |
| Declarative Web Push | iOS 18.4+ | ✅ Improved reliability |
| Standalone Display Mode | iOS 16.4+ | ✅ Supported |
Cost-Benefit Analysis¶
| Aspect | Tauri + Capacitor | PWA |
|---|---|---|
| Initial development | 40-80 hours | 8-16 hours |
| CI/CD pipelines | Additional complexity | None |
| Code signing | Required (Apple, Windows) | None |
| App store submissions | Required | None |
| Update cycle | Days (app store review) | Instant |
| Maintenance | Ongoing | Minimal |
Estimated savings: 80-150 hours initial + ongoing maintenance reduction
Tauri/Capacitor Provided No Real Advantage¶
Both planned native apps were WebView wrappers:
Container(desktop, "Tauri, Rust", "Native desktop shell wrapping the web application")Container(mobile, "Capacitor", "Mobile shell for on-the-go monitoring")
PWA provides the same experience (installable, app-like, offline capable) without:
- Separate build pipelines
- Platform-specific debugging
- App store management
- Code signing certificates
Implementation¶
Phase 1: PWA Foundation¶
- Add
vite-plugin-pwato the web application - Create
manifest.jsonwith app metadata and icons - Configure service worker for asset caching
- Enable HTTPS (already implemented)
{
"name": "Forma3D.Connect",
"short_name": "Forma3D",
"start_url": "/",
"display": "standalone",
"background_color": "#ffffff",
"theme_color": "#0066cc"
}
Phase 2: Push Notifications¶
- Implement Web Push API in frontend
- Add VAPID key configuration to API
- Create notification service (integrate with existing email notifications)
- User permission flow in dashboard settings
Phase 3: Enhanced Offline Support¶
- IndexedDB for offline data caching
- Background sync for queued actions
- Optimistic UI updates
Consequences¶
Positive:
- ✅ Significant reduction in development and maintenance effort
- ✅ Single codebase, single deployment target
- ✅ Instant updates for all users (no app store delays)
- ✅ No platform-specific bugs or WebView inconsistencies
- ✅ No code signing or app store management
- ✅ Works on any device with a modern browser
Negative:
- ⚠️ iOS requires Home Screen install for full PWA features
- ⚠️ No notification sounds on iOS PWA (visual only)
- ⚠️ Limited system tray integration on desktop
Removed from Project:
- ❌
apps/desktop(Tauri) - removed from roadmap - ❌
apps/mobile(Capacitor) - removed from roadmap
Updated Architecture¶
The C4 Container diagram has been updated to reflect the PWA-only architecture:
Before:
├── Web Application (React 19)
├── Desktop App (Tauri) [future]
├── Mobile App (Capacitor) [future]
└── API Server (NestJS)
After:
├── Progressive Web App (React 19 + PWA)
└── API Server (NestJS)
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Keep Tauri + Capacitor plan | Unnecessary complexity; WebView wrappers provide no advantage over PWA |
| React Native for mobile | Requires separate codebase; overkill for admin dashboard |
| Electron for desktop | Large bundle size; same WebView approach as Tauri but less efficient |
| Flutter | Requires separate codebase; not justified for simple dashboard |
Related Documents¶
- PWA Feasibility Study - Detailed research and analysis
- C4 Container Diagram - Updated architecture diagram
ADR-036: localStorage Fallback for PWA Install Detection¶
| Attribute | Value |
|---|---|
| ID | ADR-036 |
| Status | Accepted |
| Date | 2026-01-20 |
| Context | Need to detect if PWA is installed when user views site in browser, to show appropriate messaging and avoid duplicate install prompts |
Decision¶
Use a dual detection strategy combining the getInstalledRelatedApps() API with localStorage persistence as a fallback for PWA installation detection.
Rationale¶
The Problem¶
When a user installs a PWA and later visits the same site in a regular browser:
- The browser doesn't know the PWA is installed
- The site shows "Install App" even though it's already installed
- This creates a confusing user experience
API Limitations¶
The navigator.getInstalledRelatedApps() API can detect installed PWAs, but has limitations:
| Platform | Chrome Version | Support |
|---|---|---|
| Android | 80+ | ✅ Full support |
| Windows | 85+ | ✅ Supported |
| macOS | 140+ | ✅ Same-scope only |
| iOS/Safari | - | ❌ Not supported |
Even where supported, the API can be unreliable due to:
- Scope restrictions (must be same origin/scope)
- Timing issues during page load
- Browser implementation quirks
Dual Detection Strategy¶
- Primary:
getInstalledRelatedApps()API - Query the browser for installed related apps
-
Works when supported and correctly configured
-
Fallback: localStorage persistence
- Store
pwa-installed: truewhen:- User installs via
appinstalledevent - App is opened in standalone mode
- API successfully detects installation
- User installs via
- Check localStorage on page load
Implementation¶
// Detection flow
useEffect(() => {
// 1. Check standalone mode (running inside PWA)
const isStandalone = window.matchMedia('(display-mode: standalone)').matches;
if (isStandalone) {
setIsInstalled(true);
localStorage.setItem('pwa-installed', 'true');
return;
}
// 2. Check localStorage fallback
if (localStorage.getItem('pwa-installed') === 'true') {
setIsInstalled(true);
}
// 3. Try getInstalledRelatedApps API
if (navigator.getInstalledRelatedApps) {
navigator.getInstalledRelatedApps().then((apps) => {
if (apps.some((app) => app.platform === 'webapp')) {
setIsInstalled(true);
localStorage.setItem('pwa-installed', 'true');
}
});
}
}, []);
// Persist on install
window.addEventListener('appinstalled', () => {
localStorage.setItem('pwa-installed', 'true');
});
Consequences¶
Positive:
- ✅ Works across all browsers and platforms
- ✅ Provides consistent UX when switching between PWA and browser
- ✅ No false "Install App" prompts when already installed
- ✅ Gracefully degrades when API not supported
Negative:
- ⚠️ localStorage can become stale if user uninstalls PWA externally
- ⚠️ No automatic cleanup mechanism for uninstalled apps
- ⚠️ Per-browser storage (installing in Chrome won't reflect in Firefox)
Trade-off Accepted:
The risk of showing "Installed" for an uninstalled app is acceptable because:
- Users rarely uninstall and then want to reinstall immediately
- Clearing site data will reset the state
- Better UX than constantly prompting to install an already-installed app
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| API only | Too unreliable; doesn't work on Safari/iOS |
| localStorage only | Misses installations from other sessions |
| Server-side tracking | Requires authentication; overcomplicated |
| Cookie-based | Cleared more frequently than localStorage |
Related Documents¶
- ADR-035: Progressive Web App (PWA)
apps/web/src/hooks/use-pwa-install.ts- Implementation
ADR-037: Keep a Changelog for Release Documentation¶
| Attribute | Value |
|---|---|
| ID | ADR-037 |
| Status | Accepted |
| Date | 2026-01-20 |
| Context | Need a standardized way to document changes between releases for developers, operators, and stakeholders |
Decision¶
Adopt the Keep a Changelog format for documenting all notable changes to the project, combined with Semantic Versioning for version numbers.
Rationale¶
Why Keep a Changelog?¶
- Human-readable: Written for humans, not machines - focuses on what matters to users
- Standardized format: Well-known convention reduces cognitive load
- Categorized changes: Clear sections (Added, Changed, Deprecated, Removed, Fixed, Security)
- Release-oriented: Groups changes by version, making it easy to see what's in each release
- Unreleased section: Accumulates changes before a release, making release notes easy
Why Semantic Versioning?¶
- MAJOR.MINOR.PATCH format communicates impact:
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes (backward compatible)
- Industry standard, well understood by developers
- Enables automated tooling and dependency management
Benefits for AI-Generated Codebase¶
This project is primarily AI-generated, making structured documentation critical:
- Context for AI: Changelog provides history context for future AI sessions
- Audit trail: Documents what was added/changed in each phase
- Stakeholder communication: Non-technical stakeholders can understand progress
- Debugging aid: When issues arise, changelog helps identify when changes were introduced
Implementation¶
File location: CHANGELOG.md in repository root
Format:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
## [0.7.0] - 2026-01-19
### Added
- Feature description
### Changed
- Change description
### Fixed
- Bug fix description
### Security
- Security fix description
Change categories (use only those that apply):
- Added: New features
- Changed: Changes to existing functionality
- Deprecated: Features marked for removal
- Removed: Features removed
- Fixed: Bug fixes
- Security: Vulnerability fixes
Guidelines¶
- Update with every PR: Add changelog entry as part of the PR
- Write for humans: Describe the user impact, not implementation details
- Link to issues/PRs: Reference related issues where helpful
- Keep Unreleased current: Move entries to versioned section on release
- One entry per change: Don't combine unrelated changes
Consequences¶
Positive:
- ✅ Clear release history for all stakeholders
- ✅ Standardized format reduces documentation overhead
- ✅ Supports both manual reading and automated parsing
- ✅ Integrates well with CI/CD release workflows
- ✅ Provides context for AI-assisted development sessions
Negative:
- ⚠️ Requires discipline to update with each change
- ⚠️ Can become verbose if too granular
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Git commit history only | Too granular; hard to see high-level changes |
| GitHub Releases only | Tied to GitHub; not in repository |
| Auto-generated from commits | Requires strict commit conventions; often too noisy |
| Wiki-based changelog | Separate from code; easy to forget to update |
Related Documents¶
- CHANGELOG.md - The changelog file
- Semantic Versioning
- Keep a Changelog
ADR-038: Zensical for Publishing Project Documentation¶
| Attribute | Value |
|---|---|
| ID | ADR-038 |
| Status | Accepted |
| Date | 2026-01-21 |
| Context | Need a maintainable, deployable documentation website built from the repository docs/ |
Decision¶
Publish the repository documentation in docs/ as a static website built with Zensical.
The docs site is:
- Built from
docs/with configuration inzensical.toml - Rendered with PlantUML pre-rendering (SVG/PNG) for existing diagrams
- Packaged as a container image
forma3d-connect-docsand published to the existing container registry - Deployed to staging behind Traefik at
https://staging-connect-docs.forma3d.be - Managed by the existing Azure DevOps pipeline using
docsAffecteddetection
Rationale¶
- Single source of truth: docs live next to the code they describe (
docs/) - Static output: simple, fast, cacheable; no backend runtime required
- Pipeline parity: follows the same build/sign/SBOM/deploy controls as
apiandweb - Diagram support: preserves existing PlantUML investment via deterministic CI rendering
Implementation¶
- Config:
zensical.toml(sets site name, logo, PlantUML markdown extension) - Container build:
deployment/docs/Dockerfile(builds site + serves via Nginx) - Staging service:
deployment/staging/docker-compose.yml(docsservice + Traefik labels) - CI/CD:
azure-pipelines.yml - Detect changes to
docs/**orzensical.tomlviadocsAffected - Build/push/sign/SBOM the
forma3d-connect-docsimage - Deploy conditionally to staging
Consequences¶
Positive:
- ✅ Documentation changes can be delivered independently of API/Web
- ✅ Consistent hosting model (Traefik + container) across services
- ✅ PlantUML diagrams render in the published docs site
Negative:
- ⚠️ Docs builds can be slower due to diagram rendering (mitigated by caching)
- ⚠️ Local preview requires Zensical + Java/Graphviz (documented in developer workflow)
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Host Markdown in repo UI | Not a branded, searchable documentation site |
| MkDocs Material | Zensical provides a modern, batteries-included path with similar ecosystem compatibility |
| Convert all diagrams to Mermaid | High migration effort; risk of losing diagram fidelity |
Related Documents¶
docs/README.md- Documentation indexdocs/05-deployment/staging-deployment-guide.md- Staging deployment guide- Zensical: Get started
- Zensical: Logo and icons
ADR-039: Global API Key Authentication (Fail-Closed)¶
| Attribute | Value |
|---|---|
| ID | ADR-039 |
| Status | Accepted |
| Date | 2026-01-21 |
| Context | The API exposed non-health endpoints when INTERNAL_API_KEY was missing, risking data access |
Decision¶
Enforce API key authentication globally for the API application, with explicit public exceptions.
- All HTTP routes require
X-API-Keyby default - Only the following are public:
/health/**(orchestration/monitoring probes)- External webhook receivers (secured by their own verification guards)
- Authentication is fail-closed:
- If
INTERNAL_API_KEYis not configured, non-public endpoints return an error (no “development bypass”) - Real-time channel is also secured:
- Socket.IO
/eventsrequires the same internal API key during handshake
Rationale¶
- Default-secure posture: avoids accidental exposure in development/staging due to missing env vars
- Consistency: one policy applied across all controllers (no “forgot to add
@UseGuards” drift) - Clear separation: health/webhooks remain reachable for infrastructure and external platforms
- Parity with dashboard: matches the operator dashboard’s expectation that API access is gated
Implementation¶
- Global guard: register API key guard as an
APP_GUARDinapps/api - Public routes: introduce
@Public()decorator to opt out for/health/**and webhook controllers - Fail-closed config: if
INTERNAL_API_KEYis missing, non-public HTTP routes are rejected - WebSocket guard: add
WsApiKeyGuardtoEventsGatewayfor the/eventsnamespace - Webhook verification: require
SIMPLYPRINT_WEBHOOK_SECRETfor SimplyPrint inbound verification (fail-closed)
Consequences¶
Positive:
- ✅ Eliminates unauthenticated access to operational/admin API endpoints
- ✅ Prevents misconfiguration from silently reducing security
- ✅ Makes “secure endpoints” the default, with explicit public exceptions
- ✅ Secures both REST and realtime update channels consistently
Negative:
- ⚠️ Local development now requires configuring
INTERNAL_API_KEYto use non-health endpoints - ⚠️ Clients (dashboard, tools) must always send
X-API-Keyfor non-public routes
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
Per-controller @UseGuards(ApiKeyGuard) only |
Easy to miss a controller; inconsistent over time |
| Allow all when key missing (“dev mode”) | Unsafe default; makes staging/prod exposure more likely |
| Network-only restrictions (IP allowlist) | Harder operationally; not sufficient on its own |
Related Documents¶
- ADR-024 / ADR-029 (previous API key authentication decisions)
apps/api/src/common/guards/api-key.guard.tsapps/api/src/common/decorators/public.decorator.ts
ADR-040: Shopify Order Backfill for Downtime Recovery¶
| Attribute | Value |
|---|---|
| ID | ADR-040 |
| Status | Accepted |
| Date | 2026-01-22 |
| Context | Shopify webhooks retry for only ~4 hours; extended downtime can permanently lose order events |
Decision¶
Implement a scheduled backfill service that periodically polls Shopify's Orders API to catch any orders missed during webhook delivery failures.
Strategy:
- Store a durable
since_idwatermark in theSystemConfigtable - Every 5 minutes (configurable), fetch orders from Shopify with
since_idpagination - For each order not in our database, create it using the same mapping logic as webhooks
- Advance watermark only after successful processing (not before, unlike webhook path)
- Provide admin endpoints for manual backfill trigger, status check, and watermark reset
Rationale¶
- Shopify retry window is limited: Webhooks are retried only 8 times over ~4 hours (as of September 2024)
- Downtime recovery: If service is down longer than 4 hours, orders would be permanently lost without backfill
- Idempotent by design: Order creation is already deduplicated by
shopifyOrderId, so re-processing is safe - Operational visibility: Admin endpoints allow operators to trigger backfill after incidents
- Consistent mapping: Reuses the same
ShopifyService.buildCreateOrderInput()method as webhooks
Implementation¶
- SystemConfigService: New service for persisting key-value configuration (watermarks, etc.)
- ShopifyBackfillService: Scheduled job with
@Cron(EVERY_5_MINUTES)plus startup run - ShopifyAdminController: Admin endpoints at
/api/v1/admin/shopify/backfill/* - Shared mapping: Extracted
buildCreateOrderInput()andcheckUnmappedSkus()inShopifyService(usesfindUnmappedLineItems()with product/variant ID + SKU matching)
Configuration¶
| Environment Variable | Default | Description |
|---|---|---|
SHOPIFY_BACKFILL_ENABLED |
true |
Enable/disable scheduled backfill |
SHOPIFY_BACKFILL_BATCH_SIZE |
50 |
Orders to fetch per API call |
Consequences¶
Positive:
- ✅ Guarantees order recovery after extended downtime (not dependent on webhook retry window)
- ✅ Uses existing idempotency (no duplicates even with aggressive backfill)
- ✅ Operators can manually trigger backfill after incidents
- ✅ Observable via event logs and admin status endpoint
Negative:
- ⚠️ Adds Shopify API calls even during normal operation (rate limit aware)
- ⚠️ Does not reconstruct intermediate webhook events (e.g., multiple
orders/updatedduring downtime) - ⚠️ Initial backfill on existing system may take time to paginate through history
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Only rely on Shopify retries | 4-hour window insufficient for extended outages |
| Event sourcing / webhook queue | Over-engineered for current scale; adds infrastructure |
| Manual import after incidents | Error-prone, delays recovery, requires operator intervention |
Time-based polling (updated_at_min) |
Harder to paginate reliably; since_id is simpler and robust |
Related Documents¶
- ADR-011 (Idempotent Webhook Processing)
- ADR-033 (Database-Backed Webhook Idempotency)
apps/api/src/shopify/shopify-backfill.service.tsapps/api/src/config/system-config.service.ts
ADR-041: SimplyPrint Webhook Idempotency and Job Reconciliation¶
| Attribute | Value |
|---|---|
| ID | ADR-041 |
| Status | Accepted |
| Date | 2026-01-22 |
| Context | SimplyPrint webhooks lacked idempotency; polling only detected PRINTING status, not completed/failed jobs |
Decision¶
Add database-backed webhook idempotency to SimplyPrint webhook handling and implement a job reconciliation service that periodically syncs print job statuses with SimplyPrint's API.
Webhook Idempotency:
- Reuse the existing
WebhookIdempotencyRepository(same as Shopify) - Deduplicate by
webhook_idfrom SimplyPrint payload - Key format:
simplyprint/{event}(e.g.,simplyprint/job.started)
Job Reconciliation:
- Scheduled job runs every minute to check active print jobs
- Query all print jobs with
simplyPrintJobIdin non-terminal states (QUEUED, ASSIGNED, PRINTING) - Compare local status with SimplyPrint's queue and printer states
- Emit
JOB_STATUS_CHANGEDevents for discrepancies
Rationale¶
- Webhook idempotency: SimplyPrint may retry webhooks on timeout; duplicate events could cause issues
- Existing polling was limited: Only detected PRINTING status via printer polling
- History-based terminal state detection: If COMPLETED/FAILED/CANCELLED webhooks are missed, the reconciliation service queries SimplyPrint's print history API (
GET /{id}/jobs/Get) after a 5-minute grace period to detect terminal states automatically - Hybrid approach: Webhooks for real-time + reconciliation for reliability (belt and suspenders)
Implementation¶
- SimplyPrintService.handleWebhook(): Added idempotency check using
WebhookIdempotencyRepository - SimplyPrintReconciliationService: New service with
@Cron(EVERY_MINUTE)that reconciles job statuses - SimplyPrintReconciliationService.handleMissingJob(): Two-step lookup for missing jobs —
getJob()(GetDetails) thengetJobHistory()(history list) — with grace period (5 min), rate limiting (max 10/cycle), and escalation logging (30 min) - SimplyPrintApiClient.getJobHistory(): Queries the print history endpoint to find completed/failed/cancelled jobs no longer in the queue or on a printer
- Direct Prisma access: Reconciliation uses
PrismaServicedirectly to avoid circular dependency with PrintJobsModule
SimplyPrint Job ID Resolution¶
SimplyPrint uses three different identifiers for the same logical job:
| Identifier | Source | Format | When available |
|---|---|---|---|
Queue-item created_id |
AddItem response |
Integer (e.g. 385029) |
At queue time |
Job uid |
Webhooks, GetDetails |
UUID (e.g. da69d2a4-...) |
After job starts |
Job numeric id |
Webhooks | Integer (e.g. 552252) |
After job starts |
When a job is queued, we store created_id as both simplyPrintJobId (mutable) and simplyPrintQueueItemId (persistent). When the first job.started webhook arrives, simplyPrintJobId is updated to the job UID for fast future lookups, but simplyPrintQueueItemId is never overwritten.
Lookup chain in PrintJobsService.handleSimplyPrintStatusChange():
- Primary: Find by
simplyPrintJobId= webhook job UID - Fallback 1: Find by
simplyPrintJobId= webhook numeric job ID - Fallback 2: Call
GetDetailsAPI for the job'squeued.id, find bysimplyPrintJobId=queued.id - Fallback 3: Find by
simplyPrintQueueItemId=queued.id(handles re-queued jobs wheresimplyPrintJobIdwas already overwritten with the first job's UID)
Fallback 3 includes a safety check: before adopting the matched print job, it verifies that the new job UID is not already linked to another print job in the database (prevents accidentally hijacking a webhook for a different order's job).
Re-queue scenario: When SimplyPrint cancels a job and the operator clears the bed, SimplyPrint revives the same queue item and creates a new job with a different UID but the same queued.id. Fallback 3 matches via simplyPrintQueueItemId and adopts the cancelled print job, updating its simplyPrintJobId to the new UID.
Configuration¶
| Environment Variable | Default | Description |
|---|---|---|
SIMPLYPRINT_RECONCILIATION_ENABLED |
true |
Enable/disable scheduled reconciliation |
Consequences¶
Positive:
- ✅ Prevents duplicate event processing from webhook retries
- ✅ Catches missed PRINTING status changes via reconciliation
- ✅ Uses existing idempotency infrastructure (no new tables)
- ✅ Observable via event logs
- ✅ Re-queued jobs are automatically matched back to their original print job via persistent
simplyPrintQueueItemId
Negative:
- ⚠️ Reconciliation cannot detect COMPLETED/FAILED from SimplyPrint API alone (relies on webhooks for terminal states)
- ⚠️ Adds API calls to SimplyPrint every minute (rate limit aware)
- ⚠️ Jobs "missing" from SimplyPrint are logged but not auto-updated (avoids incorrect state changes)
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Extend existing polling for all states | SimplyPrint API doesn't expose completed/failed job history |
| Store last-seen status for comparison | Over-complicated; event emission on change is sufficient |
| Skip idempotency (rely on status check) | Status check is partial protection; true idempotency is safer |
Related Documents¶
- ADR-033 (Database-Backed Webhook Idempotency)
- ADR-040 (Shopify Order Backfill)
apps/api/src/simplyprint/simplyprint.service.tsapps/api/src/simplyprint/simplyprint-reconciliation.service.ts
ADR-042: SendCloud Webhook Integration for Shipment Status Updates¶
| Attribute | Value |
|---|---|
| ID | ADR-042 |
| Status | Accepted |
| Date | 2026-01-22 |
| Context | Shipment statuses only updated at label creation; no visibility into transit/delivery state |
Decision¶
Implement SendCloud webhook receiver for real-time shipment status updates with HMAC-SHA256 signature verification and a reconciliation service for backfill.
Webhook Handling:
- New endpoint:
POST /webhooks/sendcloud - Verify
Sendcloud-Signatureheader using HMAC-SHA256 - Process
parcel_status_changedevents - Database-backed idempotency using existing infrastructure
Status Mapping:
| SendCloud Status ID | ShipmentStatus |
|---|---|
| 1-10 | LABEL_CREATED |
| 11-99 | ANNOUNCED |
| 1000-1098 | IN_TRANSIT |
| 1100-1199 | CANCELLED |
| 1999, 2001+ | FAILED |
| 2000 | DELIVERED |
Reconciliation:
- Scheduled job runs every 5 minutes
- Polls SendCloud API (
getParcel) for active shipments - Updates status for any discrepancies found
Rationale¶
- Customer visibility: Users need to see when shipments are in transit, delivered, or failed
- Operational awareness: Operators need to know if shipments encounter problems. The
shipmentStatusfield on every order API response enables shipping status badges and dedicated shipping filters (Ready to Ship, In Transit, Delivered, Shipping Issues) on the orders list page. - Existing UI ready: ShippingInfo component already displays all statuses with color-coded badges. The orders list now shows a shipping badge (with truck icon) alongside the order status badge.
- Webhook reliability: SendCloud may retry webhooks; idempotency prevents duplicate processing
Implementation¶
- SendcloudWebhookGuard: Verifies HMAC-SHA256 signature
- SendcloudWebhookService: Processes status changes, maps statuses, updates shipments
- SendcloudReconciliationService: Polls SendCloud API every 5 minutes for active shipments
- IShipmentsService.findBySendcloudParcelId(): Added to interface for parcel ID lookups
- OrderResponseDto.shipmentStatus: Every order API response includes the associated shipment status (null if no shipment exists). The
OrderQueryDtosupportsshipmentStatusfilter (by exact status) andreadyToShipboolean filter (completed orders with PENDING/LABEL_CREATED/ANNOUNCED shipments). - ShipmentStatus enum: Shared via
@forma3d/domain(libs/domain/src/enums/shipment-status.ts) for use across backend and frontend.
Configuration¶
| Environment Variable | Default | Description |
|---|---|---|
SENDCLOUD_WEBHOOK_SECRET |
- | HMAC secret for signature verification (same as API Secret Key) |
SENDCLOUD_RECONCILIATION_ENABLED |
true |
Enable/disable scheduled reconciliation |
Consequences¶
Positive:
- ✅ Real-time shipment status updates in UI
- ✅ Automatic detection of delivered/failed shipments
- ✅ Backfill for missed webhooks via reconciliation
- ✅ Uses existing idempotency infrastructure
- ✅ Shipping status surfaced on orders list via
shipmentStatusfield — operators can filter by shipping status (Ready to Ship, In Transit, Delivered, Shipping Issues) and see badges on each order row
Negative:
- ⚠️ Requires webhook configuration in SendCloud panel
- ⚠️ Additional API calls for reconciliation (rate limit aware)
- ⚠️ Webhook secret must be configured for production security
Related Documents¶
- ADR-033 (Database-Backed Webhook Idempotency)
- ADR-041 (SimplyPrint Webhook Idempotency)
apps/api/src/sendcloud/sendcloud-webhook.service.tsapps/api/src/sendcloud/sendcloud-reconciliation.service.ts
ADR-043: PWA Version Mismatch Detection¶
| Attribute | Value |
|---|---|
| ID | ADR-043 |
| Status | Accepted |
| Date | 2026-01-23 |
| Context | Users may run outdated PWA versions if they dismiss the update prompt or if the service worker hasn't yet detected updates |
Decision¶
Implement automatic version mismatch detection on the Settings page that compares the cached PWA version against the server version and triggers the service worker update prompt when they differ.
Rationale¶
Problem Statement¶
The PWA displays the frontend version in two places:
- Settings page - Shows version from cached
/build-info.json - Sidebar footer - Shows version from cached
/build-info.json
When a new version is deployed:
- The service worker checks for updates hourly
- Users may have dismissed the "Update now" prompt
- The cached version can become stale
Users visiting the Settings page to check version information should be prompted to update if running an outdated version.
Solution¶
When the user navigates to the Settings page:
- Fetch
/build-info.jsonfrom the server with cache-busting headers - Compare the server version against the cached PWA version
- If versions differ, call
registration.update()on the service worker - This triggers the "New version available!" prompt
Implementation¶
New Components¶
| Component | Path | Description |
|---|---|---|
| ServiceWorkerContext | apps/web/src/contexts/service-worker-context.tsx |
Centralized SW state management, exposes checkForUpdates() |
| useServerVersion | apps/web/src/hooks/use-server-version.ts |
Fetches fresh version with cache-busting |
| useVersionMismatchCheck | apps/web/src/hooks/use-version-mismatch-check.ts |
Compares versions, triggers update on mismatch |
Architecture¶
User visits Settings page
│
▼
useVersionMismatchCheck({ checkOnMount: true })
│
▼
fetch('/build-info.json?_=timestamp', { cache: 'no-store' })
│
▼
Compare serverVersion vs cachedVersion
│
▼ (if different)
serviceWorkerContext.checkForUpdates()
│
▼
registration.update() detects new SW
│
▼
needRefresh = true → Update prompt shown
Cache-Busting Strategy¶
const response = await fetch(`/build-info.json?_=${Date.now()}`, {
cache: 'no-store',
headers: {
'Cache-Control': 'no-cache, no-store, must-revalidate',
Pragma: 'no-cache',
},
});
Consequences¶
Positive:
- ✅ Users are reliably prompted to update when viewing version info
- ✅ Works even if previous update prompt was dismissed
- ✅ No polling overhead - only checks when user visits Settings
- ✅ Centralized service worker state via React Context
- ✅ Reusable hooks for future version-aware features
Negative:
- ⚠️ Extra network request on Settings page load
- ⚠️ Relies on service worker being registered
Related Documents¶
- ADR-035 (Progressive Web App for Cross-Platform Access)
- ADR-036 (localStorage Fallback for PWA Install Detection)
apps/web/src/pwa/sw-update-prompt.tsxapps/web/src/pages/settings/index.tsx
ADR-044: Role-Based Access Control and Tenant-Ready Architecture¶
| Attribute | Value |
|---|---|
| ID | ADR-044 |
| Status | Accepted |
| Date | 2026-01-24 |
| Context | Need to implement multi-user authentication with role-based access control, while preparing for future multi-tenancy |
Decision¶
Implement in-app RBAC and tenant-ready data isolation without external identity providers (no Keycloak/OpenID Connect yet).
Key Decisions¶
- Session-Based Authentication
- HTTP-only cookies with
express-sessionand PostgreSQL session store - Argon2id password hashing with automatic rehashing
-
Legacy API key authentication preserved for backward compatibility
-
Permission-Based Authorization
- Permissions are string constants (e.g.,
orders.read,orders.write) - Roles are named bundles of permissions (e.g.,
admin,operator,viewer) - Users can have multiple roles; effective permissions = union of all role permissions
-
Server-side enforcement via NestJS guards (
SessionGuard,PermissionsGuard) -
Tenant-Ready Data Model
- All tenant-owned entities include
tenantIdforeign key - Repositories enforce tenant scoping in all queries
- Single default tenant (
00000000-0000-0000-0000-000000000001) for current operations -
Architecture supports future multi-tenant expansion
-
Security Auditing
AuditLogtable captures security-relevant actions- Actor identity and tenant context attached to Sentry error reports
- No logging of sensitive data (passwords, tokens, API keys)
Database Schema Additions¶
-- Core RBAC tables
Tenant, User, Role, Permission, UserRole, RolePermission, Session, AuditLog
-- Tenant scoping on all existing tables
Order, LineItem, PrintJob, ProductMapping, AssemblyPart, Shipment, EventLog, etc.
Implementation Details¶
Backend (NestJS)¶
SessionGuard: Global guard that validates sessions or falls back to API keyPermissionsGuard: Route-level guard that checks required permissions@CurrentUser(): Decorator to inject authenticated user into controllers@RequirePermissions(): Decorator to specify required permissions@Public(): Decorator to mark routes as public (bypass authentication)TenantContextService: Request-scoped service providing tenant contextAuditService: Centralized security audit logging
Frontend (React)¶
AuthContext: Provides user state, login/logout, permission checksusePermissions(): Hook for permission-based UI renderingProtectedRoute: Redirects unauthenticated users to loginPermissionGatedRoute: Hides routes based on permissions
User Management UI¶
- Location: Settings page → Administration section (visible to users with
users.readpermission) - Route:
/admin/users(requiresusers.readpermission to access) - Components:
UserFormModal: Create/edit users with email, password, and role selectionChangePasswordModal: Change password for existing usersUsersPage: User list with search, filtering, and CRUD operations- Features:
- Create new users with email, password, and role assignment
- Edit existing user email and roles
- Change user passwords (separate modal for security)
- Deactivate/reactivate users (soft delete pattern)
- Role selection with visual indicators for selected roles
- Permission-gated UI (actions hidden if user lacks
users.write)
Default Roles¶
| Role | Description | Permissions |
|---|---|---|
admin |
Full system access | All permissions |
operator |
Day-to-day operations | Orders, print jobs, mappings, shipments, logs (read/write) |
viewer |
Read-only access | View-only access to operational data |
legacy-admin |
API key compatibility | All permissions (deprecated) |
Consequences¶
Positive:
- ✅ Multiple users can sign in with different access levels
- ✅ Server-side permission enforcement (not UI-only security)
- ✅ Audit trail for security-relevant actions
- ✅ Architecture ready for future multi-tenancy
- ✅ Backward compatibility with existing API key integrations
- ✅ Sentry error reports enriched with user/tenant context
Negative:
- ⚠️ Session management adds infrastructure complexity
- ⚠️ All repositories needed updates for tenant scoping
- ⚠️ Coverage thresholds temporarily lowered for new modules
Migration Path¶
- Run Prisma migration to add RBAC and tenant tables
- Run seed script to create default tenant, roles, permissions, and admin user
- Existing data migrated to default tenant
- Legacy API key authentication continues to work during transition
Related Documents¶
- ADR-024 (API Key Authentication for Admin Endpoints)
- ADR-029 (API Key Authentication for Dashboard)
apps/api/src/auth/moduleapps/api/src/audit/moduleapps/api/src/tenancy/moduleapps/api/src/users/module
ADR-045: pgAdmin for Staging Database Administration¶
| Attribute | Value |
|---|---|
| ID | ADR-045 |
| Status | Accepted |
| Date | 2026-01-24 |
| Context | Need a web-based interface to inspect, query, and manage the PostgreSQL staging database |
Decision¶
Deploy pgAdmin 4 as a Docker container in the staging environment, exposed via Traefik with TLS.
Rationale¶
- Official PostgreSQL tool: pgAdmin is the official GUI administration tool for PostgreSQL
- Web-based access: No need to install desktop software or configure VPN/SSH tunnels
- Full SQL capabilities: Execute queries, view data, manage schemas, backup/restore
- Secure access: TLS via Let's Encrypt, separate credentials from database credentials
- No database exposure: Database remains inaccessible from the internet; pgAdmin connects internally via Docker network
Implementation¶
| Component | Value |
|---|---|
| Container Image | dpage/pgadmin4:latest |
| Subdomain | staging-connect-db.forma3d.be |
| Docker Network | forma3d-network (internal) |
| Data Persistence | pgadmin-data Docker volume |
| TLS Certificate | Auto-provisioned via Let's Encrypt |
Environment Variables¶
| Variable | Description | Secret? |
|---|---|---|
PGADMIN_DEFAULT_EMAIL |
Login email for pgAdmin | No |
PGADMIN_DEFAULT_PASSWORD |
Login password for pgAdmin | Yes |
Usage¶
- Navigate to
https://staging-connect-db.forma3d.be - Log in with
PGADMIN_DEFAULT_EMAILandPGADMIN_DEFAULT_PASSWORD - Add a new server connection:
- Name:
Forma3D Staging - Host: Database hostname from
DATABASE_URL(DigitalOcean managed PostgreSQL hostname) - Port:
25060(DigitalOcean managed PostgreSQL port) - Database:
defaultdb(or your database name) - Username/Password: From
DATABASE_URL - SSL Mode: Require (set in Connection > SSL tab)
Security Considerations¶
- pgAdmin credentials are separate from database credentials
- Database credentials are entered manually in pgAdmin (not stored in environment)
- Enhanced cookie protection enabled
- Access is restricted to those who know the pgAdmin login credentials
- TLS encrypts all traffic
Consequences¶
Positive:
- ✅ Easy database inspection without SSH access
- ✅ Web-based access from any device
- ✅ Full SQL query capabilities
- ✅ Visual schema exploration
- ✅ Data export/import capabilities
Negative:
- ⚠️ Additional attack surface (mitigated by strong password + TLS)
- ⚠️ Resource overhead (minimal - pgAdmin is lightweight)
- ⚠️ Users must manually configure the database server connection
Related Documents¶
deployment/staging/docker-compose.ymldeployment/staging/env.staging.templatedocs/05-deployment/staging-deployment-guide.md
ADR-046: PostgreSQL Session Store for Persistent Authentication¶
| Attribute | Value |
|---|---|
| ID | ADR-046 |
| Status | Accepted |
| Date | 2026-01-26 |
| Context | User sessions were lost on server restarts, causing frequent re-authentication during deployments |
Decision¶
Replace the default in-memory session store with PostgreSQL-backed sessions using connect-pg-simple, and extend session duration from 24 hours to 7 days.
Rationale¶
The default express-session in-memory store has critical limitations:
| Problem | Impact | Solution |
|---|---|---|
| Sessions lost on restart | Users logged out during every deployment | PostgreSQL persistence |
| No session sharing | Cannot scale to multiple API instances | Shared database store |
| Short session duration | Users had to re-login frequently | Extended to 7 days |
| Memory consumption | Sessions consume server RAM | Offloaded to database |
Implementation¶
Package: connect-pg-simple with @types/connect-pg-simple
Migration: prisma/migrations/20260126000000_add_session_store/migration.sql
CREATE TABLE "session" (
"sid" varchar NOT NULL COLLATE "default",
"sess" json NOT NULL,
"expire" timestamp(6) NOT NULL
);
ALTER TABLE "session" ADD CONSTRAINT "session_pkey" PRIMARY KEY ("sid");
CREATE INDEX "IDX_session_expire" ON "session" ("expire");
Configuration: apps/api/src/main.ts
| Setting | Value | Description |
|---|---|---|
store |
PgSession | PostgreSQL-backed session store |
tableName |
session |
Database table for sessions |
pruneSessionInterval |
3600 (1 hour) | Expired session cleanup interval |
maxAge |
7 days (configurable) | Session cookie lifetime |
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
SESSION_SECRET |
(required) | Secret key for signing session cookies |
SESSION_MAX_AGE_DAYS |
7 |
Session duration in days |
Session Lifecycle¶
User Login → Session created in PostgreSQL → Cookie sent to browser
↓
Browser Request → Cookie validated → Session loaded from PostgreSQL
↓
Session Expires → Pruned by hourly cleanup job
Consequences¶
Positive:
- ✅ Sessions survive server restarts and deployments
- ✅ Sessions shared across multiple API instances (horizontal scaling ready)
- ✅ 7-day sessions reduce login friction for users
- ✅ Automatic cleanup of expired sessions (no manual maintenance)
- ✅ No additional infrastructure (uses existing PostgreSQL)
Negative:
- ⚠️ Slight latency increase for session lookups (negligible with connection pooling)
- ⚠️ Database storage for sessions (minimal - each session ~1-2KB)
- ⚠️ Migration required on existing deployments
Related Documents¶
apps/api/src/main.ts- Session configurationprisma/migrations/20260126000000_add_session_store/migration.sql- Database schema.env.example- Environment variable documentationdeployment/staging/docker-compose.yml- Container configuration
ADR-047: Three-Tier Logging Strategy (Application + Business Events + Sentry Logs)¶
| Attribute | Value |
|---|---|
| ID | ADR-047 |
| Status | Superseded by ADR-058 (Sentry Logs tier replaced by ClickHouse + Grafana) |
| Date | 2026-01-27 |
| Context | Need comprehensive observability with different log types for debugging vs. compliance/audit |
Decision¶
Implement a three-tier logging strategy that separates application logs from business event logs, with Sentry Logs for centralized visibility:
| Tier | Storage | Purpose | Examples |
|---|---|---|---|
| Application Logs | Pino (stdout) | Debugging, performance | HTTP requests, service calls |
| Business Events | PostgreSQL (EventLog) + Sentry Logs |
Business audit trail | Order created, shipment status changed |
| Security Audit | PostgreSQL (AuditLog) + Sentry Logs |
Compliance, security | Login success/failure, permission denied |
| Sentry Logs | Sentry (cloud) | Centralized visibility | All business + audit events in one place |
Rationale¶
Different log types serve different purposes:
| Concern | Application Logs | Business Events | Audit Logs | Sentry Logs |
|---|---|---|---|---|
| Retention | Short (days/weeks) | Long (months/years) | Regulatory (years) | 30 days (configurable) |
| Query needs | Full-text search | Structured filtering | Compliance reporting | Real-time search |
| Access control | DevOps/Developers | Business users | Administrators only | DevOps team |
| Storage cost | High volume, low cost | Moderate volume | Low volume, high value | Included in Sentry plan |
Why Sentry Logs?
- Single pane of glass: View errors, traces, and logs in one place
- No additional tooling: Already using Sentry for error tracking
- Structured attributes: Filter by orderId, eventType, userId, etc.
- Real-time: Logs appear immediately for debugging
- Cost-effective: Included in existing Sentry subscription (with limits)
Implementation¶
Application Logging (Pino via nestjs-pino):
- Configured in
apps/api/src/observability/observability.module.ts - Environment-based formatting (pretty dev, JSON prod)
- Automatic request/response logging via interceptors
- Redacts sensitive fields (passwords, tokens, cookies)
Business Event Logging (EventLogService):
- Stored in
EventLogPostgreSQL table - Structured metadata with orderId, printJobId associations
- Severity levels: INFO, WARNING, ERROR
- Triple output: Database + Application logger + Sentry Logs
Security Audit Logging (AuditService):
- Stored in
AuditLogPostgreSQL table - Captures actor, action, target, IP address, user agent
- Tenant-scoped for multi-tenancy support
- Admin-only access via
audit.readpermission - Also sent to Sentry Logs for real-time visibility
Sentry Logs Integration (SentryLoggerService):
- Wrapper around
Sentry.loggerAPI - Centralized service in
apps/api/src/observability/services/sentry-logger.service.ts - Structured attributes for filtering (eventType, orderId, userId, etc.)
- Automatic integration with EventLogService and AuditService
- View in Sentry: Explore > Logs
Event Types¶
Business Events (EventLog):
| Event Type | Severity | Trigger |
|---|---|---|
order.created |
INFO | Shopify webhook creates order |
order.status_changed |
INFO | Order status transition |
order.cancelled |
WARNING | Order cancellation |
printjob.created |
INFO | Print job created in SimplyPrint |
printjob.status_changed |
INFO/ERROR | SimplyPrint status update |
shipment.created |
INFO | Shipment record created |
shipment.status_changed |
INFO/WARNING | Sendcloud status update |
shipment.tracking_updated |
INFO | Tracking number assigned |
shipment.label_generated |
INFO | Shipping label created |
shipment.cancelled |
WARNING | Shipment cancellation |
Audit Events (AuditLog):
| Action | Success | Trigger |
|---|---|---|
auth.login.success |
true | User successfully logged in |
auth.login.failure |
false | Invalid credentials |
auth.logout |
true | User logged out |
permission.denied |
false | Access denied to protected resource |
user.created |
true | New user account created |
user.updated |
true | User profile updated |
password.changed |
true | Password changed |
API Endpoints¶
| Endpoint | Permission | Purpose |
|---|---|---|
GET /api/v1/logs |
logs.read |
View business event logs |
GET /api/v1/audit-logs |
audit.read |
View security audit logs (Admin only) |
UI Access¶
- Activity Logs: Sidebar → Activity Logs
- Audit Logs: Settings → Administration → Audit Logs
Consequences¶
Positive:
- ✅ Clear separation of concerns (debugging vs. compliance vs. security)
- ✅ Business events queryable by order/print job for troubleshooting
- ✅ Audit logs provide compliance trail for security reviews
- ✅ Structured metadata enables powerful filtering
- ✅ Admin-only audit access protects sensitive security data
Negative:
- ⚠️ Three separate log stores to maintain
- ⚠️ Potential for inconsistency if logging calls are missed
- ⚠️ Database storage for events (mitigated by pruning/archival)
Related Documents¶
apps/api/src/observability/observability.module.ts- Pino configurationapps/api/src/observability/services/sentry-logger.service.ts- Sentry Logs integrationapps/api/src/event-log/event-log.service.ts- Business event loggingapps/api/src/audit/audit.service.ts- Security audit loggingapps/api/src/audit/audit.controller.ts- Audit logs API endpointapps/web/src/pages/admin/audit-logs/index.tsx- Audit logs UI- Sentry Dashboard: Explore > Logs
ADR-048: Shopify OAuth 2.0 Authentication¶
| Attribute | Value |
|---|---|
| ID | ADR-048 |
| Status | Implemented |
| Date | 2026-01-28 |
| Context | Shopify deprecated legacy custom apps for merchants (January 1, 2026). New merchant stores require OAuth-authenticated apps. |
Decision¶
Implement Shopify OAuth 2.0 Authorization Code Grant flow for app installation and authentication, replacing the static access token approach.
Rationale¶
- Production requirement: As of January 2026, Shopify merchants can only install OAuth-authenticated apps
- Multi-shop support: OAuth enables connecting multiple shops per tenant
- Token refresh: Offline access tokens have 90-day expiry with refresh capability
- Security: Tokens encrypted at rest using AES-256-GCM
- Backward compatibility: Legacy static token mode preserved for development/testing
Implementation¶
Database Schema:
model ShopifyShop {
id String @id @default(uuid())
tenantId String
shopDomain String // e.g., "example.myshopify.com"
accessToken String // Encrypted OAuth access token
tokenType String @default("offline")
scopes String[]
expiresAt DateTime?
refreshToken String?
installedAt DateTime @default(now())
uninstalledAt DateTime?
isActive Boolean @default(true)
tenant Tenant @relation(...)
@@unique([tenantId, shopDomain])
}
OAuth Flow Endpoints:
| Endpoint | Purpose |
|---|---|
GET /shopify/oauth/authorize?shop=xxx |
Initiate OAuth flow, redirect to Shopify consent |
GET /shopify/oauth/callback |
Exchange authorization code for token |
POST /shopify/oauth/uninstall |
Handle app uninstallation webhook |
GET /shopify/oauth/shops |
List connected shops for tenant |
DELETE /shopify/oauth/shops/:domain |
Disconnect a shop |
GET /shopify/oauth/status |
Check OAuth/legacy configuration status |
Security Measures:
- HMAC verification on all OAuth callbacks (timing-safe comparison)
- State parameter with cryptographic nonce (CSRF protection)
- Token encryption at rest (AES-256-GCM with unique IV per token)
- Automatic token refresh 24 hours before expiry
Configuration:
# OAuth mode (required for production)
SHOPIFY_API_KEY=<client-id>
SHOPIFY_API_SECRET=<client-secret>
SHOPIFY_APP_URL=https://connect-api.forma3d.be
SHOPIFY_SCOPES=read_orders,write_orders,read_products,write_products,read_fulfillments,write_fulfillments,read_inventory,read_merchant_managed_fulfillment_orders,write_merchant_managed_fulfillment_orders
SHOPIFY_TOKEN_ENCRYPTION_KEY=<64-hex-char-key>
# Legacy mode (optional, for development)
SHOPIFY_SHOP_DOMAIN=forma3d-dev.myshopify.com
SHOPIFY_ACCESS_TOKEN=shpat_xxx
API Client Modes:
The ShopifyApiClient supports both modes:
// Legacy mode (static token)
await shopifyClient.createFulfillment(orderId, data);
// OAuth mode (per-shop token)
await shopifyClient.createFulfillmentForShop(tenantId, shopDomain, orderId, data);
Consequences¶
Positive:
- ✅ Production-ready for merchant app installations
- ✅ Multi-shop support enables B2B scenarios
- ✅ Automatic token refresh prevents authentication failures
- ✅ Encrypted tokens protect against database leaks
- ✅ Backward compatible - existing deployments continue working
Negative:
- ⚠️ Additional complexity for OAuth flow
- ⚠️ Requires additional environment variables for production
- ⚠️ Token refresh failures need monitoring
Related Documents¶
apps/api/src/shopify/shopify-oauth.controller.ts- OAuth endpointsapps/api/src/shopify/shopify-oauth.service.ts- OAuth flow logicapps/api/src/shopify/shopify-token.service.ts- Token management/encryptionapps/api/src/shopify/shopify-shop.repository.ts- Database accessprisma/migrations/20260128000000_add_shopify_oauth/- Database migration- Shopify OAuth Documentation
Document History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-10 | AI Assistant | Initial ADR document with 13 decisions |
| 1.1 | 2026-01-10 | AI Assistant | Updated ADR-006 for Digital Ocean hosting, added ADR-014 for SimplyPrint |
| 1.2 | 2026-01-10 | AI Assistant | Added ADR-015 for Aikido Security Platform |
| 1.3 | 2026-01-10 | AI Assistant | Added ADR-016 for Sentry Observability with OpenTelemetry |
| 1.4 | 2026-01-10 | AI Assistant | Marked ADR-016 as implemented, added implementation details |
| 1.5 | 2026-01-10 | AI Assistant | Added ADR-017 for Docker + Traefik Deployment Strategy |
| 1.6 | 2026-01-11 | AI Assistant | Added ADR-018 for Nx Affected Conditional Deployment Strategy |
| 1.7 | 2026-01-13 | AI Assistant | Phase 2 updates: Updated ADR-008 with implemented events, added ADR-019 (SimplyPrint Webhook Verification), ADR-020 (Hybrid Status Monitoring) |
| 1.8 | 2026-01-14 | AI Assistant | Phase 3 updates: Added ADR-021 (Retry Queue), ADR-022 (Event-Driven Fulfillment), ADR-023 (Email Notifications) |
| 1.9 | 2026-01-14 | AI Assistant | Security update: Added ADR-024 (API Key Authentication for Admin Endpoints) |
| 2.0 | 2026-01-14 | AI Assistant | Supply chain security: Added ADR-025 (Cosign Image Signing) |
| 2.1 | 2026-01-14 | AI Assistant | Phase 4 updates: Added ADR-027 (TanStack Query), ADR-028 (Socket.IO Real-Time), ADR-029 (Dashboard Authentication) |
| 2.2 | 2026-01-16 | AI Assistant | SBOM attestations: Added ADR-026 (CycloneDX SBOM Attestations with Syft) |
| 2.3 | 2026-01-16 | AI Assistant | Phase 5 updates: Added ADR-030 (Sendcloud for Shipping Integration) |
| 2.4 | 2026-01-16 | AI Assistant | Registry cleanup: Added ADR-031 (Automated Container Registry Cleanup) |
| 2.5 | 2026-01-17 | AI Assistant | Domain boundary separation: Added ADR-032 (Domain Boundary Separation with Interface Contracts) |
| 2.6 | 2026-01-17 | AI Assistant | Critical tech debt resolution: Added ADR-033 (Database-Backed Webhook Idempotency) |
| 2.7 | 2026-01-19 | AI Assistant | Infrastructure hardening: Added ADR-034 (Docker Log Rotation & Resource Cleanup) |
| 2.8 | 2026-01-19 | AI Assistant | Cross-platform strategy: Added ADR-035 (PWA replaces Tauri/Capacitor native apps) |
| 2.9 | 2026-01-20 | AI Assistant | PWA detection: Added ADR-036 (localStorage Fallback for PWA Install Detection) |
| 3.0 | 2026-01-20 | AI Assistant | Documentation: Added ADR-037 (Keep a Changelog for Release Documentation) |
| 3.1 | 2026-01-21 | AI Assistant | Documentation: Added ADR-038 (Zensical for publishing project documentation from docs/) |
| 3.2 | 2026-01-22 | AI Assistant | Resilience: Added ADR-040 (Shopify Order Backfill for Downtime Recovery) |
| 3.3 | 2026-01-22 | AI Assistant | Resilience: Added ADR-041 (SimplyPrint Webhook Idempotency and Job Reconciliation) |
| 3.4 | 2026-01-22 | AI Assistant | Feature: Added ADR-042 (SendCloud Webhook Integration for Shipment Status Updates) |
| 3.5 | 2026-01-23 | AI Assistant | PWA enhancement: Added ADR-043 (PWA Version Mismatch Detection on Settings page) |
| 3.6 | 2026-01-24 | AI Assistant | Security: Added ADR-044 (Role-Based Access Control and Tenant-Ready Architecture) |
| 3.7 | 2026-01-25 | AI Assistant | Feature: Updated ADR-044 with User Management UI implementation details |
| 3.8 | 2026-01-25 | AI Assistant | Documentation: Updated ADR-008 with complete event catalog (Order, PrintJob, Orchestration, SimplyPrint, Shipment, SendCloud, Fulfillment events); all ADRs have Status field indicating implementation |
| 3.9 | 2026-01-24 | AI Assistant | Infrastructure: Added ADR-045 (pgAdmin for Staging Database Administration) |
| 4.0 | 2026-01-26 | AI Assistant | Session management: Added ADR-046 (PostgreSQL Session Store for Persistent Authentication) |
| 4.1 | 2026-01-27 | AI Assistant | Observability: Added ADR-047 (Two-Tier Logging Strategy with Application, Business, and Audit logs) |
| 4.2 | 2026-01-28 | AI Assistant | Authentication: Added ADR-048 (Shopify OAuth 2.0 Authentication for production merchant stores) |
| 4.3 | 2026-02-07 | AI Assistant | Data model: Added ADR-049 (Optional SKU with Shopify Product/Variant ID Matching Priority) |
ADR-049: Optional SKU with Shopify Product/Variant ID Matching Priority¶
| Attribute | Value |
|---|---|
| ID | ADR-049 |
| Status | Implemented |
| Date | 2026-02-07 |
| Context | Shopify product variants may have null/empty SKUs. Merchants do not always configure SKUs, making SKU-only matching unreliable for linking incoming orders to product mappings. |
Decision¶
Make the sku field optional on the ProductMapping model and implement a product/variant ID-first matching strategy with SKU as fallback.
Rationale¶
- Shopify reality: SKU is optional on Shopify variants — many merchants leave it empty
- Reliable matching: Shopify Product ID and Variant ID are always present on order line items and are immutable identifiers
- Backward compatible: Existing mappings with SKUs continue to work via the fallback path
- No data loss: SKU remains as a display/search field when available; PostgreSQL treats multiple NULL SKUs as distinct for the unique constraint
Matching Priority¶
- Exact match:
shopifyProductId+shopifyVariantId→ specific variant mapping - Product-level catch-all:
shopifyProductIdonly (variant ID is null on the mapping) → applies to all variants of that product - SKU fallback:
skumatch → legacy path for backward compatibility
Implementation¶
Schema Changes:
model ProductMapping {
sku String? // Was: String — now nullable
// shopifyProductId and shopifyVariantId remain as before
}
model LineItem {
shopifyProductId String? // NEW — stored from webhook payload
shopifyVariantId String? // NEW — stored from webhook payload
@@index([shopifyProductId])
}
Service Changes:
ProductMappingsRepository.findByShopifyProduct(productId, variantId?)— new method for ID-based lookupProductMappingsService.findUnmappedLineItems()— replacesfindUnmappedSkus(), accepts line item objects with IDs + SKUProductMappingsService.findMappingForLineItem()— encapsulates the matching priorityPrintJobsService.createPrintJobsForLineItem()— uses new matching: tries product ID first, then SKU
Frontend Changes:
- SKU field marked as optional in the mapping form
- Search/display handles null SKUs with fallback display (
—)
Consequences¶
- Positive: System works for all Shopify merchants regardless of SKU configuration
- Positive: More reliable matching — Shopify IDs are guaranteed present and immutable
- Neutral: Existing mappings with SKUs continue working unchanged
- Consideration: When both a variant-specific and product-level mapping exist, the variant-specific one takes priority
ADR-050: Apache ECharts for Dashboard Analytics¶
| Attribute | Value |
|---|---|
| ID | ADR-050 |
| Status | Implemented |
| Date | 2026-02-13 |
| Context | The dashboard displayed only static stat cards and lists. Operators lacked visual insight into order, print job, and shipment status distributions, revenue trends, and day-over-day comparisons. |
Decision¶
Adopt Apache ECharts (v6) via echarts-for-react as the charting library for the dashboard analytics feature. Use on-demand imports for bundle optimization and lazy loading (React.lazy) for all chart components.
Rationale¶
- Richest chart variety: Donut, bar, line, gauge — all required chart types in a single library
- On-demand imports: Tree-shakeable core (~225 KB shared bundle) vs ~320 KB full import
- Dark theme support: Native theme registration via
echarts.registerTheme()— consistent with existing dark UI - TypeScript-first: Complete type definitions for all chart options and callbacks
- Active maintenance: Apache Foundation project with large community
- React wrapper:
echarts-for-reactprovides declarative React component with event handling
Implementation¶
Frontend — Chart Components (apps/web/src/components/charts/):
echarts-setup.ts— On-demand ECharts core with registeredforma3ddark themechart-card.tsx— Reusable wrapper with title, subtitle, loading state, and empty statedonut-chart.tsx— Generic donut chart with custom center labels and click-to-filterbar-chart.tsx— Generic bar chart with value labels and prefix/suffix formattingline-chart.tsx— Generic line chart with gradient area fill
Frontend — Analytics Components (apps/web/src/components/analytics/):
OrderStatusChart— Order status donut with click-to-filter navigationPrintJobStatusChart— Print job status donut showing active job countShipmentStatusChart— Shipment status donut showing in-transit countRevenueTrendChart— Weekly revenue bar chartOrderTrendChart— 30-day order volume line chartAnalyticsPeriodDropdown— Shared period selector (Today / Week / Month / All Time)
Frontend — Dashboard Integration (apps/web/src/pages/dashboard.tsx):
- Enhanced stat cards with trend delta indicators (up/down arrows, day-over-day change)
- Lazy-loaded chart components with
React.lazy()and<Suspense>fallback - Shared
AnalyticsPeriodstate driving all three donut charts simultaneously
Backend — Analytics Module (apps/api/src/analytics/):
AnalyticsRepository— PrismagroupByfor status distributions,$queryRawfor daily trend aggregationAnalyticsService— Business logic for percentages, success rates, comparison deltasAnalyticsController— 6 REST endpoints under/api/v1/analytics/*- DTOs with Swagger decorators for API documentation
Shared Contracts (libs/domain-contracts/src/api/analytics.api.ts):
AnalyticsPeriod,OrderAnalyticsApiResponse,PrintJobAnalyticsApiResponseShipmentAnalyticsApiResponse,TrendsApiResponse,EnhancedDashboardStatsApiResponse
Database Indexes (prisma/schema.prisma):
- Composite indexes
@@index([tenantId, status, createdAt])on Order, PrintJob, and Shipment models
Data Fetching (apps/web/src/hooks/use-analytics.ts):
- TanStack Query hooks:
useOrderAnalytics,usePrintJobAnalytics,useShipmentAnalytics - Trend hooks:
useRevenueTrend,useOrderTrend - Enhanced stats:
useEnhancedDashboardStats(30s refresh for KPI tiles)
Consequences¶
- Positive: Operators get immediate visual insight into 3D print farm operations
- Positive: Bundle size managed through on-demand imports and lazy loading (~225 KB shared chunk)
- Positive: Consistent dark theme integration with existing UI
- Positive: Click-to-filter on donut slices enables quick navigation to filtered views
- Neutral: New dependency (
echarts+echarts-for-react) — well-maintained Apache project - Consideration: ECharts v6 has stricter TypeScript types requiring
CallbackDataParamsfor formatters - Trade-off: Raw SQL used for date truncation in trend queries (Prisma
groupBylacksDATE()support)
Alternatives Considered¶
| Library | Pros | Cons |
|---|---|---|
| Recharts | Simple React API | Limited donut customization, fewer chart types |
| Chart.js | Lightweight | Weak TypeScript, less donut label flexibility |
| Nivo | Beautiful defaults | Heavier bundle, React-specific only |
| D3.js | Maximum flexibility | High complexity, no React integration out of box |
| ECharts | Rich charts, tree-shakeable | Larger full bundle (mitigated by on-demand) |
Test Coverage¶
- Backend: 34 unit tests across
analytics.repository.spec.ts,analytics.service.spec.ts,analytics.controller.spec.ts - Frontend: 14 hook tests in
use-analytics.test.tsxwith MSW handlers for all 6 analytics endpoints - Type Safety: Full TypeScript strict mode compliance (ECharts v6 types)
ADR-051: Decompose Monolithic API into Domain-Aligned Microservices¶
| Attribute | Value |
|---|---|
| ID | ADR-051 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | The monolithic apps/api was growing beyond 300+ files across multiple domains (orders, print jobs, shipping, fulfillment, GridFlock). Feature work required understanding the entire codebase, and deployments restarted all domains even for single-domain changes. The upcoming GridFlock pipeline added compute-intensive STL generation that could block the API request thread. |
Decision¶
Decompose the monolithic API into five domain-aligned microservices plus an API Gateway:
| Service | Port | Domain |
|---|---|---|
| API Gateway | 3000 | Auth, routing, WebSocket, sessions |
| Order Service | 3001 | Orders, mappings, orchestration |
| Print Service | 3002 | Print jobs, SimplyPrint integration |
| Shipping Service | 3003 | Shipments, Sendcloud integration |
| GridFlock Service | 3004 | STL generation, slicing pipeline |
| Slicer Container | 3010 | BambuStudio CLI headless slicing |
The Gateway is the single entry point for all external traffic, routing to downstream services via HTTP proxy. Services communicate asynchronously via BullMQ event queues (Redis) and synchronously via internal HTTP APIs protected by X-Internal-Key header.
Rationale¶
- Independent deployability: Each service can be deployed without affecting others
- Domain isolation: GridFlock compute-intensive work cannot block order processing
- Horizontal scalability: Services can be scaled independently based on load
- Team scalability: Different domains can be worked on independently
- Fault isolation: A failure in one service does not cascade to others
Consequences¶
- Positive: Independent deployment and scaling per domain
- Positive: GridFlock STL generation isolated from order processing
- Positive: Clear domain boundaries enforced by service boundaries
- Positive: Each service has a smaller, focused codebase
- Negative: Increased operational complexity (more containers to manage)
- Negative: Network latency added for inter-service calls
- Negative: Distributed transaction complexity
- Trade-off: Shared database via Prisma (no per-service database yet)
Alternatives Considered¶
| Approach | Pros | Cons |
|---|---|---|
| Keep monolith | Simple operations | Growing complexity, deployment coupling |
| Modular monolith | Simpler networking | Still single deployment unit |
| Microservices | Full isolation, scalability | More containers, networking complexity |
| Serverless functions | Auto-scaling | Cold starts, vendor lock-in |
ADR-052: BullMQ Event Queues for Inter-Service Async Communication¶
| Attribute | Value |
|---|---|
| ID | ADR-052 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | The monolithic API used EventEmitter2 for internal events. In a microservice architecture, events need to cross process boundaries. We need reliable, at-least-once delivery with retry capability. |
Decision¶
Use BullMQ (backed by Redis) for inter-service asynchronous event communication. Each event type gets its own dedicated BullMQ queue. The @forma3d/service-common library provides a shared BullMqEventBus abstraction.
Event types:
order.created,order.ready-for-fulfillment,order.cancelledprint-job.completed,print-job.failed,print-job.status-changed,print-job.cancelledshipment.created,shipment.status-changedgridflock.mapping-ready,gridflock.pipeline-failed
Configuration:
- Concurrency: 5 workers per queue
- Retries: 3 attempts with exponential backoff
- Dead letter: Failed events retained (
removeOnFail: 5000) - Completed cleanup:
removeOnComplete: 1000
Rationale¶
- At-least-once delivery: BullMQ guarantees delivery with retries
- Redis already present: Required for sessions and Socket.IO adapter
- Built-in retry: Exponential backoff without custom implementation
- Visibility: Job status, progress, and failure tracking via Bull Board
- NestJS integration:
@nestjs/bullmqprovides native module support
Consequences¶
- Positive: Reliable cross-service event delivery with retries
- Positive: Dead letter queue for debugging failed events
- Positive: Event handlers are idempotent (check before acting)
- Negative: Redis becomes a critical infrastructure dependency
- Trade-off: At-least-once semantics require idempotent handlers
Alternatives Considered¶
| Approach | Pros | Cons |
|---|---|---|
| RabbitMQ | Feature-rich, routing | Additional infrastructure, more complex |
| Kafka | High throughput, replay | Overkill for this scale, complex setup |
| BullMQ | Simple, Redis-native | Redis single point of failure |
| HTTP webhooks | Simple to implement | No retry guarantees, no backpressure |
| AWS SQS/SNS | Managed, scalable | Vendor lock-in, latency |
ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)¶
| Attribute | Value |
|---|---|
| ID | ADR-053 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | The GridFlock pipeline generates STL files, slices them to gcode, and uploads to SimplyPrint. In a containerized environment with potential horizontal scaling, local file storage creates state that prevents scaling and requires cleanup. |
Decision¶
The entire GridFlock pipeline operates on in-memory buffers. No files are written to the local filesystem at any point in the pipeline:
- STL Generation: JSCAD generates geometry → serialized to STL binary buffer
- Slicing: STL buffer sent to Slicer container via HTTP → gcode buffer returned
- Upload: Gcode buffer uploaded directly to SimplyPrint Files API
- Mapping: ProductMapping created in database referencing SimplyPrint file ID
Plates in a plate set are processed sequentially to bound memory usage (one plate buffer at a time).
Rationale¶
- Stateless containers: No local state means any replica can handle any request
- Horizontal scaling: Multiple GridFlock Service instances can run concurrently
- No cleanup needed: No temp files to garbage collect
- SimplyPrint as storage: The only permanent storage is SimplyPrint (source of truth for gcode)
Consequences¶
- Positive: Fully stateless, horizontally scalable
- Positive: No disk I/O bottleneck
- Positive: No file cleanup cron jobs needed
- Negative: Memory-bound (large plate sets consume RAM)
- Mitigation: Sequential plate processing bounds peak memory to ~50MB per plate
ADR-054: SimplyPrint API Files for Gcode Upload¶
| Attribute | Value |
|---|---|
| ID | ADR-054 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | After slicing GridFlock baseplates, the gcode must be stored and made available for printing via SimplyPrint. SimplyPrint offers a Files API for uploading files to the print farm. The SimplyPrint cloud slicer is not accessible via API. |
Decision¶
Upload sliced gcode files to SimplyPrint via the Files API (requires Print Farm plan). Each gcode file is uploaded as a buffer with metadata, and SimplyPrint returns a file ID used for creating print jobs.
Rationale¶
- Single source of truth: SimplyPrint stores all printable files
- No local storage: Aligns with buffer-based pipeline (ADR-053)
- Print job integration: File IDs directly used in SimplyPrint print queue
- Existing API: Files API already used for manual file uploads
Consequences¶
- Positive: No separate file storage infrastructure needed
- Positive: Files immediately available for printing
- Negative: Requires SimplyPrint Print Farm plan (API file access)
- Negative: Upload latency adds to pipeline time (~2-5 seconds per file)
ADR-055: BambuStudio CLI Slicer Container¶
| Attribute | Value |
|---|---|
| ID | ADR-055 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | The GridFlock pipeline needs to slice STL files into gcode with specific printer profiles (nozzle diameter, layer height, filament type). SimplyPrint's cloud slicer is not API-accessible. We need a headless slicer that supports Bambu Lab and Prusa printer profiles. |
Decision¶
Run BambuStudio CLI (fork of PrusaSlicer/SuperSlicer) in a dedicated Docker container as a headless slicing service. The container exposes an HTTP API that accepts STL buffers and returns gcode buffers. Printer profiles are configurable per tenant via SystemConfig.
Rationale¶
- Bambu Lab support: Native profiles for X1 Carbon, P1S printers
- Prusa support: Backward-compatible with PrusaSlicer profiles
- Deterministic: Same input always produces same output
- Containerized: Isolated from other services, independently scalable
- CLI-based: No GUI dependencies, runs in headless mode
Consequences¶
- Positive: Full control over slicing parameters
- Positive: Tenant-configurable print profiles
- Positive: No dependency on SimplyPrint cloud slicer
- Negative: Additional container to maintain and update
- Negative: BambuStudio updates require container rebuilds
Alternatives Considered¶
| Approach | Pros | Cons |
|---|---|---|
| SimplyPrint slicer | No additional container | Not API-accessible |
| PrusaSlicer CLI | Lighter weight | No native Bambu Lab profiles |
| BambuStudio CLI | Bambu + Prusa support | Heavier container (~1.5GB) |
| CuraEngine | Popular slicer | Different profile format |
ADR-056: Redis for Sessions, Event Queues, and Socket.IO Adapter¶
| Attribute | Value |
|---|---|
| ID | ADR-056 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | The microservice architecture requires shared infrastructure for sessions, inter-service events, and WebSocket broadcasting. Rather than introducing multiple infrastructure components, a single Redis instance can serve all three purposes. |
Decision¶
Use a single Redis 7 instance for three purposes:
- Session Store: Gateway stores Express sessions in Redis (replaces PostgreSQL
connect-pg-simple), enabling session sharing across Gateway replicas - Event Bus: BullMQ queues for inter-service async events (see ADR-052)
- Socket.IO Adapter: Redis adapter enables WebSocket event broadcasting across multiple Gateway instances
Rationale¶
- Single infrastructure: One Redis instance serves all three use cases
- Horizontal scaling: Sessions and WebSockets work across Gateway replicas
- Performance: In-memory data store with sub-millisecond latency
- Proven stack: Redis + BullMQ + Socket.IO Redis adapter is a well-tested combination
Consequences¶
- Positive: Enables horizontal scaling of Gateway
- Positive: Single infrastructure dependency for multiple features
- Positive: Fast session lookups (vs. PostgreSQL round-trip)
- Negative: Redis becomes a critical dependency (all services depend on it)
- Mitigation: Redis is deployed with persistence enabled and health checks
- Trade-off: Session data is ephemeral (Redis restart clears sessions)
ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy¶
| Attribute | Value |
|---|---|
| ID | ADR-057 |
| Status | Accepted |
| Date | 2026-02-15 |
| Context | Pipeline build times grew significantly with 8+ microservice Docker images |
Decision¶
Deploy a self-hosted Azure DevOps build agent on a DigitalOcean droplet (4 vCPU / 8 GB RAM) running 2 agent instances, and adopt a hybrid agent strategy:
- MS-hosted agent handles lightweight jobs (lint, Nx builds, deployments)
- Self-hosted agents handle Docker packaging with persistent local layer cache
- Merged Validate & Test stage runs lint, typecheck, and unit tests in parallel across all 3 agents
Rationale¶
- Docker layer caching: MS-hosted agents are ephemeral (cold cache every run). Self-hosted agents maintain a warm Docker layer cache between builds, reducing per-service build time from ~7-10 min to ~2-3 min
- Pre-installed tools: Cosign and Syft are pre-installed on the self-hosted agent instead of being downloaded on every job (~1 min saved per service)
- Cost-effective parallelism: 2 self-hosted agent instances at $48/month provide more parallelism than buying 1 extra MS-hosted parallel job at $40/month
- Stage merging: Combining Validate and Test into a single stage eliminates ~5 min of sequential stage overhead by running Lint, TypeCheck, and UnitTests in parallel
Pipeline Architecture¶
| Agent | Stage | Jobs |
|---|---|---|
| MS-hosted | Validate & Test | Lint |
| DO Agent 1 | Validate & Test | TypeCheck |
| DO Agent 2 | Validate & Test | UnitTests |
| MS-hosted | Build & Package | DetectAffected, BuildAll |
| DO Agent 1+2 | Build & Package | All Package* Docker jobs |
| MS-hosted | Deploy, Acceptance, Production | All remaining stages |
Infrastructure¶
| Component | Specification |
|---|---|
| Droplet | DigitalOcean s-4vcpu-8gb ($48/month) |
| OS | Ubuntu 22.04 LTS |
| Agent Pool | DO-Build-Agents (self-hosted) |
| Agent Instances | 2 (do-build-agent-1, do-build-agent-2) |
| Setup Script | deployment/build-agent/setup-build-agent.sh |
Performance Impact¶
| Metric | Before (1 MS-hosted) | After (hybrid) | Improvement |
|---|---|---|---|
| Validate + Test | ~18 min (sequential) | ~8 min (parallel) | 56% faster |
| Build & Package (full) | ~75 min | ~15 min | 80% faster |
| Full pipeline (main) | ~133 min | ~63 min | 53% faster |
| Monthly cost | $0 | $48 | — |
Consequences¶
- Positive: Dramatically faster builds, especially Docker packaging
- Positive: Cost-effective compared to buying MS-hosted parallel jobs
- Positive: Docker layer cache persists between builds
- Negative: Self-hosted agent requires maintenance (Docker cleanup, OS updates, agent updates)
- Mitigation: Automated weekly Docker cleanup and daily disk monitoring cron jobs
- Negative: Single point of failure (if droplet goes down, Docker builds queue on MS-hosted)
- Mitigation: Pipeline falls back gracefully; Package jobs wait for agent availability
Related¶
- Self-Hosted Build Agent Documentation
- Pipeline Reference
- ADR-006: Azure DevOps for CI/CD
- ADR-018: Nx Affected Conditional Deployment Strategy
ADR-058: Self-Hosted Log Infrastructure (ClickHouse + Grafana via OpenTelemetry)¶
| Attribute | Value |
|---|---|
| ID | ADR-058 |
| Status | Accepted |
| Date | 2026-02-21 |
| Context | Sentry Logs has limited retention (30 days), query capabilities, and cost scalability for structured business/audit logs |
Decision¶
Migrate structured logging (business events, audit logs, observability) from Sentry Logs to a self-hosted ClickHouse + Grafana stack, using OpenTelemetry Collector as the ingestion pipeline and Pino as the application-level logger. Sentry remains the platform for error tracking, distributed tracing, performance monitoring, and profiling.
Architecture¶
Responsibility Split¶
| Concern | Platform | Rationale |
|---|---|---|
| Error tracking | Sentry | Best-in-class stack traces, issue grouping, alerting |
| Distributed tracing | Sentry | End-to-end traces across services with Sentry UI |
| Performance monitoring | Sentry | Request latency, database query profiling |
| Profiling | Sentry | Node.js CPU profiling in production |
| Structured logging | ClickHouse + Grafana | Unlimited retention, SQL queries, self-hosted cost control |
| Business event logs | ClickHouse + Grafana | Long-term queryable audit trail |
| Security audit logs | ClickHouse + Grafana | Compliance-grade retention |
| Log dashboards | Grafana | Custom dashboards, alerting rules |
Implementation Details¶
Shared Library (libs/observability):
otel-logger.ts— Pino logger factory with configurable log level and pino-pretty for developmentOtelLoggerService— NestJS injectable service replacingSentryLoggerService, providinginfo,warn,error,debug,logEvent, andlogAuditmethods
Instrumentation (apps/*/src/observability/instrument.ts):
- OpenTelemetry SDK initializes before Sentry to enable Pino-OTel bridging
@opentelemetry/instrumentation-pinoauto-bridges Pino logs to OTLP@opentelemetry/exporter-logs-otlp-grpcexports logs to OTel Collector- Sentry
_experiments: { enableLogs: true }flag removed
Infrastructure (deployment/staging/):
otel-collector-config.yaml— OTLP receiver → batch processor → ClickHouse exporterclickhouse-config.xml— S3 backup disk withfrom_envcredential injectionclickhouse-users.xml—oteluser for collector writesgrafana/provisioning/datasources/clickhouse.yaml— ClickHouse datasource with OTel schemascripts/backup-clickhouse-logs.sh— Daily backup cron job
Pipeline (azure-pipelines.yml):
- 7 new variables:
CLICKHOUSE_PASSWORD,GRAFANA_ADMIN_PASSWORD,DO_SPACES_KEY,DO_SPACES_SECRET,DO_SPACES_REGION,DO_SPACES_BUCKET,DO_SPACES_LOG_PREFIX - Variables flow from Azure DevOps →
.env→ Docker Compose → container environment → ClickHousefrom_envXML
Log Retention (ClickHouse TTL)¶
| Tier | Retention | Data |
|---|---|---|
| Hot | 30 days | Full structured logs |
| Warm | 90 days | Aggregated summaries |
| Archive | 365 days | Daily backups to DigitalOcean Spaces |
Consequences¶
Positive:
- ✅ Unlimited log retention at self-hosted cost (~$0 incremental on existing droplet)
- ✅ Full SQL query capability via Grafana for log analysis
- ✅ Sentry retains its strengths (error tracking, tracing, profiling) without log clutter
- ✅ ClickHouse columnar storage compresses logs 10-20x vs PostgreSQL
- ✅ Vendor-neutral via OpenTelemetry — can swap backends without code changes
- ✅ Daily automated backups to DigitalOcean Spaces (S3-compatible)
Negative:
- ⚠️ Additional infrastructure to maintain (3 new containers: OTel Collector, ClickHouse, Grafana)
- ⚠️ ~1 GB additional RAM on staging droplet
- ⚠️ Backup credentials require Azure DevOps variable management
- Mitigation: All configuration is pipeline-driven; containers are stateless except ClickHouse data volume
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Keep Sentry Logs only | 30-day retention, limited querying, potential cost at scale |
| Grafana + Loki | Loki less efficient for structured logs than ClickHouse |
| ELK Stack (Elasticsearch) | Heavy resource requirements, complex to operate |
| Datadog / New Relic | Expensive SaaS, vendor lock-in |
Related¶
- ClickHouse + Grafana Logging Research
- ADR-016: Sentry Observability with OpenTelemetry (updated — Sentry no longer handles structured logging)
- ADR-047: Three-Tier Logging Strategy (superseded — Sentry Logs tier replaced by ClickHouse)
ADR-059: Nx Affected Resilience via Last-Successful-Deploy Tag¶
| Attribute | Value |
|---|---|
| ID | ADR-059 |
| Status | Accepted |
| Date | 2026-02-22 |
| Context | Nx affected with HEAD~1 base loses track of undeployed changes when a pipeline run fails partway through |
Decision¶
Replace the hard-coded --base=HEAD~1 in the DetectAffected job with a last-successful-deploy git tag that is only advanced after the full pipeline succeeds. A new UpdateDeployTag stage at the end of the pipeline pushes the tag forward on success.
Problem¶
When the pipeline runs on main, nx affected --base=HEAD~1 compares the current commit against the previous commit. If the pipeline fails (e.g., during DeployStaging or AcceptanceTest), the changes in that commit are never deployed. When the next commit arrives with a fix, HEAD~1 now points to the failed commit — the originally undeployed changes are invisible to nx affected and are permanently skipped.
main: A ─── B (deploy fails) ─── C (fix)
│ │
└── HEAD~1 base ──────┘
changes X,Y,Z only fix Z is detected
never deployed X,Y permanently lost
Solution¶
main: A ─── B (deploy fails) ─── C (fix)
│ │
└── last-successful-deploy └── HEAD
tag stays at A affected sees A→C (includes X,Y,Z + fix)
DetectAffected job:
- Check if
last-successful-deploytag exists - If yes, use it as
--basefornx affectedandgit diff - If no (first run), fall back to
HEAD~1
UpdateDeployTag stage:
- Depends on all pipeline stages (Build, DeployStaging, AcceptanceTest, LoadTest, DeployProduction, SmokeTest)
- Runs only when Build, DeployStaging, and AcceptanceTest did not fail (disabled stages like DeployProduction/SmokeTest are allowed to be skipped)
- Force-pushes the
last-successful-deploytag to the current commit
Bootstrap¶
No manual setup required. On the first pipeline run, the tag does not exist and affected detection falls back to HEAD~1. After the first successful run, the tag is created automatically.
Consequences¶
Positive:
- No changes are ever "forgotten" — failed deployments are re-evaluated on the next run
- Self-healing: a fix PR automatically includes all previously missed changes
ForceFullVersioningAndDeploymentparameter remains as a manual override- Zero infrastructure dependencies (uses git tags, no external state store)
Negative:
- Requires
persistCredentials: trueon the UpdateDeployTag checkout step forgit push - Force-pushing tags requires appropriate repository permissions for the pipeline service account
- After extended failures, the affected set may be large (all changes since last success)
- Mitigation: This is intentional — better to rebuild too much than to skip changes
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
HEAD~1 (previous approach) |
Loses undeployed changes after pipeline failures |
| External state store (S3, SSM) | Additional infrastructure dependency for a simple use case |
| Nx Cloud affected tracking | Requires Nx Cloud subscription; overkill for current scale |
Manual ForceFullVersioningAndDeployment after failures |
Error-prone, depends on human remembering to toggle |
Related¶
- ADR-018: Nx Affected Conditional Deployment Strategy (extended by this ADR)
- ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
- ADR-006: Azure DevOps for CI/CD
ADR-060: Single Source of Truth for STL Preview Generation¶
| Attribute | Value |
|---|---|
| ID | ADR-060 |
| Status | Accepted (extended by ADR-061) |
| Date | 2026-02-27 |
| Context | STL preview generation logic was duplicated between the NestJS service and needed by offline cache population scripts |
Decision¶
Extract all STL preview generation logic from GridflockPreviewService into a standalone generatePreviewStl() function in @forma3d/gridflock-core. Both the NestJS service and offline scripts import the same function — a single source of truth for preview STL generation.
Problem¶
The GridflockPreviewService contained the full preview generation pipeline (plate set calculation, offset computation, parallel plate generation, STL combining) as private methods and hardcoded constants. To pre-populate the STL preview cache offline (eliminating cold-start latency for 16,000+ dimension combinations), this logic needed to be callable from a standalone CLI script without depending on NestJS, Prisma, Redis, or any server infrastructure.
Duplicating the generation logic in the script would create a maintenance risk — any change to STL generation would need to be applied in two places, and divergence would produce inconsistent cached files.
Solution¶
New module: libs/gridflock-core/src/lib/preview-generator.ts
Exports generatePreviewStl(widthMm, heightMm, options?) which orchestrates the full pipeline:
- Calculate plate set using
PRINTER_PROFILES['bambu-a1'] - Compute X/Y offsets per plate
- Generate plates in parallel via
generatePlatesParallel()(with configurablemaxWorkers) - Combine into a single binary STL via
combineStlBuffers()
Moved from service to library:
computeUniformOffsets()— cumulative X-axis offset calculationcomputeOffsetsPerColumn()— per-column Y-axis offset calculationPLATE_GAP_MM = 10— gap constant between plates in the previewDEFAULT_PREVIEW_OPTIONS— intersection-puzzle connectors, magnets disabled
maxWorkers parameter: Added to generatePlatesParallel() (backward-compatible optional parameter) so the offline script can limit worker threads per combination when running multiple combinations concurrently.
Service refactored: GridflockPreviewService.generatePreview() became a thin wrapper:
- Normalize dimensions (larger dimension first)
- Attempt plate-level assembly → return if successful
- Fall back to
generatePreviewStl(w, h, { log })→ return
Note: ADR-061 extends this architecture with a plate-level cache that assembles previews from ~268 cached base plates + dynamically generated border geometry (~60 MB total) while supporting any input resolution. The legacy full-preview disk cache (16,471 files, ~32 GB) was removed in March 2026.
Consequences¶
Positive:
- Single source of truth — changes to STL generation logic happen in one place
- Offline scripts produce byte-for-byte identical output to the server
maxWorkersenables CPU-aware parallelism in the population script without oversubscription- Cache key normalization prevents duplicate entries (320×450 = 450×320)
@forma3d/gridflock-coreremains NestJS-independent — usable in any Node.js context
Negative:
- Preview generation parameters (printer profile, connector type) are hardcoded in the library rather than configurable per-tenant
- Mitigation: When multi-tenant preview customization is needed,
generatePreviewStlcan accept a configuration parameter
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Duplicate generation logic in the script | Maintenance risk — two codepaths that must stay in sync |
| Import NestJS service in the script | Would require NestJS DI container, Prisma, Redis — heavyweight for an offline tool |
| Use the server's REST API from the script | Network-bound, requires running server, doesn't leverage local CPU cores |
Related¶
- ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
- ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly
- STL Cache Pre-Population Research
scripts/populate-plate-cache.ts— plate-level cache population script
ADR-061: Plate-Level Preview Cache with Dynamic Border Assembly¶
| Attribute | Value |
|---|---|
| ID | ADR-061 |
| Status | Accepted |
| Date | 2026-03-01 |
| Context | Full-preview-per-dimension cache requires ~32 GB for 0.5 cm resolution (16,471 files) and ~853 GB for 1 mm resolution (406,351 files), making fine-grained drawer-fit precision impractical |
Decision¶
Replace the full-preview-per-dimension cache with a plate-level cache of 200 base plates (~41 MB) that are assembled on the fly with dynamically generated border geometry to produce previews for any dimension at any resolution.
Problem¶
The storefront configurator's 3D preview required pre-populating one STL file per dimension pair. At 0.5 cm resolution (step 5 mm), this was 16,471 files at ~32 GB — already impractical to generate (10–14 hours) and deploy. Supporting 1 mm input precision (for exact drawer fit) would require 406,351 files at ~853 GB, which was not feasible.
Analysis revealed that each preview's plate geometry is determined by only two factors: grid size (1–6 × 1–6) and connector edge pattern (4 booleans). The border around the grid cells is what creates the combinatorial explosion (147,500 unique plates with border vs only 200 without border). The border itself is a trivial rectangular solid that can be generated in microseconds.
Solution¶
Three-tier architecture:
- 200 base plates cached — each is a unique (gridSize, connectorEdges) combination with zero border and no plate number, generated via JSCAD CSG and stored as binary STL
- Border strips generated on the fly — simple rectangular cuboids (12 triangles, 684 bytes each) created as raw binary STL without any JSCAD dependency
- Assembly via
combineStlBuffers()— existing buffer concatenation with vertex offset translation
New modules in @forma3d/gridflock-core:
preview-generator.tsadditions:basePlateCacheKey()— deterministic keyplate-{cols}x{rows}-{NESW}.stlgenerateBasePlateStl()— JSCAD plate withNO_BORDERandplateNumber: falseenumerateAllBasePlateKeys()— discovers all 200 keys via representative dimensionsassemblePreviewFromPlateCache()— assembly function accepting a plate-lookup callbackborder-generator.ts— pure binary STL box generation, no JSCAD dependency
New service in gridflock-service:
PlateCacheService— loads all 200 base plates into memory at startup (~41 MB), provides synchronous lookup by key
Preview resolution cascade:
- Plate-level assembly (base plates + dynamic borders) → 10–100 ms
- Full JSCAD generation via
generatePreviewStl()→ 12–30 seconds (fallback)
Dimension validation updated:
- Shopify configurator:
step="0.1"(1 mm precision),max="100"(100 cm) - Backend DTOs:
@Max(1000)for both width and height (both preview and checkout) - Sub-millimeter inputs rounded down (floor) to nearest 0.1 cm
Consequences¶
Positive:
- Cache reduced from ~32 GB (16,471 files) to ~60 MB (~268 files)
- Population time reduced from 10–14 hours to 2–5 minutes
- Supports any input resolution (1 mm, 0.5 cm, continuous) with the same cache
- Preview assembly completes in 10–100 ms — no perceptible delay
- Production STL generation path is completely unchanged (byte-identical output)
- Legacy full-preview cache was removed entirely (March 2026) — no longer needed
Negative:
- Preview STLs are not byte-identical to the legacy full previews (different border geometry, no plate numbers)
- Mitigation: Visually equivalent in the 3D viewer; the differences (rounded vs square border corners, absent plate numbers) are invisible in the preview context. Production plates remain unchanged.
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Cache individual plates with border baked in | 147,500 files at ~29 GB — still too large, does not solve the scaling problem |
| Generate full previews at 1 mm resolution | ~853 GB, ~69 hours to generate — completely impractical |
| Server-side rendering (image preview instead of STL) | Loses the interactive 3D preview that customers value; would require a rendering pipeline |
Related¶
- ADR-060: Single Source of Truth for STL Preview Generation
- ADR-053: Buffer-Based GridFlock Pipeline (No Local File Storage)
- Plate-Level Preview Cache Prompt — Full analysis with combinatorics
- STL Cache Pre-Population Research
scripts/populate-plate-cache.ts— base plate population scriptlibs/gridflock-core/src/lib/border-generator.ts— pure binary STL border generationapps/gridflock-service/src/gridflock/plate-cache.service.ts— in-memory plate cache
ADR-062: Inventory Tracking and Stock Replenishment¶
| Attribute | Value |
|---|---|
| ID | ADR-062 |
| Status | Accepted |
| Date | 2026-03-08 |
| Context | Forma3D.Connect operates as a pure print-to-order platform, meaning every order triggers a new print job. Popular products are reprinted constantly, causing fulfillment delays during peak demand and leaving printers idle during quiet periods. |
Decision¶
Introduce a hybrid fulfillment model with opt-in inventory tracking at the ProductMapping level, scheduled stock replenishment during quiet periods, and stock-aware order fulfillment that consumes available stock before creating print jobs.
Problem¶
Popular products (best-sellers) follow a predictable demand pattern, yet every order triggers a full print cycle (4–24 hours). During weekend order surges, backlogs build up. During weekday quiet periods, printers sit idle. There is no mechanism to pre-print popular products or track physical stock of completed units.
Solution¶
Three new capabilities, all placed in order-service (tightly coupled with orchestration):
1. Inventory Tracking (InventoryModule)
- Extended
ProductMappingwith stock fields:currentStock,minimumStock,maximumStock,replenishmentPriority,replenishmentBatchSize - Stock management is opt-in:
minimumStock = 0(default) keeps the product as print-to-order;minimumStock > 0enables stock tracking - One stock unit = one complete set of all
AssemblyPartsfor a product - All stock mutations (production, consumption, adjustment, scrapping) create
InventoryTransactionrecords for a full audit trail currentStockcan never go negative; all mutations are atomic via database transactions
2. Stock Replenishment (StockReplenishmentModule)
- Cron scheduler evaluates stock levels every 10 minutes
- Respects configurable
allowedHours,allowedDaysto run during quiet periods only - Skips when order print queue exceeds
orderQueueThreshold(order jobs always take priority) - Skips when active stock jobs exceed
maxConcurrentStockJobscapacity - For each product where
currentStock < minimumStock, calculates deficit (accounting for pendingStockBatches) and creates batches - One
StockBatch= one sellable unit =PrintJobrecords for allAssemblyParts×quantityPerProduct PrintJobrecords created withpurpose = 'STOCK',lineItemId = null,stockBatchIdset- Global
STOCK_REPLENISHMENT_ENABLEDenvironment variable acts as master switch
3. Stock-Aware Order Fulfillment (updated OrchestrationService)
OrchestrationService.handleOrderCreated()now callsInventoryService.tryConsumeStock()before creating print jobs- If stock covers the full order quantity, no print jobs are created; order completes immediately
- Partial fulfillment supported: consume available stock, print remaining units
- Products with
minimumStock = 0bypass stock consumption entirely (unchanged print-to-order flow) - GridFlock products bypass stock consumption (custom STL pipeline unchanged)
4. Schema Changes
PrintJob.lineItemIdmade nullable (stock jobs have no line item)PrintJob.purposefield added (enum:ORDER|STOCK, defaultORDER)PrintJob.stockBatchIdFK added (nullable, referencesStockBatch)- New
StockBatchmodel (id, productMappingId, status, totalJobs, completedJobs) - New
InventoryTransactionmodel (id, productMappingId, transactionType, quantity, direction, referenceType, referenceId, notes, createdBy) - New enums:
PrintJobPurpose,StockBatchStatus,InventoryTransactionType,StockDirection,InventoryReferenceType
5. API Endpoints (proxied via gateway at /api/v1/inventory/*)
| Method | Path | Permission | Purpose |
|---|---|---|---|
| GET | /api/v1/inventory/stock |
inventory.read |
Stock levels for all managed products |
| PUT | /api/v1/inventory/stock/:id/config |
inventory.write |
Update stock configuration |
| POST | /api/v1/inventory/stock/:id/adjust |
inventory.write |
Manual stock adjustment with audit trail |
| POST | /api/v1/inventory/stock/:id/scrap |
inventory.write |
Scrap damaged stock with audit trail |
| GET | /api/v1/inventory/stock/:id/transactions |
inventory.read |
Transaction history (paginated) |
| GET | /api/v1/inventory/replenishment/status |
inventory.read |
Replenishment system status |
6. Event Flow
STOCK_REPLENISHMENT_SCHEDULEDevent published per created stock print job (BullMQ)print-job.completedevents withpurpose === 'STOCK'route toInventoryService.handleStockJobCompleted()instead of order orchestration- When a
StockBatchcompletes (all jobs done),ProductMapping.currentStockis incremented by 1 and aPRODUCEDtransaction is recorded
Consequences¶
Positive:
- Best-seller orders can be fulfilled in minutes (from stock) instead of 4–24 hours (printing)
- Printer utilization improves — idle periods fill with stock replenishment work
- Weekend order surges are absorbed by pre-built stock
- Full audit trail of every stock movement via
InventoryTransactionledger - Existing print-to-order flow is completely unchanged for products with
minimumStock = 0 - Existing GridFlock custom STL pipeline is unaffected
Negative:
PrintJob.lineItemIdis now nullable, requiring safe access patterns (?.and?? '') across order-service and print-service- Mitigation: All existing services updated; comprehensive unit tests added for both null and non-null cases
- Stock jobs consume printer capacity that could serve order jobs during unexpected demand spikes
- Mitigation:
orderQueueThresholdensures stock replenishment yields to order jobs; replenishment only runs during configurable quiet periods
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Separate inventory microservice | Tight coupling with orchestration logic (stock consumption happens during order processing); would require distributed transactions or saga pattern for atomicity |
| Track inventory at the AssemblyPart level | Overly complex; operators think in "sellable units," not individual parts. Part-level tracking would require complex partial-unit logic |
| Manual-only replenishment (no scheduling) | Defeats the purpose of utilizing idle printer capacity; operators would need to manually evaluate and trigger batches |
| Priority queue preemption for order jobs | Too complex for v1; simple threshold check achieves the same goal with much less risk |
Related¶
- ADR-008: Event-Driven Internal Communication
- ADR-012: Assembly Parts Model for Product Mapping
- ADR-022: Event-Driven Fulfillment Architecture
- ADR-051: Decompose Monolithic API into Domain-Aligned Microservices
- ADR-052: BullMQ Event Queues for Inter-Service Async Communication
- Stock Management Prompt — Full implementation specification
apps/order-service/src/inventory/— InventoryModule implementationapps/order-service/src/stock-replenishment/— StockReplenishmentModule implementationlibs/domain-contracts/src/api/inventory.api.ts— API response contracts
ADR-063: ORDER-over-STOCK Print Queue Priority¶
| Attribute | Value |
|---|---|
| ID | ADR-063 |
| Status | Accepted |
| Date | 2026-03-09 |
| Context | With the introduction of stock replenishment (ADR-062), both ORDER-purpose and STOCK-purpose print jobs share the same SimplyPrint print queue. Without explicit ordering, STOCK jobs scheduled during quiet periods could delay incoming customer orders. |
Decision¶
Implement best-effort FIFO-within-priority-class queue ordering. ORDER-purpose jobs always precede STOCK-purpose jobs in the SimplyPrint queue. Within each class, jobs are processed in FIFO order.
Problem¶
When a customer order arrives and the SimplyPrint queue already contains STOCK replenishment jobs, the new ORDER job is appended at the end of the queue. This means the customer's order waits behind pre-printing stock jobs, causing unnecessary fulfillment delays — defeating the purpose of replenishment (which is meant to improve, not degrade, customer experience).
Solution¶
After each ORDER-purpose print job is added to the SimplyPrint queue, the system:
- Queries the current queue from SimplyPrint (
GET /{id}/queue/GetItems) - Queries local
PrintJobrecords to identify which queue items are ORDER-purpose (findActiveOrderQueueItemIds) - Calculates the correct position:
existingOrderCount + 1(after all existing ORDER items) - Moves the new item if its current position is behind the target, using SimplyPrint's
SetOrderendpoint (GET /{id}/queue/SetOrder?queue_item={id}&to={position})
STOCK-purpose jobs are never explicitly reordered — they naturally accumulate after ORDER jobs as new ORDER items are inserted before them.
Key design properties:
- FIFO within each class: New ORDER jobs go after existing ORDER jobs; STOCK jobs maintain their original arrival order
- Best-effort: If the reorder API call fails, the job remains in the queue at its default position. The operation logs a warning but does not fail the print job creation
- Retry-aware: Retried ORDER jobs also receive priority positioning
- No preemption: Jobs already assigned to printers or in-progress are not affected
Consequences¶
Positive:
- Customer orders are always printed before stock replenishment items
- FIFO ordering within each priority class prevents starvation and ensures fairness
- Best-effort approach means a SimplyPrint API failure cannot block print job creation
- No additional database schema changes required
Negative:
- Two additional API calls per ORDER print job (getQueue + setQueueOrder) add latency
- Mitigation: Both calls are fast (<100ms each) and only occur for ORDER jobs
- Race condition between concurrent ORDER job insertions could result in suboptimal ordering
- Mitigation: The overall constraint (ORDER before STOCK) is maintained; within-ORDER FIFO may have minor deviation under high concurrency, which is acceptable
- Relies on SimplyPrint's
SetOrderAPI being available and correctly implemented - Mitigation: Graceful degradation — jobs remain in queue at default position on failure
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
SimplyPrint priority field on AddItem |
SimplyPrint's API does not expose a priority parameter for queue items |
Insert ORDER jobs at position 1 (to=1) |
Breaks FIFO among ORDER jobs — later orders would execute before earlier ones (LIFO) |
| Separate SimplyPrint queues per purpose | SimplyPrint does not support multiple queues per company; would require separate company accounts |
| Threshold-only approach (ADR-062 original) | Prevents stock jobs from being created when orders are busy, but does not help when stock jobs are already queued and a new order arrives |
Related¶
- ADR-062: Inventory Tracking and Stock Replenishment
- ADR-052: BullMQ Event Queues for Inter-Service Async Communication
apps/print-service/src/print-jobs/print-jobs.service.ts—prioritizeOrderJobInQueue()implementationapps/print-service/src/simplyprint/simplyprint-api.client.ts—setQueueOrder()method- SimplyPrint Queue API — SetOrder endpoint documentation
ADR-064: Stock Replenishment Event Subscriber for SimplyPrint Queue¶
| Attribute | Value |
|---|---|
| ID | ADR-064 |
| Status | Accepted |
| Date | 2026-03-09 |
| Context | ADR-062 introduced stock replenishment with the StockReplenishmentService creating StockBatch and PrintJob records and publishing STOCK_REPLENISHMENT_SCHEDULED events via BullMQ. However, no subscriber was wired up to consume these events, so stock print jobs remained in QUEUED status indefinitely and never reached the SimplyPrint print queue. |
Decision¶
Wire a subscriber for STOCK_REPLENISHMENT_SCHEDULED events in the order-service's EventSubscriberService that queues each stock print job to SimplyPrint via SimplyPrintApiClient.addToQueue().
Problem¶
After the stock replenishment scheduler creates PrintJob records with purpose = 'STOCK' and publishes STOCK_REPLENISHMENT_SCHEDULED events, nothing consumes those events. The print jobs sit in QUEUED status in the database but never reach the SimplyPrint API queue. Printers never receive stock jobs, defeating the purpose of the replenishment system.
Solution¶
Add a subscription for SERVICE_EVENTS.STOCK_REPLENISHMENT_SCHEDULED in EventSubscriberService.onModuleInit(). The handler (queueStockJobToSimplyPrint) mirrors the ORDER-purpose flow in PrintJobsService.createSinglePrintJob() but without order/line-item context or ORDER-over-STOCK priority reordering:
- Validates
fileIdis present (skips if null) - Checks
SimplyPrintApiClient.isEnabled()(skips if not configured) - Looks up the
PrintJobby ID (skips if not found) - Checks idempotency — skips if job already has a
simplyPrintJobId - Calls
SimplyPrintApiClient.addToQueue({ fileId, amount: 1 }) - Releases any stale
simplyPrintJobIdfrom old jobs (SimplyPrint may reuse queue-item IDs) - Updates the
PrintJobwithsimplyPrintJobIdandsimplyPrintQueueItemId
Why the order-service subscribes to its own event (not the print-service):
- The ORDER-purpose print job flow runs entirely within the order-service (via
OrchestrationService→PrintJobsService→SimplyPrintApiClient) - The order-service already has
SimplyPrintApiClient,PrintJobsRepository, and the idempotency/release logic - Keeping STOCK job queuing in the same service avoids duplicating SimplyPrint integration code across services
- The
EventSubscriberServicealready bridges BullMQ events to local handlers for print-job completion, shipments, and integrations
STOCK jobs are not priority-reordered. ORDER-over-STOCK priority (ADR-063) is handled by PrintJobsService.prioritizeOrderJobInQueue() on the ORDER side. STOCK jobs naturally queue behind ORDER jobs.
Consequences¶
Positive:
- Stock replenishment now works end-to-end: cron → batch creation → event → SimplyPrint queue → printer
- Idempotent: duplicate events are safely ignored (job already has SimplyPrint ID)
- Graceful degradation: SimplyPrint API failures are logged but don't crash the event processing
- No changes to the
StockReplenishmentServiceitself — it continues to publish events as designed
Negative:
EventSubscriberServicenow depends onSimplyPrintApiClient(previously it only used repositories and re-emitted events)- Mitigation:
SimplyPrintModulewas already imported inEventsModule; the additional dependency is minimal
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
Subscribe in print-service EventSubscriberService |
Print-service's ORDER flow is unused (order-service handles everything); would duplicate SimplyPrint integration code and require maintaining two parallel queue paths |
Call SimplyPrintApiClient.addToQueue() directly in StockReplenishmentService.evaluateAndSchedule() |
Mixes batch-creation concern with queue-dispatch concern; the event-driven approach allows retry/replay and keeps the replenishment service focused on scheduling logic |
Create a dedicated StockJobDispatcherService |
Over-engineering for a single subscriber; EventSubscriberService already handles similar bridge logic for print-job and shipment events |
Related¶
- ADR-062: Inventory Tracking and Stock Replenishment
- ADR-063: ORDER-over-STOCK Print Queue Priority
- ADR-052: BullMQ Event Queues for Inter-Service Async Communication
apps/order-service/src/events/event-subscriber.service.ts—queueStockJobToSimplyPrint()implementationapps/order-service/src/stock-replenishment/stock-replenishment.service.ts— event publisherdocs/03-architecture/sequences/C4_Seq_11_StockReplenishment.puml— updated sequence diagram
ADR-065: SonarCloud for Continuous Code Quality Analysis¶
| Status | Accepted |
| Date | 2026-03-12 |
| Context | The codebase has ESLint for linting and Syft + Grype for container security scanning, but lacks cross-cutting code quality metrics: duplicated code detection, cognitive complexity scoring, technical debt quantification, and historical trend tracking. A dedicated static analysis platform was needed to fill this gap. |
Decision¶
Adopt SonarCloud Team ($32/month) as the continuous code quality platform, integrated into the Azure DevOps CI/CD pipeline.
Problem¶
Without a cross-cutting code quality platform:
- Duplicated code across microservices was invisible — at initial scan, 19.5% of the codebase was duplicated
- Cognitive complexity of functions was unchecked, leading to unmaintainable business logic
- Security hotspots (regex DoS, pseudorandom generators, publicly writable directories) were undetected
- Technical debt had no quantification or trend tracking
- PR reviews lacked automated quality gate enforcement
Solution¶
SonarCloud analyzes every push to main and every PR, providing:
sonar-project.propertiesat the repository root — configures source directories, exclusions, coverage report paths, and rule suppressionsCodeQualityjob in theValidateAndTestpipeline stage — runs afterUnitTests, downloads coverage artifacts, executes SonarCloud analysis- PR decoration — SonarCloud posts quality gate status and issue summaries directly on Azure DevOps pull requests
- Coverage integration —
lcov.inforeports from Vitest/Jest are uploaded to SonarCloud;sonar.coverage.exclusionsaligns the denominator with the test frameworks' exclusion patterns - Quality gate — blocks merges when new code introduces bugs, vulnerabilities, or excessive duplication
- Rule suppression — false positives and won't-fix items are suppressed via
sonar.issue.ignore.multicriteriainsonar-project.properties(inline// NOSONARcomments do not work for JS/TS in SonarCloud) - AI Code Assurance — SonarCloud applies a stricter quality gate to AI-generated code, requiring higher coverage and zero issues
Key Configuration Decisions¶
| Decision | Rationale |
|---|---|
| SonarCloud (SaaS) over SonarQube (self-hosted) | Zero infrastructure overhead; staging droplet already at 96% memory usage |
| Monorepo-level scan (not per-project) | Single quality gate covers all 14 source directories; simpler than Nx-per-project scanning |
configMode: 'file' in pipeline |
All configuration centralized in sonar-project.properties, not scattered in YAML |
CodeQuality job on MS-hosted agent |
No load on the self-hosted DO build agent; SonarCloud analysis is network-bound, not CPU-bound |
| Rule suppressions in properties file | // NOSONAR does not work for TypeScript/JavaScript in SonarCloud; properties file suppressions are auditable and centralized |
| Inline comments with rule keys | Each suppressed code location has a // Sonar suppression — typescript:SXXXX: reason comment for traceability |
Results (First Week)¶
| Metric | Before (2026-03-12) | After (2026-03-13) | Change |
|---|---|---|---|
| Total issues | 769 | 244 | -68% |
| Bugs | 9 | 0 | -100% |
| Vulnerabilities | 12 | 0 | -100% |
| Code smells | 748 | 244 | -67% |
| Security hotspots | 6 (TO_REVIEW) | 0 | -100% |
| Duplication | 19.5% | 15.7% | -3.8pp |
| Duplicated lines | ~13,300 | 10,743 | -19% |
Consequences¶
Positive:
- Every PR now has automated quality gate enforcement with inline issue annotations
- Code duplication is visible and measurable — drove the extraction of
libs/service-common(12,900 duplicated lines removed) - Security hotspots are reviewed and tracked
- Technical debt is quantified with effort estimates
- Coverage discrepancies between Azure DevOps and SonarCloud are resolved by aligning
sonar.coverage.exclusionswith Vitest/JestcollectCoverageFrompatterns
Negative:
- Monthly cost of $32 for SonarCloud Team plan
- Mitigation: Cost is trivial compared to the engineering time saved on code review and technical debt discovery
sonar.issue.ignore.multicriteriain properties file must be maintained as rules are suppressed- Mitigation: Each suppression has a documented rationale and inline code comments
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| SonarQube self-hosted | Staging droplet already at 96% memory; would need separate infrastructure (~$20/month for a 2 GB droplet + maintenance overhead) |
| ESLint-only (no SonarCloud) | ESLint cannot detect cross-file duplication, cognitive complexity trends, or security hotspots; no PR decoration or historical dashboards |
| CodeClimate | Less mature TypeScript support; no native Azure DevOps integration; higher cost at scale |
| Codacy | Similar capabilities but SonarCloud has stronger NestJS/React ecosystem support and the team already evaluated it in the research phase |
Related¶
- SonarCloud Code Quality Research
- SonarCloud Issue Triage Report — 2026-03-12
- SonarCloud Issue Triage Report — 2026-03-13
- ADR-006: Azure DevOps for CI/CD with Digital Ocean Hosting
- ADR-015: Aikido Security Platform for Continuous Security Monitoring (Superseded by ADR-067)
- ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
ADR-066: CodeCharta City Visualization for Codebase Insight¶
| Status | Accepted |
| Date | 2026-03-14 |
| Context | SonarCloud provides numeric code quality metrics (complexity, duplication, code smells, coverage), but these numbers lack spatial context. Developers cannot easily identify hotspots — large, complex, frequently-changed files — or knowledge silos (single-author modules). A visual representation was needed to make these metrics actionable for sprint planning, retrospectives, and onboarding. |
Decision¶
Integrate CodeCharta into the CI/CD pipeline (Option C from the research document) to generate a 3D city map from SonarCloud metrics + git history, served from the existing docs container with shareable URLs.
Problem¶
- SonarCloud metrics are presented as flat lists and numeric summaries — they do not convey spatial relationships between files
- Identifying complexity hotspots, change frequency patterns, and knowledge silos requires manually correlating multiple SonarCloud views
- New team members have no visual onboarding aid to understand codebase structure
- No way to share preconfigured metric views with the team via bookmarkable URLs
Solution¶
GenerateCodeChartapipeline job in theBuildstage — runs on MS-hostedubuntu-latest, uses thecodecharta/codecharta-analysisDocker image (~1.2 GB, CI-only), imports SonarCloud metrics viaccsh sonarimportand parses git history viaccsh gitlogparser, then merges both intoforma3d.cc.json- Artifact handoff — the
.cc.jsonis published as a pipeline artifact and downloaded by thePackageDocsjob - Dockerfile integration —
COPY codecharta/forma3d.cc.jso[n]in the docs Dockerfile uses a glob pattern to gracefully handle the file's absence on PR branches - Nginx CORS — a
/codecharta/location block serves the file withAccess-Control-Allow-Origin: https://codecharta.comso the hosted Web Studio can fetch it via XHR - Shareable URLs — bookmarkable links to
codecharta.com/visualization/app/index.html?file=...with preconfigured metric mappings (area=ncloc, height=cognitive_complexity, color=code_smells) - Settings page link — a "Codebase City Map" link in the Help & Support section, visible only to admin users of the default tenant
Key Configuration Decisions¶
| Decision | Rationale |
|---|---|
| Option C (hosted Web Studio + docs-served data) over Options A/B/D | No new container, no new DNS record, no self-hosted visualization; reuses existing docs infrastructure |
Separate read-only SonarCloud token (SONARCLOUD_CODECHARTA_TOKEN) |
Security isolation from the service connection used for analysis; revocable without affecting CI |
fetchDepth: 0 in GenerateCodeCharta job only |
Full git history needed for gitlogparser; other jobs keep shallow clones for speed |
CORS restricted to https://codecharta.com |
Not a wildcard *; only the CodeCharta Web Studio can fetch the file |
Cache-Control: no-cache on /codecharta/ |
Users always see the latest map after a pipeline run |
Glob trick forma3d.cc.jso[n] in Dockerfile |
Docker COPY fails on missing source files; the glob pattern makes it optional without conditional logic |
PackageDocs condition updated with or() |
Docs rebuild on main when CodeCharta succeeds, even if docs content hasn't changed — ensures fresh maps |
Consequences¶
Positive:
- The team can visualize the codebase as a 3D city — buildings represent files, dimensions encode metrics (lines of code, cognitive complexity, code smells, coverage, change frequency)
- Hotspots, knowledge silos, and temporal coupling are immediately visible
- Shareable URLs enable preconfigured views for sprint planning and retrospectives
- Zero infrastructure cost — reuses existing docs container and publicly hosted CodeCharta Web Studio
- The
.cc.jsoncontains only file paths and numeric metrics — no source code is exposed
Negative:
- The
GenerateCodeChartajob adds ~2–3 minutes to the pipeline on main branch builds - Mitigation: Runs on MS-hosted agents, in parallel with other packaging jobs
- Dependency on the publicly hosted CodeCharta Web Studio at
codecharta.com - Known limitation: CodeCharta's CSP (
default-src 'self') blocks XHR to external origins, so the shareable?file=URLapproach does not work. Users must download the.cc.jsonfile and drag-and-drop it into the Web Studio manually. - Mitigation: Option D (self-hosted visualization) can be adopted later to restore shareable URLs with a relaxed CSP
- The
codecharta/codecharta-analysisDocker image is ~1.2 GB, pulled on every main build - Mitigation: Docker layer caching on MS-hosted agents; image is not deployed to staging
Alternatives Considered¶
| Alternative | Reason for Rejection |
|---|---|
| Option A: Local developer workstation only | Not shareable; requires local tooling setup; maps not versioned |
| Option B: CI-generated artifacts without serving | Team cannot visualize without downloading and opening manually |
| Option D: Self-hosted visualization container | Adds a new container to staging; increases memory pressure and maintenance overhead |
| Custom visualization dashboard | Significant development effort; CodeCharta Web Studio is mature and feature-rich |
Related¶
- CodeCharta City Visualization Research
- ADR-065: SonarCloud for Continuous Code Quality Analysis
- ADR-006: Azure DevOps for CI/CD with Digital Ocean Hosting
- ADR-038: Zensical for Publishing Project Documentation
- ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance¶
| Status | Accepted |
| Date | 2026-03-17 |
| Context | The pipeline generates CycloneDX SBOMs for every container image (ADR-026), but SBOMs alone are passive inventories — they list components without evaluating them for known vulnerabilities. A quality gate was needed to prevent deploying images with exploitable CVEs. |
Decision¶
Integrate Grype (by Anchore) into the CI/CD pipeline to scan every SBOM for known CVEs, configured to fail on High severity vulnerabilities that have available fixes (--fail-on high --only-fixed). Use a .grype.yaml exclusion file for vulnerabilities that cannot be patched at the project level. Exclude the Slicer container from scanning entirely due to its unpatachable base image.
Key Concepts¶
CVE (Common Vulnerabilities and Exposures): A standardized identifier for a publicly known security vulnerability. Each CVE has a severity rating (Critical, High, Medium, Low) based on the CVSS scoring system.
EPSS (Exploit Prediction Scoring System): A data-driven model maintained by FIRST.org that estimates the probability a CVE will be exploited in the wild within 30 days, expressed as a percentage (0–100%) and a percentile rank. Unlike CVSS severity (which measures potential impact), EPSS measures likelihood of actual exploitation. For example:
- CVE-2024-9680 (Firefox): EPSS 30.8% (96th percentile) — actively exploited, high urgency
- GHSA-p436-gjf2-799p (docker/cli): EPSS < 0.1% (1st percentile) — theoretically vulnerable but extremely unlikely to be exploited
EPSS is used in this project to inform risk acceptance decisions: Go module CVEs with near-zero EPSS scores from Alpine's docker-cli package are excluded from the scan rather than blocking deployments.
SBOM (Software Bill of Materials): A complete inventory of components in a container image, generated by Syft in CycloneDX format (see ADR-026). Grype scans the SBOM rather than the image directly, which is faster and produces deterministic results.
Problem¶
- Container images contained transitive npm dependencies with High severity CVEs (cross-spawn, minimatch, tar, glob, serialize-javascript)
- The
node:20-alpinebase image bundles npm at runtime, which includes its own vulnerable dependencies (tar@6.2.1, glob@10.4.2) even though npm is not needed for running the Node.js application - Alpine system packages (docker-cli, zlib) contained Go module and C library CVEs
- The Slicer container (BambuStudio v1 base image) had 800+ CVEs from its Debian 12 desktop environment
- Without automated scanning, these vulnerabilities would accumulate silently
Solution¶
Pipeline integration:
Each container packaging job includes a Grype scan step after SBOM generation:
- script: |
grype sbom:<service>-sbom.cdx.json --output table --fail-on high --only-fixed
displayName: 'Scan SBOM for CVEs (<Service>)'
condition: eq('${{ parameters.enableSigning }}', 'true')
The --only-fixed flag is critical: it only reports CVEs that have a fix available, preventing false failures from vulnerabilities that no one can remediate yet.
Remediation strategy (three layers):
| Layer | Source | Fix |
|---|---|---|
| npm transitive dependencies | cross-spawn, minimatch, file-type, lodash, ajv, bn.js, serialize-javascript, qs | pnpm overrides in package.json forcing patched versions |
| Docker base image bundled npm | tar@6.2.1, glob@10.4.2, cross-spawn@7.0.3 from node:20-alpine's npm |
Strip npm from production images (rm -rf /usr/local/lib/node_modules/npm) |
| Alpine system packages | zlib, docker-cli Go binaries | apk upgrade --no-cache in Dockerfile production stage |
Risk acceptance (.grype.yaml):
Go module CVEs compiled into Alpine's docker-cli and containerd packages cannot be patched without Alpine shipping updated packages. These are excluded from the scan with documented rationale:
ignore:
- vulnerability: GHSA-p436-gjf2-799p # docker/cli v28→29 (High, EPSS <0.1%)
package:
type: go-module
All excluded CVEs have EPSS scores at the 0th–1st percentile (near-zero exploitation probability).
Slicer exclusion:
The Slicer container uses linuxserver/bambustudio:01.08.03 (Debian 12) with 38,731 SBOM components and 800+ CVEs from system packages (Firefox ESR, glibc, ffmpeg, GStreamer, Qt5), Go binaries (buildkit, runc, containerd), and Python packages (cryptography). These cannot be fixed without an upstream base image update. The grype scan is commented out with rationale, and a TODO.md entry tracks the BambuStudio v2 upgrade.
Key Design Decisions¶
| Decision | Rationale |
|---|---|
--fail-on high (not critical) |
Critical-only would miss many actionable High CVEs; High threshold catches the most important vulnerabilities while keeping Medium/Low informational |
--only-fixed |
Prevents pipeline failures from CVEs with no available fix — avoids blocking deployments on problems no one can solve |
| Strip npm from production images | The runtime only needs node, not npm/npx; removing npm eliminates an entire class of bundled dependency CVEs |
| pnpm overrides (not dependency updates) | Transitive dependencies can't be updated directly; overrides force specific versions across the entire dependency tree |
.grype.yaml exclusions scoped to type: go-module |
Exclusions are narrow — they only apply to Go binaries, not npm or Alpine packages |
| Slicer excluded entirely (not just Go modules) | The base image has CVEs across all layers (deb, Go, Python, npm); partial exclusions would still fail the scan |
| EPSS for risk acceptance | CVSS severity alone doesn't indicate exploitation likelihood; EPSS provides data-driven prioritization |
Consequences¶
Positive:
- Every container image is scanned for CVEs before deployment — vulnerabilities cannot reach staging silently
- The
--only-fixedflag eliminates false positives from unfixable CVEs - Risk acceptance is documented and auditable (
.grype.yamlwith comments) - EPSS-informed decisions prevent security theater (blocking on theoretical vulnerabilities with zero exploitation probability)
- npm stripping reduces production image attack surface beyond just CVE remediation
Negative:
- Go module CVEs in Alpine packages require manual exclusion maintenance
- Mitigation:
.grype.yamlincludes review notes; exclusions should be removed when Alpine ships updates - The Slicer has no CVE scanning coverage
- Mitigation: TODO.md tracks BambuStudio v2 upgrade to reinstate scanning
- Grype must be installed on each pipeline run (~3 seconds)
- Mitigation: Installed to
$HOME/.local/binwhich persists across steps within a job
Related¶
- ADR-025: Cosign Image Signing for Supply Chain Security
- ADR-026: CycloneDX SBOM Attestations
- ADR-055: BambuStudio CLI Slicer Container
- ADR-057: Self-Hosted Build Agent with Hybrid Pipeline Strategy
- FIRST EPSS Model
- Grype Documentation
ADR-068: Dependency License Compliance Check¶
| Status | Accepted |
| Date | 2026-03-19 |
| Context | The project uses ~200 npm dependencies (direct + transitive). Without automated checking, a non-permissive license (GPL, AGPL, SSPL, Commons Clause) could enter the dependency tree unnoticed through a transitive update, creating legal risk for a proprietary/commercial product. The pipeline already has CVE scanning (Grype, ADR-067) and code quality gates (SonarCloud, ADR-065), but no license compliance gate. |
Decision¶
Add a lightweight dependency license check to the CI pipeline that fails the build if any package in the dependency tree has a non-permissive license. Use license-checker-rseidelsohn — an actively maintained fork of the original license-checker — with a small custom script (scripts/check-licenses.js).
Problem¶
- Transitive dependency updates (via
pnpm updateor lockfile refresh) can silently introduce packages with strong copyleft licenses (GPL-2.0, GPL-3.0, AGPL-3.0) or restrictive terms (SSPL, Commons Clause) - The project already encountered a licensing concern with Gridfinity GRIPS (non-permissive), leading to the creation of GridFlock under MIT — demonstrating that license awareness is an active concern
- Manual auditing of license changes in
pnpm-lock.yamlis impractical at scale
Solution¶
Script (scripts/check-licenses.js):
A ~50-line Node.js script that:
- Uses
license-checker-rseidelsohnto scan the full dependency tree - Matches each package's license string against a disallowed pattern:
GPL,AGPL,SSPL,Commons Clause(case-insensitive) - Excludes private packages (the project's own
UNLICENSEDroot) - Exits with code 1 and lists offending packages if any match
- Exits with code 0 on success
Pipeline integration:
The check runs as the first step in the Lint job (both azure-pipelines.yml and ci.yml), after dependency installation but before linting:
- script: pnpm run license-check
displayName: 'Check dependency licenses (fail on non-permissive)'
Why license-checker-rseidelsohn:
| Option | Status | Rationale |
|---|---|---|
license-checker (davglass) |
Abandoned (last release Jan 2019, 75 open issues) | Not suitable for a maintained project |
license-checker-rseidelsohn |
Actively maintained fork (~200k weekly downloads) | Compatible API, receives updates and bugfixes |
| Grant (Anchore) | Active | Heavier; designed for SBOM/container scanning rather than npm dependency trees |
pnpm licenses list |
Built-in | No built-in fail-on-disallowed; requires more scripting to parse output |
Key Design Decisions¶
| Decision | Rationale |
|---|---|
| Deny-list (not allow-list) | New permissive licenses (e.g. BlueOak-1.0.0) shouldn't require allowlist updates; only known problematic licenses are blocked |
| Case-insensitive regex | License strings in package.json vary in casing (GPL-3.0, gpl-3.0-only, etc.) |
| Run in Lint job | Lint is the fastest-feedback job and already runs on every push; license violations are caught before tests or builds run |
excludePrivatePackages: true |
The project root has "license": "UNLICENSED" which is valid for a private project but would false-positive against a strict allow-list |
| Custom script (not CLI flags) | --failOn only matches exact license names; a regex handles compound expressions like (GPL-2.0 OR MIT) correctly |
Consequences¶
Positive:
- Non-permissive licenses cannot enter the dependency tree without failing the pipeline
- Developers get fast feedback (license check runs in ~1 second)
- No external service dependency — runs offline against
node_modules - Complements Grype CVE scanning: vulnerabilities are caught by Grype, license violations by license-checker
Negative:
- Dual-licensed packages where one option is permissive (e.g.
MIT OR GPL-3.0) will be flagged - Mitigation: The deny-list regex is intentionally broad; if a valid dual-licensed package is flagged, add a documented exception to the script
- Does not cover non-npm dependencies (e.g. Docker base image licenses, system packages)
- Mitigation: Container-level license scanning could be added via Grant if needed in the future
Related¶
- ADR-067: Grype CVE Scanning with EPSS-Informed Risk Acceptance
- ADR-026: CycloneDX SBOM Attestations
- ADR-065: SonarCloud for Continuous Code Quality Analysis
- license-checker-rseidelsohn
ADR-069: Agent CLAUDE.md Governance — Repo as Source of Truth¶
| Status | Accepted |
| Date | 2026-03-22 |
| Context | The Nanoclaw agentic team (Ryan, Sam, Cody) each have a CLAUDE.md file that defines their identity, responsibilities, protocols, and behavioral rules. These files are mounted read-write into agent containers, meaning agents can technically modify their own instructions. During initial deployment, agents occasionally self-modified their CLAUDE.md or had their files overwritten during sync operations, leading to drift between what the repo contained and what was running on the droplet. |
Decision¶
Adopt a strict governance model for agent CLAUDE.md files:
- The repo (
agentic-team/agents/) is the single source of truth. All canonical versions of agent CLAUDE.md files live here. - Individual agents must not self-modify. Each agent's CLAUDE.md contains a governance rule: "You MUST NOT modify your own CLAUDE.md or any other agent's CLAUDE.md."
- The Team agent (main channel) is the only agent authorized to edit CLAUDE.md files. It has write access to all group folders via
additionalMounts. When Jan requests a behavioral change in the main chat, Team makes the edit. - Jan and the AI assistant (Cursor) are reviewers. Changes made by Team on the droplet should be periodically synced back to the repo. Changes made in the repo should be pushed to the droplet via scp or the deploy script.
Flow¶
Jan (WhatsApp main chat or Cursor) → Team agent edits CLAUDE.md on droplet
↓
Periodic sync: droplet → repo (manual)
↓
Repo is the canonical record
Consequences¶
Positive:
- No silent behavioral drift — agents cannot quietly rewrite their own rules
- All changes are auditable through the repo's git history
- Team agent provides a conversational interface for behavioral changes without needing SSH or Cursor
- Clear chain of authority: Jan → Team → individual agents
Negative:
- Requires discipline to sync droplet changes back to the repo — if forgotten, the repo becomes stale
- Mitigation: Before pushing repo files to the droplet, always compare checksums first (
md5sumon droplet vsmd5locally) to avoid overwriting agent-side changes - Agents cannot adapt their own instructions based on learned patterns — all adaptations require human approval
- Mitigation: Agents can suggest changes by asking Jan in their group chat; Jan routes through Team
Related¶
- ADR-070: Per-Agent Claude Model Selection
- Nanoclaw containerConfig documentation
ADR-070: Per-Agent Claude Model Selection¶
| Status | Accepted |
| Date | 2026-03-22 |
| Context | The Nanoclaw agentic team has three agents with different cognitive demands. Ryan (DevOps) and Sam (Infra) primarily run SSH health checks, query APIs, and route information — tasks that don't require deep code reasoning. Cody (Dev) diagnoses code failures, writes fixes, and opens PRs — tasks that benefit from the strongest available model. All agents were initially running on Claude Sonnet 4.6 (the default). The Anthropic API pricing difference is significant: Sonnet costs \(3/\)15 per MTok (input/output) while Opus costs \(15/\)75 — a 5x multiplier. |
Decision¶
Configure per-agent model selection via Nanoclaw's containerConfig.model field in the registered_groups database:
- Ryan (DevOps): Claude Sonnet 4.6 (default) — SSH checks, API queries, routing
- Sam (Infra): Claude Sonnet 4.6 (default) — health monitoring, diagnostics
- Cody (Dev): Claude Opus 4.6 — code reasoning, fix generation, PR creation
- Team (main): Claude Sonnet 4.6 (default) — admin tasks, group management
The model is passed as a CLAUDE_MODEL environment variable to the agent container, which the agent-runner forwards to the Claude Agent SDK's query() call. Agents without a model override use the SDK's default (currently Sonnet).
Implementation¶
- Added
model?: stringto Nanoclaw'sContainerConfigtype - Container-runner reads
group.containerConfig.modeland injects-e CLAUDE_MODEL=...into Docker args - Agent-runner passes
process.env.CLAUDE_MODELto the SDK'squery({ options: { model } }) - Cody's database entry:
container_config = '{"model":"claude-opus-4-6"}'
Limitation¶
Agents cannot reliably self-report which model they are using when asked conversationally. The Claude Agent SDK does not expose the active model name to the agent's own context. Verification must be done through the Anthropic console usage logs or by inspecting the container's environment variables (docker inspect).
Consequences¶
Positive:
- Cody produces higher-quality fixes with fewer retry loops, potentially offsetting the higher per-token cost
- Ryan and Sam stay cost-efficient on Sonnet for tasks that don't need deep reasoning
- Model selection is per-agent, not global — can be tuned independently
- Easy to change: single database update + restart, no code changes
Negative:
- Opus invocations are 5x more expensive — a typical Cody fix cycle costs $2.50-10.00 vs $0.50-2.00 on Sonnet
- Mitigation: Prepaid Anthropic credit with no auto-reload acts as a hard spending cap;
CONTAINER_TIMEOUTlimits per-invocation token burn - Modifying Nanoclaw's upstream source (container-runner, agent-runner, types) means patches must be re-applied after upgrades
- Mitigation: Document the patches in
agentic-team/README.mdtroubleshooting section
Related¶
- ADR-069: Agent CLAUDE.md Governance
- Anthropic API Pricing
- Claude Agent SDK documentation
References¶
- Nx Documentation
- NestJS Documentation
- Prisma Documentation
- C4 Model
- ADR GitHub Organization
- Syft — SBOM Generator
- Grype — CVE Scanner
- Sentry Documentation
- Sentry for NestJS
- Sentry for React
- OpenTelemetry Documentation
- Pino Logger
- Sigstore/Cosign Documentation
- Syft Documentation
- CycloneDX Specification
- Grype Vulnerability Scanner
- DigitalOcean Container Registry
- doctl CLI
- Zensical Documentation
- plantuml-markdown
- Apache ECharts
- echarts-for-react
- ClickHouse Documentation
- Grafana Documentation
- OpenTelemetry Collector
- SonarCloud Documentation
- SonarCloud Azure DevOps Extension
- CodeCharta
- CodeCharta GitHub
- license-checker-rseidelsohn