Project Timeline — Forma3D.Connect¶
An AI-built production system, guided by a human
This document traces how Forma3D.Connect evolved from first commit to a multi-service production platform — entirely through AI-human collaboration. It highlights the major milestones, architectural shifts, and the surprisingly impactful role that human intuition played alongside AI velocity.
At a Glance¶
The Numbers¶
| Metric | Value |
|---|---|
| Calendar days (Jan 9 – Mar 2) | 53 |
| Active days | 50 (only 3 days with zero activity) |
| Total AI chat sessions | 460+ |
| Average sessions per active day | ~9.2 |
| Peak day: Feb 17 (microservices deploy) | 19 sessions |
| Human interventions acknowledged by AI | 90+ |
| Pipeline failures pasted by human | 50+ (~1/day) |
| Pipeline failures caused by AI code | ~63% |
| CHANGELOG versions | 17 (0.0.0 → 20260302) |
| Human estimated duration (Phases 0–7) | 26.5 weeks |
| Human estimated duration (Phases 0–13) | 48.5 weeks |
| Actual duration (Phases 0–7) | 10 days |
| Actual duration (Phases 0–13) | 53 days |
| Cost (AI usage, Jan 2026) | €655 |
| Cost (human team estimate) | €120,000 – €160,000 (Phases 0–7) |
Human vs AI: Implementation Plan Timeline Comparison¶
The implementation plan estimated 27.5 work-weeks for a team of 3–4 mid-level full-stack developers to complete Phases 0–8. The AI + human pair completed the same scope in 18 calendar days — including weekends.
The two Gantt charts below use matching phase labels so the scale difference speaks for itself.
Human team estimate (Phases 0–8)¶
Sequential work-weeks for a 3–4 person team. Sub-phases (1b–1d, 5b–5k) are grouped under their parent phase.
AI + human actual (Phases 0–8)¶
One human operator + AI. Weekends included — the AI doesn't take days off.
Phase-by-phase breakdown¶
| Phase | Scope | Human Estimate | AI Actual | Ratio |
|---|---|---|---|---|
| Phase 0 | Foundation, Nx, Prisma, CI | 2 weeks | 1 day (Jan 9) | 10x |
| Phase 1 + 1b–1d | Shopify, Observability, Staging, Tests | 5.5 weeks | 4 days (Jan 9–12) | 10x |
| Phase 2 | SimplyPrint integration | 3 weeks | 1 day (Jan 13) | 15x |
| Phase 3 | Fulfillment automation | 2 weeks | 1 day (Jan 14) | 10x |
| Phase 4 | Dashboard MVP | 3 weeks | 1 day (Jan 14) | 15x |
| Phase 5 | Shipping (Sendcloud) | 2 weeks | 1 day (Jan 16) | 10x |
| Tech Debt 5b–5k | 9 items (domain boundaries, tests, schemas) | 5.5 weeks | 4 hours (Jan 17) | 97x |
| Phase 6 | Hardening, runbooks, load testing | 2 weeks | 1 day (Jan 18) | 10x |
| Phase 7 | PWA, push notifications | 1 week | 1 hour (Jan 19) | 56x |
| Phase 8 | RBAC, session auth, user management | 1 week | 5 days (Jan 21–25) | 1.4x |
| Total | 27.5 weeks | 18 days | ~11x |
Phase 8 (RBAC) stands out as the closest to the human estimate. Security-critical features — password hashing, session management, role enforcement, audit logging — benefit less from AI velocity because the complexity is in getting the design right, not in writing code fast.
Summary¶
Detailed Timeline¶
Week 1 — Foundation and Core Integrations (Jan 9–14)¶
Key human intervention — Jan 9: The human spotted that the database schema assumed a 1:1 relationship between Shopify products and print files. The AI acknowledged: "Excellent observation! You're right — the current schema doesn't account for assemblies." This led to the assembly parts model that persists today.
Key human intervention — Jan 10: The human caught the AI overwriting the .env.example file (destroying SimplyPrint config) when adding Sentry variables. Also pushed for 100% Sentry sampling since traffic would be low initially.
Key human intervention — Jan 11: The human spotted that the deployment script would restart containers before running database migrations — a classic production incident waiting to happen. The AI called it "exactly the kind of subtle ordering bug that can cause production incidents!"
Week 2 — Shipping, Hardening, PWA (Jan 16–21)¶
Key human intervention — Jan 17: The human asked "Isn't that quick fix a violation against ADR-032: Domain Boundary Separation?" — catching the AI taking a shortcut that would violate the project's own architecture decisions. The fix was reimplemented properly.
Key human intervention — Jan 18: The human pointed out that relying on existing orders in the database for load tests was brittle — the staging database might be empty. Also caught a version mismatch between the API (0.0.1) and frontend (0.4.0).
Key human intervention — Jan 19: The human identified that the PWA manifest's theme_color is static and can't change when the user toggles dark mode. Also asked about Docker log rotation after a disk-full incident, leading to the json-file logging driver configuration.
Week 3 — Auth, Users, Reconciliation (Jan 22–28)¶
Key human intervention — Jan 23: The human raised two issues about pagination: the API wasn't consistently returning pagination metadata, and not all components handled it. This became a project-wide API contract rule: "The API should always return pagination info (page, pageSize, and total)."
Key human intervention — Jan 24: The human noticed that documentation URLs, pgAdmin URLs, and other links were hardcoded in the frontend. These would break in production. Environment-based URL configuration was implemented.
Week 4 — OAuth and Research (Jan 29 – Feb 3)¶
Key human intervention — Feb 1: The human asked whether the registry cleanup script would accidentally remove the :cache tag needed for Docker inline caching. "Good catch" — the script was updated to preserve cache tags.
Key human intervention — Feb 1: The human noticed that Docker build times actually got slower after optimization attempts. "You're right — 5m 44s is actually slower than before. The cache isn't working as expected." The cache strategy was reworked.
Week 5 — Advanced Features (Feb 4–12)¶
Key human intervention — Feb 8: The human identified three missing items in a single review: typechecking hadn't been run, tests weren't written, and Shopify theme docs weren't updated. "Good catches" — all three were addressed.
Key human intervention — Feb 9: The human corrected the AI multiple times about SimplyPrint's queue behavior. The AI was guessing API response field names instead of reading documentation. The human insisted: read the docs first. The AI responded: "You're absolutely right, and I owe you an honest answer."
Key human intervention — Feb 10: The human asked "What is the purpose of the isActive field on ProductMapping? I never asked for it." — prompting a code archaeology investigation that led to removing an unrequested feature.
Week 6 — The Big Split (Feb 13–18)¶
Key human intervention — Feb 15: The human caught a critical deployment bug: the .env file was being overwritten with all image tags on every pipeline run. If a service wasn't rebuilt (due to Nx affected), its tag would be overwritten with the wrong value, potentially bringing down unaffected services.
Key human intervention — Feb 15: The human noticed cosign was pre-installed on self-hosted agents but the pipeline was redundantly trying to re-download it every run, failing because it was owned by root.
Key human intervention — Feb 16: The human correctly identified that docker builder prune was running on the staging server (inside an SSH heredoc), not on the build agent — the AI had confused which machine was being affected, which was wiping Docker cache after every deployment.
Key human intervention — Feb 17: The human raised two critical issues about database containers in CI: (1) even with trust auth, two agents can't bind to the same port, and (2) 80+ orphaned Docker volumes proved Azure DevOps wasn't cleaning up on cancellation. Both were addressed with dynamic port allocation and cleanup cron jobs.
Week 7 — Documentation, Integration UI, and Observability (Feb 18–22)¶
This week focused on EventCatalog architecture documentation, event traceability fields (eventId, source), and database-backed UI configuration for SimplyPrint and Sendcloud connections with AES-256-GCM encrypted credential storage. The ClickHouse + Grafana centralized logging pipeline was deployed — replacing Sentry Logs with a self-hosted stack (OTel Collector → ClickHouse → Grafana). pgAdmin was moved to an on-demand container managed via the DevTools UI. The Nx affected pipeline was hardened with a last-successful-deploy git tag (ADR-059).
Week 8 — Preview Infrastructure and Security (Feb 23 – Mar 2)¶
Key achievements — Week 8:
- Preview cache revolution: Extracted STL preview generation into
@forma3d/gridflock-coreas a reusable library. Created offline scripts for pre-populating preview caches with CPU-aware parallelism. Implemented plate-level caching (268 files, 60 MB) replacing the legacy per-dimension cache (16,471 files, 35 GB) — a 99.8% storage reduction with 100-300x faster preview assembly. - Security baseline: Configured Aikido for continuous security scanning. Initial findings identified Prisma operator injection risks, Express security header gaps, and dependency CVEs — all tracked for remediation.
- Storefront fix: Grid configurator dimension rounding changed from
Math.roundtoMath.floorto prevent dimensions like 74.41 cm from rounding up to 74.5 cm. - SaaS research: Comprehensive SaaS launch readiness research covering onboarding, pricing, billing, GDPR, multi-tenancy considerations, and Stripe integration.
The Architectural Evolution¶
Session Category Breakdown¶
Observation: Bug fixing is the second-largest category (19.9%), reflecting the reality that AI speed doesn't eliminate bugs — it just creates and fixes them faster. The high documentation percentage (16.9%) is unusual for a fast-moving project and reflects the explicit quality mandate in the project rules.
Human-AI Cooperation Analysis¶
The Human Contribution — By the Numbers¶
AI Acknowledgment Patterns¶
| AI Response Pattern | Count | Typical Context |
|---|---|---|
| "Good catch" | 20 | Human found a bug, logic error, or inconsistency |
| "You're right" / "You're correct" | 19 | Human corrected a factual mistake or wrong assumption |
| "Good question" / "Great question" | 32 | Human asked something that revealed a gap in the solution |
| "You're absolutely right" | 10 | Human identified a significant issue the AI missed entirely |
| "Good point" / "Great point" | 3 | Human made a strategic or practical recommendation |
| "Excellent observation" / "Excellent point" | 5 | Human's insight changed the approach |
The Most Critical Human Interventions¶
These interventions directly prevented production incidents or architectural rot:
| Date | Intervention | Severity | AI Response |
|---|---|---|---|
| Jan 9 | Assembly model gap (1:1 → 1:many) | Architecture | "Excellent observation!" |
| Jan 11 | Migration-before-restart ordering | Critical | "Exactly the kind of subtle ordering bug that can cause production incidents!" |
| Jan 17 | ADR-032 violation in a quick fix | Architecture | Reimplemented properly |
| Feb 1 | Registry cleanup deleting cache tags | Deployment | Script updated to preserve cache |
| Feb 9 | Insisting AI read API docs instead of guessing | Process | "I owe you an honest answer" |
| Feb 15 | .env overwrite destroying unaffected services | Critical | Deploy flow redesigned |
| Feb 16 | Identifying wrong machine for docker builder prune |
Deployment | Cache strategy corrected |
| Feb 17 | Port collision + orphaned volumes in CI | CI/CD | Dynamic ports + cleanup cron |
Patterns in Human-AI Cooperation¶
Human Intervention Frequency Over Time¶
Key insight: Human interventions spike during two specific periods:
- Week 1 (foundation) — setting patterns and catching early design flaws before they propagate.
- Week 6 (microservices migration) — the most complex architectural change with new infrastructure, multiple deployment targets, and CI pipeline rewrites.
The quieter weeks (3–5) represent periods where patterns were established, the AI was operating within known boundaries, and fewer novel decisions needed human judgment.
Pipeline Interventions — The Human as Build Monitor¶
Beyond architectural and design interventions, the human played a constant role as the project's build monitor — pasting pipeline output into the chat 43 times when something broke. In 95% of cases, the human pasted raw Azure DevOps pipeline logs (full ##[section] blocks with timestamps, exit codes, and stack traces). Only twice was the problem described verbally.
Pipeline Failure Categories¶
Root Causes — Who Broke It?¶
The headline number: nearly 63% of all pipeline failures were caused by AI-generated code changes. The AI writes code, pushes it, and the human discovers it broke the pipeline by pasting the output back. This is the dominant feedback loop in the project.
Pipeline Failures Over Time¶
Recurring Failure Patterns¶
Several failure classes kept resurfacing — they weren't one-off issues but structural weaknesses:
| Pattern | Occurrences | Root Problem |
|---|---|---|
| Playwright/test result publishing | 4 times | Azure DevOps task configuration never fully stable |
| Coverage threshold not met | 4 times | AI ships new code without enough tests, every time |
| Cosign/Sigstore attestation | 3 times | Device flow auth fundamentally incompatible with CI |
| Docker build failures | 6 times (5 in Feb) | Each new microservice introduced a new Dockerfile to get wrong |
| Acceptance tests breaking | 8 times | Most fragile stage — any application change can break them |
The "It Used to Work" Pattern¶
At least 3 times, the human expressed explicit frustration:
- "What has changed? It used to work perfectly." (deployment verification, Jan 26)
- "Still failing. What has changed? It used to work all the time." (registry cleanup, Feb 1)
- "The tests still seem not to publish correctly." (test results, Feb 16)
These all trace back to AI changes breaking previously-working pipeline functionality — a side effect of the AI modifying pipeline YAML or build scripts without fully understanding the downstream effects.
The Self-Hosted Agent Effect¶
When the project moved to self-hosted build agents around Feb 15, an entirely new class of failures appeared:
| Failure | Why It Didn't Happen on Hosted Agents |
|---|---|
syft: command not found |
Hosted agents have most tools pre-installed |
git fetch: non-fast-forward |
Hosted agents start with a clean workspace every time |
| PostgreSQL port collisions | Hosted agents don't share ports between pipeline runs |
| 80+ orphaned Docker volumes | Hosted agents are ephemeral — no accumulation |
What This Reveals About AI-First Development¶
The key metric: with 43 pipeline failures across 41 active days, the project averaged roughly 1 pipeline failure per day that required human intervention. This is the operational cost of AI-first development — the AI writes code fast but doesn't run the pipeline locally, so the human becomes the feedback loop between the AI and the CI system.
What the Data Reveals¶
The Human Is Not Just a Rubber Stamp¶
With 74 acknowledged interventions across 296 sessions, the human contributed meaningfully in roughly 25% of all sessions. These weren't cosmetic suggestions — they prevented production incidents, caught architectural violations, and injected domain knowledge the AI didn't have.
The AI's Blind Spots Are Predictable¶
The AI consistently struggles with:
- Cross-boundary effects — changing one file without checking if it affects deployments, docs, or other services
- Real-world behavior vs. documented behavior — APIs that behave differently than their docs suggest
- Infrastructure edge cases — concurrent agents, cancelled pipelines, disk-full scenarios
- Optimizing for speed over safety — taking shortcuts that violate the project's own architecture decisions
- Visual/spatial reasoning — the logo SVG required 10+ rounds of human correction
- Pipeline awareness — the AI caused 62.8% of all pipeline failures by pushing code without local verification; coverage thresholds were violated 4 times by shipping new code without tests
The Collaboration Gets More Efficient¶
Early weeks required more interventions per session. As the project established patterns, rules (like the .cursorrules file), and ADRs, the AI needed less correction. The microservices migration (Week 6) temporarily disrupted this — introducing new infrastructure always resets the learning curve.
The Human's Superpower: Asking "What About...?"¶
The single most impactful human behavior was asking about scenarios the AI didn't consider:
- "What about when you cancel the pipeline?" → discovered orphaned containers
- "What about concurrent build agents?" → discovered port collisions
- "What about an empty staging database?" → discovered brittle load tests
- "What about the other documents that mention endpoints?" → discovered stale docs
- "What about the cache tag?" → prevented cache strategy from being destroyed
These "what about" questions account for roughly a third of all human interventions and nearly all of the most critical ones.
Cost and Velocity Summary¶
Conclusion¶
Forma3D.Connect was built by AI — but it was shaped by a human. The AI provided velocity, breadth of knowledge, and tireless consistency. The human provided judgment, domain expertise, and the crucial ability to ask "but what about...?"
Neither could have built this system alone. The AI without the human would have shipped a fragile system full of subtle deployment bugs, architectural shortcuts, and hardcoded assumptions. The human without the AI would still be in Week 3 of the original 26.5-week plan.
The data shows that human oversight is not overhead — it's a force multiplier. The 74 interventions didn't slow the project down; they prevented the kind of problems that derail projects for days or weeks. The cost of those interventions (a few minutes of human attention each) was trivial compared to the production incidents they prevented.
This is what AI-human collaboration looks like when it works: AI velocity, human wisdom, and a shared commitment to getting it right.
Addendum: The ClickHouse Death Spiral (March 3, 2026)¶
One day after the timeline was "complete," the AI proved its value again.
The Question¶
The human noticed ClickHouse consuming 83% CPU on the staging Dozzle dashboard and asked a simple question: "Is it normal that ClickHouse is taking more than 80% CPU while the system is doing nothing?"
What the AI Found¶
Within minutes of SSH-ing into the server, the AI uncovered a merge death spiral — a self-reinforcing feedback loop that was silently burning both CPUs and would have become a production crisis:
-
System tables had grown unchecked. ClickHouse's own internal logging tables had accumulated 138 million rows in
asynchronous_metric_log, 3 million rows / 788 MiB intext_log, and 82 active parts inmetric_log— all without any TTL or retention policy. -
Background merges were failing. ClickHouse needed to merge these bloated tables, but every merge attempt hit
MEMORY_LIMIT_EXCEEDED(Code 241). The server had 3.82 GiB of RAM shared across 16 containers, but ClickHouse was configured to claim 80% of the host's RAM (3.06 GiB) — memory it could never actually get. -
Failed merges created more data. Each OOM failure logged a massive stack trace to
system.text_log, which created more parts, which triggered more merge attempts, which failed, which generated more errors. Classic death spiral. -
713 threads spinning on retry loops. 48 MergeTree background threads were all caught in this fail-log-retry cycle, pegging both CPUs at 163% with zero useful work.
-
The OTel filter was defined but never wired. A
filter/drop-debugprocessor existed in the OpenTelemetry collector config but wasn't included in the pipeline — debug-level logs were being ingested unnecessarily.
The Fix¶
The AI applied a layered fix in under 30 minutes:
| Layer | Change | Impact |
|---|---|---|
| Immediate | Truncated all bloated system tables | CPU dropped from 163% to 3.6% instantly |
| Docker Compose | Added mem_limit: 1536m, cpus: 1.0 to ClickHouse |
Container can no longer starve other services |
| ClickHouse config | Added 3-7 day TTL to all system tables | Tables auto-purge, preventing unbounded growth |
| ClickHouse config | Set text_log level to warning |
Broke the error-logging feedback loop |
| ClickHouse config | Reduced background_schedule_pool_size from 512 to 32 |
Fewer idle threads on a 2-CPU machine |
| OTel collector | Activated the filter/drop-debug processor in pipeline |
Stopped unnecessary debug log ingestion |
Before and After¶
| Metric | Before | After |
|---|---|---|
| CPU | 162.84% | 3.57% |
| Memory | 937.9 MiB (no limit) | 290.8 MiB / 1.5 GiB |
| Threads | 713 | 212 |
| Error rate | ~5 errors/sec | 0 |
| System table size | ~1.3 GiB | Truncated + TTL |
Why This Matters¶
This incident perfectly illustrates the value of AI-human collaboration in operations:
- The human noticed. A glance at a dashboard, a gut feeling that 83% CPU during idle was wrong. No alert fired. No user complained. The human's pattern-matching caught it.
- The AI diagnosed. Within minutes, it traced the root cause through five layers of infrastructure — Docker stats, ClickHouse system tables, background thread metrics, error logs, and config files. A human SRE would need significant ClickHouse expertise to identify a merge death spiral this quickly.
- The AI fixed it safely. It applied the fix incrementally (truncate first for immediate relief, then config changes, then restart), caught and corrected a startup failure caused by ClickHouse's pool size constraints, and verified the fix was stable before declaring success.
Left unchecked, this spiral would have continued degrading the staging environment and — once the same ClickHouse configuration was promoted to production — would have caused a production outage. The total time from "is this normal?" to "CPU at 3.57% and stable" was under 30 minutes.
This is the "what about...?" pattern in action — except this time, it happened after the project was declared complete.
Addendum: The ClickHouse Death Spiral Returns (March 18, 2026)¶
Two weeks later, the same spiral — proving that fixing symptoms without fixing architecture is borrowing time.
The Recurrence¶
The human noticed ClickHouse at 51% CPU on the Dozzle dashboard and asked the AI to check. SSH diagnostics revealed the same pattern: 69% CPU, 417 MiB / 1.5 GiB memory, 1,524 OOM errors in 10 minutes, merge tasks stuck in a retry loop. The system.metric_log table's merge task had reached level 748, and system.text_log had re-bloated to 113 MiB.
Every fix from March 3 was in place: memory limits, TTLs, thread pool tuning, log level filtering, debug log dropping. All correctly configured. All ineffective.
Why the Previous Fix Failed¶
The March 3 fix treated the correct symptoms but missed the architectural root cause:
| What was configured | Why it didn't work |
|---|---|
| 3-day TTLs on system tables | TTL cleanup runs during merges. When merges OOM, TTLs can't execute either — a circular dependency. |
toYYYYMM monthly partitions |
With 3-day TTL inside a monthly partition, expired rows must be rewritten out of the part (a merge-like operation). On a 1.5 GiB container, rewriting a month of metric_log exceeds memory. |
error_log |
Never configured at all — no TTL, no retention. Accumulated 825K rows of OOM errors, fueling the spiral. |
metric_log collect every 10s |
Created ~1 part per 10 seconds → 155K parts over 18 days. Merge tree depth reached level 748 before a single merge exceeded memory. |
flush_interval: 7.5s |
Too frequent — each flush creates a new part. More parts = more merge pressure. |
The fundamental issue: TTLs and monthly partitions are architecturally incompatible on a memory-constrained system. TTL cleanup on monthly partitions requires the same merge operations that caused the OOM in the first place.
The Actual Fix¶
| Layer | Change | Why it works |
|---|---|---|
| Partition scheme | toYYYYMM → toYYYYMMDD (daily) on all system tables |
Expired data = entire day-part → dropped with zero I/O and zero memory. No merge-rewrite needed. |
| Missing table | Added error_log config with 3-day TTL |
Closes the gap that let OOM errors accumulate without retention. |
| Missing table | Added query_metric_log config with 3-day TTL |
Another table that had no retention policy. |
| Collection rate | metric_log collect interval 10s → 60s |
6× fewer parts created, dramatically lower merge pressure. |
| Flush rate | All flush intervals 7.5s → 30–60s | More rows batched per part, far fewer parts to merge. |
| Disabled tables | trace_log, processors_profile_log, asynchronous_insert_log removed |
Zero merge/storage overhead for tables nobody reads on staging. |
| Table migration | Dropped old monthly-partitioned tables, ClickHouse recreated with daily scheme | Config only applies at table creation; existing tables needed manual recreation. |
Before and After¶
| Metric | Before | After |
|---|---|---|
| CPU | 69% | 3.2% |
| Memory | 417 MiB / 1.5 GiB | 110 MiB / 1.5 GiB |
| System table partitions | Monthly (merge-rewrite for TTL) | Daily (drop-part for TTL) |
| Error rate | ~150 errors/min | 0 |
metric_log part creation |
1 part / 10s | 1 part / 60s |
The Lesson¶
Configuration correctness ≠ operational correctness. Every TTL, every memory limit, every thread pool setting from the March 3 fix was syntactically correct and semantically appropriate. But the interaction between monthly partitions and short TTLs created a situation where the TTLs could never execute — they required the very merge operations that were failing. The fix looked right in the config file but was architecturally impossible at runtime.
This is a class of bug that's invisible to code review and config audits. The only way to catch it is to understand how ClickHouse implements TTL cleanup (via merges) and reason about whether the merge operations themselves are feasible within the resource constraints. It's a second-order failure mode: not "does this setting exist?" but "can the mechanism that enforces this setting actually run?"
Addendum: The Cosign 401 Mystery (March 6–7, 2026)¶
When the AI's analytical approach hit a wall, the human's memory of past struggles broke through.
The Problem¶
After introducing self-hosted DigitalOcean build agents alongside the default Microsoft-hosted agents, all staging promotion attestations started failing with a 401 Unauthorized error. The cosign tool could sign images and record entries in the Sigstore transparency log, but couldn't push the attestation back to the DigitalOcean Container Registry:
Error: signing registry.digitalocean.com/forma-3d/forma3d-connect-order-service@sha256:...
GET https://api.digitalocean.com/v2/registry/auth?scope=repository:forma-3d/forma3d-connect-order-service:push,pull
→ unexpected status code 401 Unauthorized
The confusing part: image builds, pushes, cosign signing, and SBOM attestations all succeeded in the earlier build stage — on the same agent pool, with the same token. Only the staging promotion attestations (which ran later in the pipeline) failed.
The AI's Attempts¶
The AI approached the problem analytically, trying three successive hypotheses:
| Attempt | Hypothesis | Fix Applied | Result |
|---|---|---|---|
| 1 | Custom DOCKER_CONFIG path was wrong |
Removed custom config, used default ~/.docker/config.json |
Still 401 |
| 2 | Credential helper on self-hosted agents intercepting docker login |
Reset Docker config to {} before login, verified inline auth |
Still 401 |
| 3 | Missing set -e hiding a silent login failure |
Added set -e to all login blocks |
Still 401 |
Each hypothesis was reasonable. Each was wrong. The AI was trapped in a pattern of analyzing the current pipeline configuration without considering DigitalOcean-specific authentication behavior.
The Human's Breakthrough¶
The human remembered: "I feel like we have had this problem a long time ago." They asked the AI to search the Specstory conversation history in .specstory/history/ — the archive of all previous AI-human sessions.
The AI found it immediately: a conversation from January 14, 2026 (container-image-promotion-issue) where the exact same 401 Unauthorized error had been encountered and solved. The fix was simple but non-obvious:
docker login with a raw DigitalOcean API token produces credentials that Docker CLI can use but cosign cannot. DigitalOcean's registry auth endpoint rejects the raw token format when cosign presents it. The solution is doctl registry login, which generates a properly scoped registry credential that both Docker and cosign understand.
The Fix¶
Three lines replaced the broken docker login pattern across all attestation jobs:
doctl auth init --access-token $DOCR_TOKEN
doctl registry login
This matched the pattern already used successfully in the pipeline's CleanupRegistry job — which also runs cosign operations against the DigitalOcean registry.
Why This Matters¶
This incident reveals a fundamental limitation and a fundamental strength of AI-human collaboration:
-
The AI's limitation: It can analyze what's in front of it with extraordinary depth. It traced credential flows through Azure DevOps variable expansion, Docker config files, and cosign's
go-containerregistrylibrary. But it couldn't recall that this exact problem had been solved before — it had no memory across sessions. -
The human's strength: The human had a feeling. Not a precise recollection, but an intuition born from having lived through the January debugging session. That vague memory — "didn't we fix something like this before?" — was worth more than three rounds of systematic analysis.
-
The archive as shared memory: The
.specstory/history/folder — containing 460+ conversation logs — acted as a bridge between the human's fuzzy recall and the AI's precise execution. The human knew something was there; the AI could find and apply it in seconds.
This is the inverse of the ClickHouse incident. There, the AI diagnosed a novel problem the human couldn't have solved alone. Here, the human's experiential memory broke through where the AI's analytical approach kept missing. The conversation archive turned an individual's vague memory into an actionable solution — a form of institutional knowledge that neither human nor AI could have leveraged alone.
Addendum: Intelligent Ralph Wiggum Loops (January 9 – March 8, 2026)¶
When the human becomes the build monitor and the AI becomes the code monkey in a feedback loop.
What Is a Ralph Wiggum Loop?¶
A Ralph Wiggum loop is the practice of repeatedly feeding the same prompt to an AI until the task is fully complete. In its original form, it's a dumb bash loop: while :; do cat PROMPT.md | claude ; done. The AI runs, the pipeline checks, the loop repeats until everything passes.
In Forma3D.Connect, we ran an intelligent variant: a human monitors the pipeline, identifies what still fails, and feeds only the relevant failure logs back to the AI. The AI proposes a fix, the human pushes it, the pipeline runs, and the cycle repeats until the pipeline goes green — or the human gives up and changes strategy.
The Numbers¶
| Metric | Value |
|---|---|
| Total intelligent Ralph Wiggum loops identified | 21 |
| Cross-session recurring patterns | 3 |
| Intentional verification loops | 1 |
| Total iterations across all loops | ~85 |
| Loops ending in clean success | 13 (62%) |
| Loops abandoned or worked around | 4 (19%) |
| Loops partially resolved or evolved | 4 (19%) |
| Longest loop by iterations | 10+ (Cosign 401, Mar 6–8) |
| Longest loop by duration | 44 hours (Cosign 401, Mar 6–8) |
| Shortest successful loop | 25 minutes (Cosign flag mismatch, Feb 26) |
| Cases where the human broke the loop | 5 |
| Average iterations per loop | ~4 |
All 21 Loops — Chronological Catalog¶
RW #1 — Azure Pipeline Type Check Cascade (Jan 9)¶
| Field | Value |
|---|---|
| Iterations | 4 |
| Duration | ~2 hours |
| Outcome | Success |
TypeScript's React 19 JSX namespace change triggered the first cascade: fix TS2503 → nx affected can't find main branch → unit tests exit with "No tests found" → deprecated Azure DevOps task versions. Four distinct errors, each revealed only after the previous was fixed.
RW #2 — Publish Test Results + Badge Fix (Jan 9)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~30 min |
| Outcome | Success |
No JUnit XML files being generated → wrong Azure DevOps badge URL. Two sequential pipeline issues fixed in the project's first day.
RW #3 — Docker Push Authorization Saga (Jan 11)¶
| Field | Value |
|---|---|
| Iterations | 5 |
| Duration | ~3 hours |
| Outcome | Success |
The most dramatic early loop. Docker login fails → AI fixes env mapping → push still fails → human spots wrong registry URL → push succeeds but hits repo limit → human creates new registry → AI rewrites all image naming. The human's intervention at iteration 3 (spotting the wrong registry) was the breakthrough the AI couldn't reach analytically.
RW #4 — CI Pipeline Triggering + Deployment (Jan 12)¶
| Field | Value |
|---|---|
| Iterations | 3–4 |
| Duration | ~2 hours |
| Outcome | Success |
Pipeline stops auto-triggering → fix trigger path exclusions → deployment verification fails → debug deployment steps. Mixed loop that evolved from one problem to another.
RW #5 — Staging Deployment Health Verification (Jan 12)¶
| Field | Value |
|---|---|
| Iterations | 4+ |
| Duration | ~4 hours |
| Outcome | Resolved (with SSH escalation) |
Health endpoint returns 404 → AI investigates Traefik routing → AI given SSH access to debug server directly → traces through Docker logs, .env files, Prisma migrations, network config. The loop escalated from "paste pipeline log" to "give the AI SSH access and let it dig."
RW #6 — Acceptance Test Post-Deploy (Jan 12)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~1 hour |
| Outcome | Partial (continued in other sessions) |
Acceptance tests fail after first successful deploy → fix Prisma binary targets for Alpine → still failing → continued debugging in subsequent sessions.
RW #7 — Playwright Report Publishing (Jan 14, cross-session)¶
| Field | Value |
|---|---|
| Iterations | 3 (across 2 sessions) |
| Duration | ~4 hours |
| Outcome | Success |
Azure DevOps PublishHtmlReport@1 can't find attachment → AI masks with continueOnError → human opens new session with same error → AI adds file-existence checks → user says "Again..." → AI realizes reportDir must point to a file, not a directory. Third time's the charm.
RW #8 — Acceptance Tests: Shipping Integration (Jan 16)¶
| Field | Value |
|---|---|
| Iterations | 4 |
| Duration | ~3 hours |
| Outcome | Abandoned |
Four shipping tests fail with wrong HTTP status codes → AI fixes error handling → still failing → AI tries again → still failing → human abandons the loop and disables shipping in staging (SHIPPING_ENABLED=false) since there are no SendCloud API keys configured anyway. A pragmatic exit.
RW #9 — Order Metadata Schema Type Check (Jan 17)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~1 hour |
| Outcome | Success |
Zod z.record() needs 2 args in v4 → fix type signature → pipeline advances to unit tests → fix coverage thresholds.
RW #10 — Rate Limiting / Orders API Timeout (Jan 18, cross-session)¶
| Field | Value |
|---|---|
| Iterations | 2+ (across 2 sessions) |
| Duration | ~2 hours |
| Outcome | Unresolved |
Acceptance tests hit 429 rate limiting → AI adds @SkipThrottle() to missing controllers → human opens new session: "Are you sure the fix was deployed?" → same 429s plus new 503 errors emerge. The loop stalled on uncertainty about whether the fix had actually been deployed.
RW #11 — Acceptance Test Run Failure (Jan 21–22)¶
| Field | Value |
|---|---|
| Iterations | 3 |
| Duration | ~9 hours |
| Outcome | Likely resolved |
Acceptance tests fail in CI → AI fixes test locators → unit tests fail → AI fixes → acceptance tests fail again → AI fixes more test logic. Ended with the human asking the AI to document the local test procedure, suggesting eventual success.
RW #12 — Real-Time Updates Acceptance Test Failure "The Marathon" (Jan 24–25)¶
| Field | Value |
|---|---|
| Iterations | 8+ |
| Duration | ~20 hours |
| Outcome | Partially resolved |
The most intense loop in the project. Missing env vars → AI fixes → still failing → AI fixes more config → human says "I am tired of going back and forth with CI" → retries after local test → "Still 8 failing" → AI fixes locators → user screams "FIX IT" → AI refactors tests → deployment verification fails (version mismatch) → AI discovers its own nginx proxy broke the version check. Tests went from 45 failing to 1, but the deployment verification remained broken. The 37,897-line conversation file is the longest in the entire project history.
RW #13 — User Creation Authentication Error "The Deployment Saga" (Jan 26)¶
| Field | Value |
|---|---|
| Iterations | 6 |
| Duration | ~3 hours |
| Outcome | Success |
Can't create users → Prisma role seeding missing → CI acceptance tests fail → deployment check fails → Docker container warnings → 1 test still fails → all resolved.
RW #14 — Deployment Verification "Version Mismatch Whack-a-Mole" (Jan 26)¶
| Field | Value |
|---|---|
| Iterations | 4 |
| Duration | ~2 hours |
| Outcome | Success |
Version mismatch on staging → AI investigates imageTag propagation → "I reran, still fails" → AI SSHes into server → "From 1 failing back to 45 failing?" — a regression caused by the fix itself. Eventually resolved after the AI identified a Docker image build issue.
RW #15 — isActive Feature Removal Cascade (Feb 10)¶
| Field | Value |
|---|---|
| Iterations | 5 (2 initial + 3 whack-a-mole) |
| Duration | ~13 hours |
| Outcome | Success |
Removing an unrequested isActive field triggered a cascade across three pipeline stages: lint (unused import) → Docker build (seed.ts still references isActive) → BDD generation (step definition text mismatch). User frustration marker: "Failed again. Make sure to run acceptance tests locally before I commit..." The AI learned — in the final iteration, it ran bddgen locally before declaring the fix.
RW #16 — Gridflock Crash Loop (Feb 16)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~1 hour |
| Outcome | Cut off (continues in RW #17) |
Gridflock service crash-looping due to missing NestJS module imports → AI fixes ServiceClientModule → still crash-looping → different missing dependency (SlicerClient). Each fix resolved one missing dependency but revealed another.
RW #17 — The Grand Marathon (Feb 16)¶
| Field | Value |
|---|---|
| Iterations | 8 |
| Duration | ~12 hours |
| Outcome | Partial success |
The day-long debugging marathon after a deployment broke everything: 44+ acceptance test failures spanning auth forwarding, WebSocket config, gridflock crashes, proxy 404s, PostgreSQL auth, and pipeline configuration. Progressive failure reduction: 44+ → ~20 → ~5 → 3 → 2 → then infrastructure issues (PostgreSQL auth on build agents, pg_isready: command not found) derailed progress. The AI had to completely rewrite a PostgreSQL wait script from bash to Node.js mid-loop. 27 conversation turns over 12 hours.
RW #18 — Sendcloud Test Loop (Feb 16)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~1 hour |
| Outcome | Abandoned (workaround) |
Late evening after the day-long marathon. Shipping/sendcloud tests still failing → AI proposes fix → still failing → human breaks the loop: "Ok, I will merge my latest changes and force a full build on main." Pragmatism over persistence.
RW #19 — Docs Docker Image Build (Feb 17)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~2 hours |
| Outcome | Success |
CHANGELOG.md not accessible inside container → AI fixes .dockerignore + COPY → same error persists → AI realizes the real root cause: pymdownx.snippets' restrict_base_path blocks .. in snippet paths. Textbook depth-of-analysis loop: first fix was necessary but insufficient.
RW #20 — PostgreSQL Auth on Self-Hosted Agents (Feb 17)¶
| Field | Value |
|---|---|
| Iterations | 2 (then evolved into architecture discussion) |
| Duration | ~3 hours |
| Outcome | Evolved |
PostgreSQL password auth failing → AI rewrites with trust auth and dynamic ports → same failure → AI SSHes into agent, finds host-level PostgreSQL interfering. Rather than continuing to iterate, the conversation pivoted to forward-looking infrastructure design (Testcontainers, MS-hosted agents, cleanup strategies). The intelligent version of breaking the loop: recognizing when iteration won't help and changing approach.
RW #21 — Type Check Cascade (Feb 23)¶
| Field | Value |
|---|---|
| Iterations | 3 |
| Duration | ~45 min |
| Outcome | Success |
The most efficient loop. TypeScript error (unused import) → fix → unit tests fail (missing tenantId) → fix → all 14 projects build successfully. Progressive pipeline stage unlocking: typecheck → tests → build → green.
RW #22 — Cosign Flag Mismatch (Feb 26)¶
| Field | Value |
|---|---|
| Iterations | 2 |
| Duration | ~25 min |
| Outcome | Success |
SBOM too large for Rekor transparency log → AI adds --no-tlog-upload → unknown flag error → AI corrects to --tlog-upload=false (version mismatch between cosign versions). The shortest successful loop.
RW #23 — The Grand Cosign 401 (Mar 6–8)¶
| Field | Value |
|---|---|
| Iterations | 10+ |
| Duration | 44 hours |
| Outcome | Success |
The longest and most dramatic loop in the entire project. cosign attest returns 401 Unauthorized pushing to DigitalOcean Container Registry. The AI proposed 7+ wrong hypotheses over 10 iterations:
| # | AI Hypothesis | Result |
|---|---|---|
| 1 | Token issue / missing set -e |
Still 401 |
| 2 | Custom DOCKER_CONFIG interference |
Still 401 |
| 3 | Stale env vars | Still 401 |
| 4 | Dirty config.json |
Still 401 |
| 5 | Wrong login method (docker login vs doctl) | Still 401 |
| 6 | Pipeline config drift from known-good state | Still 401 |
| 7 | Self-hosted agent state corruption | Still 401 (even on MS-hosted agent!) |
The human broke the loop twice:
- At iteration 5: asked the AI to search
.specstory/history/for past solutions (the AI's conversation archive as institutional memory) - At iteration 10: identified the actual root cause — the registry was locked during garbage collection, so attestation needed to run after cleanup, not in parallel
User frustration escalation: "Still failing" → "Hmmm fails again" → "Still not working??????" → "It worked! Finally! GOOD JOB."
Cross-Session Clusters¶
Some loops didn't occur in isolation but formed clusters — waves of related failures spanning multiple conversations.
Cluster A (Feb 15–17) is the most intense: a single deployment event triggered a cascade of 7 interconnected sessions spanning 28 hours and 5 Ralph Wiggum loops. The overall failure arc continued across session boundaries even when individual sessions reached local conclusions. By the end, the conversation had pivoted from reactive bug-fixing to proactive infrastructure redesign.
Cluster B (Jan 22 → Feb 3) reveals a structural weakness: the coverage threshold was lowered from 78% to 72% after the first failure, but coverage actually dropped further (from 76.75% to 71.18%) because new code kept being shipped without sufficient branch coverage.
Cluster C (Jan 26 → Feb 1) shows the container registry cleanup script breaking in a new way each time it was "fixed." The human's frustrated "Still failing. What has changed? It used to work all the time." captures the pattern perfectly.
Cluster D (Feb 26 + Mar 6–8) connects the shortest and longest loops in the project — both involving cosign + DigitalOcean, but with entirely different root causes.
Loop Outcomes Analysis¶
How the Human Broke the Loop¶
In 5 of the 21 loops, the human actively broke the cycle rather than letting the AI iterate to a solution:
| Loop | How the Human Broke It | Why |
|---|---|---|
| #8 (Shipping tests) | Disabled shipping in staging | No SendCloud API keys configured — the tests couldn't pass regardless |
| #18 (Sendcloud tests) | Merged and force-built on main | Late evening after 12-hour marathon — pragmatism over persistence |
| #20 (PostgreSQL auth) | Pivoted to architecture discussion | Recognized the root cause was systemic (host-level PG) — iteration wouldn't help |
| #23 (Cosign 401), attempt 1 | Directed AI to search conversation history | The human's vague memory ("didn't we fix this before?") was more valuable than the AI's analysis |
| #23 (Cosign 401), attempt 2 | Identified actual root cause (GC locking registry) | 7 AI hypotheses failed — the human's infrastructure intuition broke through |
Pattern: The human breaks the loop when the problem is environmental, systemic, or outside the AI's analytical reach. The AI excels at code-level fixes but struggles with infrastructure timing, external service behavior, and problems it has solved before but can't remember.
Loops Per Week¶
Observation: Loops cluster around two events: initial setup (Week 1, 6 loops) and the microservices migration (Week 6, 5 loops). The quiet period (Weeks 4–5) coincides with feature development on established patterns — exactly when you'd expect the loop frequency to drop. The lone Week 9 outlier (the Grand Cosign 401) is an infrastructure timing problem that no amount of code-level iteration could solve.
What the Loops Reveal About AI-First Development¶
The Ralph Wiggum loop is the dominant operational pattern in this project. With 21 loops comprising ~85 iterations across 53 days, the project averaged roughly 1.6 iterations per active day of human-AI pipeline ping-pong.
Three meta-patterns emerge:
-
Cascade loops (RW #1, #15, #17, #21): Fix one error, reveal the next. Each pipeline stage acts as a gate — typecheck → lint → unit tests → Docker build → deploy → acceptance tests. Fixing a failure at one gate just means hitting the next gate. These loops are the most productive: each iteration makes genuine progress.
-
Same-problem loops (RW #3, #7, #8, #10, #23): The same error keeps coming back despite fixes. These indicate the AI is treating symptoms rather than root causes. The Grand Cosign 401 (RW #23) is the extreme example: 7 wrong hypotheses over 44 hours because the real problem (registry locked during garbage collection) was outside the AI's analytical frame.
-
Regression loops (RW #14, #17): The fix makes things worse. RW #14 went from 1 failing test to 45 after an AI fix. These are the most demoralizing — and the most dangerous without human oversight.
The intelligent variant matters. In a "dumb" Ralph Wiggum loop, the entire prompt is fed back every time and the AI has no guidance about what changed. In this project, the human acted as a filter — pasting only the relevant failure, providing context about what was already tried, and critically, knowing when to break the loop entirely. Five of the 21 loops were resolved not by the AI iterating to a solution, but by the human changing strategy.
This is the case for the "intelligent" in intelligent Ralph Wiggum loops: the human's judgment about when to stop iterating is as valuable as the AI's ability to keep iterating.
Addendum: The Stock Management Feature — Anatomy of a Multi-Session Implementation (February 15 – March 9, 2026)¶
When a single feature request spawned 7 AI sessions, revealed 15 gaps, and demonstrated that the human's most valuable intervention is asking "but does this actually work?"
The Request¶
On February 15, the human described a hybrid fulfillment model: "In spare time or in the weekend it could be beneficial to already print some of the best selling products when there are not many orders." The system should maintain a minimum stock per product, pre-print during quiet periods, and consume from stock when orders arrive to speed up fulfillment.
This seemingly straightforward feature would become the most revealing case study of AI-human collaboration gaps in the entire project.
The Seven Sessions¶
Session 1 — The Prompt (Stock management prompt)¶
February 15, 2026
The human asked the AI to design the feature. The AI created a comprehensive 9-phase implementation prompt in docs/_internal/prompts/done/prompt-inventory-stock-management.md covering inventory tracking, pre-production scheduling, stock-aware fulfillment, and an audit trail.
Three latent gaps were introduced in the design phase:
| # | Gap | Why It Went Unnoticed |
|---|---|---|
| 1 | Inventory tracked at AssemblyPart level instead of ProductMapping | Seemed logical from a parts perspective, but ignored the GridFlock problem: printing one part of a grid doesn't make a complete product |
| 2 | maximumStock field designed into the schema but never used in the deficit formula |
The prompt included the field in the schema but the deficit calculation only referenced minimumStock |
| 3 | Named "pre-production" | A confusing term in a 3D printing context where "production" and "printing" are synonymous |
The human filed the prompt as TODO. No corrections were made. The gaps were invisible because the prompt looked comprehensive — the devil was in the formulae.
Session 2 — The Implementation (Implement stock management)¶
March 8–9, 2026
The human said: "Implement prompt docs/_internal/prompts/done/prompt-inventory-stock-management.md"
The AI built the entire feature across 9 phases: Prisma schema changes, domain contracts, inventory module (service, repository, controller), stock replenishment cron, stock-aware orchestration, print job completion handler, permissions, and gateway proxy.
Six new gaps were introduced during implementation:
The AI also broke 11 existing test suites (1,499 tests total) by not adding the new purpose and stockBatchId fields to mock PrintJob objects across the codebase. The human discovered this when asking "Do we need more unit tests?" — the AI then found and fixed all 141 broken test files.
Key human interventions during implementation:
- Pasted 5 separate CI failure logs requiring iterative fixes
- Asked the AI to write acceptance tests after staging deployment
- Pointed out RBAC permissions weren't seeded — "I do not have access" after deployment
- Insisted stock management config should be per-tenant (SystemConfig), not environment variables — the AI had used global env vars, which wouldn't work for a multi-tenant SaaS
Session 3 — The Architecture Questions (Flowchart and architecture)¶
March 9, 2026
The human reviewed the roadmap flowchart and caught three design flaws:
-
Flowchart loop bug: After queuing one part for replenishment, the diagram looped back to "Parts needing replenishment?" instead of the top. The human correctly noted: "In the mean time pre-production could have been disabled, new production orders and jobs could have arrived."
-
Inventory level question: The human asked whether inventory should be tracked at ProductMapping or AssemblyPart level, noting that for GridFlock grids, "printing one part alone makes no sense." The AI agreed ProductMapping was correct.
-
Toggle redundancy: The human asked: "What is the purpose of
preProductionEnabledifminimumStock > 0already implies replenishment intent?" The AI agreed it was redundant.
All three were prompt design flaws that had survived since February 15 — 22 days undetected.
Session 4 — The maximumStock Discovery (Prompt update & maximumStock)¶
March 9, 2026
This session produced the most significant human discovery. The human asked the AI to update the prompt for the current microservices architecture and rename "pre-production" to "stock replenishment."
During the review, the human asked: "The UI says the maximum stock is optional? What does this mean? If maximum stock is 0 will the system just keep printing?"
The AI investigated and discovered: maximumStock existed in the schema, DTOs, API contracts, and the UI — but had zero functional role. The deficit formula was minimumStock - currentStock. The field was pure decoration. The system always replenished to minimumStock only, regardless of maximumStock.
Critical human correction: The AI initially only updated the prompt document. The human pushed back: "Changes only made to the prompt? No. Implement the maximum stock stuff now, add or update unit tests and acceptance tests, update docs." The AI then discovered the system was already fully implemented (not a TODO) and made the code changes.
Session 5 — The Dead Pipeline (Replenishment event wiring)¶
March 9, 2026
The human deployed to staging and tested the feature. The UI showed: deficit = 1, replenishment enabled, correct settings — but 0 pending batches and no print jobs in the SimplyPrint queue.
The human shared screenshots and asked: "Shouldn't the print jobs have been created? The SimplyPrint queue is empty."
The AI investigated and found the most critical bug in the entire feature: Nobody subscribed to STOCK_REPLENISHMENT_SCHEDULED events. The StockReplenishmentService created StockBatch and PrintJob records in the database and published events — but the EventSubscriberService only listened for ORDER_CREATED and ORDER_CANCELLED. The entire replenishment pipeline was a dead end.
The fix: Added STOCK_REPLENISHMENT_SCHEDULED subscription to EventSubscriberService — lookup print job, validate file ID, call SimplyPrint's addToQueue(), update print job with queue item ID.
This bug could only have been found by testing on staging. The unit tests all passed because each component worked in isolation. The gap was in the wiring between components — exactly the kind of integration bug that unit tests can't catch.
Session 6 — Test Failures & Acceptance Test Pollution (CI fixes & stock cleanup)¶
March 9, 2026
The human pasted CI pipeline output showing two test failures caused by the maximumStock changes:
inventory.controller.spec.ts: Mock data missing the newreplenishmentTargetproperty (TS2345)inventory.service.spec.ts: The "Full Stock Widget" test data hadmaximumStock: 50withcurrentStock: 20, giving an unexpecteddeficit: 30— the test expected 1 product needing replenishment but got 2
After fixing the tests, the human noticed the staging inventory showed stock of 16 on the real "Colored Benchy" product. The human asked: "How did the system arrive at stock 16?" and "Shouldn't I be able to manually edit the current stock?"
Two more gaps discovered:
-
Acceptance tests using real product mappings. The step
"there is a product mapping with stock management"grabbed the first existing stock-managed product — the human's real "Colored Benchy." Each test run added net +8 units (+5 +3 +2 -1 -1). Two CI runs = 16 phantom units. -
No manual stock adjustment UI. The backend had full
adjustStockandscrapStockAPI endpoints, the frontend haduseAdjustStock()anduseScrapStock()hooks — but no page in the app actually used them. The feature was 90% built but invisible to the user.
Session 7 — The Missing UI & The Silent Mutation (current session)¶
March 9, 2026
The human asked: "Make it so that I can do stock adjustments on the product mapping from the UI."
The AI built a StockAdjustmentModal component with three modes (Add Stock / Remove Stock / Scrap Stock), quantity input, required reason for audit trail, and a live preview of the resulting stock level. Integrated in two places: the product mapping edit page and the inventory stock levels page.
The human tested the modal on staging and reported: "Clicking the 'Add Stock' button does not close the modal window and does not update the numbers in the modal window. This is confusing because the stock IS being adjusted but you do not see it." Even after closing the modal manually (via X or Cancel), the "Current stock" label on the product mapping page showed the stale value — a full page refresh was required.
Root cause — another wiring gap between backend and frontend:
The fix was three-layered:
- Backend: Changed
@HttpCode(HttpStatus.OK)to@HttpCode(HttpStatus.NO_CONTENT)on bothadjustStockandscrapStockcontroller methods — semantically correct for operations that return no data - Frontend
request()function: Added a safety net to read the response as text first and only parse JSON if content exists — prevents any future 200-with-empty-body from silently breaking mutations - Modal
onErrorhandler: Added explicit error toast so mutation failures are never silent again
This gap is a cascade from gap #10 (no stock adjust UI). The AI built the modal with correct onSuccess wiring (close modal, show toast, invalidate queries) — but didn't test the full HTTP round-trip where the backend returned an unexpected response format. The same "each piece works in isolation" pattern as the dead event pipeline (gap #6).
Gap #14 — Acceptance tests expected 200, got 204. The fix for gap #13 changed the HTTP status code from 200 to 204. The CI pipeline immediately caught two failing acceptance tests:
Expected: 200
Received: 204
The feature file scenarios "Stock can be adjusted positively" and "Stock can be scrapped" both asserted Then the response status should be 200. A textbook cascade: fixing the backend response format broke the acceptance test assertions that hardcoded the old status code. Fixed by updating both scenarios to expect 204.
Gap #15 — Inventory nav item not gated by feature flag. The human disabled the stockManagement feature flag on the Feature Flags settings page and observed that:
- The Stock Management tile in Settings correctly disappeared
- The Stock Management section on Product Mapping edit correctly disappeared
- But the Inventory menu item in the sidebar (and mobile nav) remained visible
Navigating to /inventory showed the error: "Failed to load stock levels: Stock management is not enabled for this tenant." The nav items were defined as a static array — the feature flag was never consulted.
This is an implementation gap from the original stock management build (Session 2). The AI correctly gated the Settings tile and the Product Mapping section behind features?.stockManagement, but forgot to apply the same gate to the navigation. A partial feature flag implementation — the feature was hidden from two of three entry points, but the primary entry point (the nav menu) was left wide open.
The fix: Added a featureFlag property to nav items and filtered the navigation arrays in both Sidebar and MobileNav components based on useFeatureFlags(). The Inventory item is now only visible when stockManagement is enabled.
The human also noted a UX issue: the "Edit" button for stock management settings was positioned at the same level as "Adjust Stock" and "Disable," but it controls the parameters below (Min Stock, Max Stock, Priority, Batch Size). The button was moved down to sit alongside the settings grid, with a "Settings" sub-label — grouping the control with the content it modifies.
The Gap Lifecycle¶
Discovery Attribution¶
The human found 67% of all gaps. The CI pipeline caught type errors and assertion failures (27%). The AI self-discovered only one bug (7%) — the tenant-wide pendingBatches count — and only because it was actively refactoring the deficit calculation for the maximumStock fix.
The Pattern: Prompt vs. Implementation vs. Wiring¶
The 15 gaps cluster into three categories that reveal where AI struggles most:
| Category | Gaps | Example | Root Cause |
|---|---|---|---|
| Design gaps (prompt) | 5 | maximumStock in schema but not in formula |
AI creates comprehensive-looking documents where individual details contradict each other |
| Wiring gaps (implementation) | 7 | No event subscriber for replenishment; nav item not gated by feature flag | AI implements each component correctly in isolation but misses the connections between them — including applying the same gate to all entry points |
| Cascade gaps (fixes) | 3 | Mock data incomplete after adding field; modal mutation silently failing; acceptance tests expecting old HTTP status | Fixing one gap exposes another — the "whack-a-mole" pattern |
The most dangerous category is wiring gaps. Each component passes its unit tests. The cron creates batches. The event bus publishes. The subscriber service runs. But nobody wired them together. This is the software equivalent of building a beautiful bridge where each section is structurally sound — but the sections don't connect.
The Human's Questions — In Order¶
The entire gap discovery process was driven by ten observations from the human:
- "Shouldn't the flowchart loop back to the top?" → Flowchart logic error (22 days old)
- "Should we track at ProductMapping or AssemblyPart level?" → Wrong abstraction level (22 days old)
- "What is the purpose of preProductionEnabled if minimumStock > 0 already implies it?" → Redundant toggle (22 days old)
- "The UI says maximum stock is optional? What does this mean?" →
maximumStocknon-functional (22 days old) - "Shouldn't the print jobs have been created? The SimplyPrint queue is empty." → Dead event pipeline (1 day old)
- "How did the system arrive at stock 16?" → Acceptance tests polluting real data (1 day old)
- "Shouldn't I be able to manually edit the current stock?" → Missing UI (1 day old)
- "Clicking Add Stock does not close the modal. The stock IS being adjusted but you do not see it." → Silent mutation failure (minutes old)
- "I feel like 'Edit' should be one level down on the same level as the parameters because it is for editing those parameters." → UX hierarchy mismatch (minutes old)
- "When I set the Stock Management feature flag to false, the Inventory menu item still exists." → Partial feature flag implementation (minutes old)
Every question was a variant of the project's recurring pattern: "What about...?" — the human's superpower identified earlier in this timeline. The human didn't need to read the code. They just used the system and noticed when reality didn't match expectations. Question #8 is particularly telling: the human reported the exact symptom — "the stock IS being adjusted but you do not see it" — which immediately pointed the AI toward a response-handling issue rather than a backend bug. Question #10 reveals the pattern at its most systematic: the human toggled a feature flag off and methodically checked every place in the UI where the feature should disappear — finding the one the AI missed.
Conclusion¶
The stock management feature was implemented across 7 sessions spanning 22 days. It produced 15 gaps — 5 in the prompt, 7 during implementation, and 3 as cascades from fixes. The human discovered 10 of them (67%), the CI pipeline caught 4 (27%), and the AI self-discovered 1 (7%).
The most important lesson: a feature can look fully implemented — schema, API, UI, tests, documentation — and still be fundamentally broken if the wiring between components is missing. The replenishment pipeline created database records but never sent print jobs to SimplyPrint. The maximumStock field appeared everywhere in the UI but did nothing. The acceptance tests ran green while silently corrupting production data. The stock adjustment modal had perfect onSuccess logic — close the modal, show a toast, invalidate queries — but a mismatch between the backend's empty HTTP 200 and the frontend's JSON parser meant the success callback never fired. The feature flag correctly hid the feature from two of three UI entry points — but left the primary navigation link visible.
The cascade chain is particularly instructive: gap #10 (no UI) → gap #13 (modal doesn't close due to HTTP 200/empty body) → gap #14 (acceptance tests fail because they expected 200, now get 204). Each fix peeled back a layer and exposed the next. Meanwhile, gap #15 (nav item not gated) demonstrates a different AI failure mode: incomplete application of a cross-cutting concern. The AI applied the feature flag to the Settings page and the Product Mapping page but missed the navigation — the most visible entry point. The human found it in seconds by simply toggling the flag and looking at the sidebar.
The AI excelled at building each piece. The human excelled at asking whether the pieces actually worked together. Neither could have shipped this feature alone — but for very different reasons than the earlier sections of this timeline describe. Here, the AI's failure mode wasn't speed-induced sloppiness. It was the gap between "implemented" and "functional" — the subtle difference between code that exists and code that works.
Addendum: The SonarQube Saga — From 769 Issues to Zero in 43 Hours (March 12–14, 2026)¶
When a phone call about a demo turned into a full code quality overhaul — and AI proved that "65 hours of developer work" is a negotiable concept.
The Catalyst¶
On the evening of March 12, the human had a phone call with Steven Robijns about an upcoming demo of the Forma3D.Connect platform. Steven asked a simple but pointed question: "How good is the code quality that AI generates?"
The human didn't have a data-driven answer. The codebase had been built entirely by AI over 9 weeks, with human guidance — but no independent quality assessment had ever been performed. That evening, the human sat down and asked the AI three things:
- Research: Create a research report on integrating SonarQube into the project
- Prompt: Design a prompt for implementing SonarCloud integration
- Execute: Implement the prompt
What followed was a 43-hour sprint that would transform the codebase from 769 issues to zero — and provide the data-driven answer Steven's question demanded.
The Timeline¶
March 12, Evening — The Phone Call and Research (17:00–21:00Z)¶
| Time | Event | Session |
|---|---|---|
| ~17:00Z | Phone call with Steven Robijns about upcoming demo | — |
| 17:18Z | Human asks AI to research SonarQube CE vs SonarCloud | Research report created |
| 17:18Z | Human asks AI to create integration prompt | prompt-sonarcloud-integration.md |
| 17:21Z | AI researches SonarQube CE limitations, TypeScript support, branch analysis | Research document with PlantUML |
| ~18:00Z | Human registers project in SonarCloud, provides credentials | — |
| ~18:30Z | AI implements SonarCloud integration in Azure Pipeline | SonarCloudPrepare@4, SonarCloudAnalyze@4, SonarCloudPublish@4 |
| 20:37Z | First scan results arrive: 769 issues | AI begins fixing S2933 (readonly modifiers) |
The first scan was sobering:
| Metric | Value |
|---|---|
| Total issues | 769 |
| Code smells | 748 |
| Vulnerabilities | 12 |
| Bugs | 9 |
| Security hotspots | 6 |
| Code duplication | 19.5% |
| SonarCloud estimated fix effort | 3,890 minutes (~65 hours) |
The AI immediately produced a triage report (sonarcloud-issue-triage-20260312.md) categorizing all 769 issues by severity and action:
| Severity | Count | Real Problems | False Positive | Won't Fix |
|---|---|---|---|---|
| BLOCKER | 13 | 1 | 12 | 0 |
| CRITICAL | 39 | 36 | 0 | 3 |
| MAJOR | 132 | 132 | 0 | 0 |
| MINOR | 585 | 490 | 15 | 80 |
| Total | 769 | 659 | 27 | 83 |
March 13, Morning — The Blitz (07:00–10:00Z)¶
The human directed the AI to fix issues in waves. Twelve parallel AI sessions ran between 07:00 and 08:00Z alone:
| Time | Rule | Description | Issues Fixed |
|---|---|---|---|
| 07:20Z | — | Remove all // NOSONAR comments (not supported in TS) |
14 |
| 07:34Z | S3863 | Merge duplicate imports | 80 |
| 07:34Z | S7773 | parseInt() → Number.parseInt() |
~15 |
| 07:34Z | S7781 | .replace(/g) → .replaceAll() |
16 |
| 07:34Z | S7748 | Remove unnecessary decimal points | 23 |
| 07:39Z | S7735 | Flip negated conditions | 40 |
| 07:39Z | S6582 | Prefer optional chaining | 10 |
| 07:47Z | S7764 | window → globalThis |
36 |
| 07:47Z | S4325 | Remove redundant type assertions | 14 |
| 07:47Z | S7778 | indexOf → includes |
9 |
| 07:47Z | S7757 | Consistent type imports | 2 |
| 07:47Z | S7776 | startsWith/endsWith |
2 |
By 10:00Z, issues had dropped from 769 to 244 — a 68% reduction in 3 hours.
March 13, Afternoon — Moderate Fixes and Coverage (13:00–15:00Z)¶
The AI produced a second triage report and executed phases 1, 2, and 5:
| Time | Rule | Description | Issues Fixed |
|---|---|---|---|
| 13:58Z | Phase 1 | Quick auto-fixes (replaceAll, ??=, .at()) | 22 |
| 14:02Z | S6759 | React props should be Readonly<> (3 sessions) |
49 |
| 14:04Z | S4624 | Extract nested template literals | 10 |
| 14:04Z | S6819 | div role="region" → <section> |
6 |
| 14:04Z | S4323 | Type aliases | 2 |
| 14:04Z | S6571 | Redundant unknown in unions |
2 |
| 14:18Z | S4325 | Type assertion review | 4 |
| 14:18Z | S6853 | Form label association (htmlFor/id) (2 sessions) |
8 |
| 14:23Z | S6551 | String coercion (String() wrapping) |
47 |
| 14:44Z | S107 | Too many constructor parameters | 2 |
March 13, Evening — The Coverage Problem (20:00–22:00Z)¶
The human checked SonarCloud and was confused: "I thought we upped coverage and downed duplication. Still seems off?"
| Metric | Azure DevOps | SonarCloud | Why Different |
|---|---|---|---|
| Coverage | 73% | 57.9% | SonarCloud counts uninstrumented files as 0% covered |
| Duplication | — | 10.1% | 6,662 duplicated lines across 141 blocks |
Root cause: SonarCloud's sonar.sources included files excluded from test instrumentation. The AI aligned sonar.coverage.exclusions with the Jest/Vitest exclusion patterns.
The AI also fixed 18 more duplicate import issues across the service layer (S3863).
By end of March 13: 61 issues remaining, duplication at 10.1%.
March 14, Morning — The Final Push (07:00–13:00Z)¶
| Time | Action | Result |
|---|---|---|
| 07:38Z | Human asks why main still shows "Failed" | AI explains PR vs. main quality gate difference |
| 08:19Z | Fix shallow clone warning in pipeline | Git fetch depth configured |
| 08:40Z | CodeCharta integration (3D city map from SonarCloud) | Visualization pipeline job added |
| 10:00Z | Human directs: "Fix or won't-fix the 47 issues" | AI resolves all 47 via API + code fixes |
| 10:00Z | Duplication reduction target: 9.8% → ~3% | Refactored duplicated service code |
| 12:06Z | Coverage improvement sprint begins | 12 new test files created |
| 12:07Z | S107 constructor parameter fix | SendcloudBaseService refactored |
| 12:09Z | Pipeline enforcement: quality gate must pass | sonar.qualitygate.wait=true added |
| 12:14Z | Final test files for service-common | Controllers and services covered |
By 13:00Z on March 14: 0 issues, quality gate passing, pipeline enforcing.
The Numbers¶
| Metric | First Scan (Mar 12) | Final (Mar 14) | Change |
|---|---|---|---|
| Total issues | 769 | 0 | -769 (-100%) |
| Code smells | 748 | 0 | -748 |
| Vulnerabilities | 12 | 0 | -12 |
| Bugs | 9 | 0 | -9 |
| Code duplication | 19.5% | ~3% | -16.5pp |
| Duplicated lines | 13,366 | ~2,000 | -85% |
| Coverage (SonarCloud) | 57.9% | 70%+ | +12pp |
| Quality gate | FAILED | PASSED |
AI Sessions Breakdown¶
| Category | Sessions | Purpose |
|---|---|---|
| Research & setup | 3 | SonarQube CE research, SonarCloud integration prompt, pipeline setup |
| Triage reports | 2 | Categorize all issues by severity and action |
| Batch rule fixes | 20 | Fix specific SonarCloud rules across codebase |
| Coverage & duplication | 4 | Align coverage metrics, reduce duplication |
| Pipeline & quality gate | 3 | Enforce quality gate, shallow clone, CodeCharta |
| Test coverage sprint | 6 | New test files to raise coverage from 59% to 70%+ |
| Total | ~38 |
Cost Analysis¶
Human Cost¶
| Activity | Estimated Time | Notes |
|---|---|---|
| Phone call with Steven Robijns | 30 min | The catalyst |
| Research direction & initial setup | 30 min | "I want SonarCloud, evaluate CE vs Cloud" |
| Registering project in SonarCloud | 15 min | Manual step in SonarCloud UI |
| Directing AI fix sessions | 2 hours | Pasting rule IDs, reviewing progress |
| Reviewing SonarCloud dashboard | 1 hour | Checking metrics between fix rounds |
| Asking clarifying questions | 30 min | "Why does main still fail?", coverage discrepancy |
| Pipeline verification | 30 min | Confirming quality gate enforcement |
| Total human time | ~5 hours | Spread across 2 evenings + 1 morning |
AI Cost¶
| Resource | Estimate |
|---|---|
| AI sessions | ~38 |
| Estimated AI compute cost | ~€40–60 |
| Files modified | 200+ |
| Lines changed | 3,000+ |
Total Project Cost¶
| Item | Cost |
|---|---|
| Human time (~5 hrs × €75/hr) | ~€375 |
| AI compute | ~€50 |
| SonarCloud (free for open-source tier) | €0 |
| Total | ~€425 |
SonarCloud's Estimate vs. Reality¶
This is where the numbers become striking. SonarCloud estimates fix effort per issue based on industry averages for a human developer working manually.
| Metric | SonarCloud Estimate (Human) | Actual (AI + Human) | Ratio |
|---|---|---|---|
| Work effort | 65 hours | ~8 hours (5 human + 3 AI) | 8x faster |
| Calendar time | ~8 working days | ~2 calendar days | 4x faster |
| Cost | ~€4,875 (at €75/hr) | ~€425 | 11x cheaper |
| Issues resolved per hour | ~12/hr | ~96/hr (elapsed) | 8x throughput |
Important caveat: SonarCloud's effort estimates assume a single human developer reading code, understanding context, making changes, running tests, and reviewing. The AI skips the "reading and understanding" phase — it can grep the entire codebase in seconds and apply mechanical transformations across hundreds of files simultaneously. For mechanical fixes (replace .replace(/g) with .replaceAll(), add Readonly<> to React props), the AI's advantage is extreme. For complex refactoring (reducing duplication, restructuring services), the advantage narrows but remains significant.
What SonarCloud Revealed About AI-Generated Code¶
The 769 issues tell a story about how AI writes code:
| Pattern | Issues | What It Reveals |
|---|---|---|
Missing readonly modifiers (S2933) |
8 | AI doesn't default to immutability |
| Duplicate imports (S3863) | 80+ | AI adds imports incrementally without consolidating |
| Legacy patterns (S7781, S7773, S7778) | 60+ | AI uses older JS idioms (replace(/g), parseInt(), indexOf) |
| Negated conditions (S7735) | 40 | AI writes if (!x) { a } else { b } instead of if (x) { b } else { a } |
React props not Readonly<> (S6759) |
49 | AI doesn't wrap React props in Readonly<> by default |
| Template literal nesting (S4624) | 10 | AI creates complex nested interpolations |
| String coercion (S6551) | 47 | AI interpolates unknown values without String() |
| Too many parameters (S107) | 2 | AI mirrors DI framework patterns without questioning parameter count |
The meta-pattern: AI writes functional code — it works, it passes tests, it handles edge cases. But it doesn't write idiomatic code. It uses patterns from its training data rather than modern best practices. SonarCloud catches exactly these kinds of style and maintainability gaps.
The irony: AI generated the code quality issues. AI also fixed them all. The human's role was to introduce the quality gate (SonarCloud) and direct the AI to fix what it found. The same AI that wrote parseInt() instead of Number.parseInt() could instantly fix all 15 occurrences when told to — it just didn't know it should until SonarCloud flagged it.
The Pipeline Integration¶
The SonarCloud integration became a permanent part of the development process:
Quality gate conditions (Sonar way for AI Code):
| Condition | Threshold | Scope |
|---|---|---|
| New issues | 0 | New code only |
| New coverage | ≥ 80% | New code only |
| New duplication | ≤ 3% | New code only |
| Reliability rating | A | Overall |
| Security rating | A | Overall |
Why This Matters for the Demo¶
Steven Robijns asked: "How good is the code quality that AI generates?"
The data now provides a nuanced answer:
-
AI-generated code starts with quality gaps. 769 issues in a 53,000-line codebase is ~14.5 issues per 1,000 lines. This is within industry norms but reveals that AI doesn't naturally write to SonarCloud standards.
-
AI can fix its own quality gaps at extreme speed. What SonarCloud estimated would take a human developer 65 hours was completed in 2 calendar days with ~5 hours of human oversight.
-
The quality is now enforced. With
sonar.qualitygate.wait=truein the pipeline, every future commit must pass the quality gate. AI can no longer introduce issues without immediately being asked to fix them. -
The real answer: AI-generated code quality is exactly as good as the quality gates you enforce. Without SonarCloud, the 769 issues would have accumulated silently. With SonarCloud, they were eliminated in 43 hours and can never return. The quality of AI code is a function of the guardrails, not the AI itself.
The Steven Robijns Pattern¶
This incident reveals a pattern seen throughout the project: external pressure drives quality improvements that internal development wouldn't prioritize.
The codebase had existed for 9 weeks without a code quality scanner. The AI never suggested adding one. The human never prioritized it. It took a phone call about a demo — an external event with social stakes — to trigger the integration.
Once triggered, the actual work was trivial for the AI. The bottleneck was never "can AI fix code quality issues?" — it was "does anyone ask it to?" This is the human's role distilled to its essence: not writing code, not even reviewing code, but deciding what questions to ask about the code.
Steven Robijns didn't write a single line of code. He asked a single question. That question led to 769 fixes, a permanent quality gate, and a data-driven answer ready for the demo. The highest-leverage intervention in this entire saga was a phone call.
Addendum: The Grype CVE Saga (March 17, 2026)¶
The supply chain security pipeline¶
Background¶
On March 17, 2026 at 09:45, the AI introduced container vulnerability scanning into the CI/CD pipeline. Using Grype (by Anchore), every Docker image now gets its SBOM scanned for known CVEs before deployment. The pipeline was configured to fail on High severity vulnerabilities that have available fixes (--fail-on high --only-fixed).
The very first pipeline run with Grype enabled immediately failed. Every single service image had CVEs. What followed was a 4-hour investigation and remediation session between the human and AI.
What Grype uncovered¶
Across the 9 container images (Gateway, Order Service, Print Service, Shipping Service, GridFlock Service, Web, Docs, EventCatalog, Slicer), Grype discovered vulnerabilities in three distinct layers:
Layer 1: npm transitive dependencies (all NestJS services)
| Package | Vulnerable Version | Fixed Version | Severity |
|---|---|---|---|
| cross-spawn | 7.0.3 | 7.0.5 | High |
| minimatch | 5.1.6, 9.0.5 | 5.1.9, 9.0.9 | High |
| glob | 10.4.2 | 10.5.0 | High |
| tar | 6.2.1 | 7.x | High (6 CVEs) |
| file-type | 21.2.0 | 21.3.3 | Medium |
| lodash | 4.17.21 | 4.17.23 | Medium |
| serialize-javascript | 6.0.2 | 7.0.4 | High |
| ajv | 8.17.1 | 8.18.0 | Medium |
| bn.js | 4.12.2 | 5.2.3 | Medium |
| qs | 6.14.1 | 6.15.0 | Low |
Layer 2: System packages from Docker base images
| Source | Packages Affected | Cause |
|---|---|---|
npm bundled in node:20-alpine |
tar@6.2.1, glob@10.4.2, cross-spawn@7.0.3 | System npm not needed at runtime |
Alpine docker-cli package |
Go stdlib, containerd, docker/cli, otel/sdk | Go binaries with old dependencies |
Alpine zlib |
zlib 1.3.1-r2 | Outdated Alpine package |
Layer 3: The Slicer (linuxserver/bambustudio:01.08.03)
The Slicer was in a category of its own: 38,731 SBOM components and 800+ CVEs with fixes available, including Critical findings with active exploitation (CVE-2024-9680 in Firefox ESR with 30.8% EPSS). The base image ships an entire Debian 12 desktop environment (Firefox, GStreamer, Qt5, CUPS, ffmpeg, GhostScript) with packages that hadn't been patched in over a year.
How they were solved¶
npm transitive dependencies — Added 9 pnpm overrides to package.json:
"pnpm": {
"overrides": {
"ajv@>=8": ">=8.18.0",
"bn.js": ">=4.12.3",
"cross-spawn": ">=7.0.5",
"file-type": ">=21.3.2",
"lodash": ">=4.17.23",
"minimatch@<6": "5.1.9",
"minimatch@>=9 <10": "9.0.9",
"qs": ">=6.14.2",
"serialize-javascript": ">=7.0.3"
}
}
A key insight: the tar@6 → 7 override was deliberately not applied. The AI investigated and found tar@7 is ESM-only with a completely different API — it would break prisma-uml in development. The tar@6.2.1 in production images came from the bundled npm, not from project dependencies.
System packages in Docker base images — Modified 5 Dockerfiles (Gateway, Order Service, Print Service, Shipping Service, GridFlock Service) to add two lines to each production stage:
RUN apk add --no-cache openssl ... && \
apk upgrade --no-cache && \
rm -rf /usr/local/lib/node_modules/npm /usr/local/bin/npm /usr/local/bin/npx
apk upgradepicks up patched Alpine packages (zlib, docker-cli Go binaries)- Removing npm strips the bundled tar/glob/cross-spawn that the runtime doesn't need
The Slicer — After analyzing the 800+ CVEs, the conclusion was that they are unfixable without an upstream base image update. BambuStudio v2 had compatibility issues, and the container runs internally without internet exposure. The grype scan was commented out with a detailed rationale, and a TODO.md entry was created to revisit after BambuStudio v2 research.
The tar investigation¶
The AI's handling of the tar package is worth highlighting. When first asked "can this be fixed?", a naive approach would have been to add "tar": ">=7.5.11" to overrides. Instead, the AI traced two separate sources:
This kind of dependency forensics — cross-referencing lockfile versions against container scan versions, tracing transitive dependency trees, evaluating major version compatibility — is exactly the work that makes CVE remediation time-consuming for humans.
Time analysis¶
| Activity | Clock time | Notes |
|---|---|---|
| Grype introduction to pipeline | ~09:45 | AI added SBOM generation + Grype scan to all 9 service jobs |
| First failed pipeline run | ~12:00 | Gateway scan revealed 30 CVEs |
| Human reviews logs, pastes to AI | ~12:00–13:30 | Human provided scan output for each service, asked "same issues?" |
| AI investigation + all fixes applied | ~13:30 | pnpm overrides, 5 Dockerfiles, Slicer exclusion, TODO.md |
| Smoke test builds pass | ~13:30 | Gateway + Order Service + GridFlock builds verified |
| Total wall clock | ~4 hours | From introduction to all fixes committed |
AI effort: The AI performed dependency tree analysis (pnpm why for 10+ packages), lockfile forensics, Dockerfile inspection across 9 services, version compatibility research, applied 15+ file edits, ran build verification, and wrote the TODO.md entry. Estimated equivalent: ~15 minutes of compute time across all interactions.
Human effort: The human pasted 6 pipeline log outputs, asked clarifying questions ("same issues?", "will this fix the pipeline?"), and made one strategic decision (exclude Slicer grype with rationale). Estimated: ~30 minutes of active work, mostly reading and deciding.
Estimated time without AI: A senior developer performing the same work manually would need to:
- Understand each CVE and its severity (~1 hour reading advisories)
- Trace each vulnerable package through the dependency tree (~2 hours with
npm ls/pnpm why) - Research which overrides are safe vs breaking (tar@7 ESM, minimatch cross-major) (~2 hours)
- Apply and test pnpm overrides (~1 hour)
- Investigate Docker base image CVE sources vs project CVEs (~2 hours)
- Modify 5 Dockerfiles and verify builds (~1 hour)
- Analyze the Slicer's 800+ CVEs and determine unfixability (~2 hours)
- Write documentation and TODO entries (~1 hour)
Estimated total without AI: 12–16 hours (2 full working days), assuming the developer has prior experience with container security scanning, pnpm overrides, and Alpine/Debian package management.
The pattern¶
This incident follows the same pattern seen throughout the project: the human provides context and makes strategic decisions; the AI provides velocity and thoroughness.
The human's highest-leverage contributions were:
- Recognizing the Slicer was different — rather than asking the AI to "fix everything," the human asked "will this actually fix the pipeline?" which led to the honest answer that Go module CVEs in the base image are unfixable
- Making the exclude decision — weighing the security risk (internal-only container) against the engineering cost (BambuStudio v2 compatibility research) and choosing to defer with documentation
- Asking about each service — by methodically going through Gateway → Print → Shipping → Order → GridFlock → Slicer, the human ensured nothing was missed and that service-specific CVEs (bn.js in Order, ajv/serialize-javascript in GridFlock) were caught
The AI's key contribution was turning each question into immediate action — no context switching, no documentation lookup, no "I'll look into it tomorrow." Each service's scan output was analyzed in seconds and cross-referenced against the fixes already applied.
Generated from CHANGELOG.md and 500+ chat sessions in .specstory/history/
Forma3D.Connect — January 9 – March 17, 2026