Incident Report: Cosign 401 Unauthorized — Staging Attestation Failure¶
Date: March 6–8, 2026 Duration: ~32 hours investigation and remediation (across 3 days) Severity: P3 — Staging attestations blocked; deployments unaffected Status: Resolved Pipeline Runs Affected: #1893 through #1903
Summary¶
All cosign attest operations in the staging promotion pipeline step started failing with 401 Unauthorized against DigitalOcean's container registry auth endpoint. The error persisted across 14 commits and multiple hypothesis-driven fixes. The root cause was a race condition: the AttestStagingPromotion job ran concurrently with the RegistryMaintenance stage, which triggers DigitalOcean container registry garbage collection. During GC, the registry's auth endpoint (api.digitalocean.com/v2/registry/auth) rejects all token exchange requests with HTTP 401, even though docker login (which validates against a different endpoint) succeeds.
The fix was to extract the attestation job into its own pipeline stage (AttestStaging) that depends on RegistryMaintenance completing before it starts.
The Error¶
Every cosign attest call produced the same error:
Error: signing registry.digitalocean.com/forma-3d/forma3d-connect-docs@sha256:...
GET https://api.digitalocean.com/v2/registry/auth
?scope=repository:forma-3d/forma3d-connect-docs:push,pull
&service=registry.digitalocean.com
: unexpected status code 401 Unauthorized
Key observations:
- docker login always succeeded ("Login Succeeded")
- The Sigstore transparency log entry was always created successfully
- The failure happened specifically when cosign tried to push the attestation to the registry
- The same DOCR_TOKEN worked for docker push and cosign sign in the Build stage
Pipeline Architecture (Before Fix)¶
The AcceptanceTest stage and RegistryMaintenance stage both depended on Build and DeployStaging, meaning they ran in parallel. The AttestStagingPromotion job (inside AcceptanceTest) would often start while garbage collection was still running.
Timeline¶
| Time | Commit | Action | Result |
|---|---|---|---|
| Feb 25, 23:23 | a251940 |
Extracted RegistryMaintenance into its own stage (parallel with AcceptanceTest) |
✅ Worked by luck — this introduced the latent race condition |
| Mar 5, 19:57 | 34f611d |
Last working pipeline (PR #489 merged) | ✅ All attestations pass (GC happened to finish first) |
| Mar 5, 23:06 | 9e24ea7 |
Introduced DOCKER_CONFIG env var for attestation steps |
❌ 401 — falsely correlated; GC race started manifesting |
| Mar 6, 14:06 | 3cc30dd |
Added set -e to login scripts |
❌ 401 |
| Mar 7, 10:14 | 5eb2a27 |
Removed custom DOCKER_CONFIG, simplified login |
❌ 401 |
| Mar 7, 11:10 | 1fa4437 |
Added credential validation with docker info |
❌ 401 |
| Mar 7, 11:37 | 6c0df9d |
Switched to doctl registry login |
❌ doctl: command not found |
| Mar 7, 18:13 | 50063b2 |
Fixed doctl PATH for same-step execution | ❌ 401 |
| Mar 7, 18:48 | 2b3cc07 |
Used doctl registry docker-config --read-write |
❌ 401 |
| Mar 7, 19:18 | 328b81c |
Added debug diagnostics (curl, crane, cosign version) | ❌ 401 (debug confirmed credentials valid) |
| Mar 7, 20:21 | ee7501e |
Fixed cosign install to always prepend PATH | ❌ 401 |
| Mar 7, 21:09 | 5bbf20c |
Forced specific cosign v2.2.4 binary | ❌ 401 |
| Mar 8, 00:20 | 270b153 |
Reverted to system cosign + simple docker login |
❌ 401 |
| Mar 8, 01:09 | d4605f5 |
Reverted YAML to exact working state of 34f611d |
❌ 401 |
| Mar 8, 01:48 | 059e70d |
Moved to MS-hosted ubuntu-latest agent |
❌ 401 |
| Mar 8, 06:24 | — | Regenerated DOCR_TOKEN in DigitalOcean | ❌ 401 |
| Mar 8, 07:36 | 2ea404a |
Extracted AttestStaging into own stage after RegistryMaintenance | ✅ Fixed |
Hypotheses Explored¶
Hypothesis 1: DOCKER_CONFIG propagation failure¶
Theory: The DOCKER_CONFIG environment variable set via ##vso[task.setvariable] wasn't reaching the attestation step scripts.
Evidence for: The commit 9e24ea7 introduced DOCKER_CONFIG and the error started appearing in the same pipeline run.
Evidence against: Reverting the YAML to the exact working state (34f611d) still produced 401. The DOCKER_CONFIG change was a false correlation — the real cause was the GC race condition introduced on Feb 25 (a251940) that had been a latent timing bomb for 8 days.
Verdict: ❌ Red herring. The temporal correlation was coincidence — GC timing simply shifted enough to start hitting the failure window around the same time.
Hypothesis 2: Credential helper interference on self-hosted agent¶
Theory: The self-hosted DO build agent had a Docker credential helper (credsStore or credHelpers) that was intercepting docker login and storing credentials in a way cosign couldn't read.
Evidence for: The working version had echo '{}' > ~/.docker/config.json (wiping any credential helper config) before docker login.
Evidence against: Even with the config reset restored, the error persisted. Also failed on a fresh MS-hosted agent (no credential helpers).
Verdict: ❌ Ruled out.
Hypothesis 3: Wrong cosign binary version¶
Theory: A system-wide cosign at /usr/local/bin/cosign (older version) was being used instead of the pipeline's intended v2.2.4, and the older version handled auth differently.
Evidence for: Debug output showed the if ! command -v cosign check was skipping installation because system cosign existed. The ##vso[task.prependpath] was inside the if block, so it was also skipped.
Evidence against: Forcing v2.2.4 explicitly still produced 401. Using the system cosign also produced 401. On ubuntu-latest with freshly installed v2.2.4, still 401.
Verdict: ❌ Ruled out (but the install logic was genuinely buggy — see Lessons Learned).
Hypothesis 4: doctl registry login needed¶
Theory: Based on a previous incident (January 2026), doctl registry login was needed instead of plain docker login for cosign compatibility.
Evidence for: Historical precedent where doctl registry login fixed a similar 401.
Evidence against: Both doctl registry login and doctl registry docker-config --read-write still produced 401.
Verdict: ❌ Ruled out in this context (the January fix was likely for a different underlying cause).
Hypothesis 5: DOCR_TOKEN expired or invalid¶
Theory: The DigitalOcean Personal Access Token may have expired around March 6.
Evidence for: docker login succeeds (validates against registry.digitalocean.com/v2/) but cosign's token exchange (against api.digitalocean.com/v2/registry/auth) fails — these are different endpoints with potentially different token validation.
Evidence against: Regenerating the token didn't fix the issue. The build stage's cosign sign and cosign attest (SBOM) continued to work with the same token.
Verdict: ❌ Ruled out.
Hypothesis 6: Self-hosted agent environment corruption¶
Theory: Something changed on the devops-buildagent machine (Docker update, credential store, system config) that broke cosign's auth flow.
Evidence for: The YAML was identical to the working version but still failed on the self-hosted agent.
Evidence against: Moving to a fresh MS-hosted ubuntu-latest agent with a completely clean environment still produced the identical 401 error.
Verdict: ❌ Definitively ruled out.
Hypothesis 7: Registry inaccessible during garbage collection ✅¶
Theory: DigitalOcean's container registry auth endpoint returns 401 during active garbage collection. The RegistryMaintenance stage triggers GC and runs in parallel with AcceptanceTest, so the attestation job starts while GC is still running.
Evidence for: - Every other hypothesis was exhaustively ruled out - The error occurred on both self-hosted and MS-hosted agents - The error occurred with fresh tokens, every cosign version, and every auth method - The Build stage's cosign operations (before GC starts) always worked - Moving attestation to run after RegistryMaintenance immediately fixed the issue
Evidence against: None.
Verdict: ✅ Root cause confirmed.
The Fix¶
Changes Made (commit 2ea404a)¶
-
Extracted
AttestStagingPromotionfrom theAcceptanceTeststage into a new top-level stageAttestStaging -
New stage dependencies:
dependsOn: - Build # For image digests and affected flags - AcceptanceTest # Must succeed (acceptance tests passed) - RegistryMaintenance # Must complete (GC finished, success or failure) -
Condition: Requires
AcceptanceTestto succeed; allowsRegistryMaintenanceto be in any completed state:condition: | and( succeeded('AcceptanceTest'), in(dependencies.RegistryMaintenance.result, 'Succeeded', 'SucceededWithIssues', 'Skipped', 'Failed'), eq(variables.isMain, true), eq('${{ parameters.enableSigning }}', 'true'), eq('${{ parameters.loadTestOnly }}', 'false') ) -
Agent pool:
vmImage: 'ubuntu-latest'(MS-hosted, clean environment every run) -
Updated
DeployProductionto depend on the newAttestStagingstage
Follow-up: Retry on attestation (March 2026)¶
Intermittent 401s were still observed occasionally after the stage-order fix. DigitalOcean’s auth endpoint can return 401 for a short period after GC is reported complete (e.g. internal cleanup or propagation delay). To handle this tail window without manual retries:
- Each
cosign attestin the AttestStaging job was wrapped in a retry loop: up to 3 attempts with a 20s delay between attempts. On transient 401, the step retries automatically; only after 3 failures does the job fail.
Why This Wasn't Caught Earlier¶
On February 25 (a251940), the RegistryMaintenance stage was extracted from DeployStaging into its own top-level stage. This made it run in parallel with AcceptanceTest — both depend on Build and DeployStaging. Before this change, cleanup ran as a job inside DeployStaging, which completed before AcceptanceTest started, so GC never overlapped with attestation.
After the extraction, the pipeline worked for 8 days (Feb 25 – Mar 5) because GC timing happened to fall in the safe window — GC finished before the attestation job started. This was a latent race condition that only became deterministic when the pipeline's workload or GC duration shifted slightly.
The DOCKER_CONFIG commit (9e24ea7 on Mar 5) was a false correlation — it happened to land at the moment the race condition started manifesting consistently. This misdirected all debugging effort toward credential configuration when the real cause was pipeline stage ordering.
Why docker login Succeeds but cosign Fails During GC¶
DigitalOcean's container registry uses Docker Registry V2 auth with two separate endpoints:
docker loginvalidates credentials directly againstregistry.digitalocean.com/v2/— this endpoint remains accessible during GCcosign attestneeds to push to the registry, which requires a Bearer token obtained via token exchange atapi.digitalocean.com/v2/registry/auth— this endpoint returns 401 during GC
This is why docker login always reported success while cosign consistently failed.
Lessons Learned¶
1. Timing-dependent failures are invisible to YAML diffs¶
The pipeline YAML was reverted to the exact working version and still failed. When a failure correlates with a commit but isn't caused by it, the real cause is environmental or temporal.
2. Eliminate variables systematically¶
The debugging process followed good scientific method — changing one variable at a time. But the breakthrough came from the user's domain knowledge about the registry GC, not from code-level debugging.
3. Registry operations should never overlap with GC¶
Any pipeline step that interacts with the container registry (push, pull, sign, attest, verify) should be sequenced after garbage collection completes. Use stage-level dependsOn to enforce this.
4. The cosign install pattern had a real bug¶
While not the root cause, the if ! command -v cosign pattern was genuinely buggy on self-hosted agents:
# BUGGY: If system cosign exists, PATH prepend is skipped
if ! command -v cosign &>/dev/null; then
# install cosign...
echo "##vso[task.prependpath]$HOME/.local/bin" # Only runs if installed
fi
On the MS-hosted agents this is fine (no system cosign), but on self-hosted agents it could pick up an unexpected system binary. This was fixed as a side effect of moving to ubuntu-latest.
5. Two endpoints, two auth stories¶
DigitalOcean separates the Docker registry (registry.digitalocean.com) from the API (api.digitalocean.com). The registry endpoint stays up during GC; the API auth endpoint does not. This distinction is critical for any tool that uses the Docker V2 token exchange flow (cosign, crane, skopeo, etc.).
Recommendations¶
-
Monitor GC duration: Add timing output to the
RegistryMaintenancestage to track how long GC takes. If it grows beyond a threshold, consider optimizing cleanup or scheduling GC separately. -
Add retry logic to cosign attest: Even with proper stage ordering, transient registry errors can occur. A simple retry wrapper (3 attempts, 30s delay) would add resilience.
-
Document the DigitalOcean GC behavior: This behavior is not well-documented by DigitalOcean. Adding a note to the deployment runbook ensures future engineers understand the constraint.
Files Changed¶
| File | Change |
|---|---|
azure-pipelines.yml |
Extracted AttestStagingPromotion from AcceptanceTest stage into new AttestStaging stage with dependsOn: [Build, AcceptanceTest, RegistryMaintenance] |
Related¶
- Previous cosign issue (Jan 2026)
- TIMELINE.md — Addendum entry for this incident
- GridFlock Pipeline Failure (Feb 2026) — Similar multi-cause investigation pattern