Skip to content

Incident Report: Cosign 401 Unauthorized — Staging Attestation Failure

Date: March 6–8, 2026 Duration: ~32 hours investigation and remediation (across 3 days) Severity: P3 — Staging attestations blocked; deployments unaffected Status: Resolved Pipeline Runs Affected: #1893 through #1903


Summary

All cosign attest operations in the staging promotion pipeline step started failing with 401 Unauthorized against DigitalOcean's container registry auth endpoint. The error persisted across 14 commits and multiple hypothesis-driven fixes. The root cause was a race condition: the AttestStagingPromotion job ran concurrently with the RegistryMaintenance stage, which triggers DigitalOcean container registry garbage collection. During GC, the registry's auth endpoint (api.digitalocean.com/v2/registry/auth) rejects all token exchange requests with HTTP 401, even though docker login (which validates against a different endpoint) succeeds.

The fix was to extract the attestation job into its own pipeline stage (AttestStaging) that depends on RegistryMaintenance completing before it starts.


The Error

Every cosign attest call produced the same error:

Error: signing registry.digitalocean.com/forma-3d/forma3d-connect-docs@sha256:...
  GET https://api.digitalocean.com/v2/registry/auth
    ?scope=repository:forma-3d/forma3d-connect-docs:push,pull
    &service=registry.digitalocean.com
  : unexpected status code 401 Unauthorized

Key observations: - docker login always succeeded ("Login Succeeded") - The Sigstore transparency log entry was always created successfully - The failure happened specifically when cosign tried to push the attestation to the registry - The same DOCR_TOKEN worked for docker push and cosign sign in the Build stage


Pipeline Architecture (Before Fix)

uml diagram

The AcceptanceTest stage and RegistryMaintenance stage both depended on Build and DeployStaging, meaning they ran in parallel. The AttestStagingPromotion job (inside AcceptanceTest) would often start while garbage collection was still running.


Timeline

Time Commit Action Result
Feb 25, 23:23 a251940 Extracted RegistryMaintenance into its own stage (parallel with AcceptanceTest) ✅ Worked by luck — this introduced the latent race condition
Mar 5, 19:57 34f611d Last working pipeline (PR #489 merged) ✅ All attestations pass (GC happened to finish first)
Mar 5, 23:06 9e24ea7 Introduced DOCKER_CONFIG env var for attestation steps ❌ 401 — falsely correlated; GC race started manifesting
Mar 6, 14:06 3cc30dd Added set -e to login scripts ❌ 401
Mar 7, 10:14 5eb2a27 Removed custom DOCKER_CONFIG, simplified login ❌ 401
Mar 7, 11:10 1fa4437 Added credential validation with docker info ❌ 401
Mar 7, 11:37 6c0df9d Switched to doctl registry login doctl: command not found
Mar 7, 18:13 50063b2 Fixed doctl PATH for same-step execution ❌ 401
Mar 7, 18:48 2b3cc07 Used doctl registry docker-config --read-write ❌ 401
Mar 7, 19:18 328b81c Added debug diagnostics (curl, crane, cosign version) ❌ 401 (debug confirmed credentials valid)
Mar 7, 20:21 ee7501e Fixed cosign install to always prepend PATH ❌ 401
Mar 7, 21:09 5bbf20c Forced specific cosign v2.2.4 binary ❌ 401
Mar 8, 00:20 270b153 Reverted to system cosign + simple docker login ❌ 401
Mar 8, 01:09 d4605f5 Reverted YAML to exact working state of 34f611d ❌ 401
Mar 8, 01:48 059e70d Moved to MS-hosted ubuntu-latest agent ❌ 401
Mar 8, 06:24 Regenerated DOCR_TOKEN in DigitalOcean ❌ 401
Mar 8, 07:36 2ea404a Extracted AttestStaging into own stage after RegistryMaintenance Fixed

Hypotheses Explored

Hypothesis 1: DOCKER_CONFIG propagation failure

Theory: The DOCKER_CONFIG environment variable set via ##vso[task.setvariable] wasn't reaching the attestation step scripts.

Evidence for: The commit 9e24ea7 introduced DOCKER_CONFIG and the error started appearing in the same pipeline run.

Evidence against: Reverting the YAML to the exact working state (34f611d) still produced 401. The DOCKER_CONFIG change was a false correlation — the real cause was the GC race condition introduced on Feb 25 (a251940) that had been a latent timing bomb for 8 days.

Verdict: ❌ Red herring. The temporal correlation was coincidence — GC timing simply shifted enough to start hitting the failure window around the same time.

uml diagram

Hypothesis 2: Credential helper interference on self-hosted agent

Theory: The self-hosted DO build agent had a Docker credential helper (credsStore or credHelpers) that was intercepting docker login and storing credentials in a way cosign couldn't read.

Evidence for: The working version had echo '{}' > ~/.docker/config.json (wiping any credential helper config) before docker login.

Evidence against: Even with the config reset restored, the error persisted. Also failed on a fresh MS-hosted agent (no credential helpers).

Verdict: ❌ Ruled out.

Hypothesis 3: Wrong cosign binary version

Theory: A system-wide cosign at /usr/local/bin/cosign (older version) was being used instead of the pipeline's intended v2.2.4, and the older version handled auth differently.

Evidence for: Debug output showed the if ! command -v cosign check was skipping installation because system cosign existed. The ##vso[task.prependpath] was inside the if block, so it was also skipped.

Evidence against: Forcing v2.2.4 explicitly still produced 401. Using the system cosign also produced 401. On ubuntu-latest with freshly installed v2.2.4, still 401.

Verdict: ❌ Ruled out (but the install logic was genuinely buggy — see Lessons Learned).

Hypothesis 4: doctl registry login needed

Theory: Based on a previous incident (January 2026), doctl registry login was needed instead of plain docker login for cosign compatibility.

Evidence for: Historical precedent where doctl registry login fixed a similar 401.

Evidence against: Both doctl registry login and doctl registry docker-config --read-write still produced 401.

Verdict: ❌ Ruled out in this context (the January fix was likely for a different underlying cause).

Hypothesis 5: DOCR_TOKEN expired or invalid

Theory: The DigitalOcean Personal Access Token may have expired around March 6.

Evidence for: docker login succeeds (validates against registry.digitalocean.com/v2/) but cosign's token exchange (against api.digitalocean.com/v2/registry/auth) fails — these are different endpoints with potentially different token validation.

Evidence against: Regenerating the token didn't fix the issue. The build stage's cosign sign and cosign attest (SBOM) continued to work with the same token.

Verdict: ❌ Ruled out.

Hypothesis 6: Self-hosted agent environment corruption

Theory: Something changed on the devops-buildagent machine (Docker update, credential store, system config) that broke cosign's auth flow.

Evidence for: The YAML was identical to the working version but still failed on the self-hosted agent.

Evidence against: Moving to a fresh MS-hosted ubuntu-latest agent with a completely clean environment still produced the identical 401 error.

Verdict: ❌ Definitively ruled out.

Hypothesis 7: Registry inaccessible during garbage collection ✅

Theory: DigitalOcean's container registry auth endpoint returns 401 during active garbage collection. The RegistryMaintenance stage triggers GC and runs in parallel with AcceptanceTest, so the attestation job starts while GC is still running.

Evidence for: - Every other hypothesis was exhaustively ruled out - The error occurred on both self-hosted and MS-hosted agents - The error occurred with fresh tokens, every cosign version, and every auth method - The Build stage's cosign operations (before GC starts) always worked - Moving attestation to run after RegistryMaintenance immediately fixed the issue

Evidence against: None.

Verdict:Root cause confirmed.


The Fix

uml diagram

Changes Made (commit 2ea404a)

  1. Extracted AttestStagingPromotion from the AcceptanceTest stage into a new top-level stage AttestStaging

  2. New stage dependencies:

    dependsOn:
      - Build              # For image digests and affected flags
      - AcceptanceTest     # Must succeed (acceptance tests passed)
      - RegistryMaintenance # Must complete (GC finished, success or failure)
    

  3. Condition: Requires AcceptanceTest to succeed; allows RegistryMaintenance to be in any completed state:

    condition: |
      and(
        succeeded('AcceptanceTest'),
        in(dependencies.RegistryMaintenance.result,
           'Succeeded', 'SucceededWithIssues', 'Skipped', 'Failed'),
        eq(variables.isMain, true),
        eq('${{ parameters.enableSigning }}', 'true'),
        eq('${{ parameters.loadTestOnly }}', 'false')
      )
    

  4. Agent pool: vmImage: 'ubuntu-latest' (MS-hosted, clean environment every run)

  5. Updated DeployProduction to depend on the new AttestStaging stage

Follow-up: Retry on attestation (March 2026)

Intermittent 401s were still observed occasionally after the stage-order fix. DigitalOcean’s auth endpoint can return 401 for a short period after GC is reported complete (e.g. internal cleanup or propagation delay). To handle this tail window without manual retries:

  • Each cosign attest in the AttestStaging job was wrapped in a retry loop: up to 3 attempts with a 20s delay between attempts. On transient 401, the step retries automatically; only after 3 failures does the job fail.

Why This Wasn't Caught Earlier

On February 25 (a251940), the RegistryMaintenance stage was extracted from DeployStaging into its own top-level stage. This made it run in parallel with AcceptanceTest — both depend on Build and DeployStaging. Before this change, cleanup ran as a job inside DeployStaging, which completed before AcceptanceTest started, so GC never overlapped with attestation.

After the extraction, the pipeline worked for 8 days (Feb 25 – Mar 5) because GC timing happened to fall in the safe window — GC finished before the attestation job started. This was a latent race condition that only became deterministic when the pipeline's workload or GC duration shifted slightly.

The DOCKER_CONFIG commit (9e24ea7 on Mar 5) was a false correlation — it happened to land at the moment the race condition started manifesting consistently. This misdirected all debugging effort toward credential configuration when the real cause was pipeline stage ordering.

uml diagram

uml diagram


Why docker login Succeeds but cosign Fails During GC

DigitalOcean's container registry uses Docker Registry V2 auth with two separate endpoints:

uml diagram

  1. docker login validates credentials directly against registry.digitalocean.com/v2/ — this endpoint remains accessible during GC
  2. cosign attest needs to push to the registry, which requires a Bearer token obtained via token exchange at api.digitalocean.com/v2/registry/auth — this endpoint returns 401 during GC

This is why docker login always reported success while cosign consistently failed.


Lessons Learned

1. Timing-dependent failures are invisible to YAML diffs

The pipeline YAML was reverted to the exact working version and still failed. When a failure correlates with a commit but isn't caused by it, the real cause is environmental or temporal.

2. Eliminate variables systematically

The debugging process followed good scientific method — changing one variable at a time. But the breakthrough came from the user's domain knowledge about the registry GC, not from code-level debugging.

3. Registry operations should never overlap with GC

Any pipeline step that interacts with the container registry (push, pull, sign, attest, verify) should be sequenced after garbage collection completes. Use stage-level dependsOn to enforce this.

4. The cosign install pattern had a real bug

While not the root cause, the if ! command -v cosign pattern was genuinely buggy on self-hosted agents:

# BUGGY: If system cosign exists, PATH prepend is skipped
if ! command -v cosign &>/dev/null; then
  # install cosign...
  echo "##vso[task.prependpath]$HOME/.local/bin"  # Only runs if installed
fi

On the MS-hosted agents this is fine (no system cosign), but on self-hosted agents it could pick up an unexpected system binary. This was fixed as a side effect of moving to ubuntu-latest.

5. Two endpoints, two auth stories

DigitalOcean separates the Docker registry (registry.digitalocean.com) from the API (api.digitalocean.com). The registry endpoint stays up during GC; the API auth endpoint does not. This distinction is critical for any tool that uses the Docker V2 token exchange flow (cosign, crane, skopeo, etc.).


Recommendations

  1. Monitor GC duration: Add timing output to the RegistryMaintenance stage to track how long GC takes. If it grows beyond a threshold, consider optimizing cleanup or scheduling GC separately.

  2. Add retry logic to cosign attest: Even with proper stage ordering, transient registry errors can occur. A simple retry wrapper (3 attempts, 30s delay) would add resilience.

  3. Document the DigitalOcean GC behavior: This behavior is not well-documented by DigitalOcean. Adding a note to the deployment runbook ensures future engineers understand the constraint.


Files Changed

File Change
azure-pipelines.yml Extracted AttestStagingPromotion from AcceptanceTest stage into new AttestStaging stage with dependsOn: [Build, AcceptanceTest, RegistryMaintenance]