Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push auth fails when 5 to 10 Minutes after pull auth (with Workload Identity in GCP) #5852

Open
4 tasks done
MichaelKorn opened this issue Mar 18, 2025 · 3 comments · May be fixed by #5859
Open
4 tasks done

push auth fails when 5 to 10 Minutes after pull auth (with Workload Identity in GCP) #5852

MichaelKorn opened this issue Mar 18, 2025 · 3 comments · May be fixed by #5859

Comments

@MichaelKorn
Copy link

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

Using Google Artifact Registry and Workload Identity for authentication:
Image pushes fail due to auth fail if the push is exactly 5 Minutes to 10 Minutes after the cache pull. With following Error:

Error: buildx failed with: ERROR: failed to solve: error writing layer blob: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3A__our-project__%2F__our-registry__%2Ftest-nginx-image%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

Error seems from authprovider.go#L140 and the issue could result from authprovider.go#L62.
I tried to change the code to:

// Tokens for Google Artifact Registry via Workload Identity expire after 5 minutes
return time.Since(created) > 5*time.Minute-10*time.Second

But these changes (also tried to change the log) are not reflected after I build the buildkit image and use it in buildx.

Reproduction

  1. Docker Registry in Google Artifact Registry
  2. Run the build via Github Actions
  3. Use Workload Identity Federation for authentication against Google Services from GH workflow run
  4. use buildx
  5. The build needs to have two [auth]
    1. first tiggered due to --cache-from: [auth] .../test-nginx-image:pull token for europe-west3-docker.pkg.dev
    2. second trigered due to --cache-to or --push: [auth] .../test-nginx-image:pull,push token for europe-west3-docker.pkg.dev
  6. The second [auth] needs to be more than 5 Minutes, but less than 10 Minutes after the first [auth].
    1. At the beginning it was a complicate Dockerfile, but a simple build with sleep 270 works, sleep 300 fails and sleep 600 (and much more) works fine again.
  • Using other Authentication mechanism works fine.
  • I tried several sleeps, also before the build, seems really only related to a single docker buildx build call.
  • As workaround we can do a build without push, followed by a build with push. The --cache-from can stay in the second call, as everything is cached there is no [auth] for the remote cache needed (or in the log) during the second run.

Version information

/usr/bin/docker buildx version
  github.com/docker/buildx v0.21.3 7b5fecbd7a62d73843f7a73a6d4ec353c0555ef5
/usr/bin/docker buildx inspect --bootstrap --builder builder-db441f8f-6bde-49ee-b10d-ccac2e79b5c6
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 4.9s done
  #1 creating container buildx_buildkit_builder-db441f8f-6bde-49ee-b10d-ccac2e79b5c60
  #1 creating container buildx_buildkit_builder-db441f8f-6bde-49ee-b10d-ccac2e79b5c60 13.0s done
  #1 DONE 18.0s
  Name:          builder-db441f8f-6bde-49ee-b10d-ccac2e79b5c6
  Driver:        docker-container
  Last Activity: 2025-03-18 18:15:21 +0000 UTC
  
  Nodes:
  Name:                  builder-db441f8f-6bde-49ee-b10d-ccac2e79b5c60
  Endpoint:              unix:///run/docker/docker.sock
  Status:                running
  BuildKit daemon flags: --debug --allow-insecure-entitlement=network.host
  BuildKit version:      v0.20.1
  Platforms:             linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
  Labels:
   org.mobyproject.buildkit.worker.executor:         oci
   org.mobyproject.buildkit.worker.hostname:         56024517c3e3
   org.mobyproject.buildkit.worker.network:          host
   org.mobyproject.buildkit.worker.oci.process-mode: sandbox
   org.mobyproject.buildkit.worker.selinux.enabled:  false
   org.mobyproject.buildkit.worker.snapshotter:      overlayfs
  GC Policy rule#0:
   All:            false
   Filters:        type==source.local,type==exec.cachemount,type==source.git.checkout
   Keep Duration:  48h0m0s
   Max Used Space: 488.3MiB
  GC Policy rule#1:
   All:            false
   Keep Duration:  [144](..../actions/runs/1644262/job/10093592#step:5:149)0h0m0s
   Reserved Space: 9.313GiB
   Max Used Space: 93.13GiB
   Min Free Space: 36.32GiB
  GC Policy rule#2:
   All:            false
   Reserved Space: 9.313GiB
   Max Used Space: 93.13GiB
   Min Free Space: 36.32GiB
  GC Policy rule#3:
   All:            true
   Reserved Space: 9.313GiB
   Max Used Space: 93.13GiB
   Min Free Space: 36.32GiB
/usr/bin/docker version
  Client:
   Version:           28.0.1
   API version:       1.48
   Go version:        go1.23.6
   Git commit:        068a01e
   Built:             Wed Feb 26 10:40:04 2025
   OS/Arch:           linux/amd64
   Context:           default
  
  Server: Docker Engine - Community
   Engine:
    Version:          28.0.1
    API version:      1.48 (minimum version 1.[24](......./actions/runs/1644262/job/10093592#step:5:25))
    Go version:       go1.23.6
    Git commit:       bbd0a17
    Built:            Wed Feb 26 10:41:19 20[25](......./actions/runs/1644262/job/10093592#step:5:26)
    OS/Arch:          linux/amd64
    Experimental:     false
   containerd:
    Version:          v1.7.25
    GitCommit:        bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
   runc:
    Version:          1.2.5
    GitCommit:        v1.2.5-0-g59923ef
   docker-init:
    Version:          0.19.0
    GitCommit:        de40ad0
  /usr/bin/docker info
  Client:
   Version:    28.0.1
   Context:    default
   Debug Mode: false
   Plugins:
    buildx: Docker Buildx (Docker Inc.)
      Version:  v0.21.3
      Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
    compose: Docker Compose (Docker Inc.)
      Version:  v2.34.0
      Path:     /usr/local/lib/docker/cli-plugins/docker-compose
  
  Server:
   Containers: 0
    Running: 0
    Paused: 0
    Stopped: 0
   Images: 0
   Server Version: 28.0.1
   Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: true
   Logging Driver: json-file
   Cgroup Driver: cgroupfs
   Cgroup Version: 2
   Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
   Swarm: inactive
   Runtimes: io.containerd.runc.v2 runc
   Default Runtime: runc
   Init Binary: docker-init
   containerd version: bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
   runc version: v1.2.5-0-g59923ef
   init version: de40ad0
   Security Options:
    seccomp
     Profile: builtin
    cgroupns
   Kernel Version: 5.15.0-1073-gke
   Operating System: Alpine Linux v3.21
   OSType: linux
   Architecture: x86_64
   CPUs: 16
   Total Memory: 125.8GiB
   Name: gcp-tiny-qfhqz-runner-wdnfq
   ID: efccffe4-c154-4c21-8d4c-1cdb57c2dceb
   Docker Root Dir: /var/lib/docker
   Debug Mode: false
   Experimental: false
   Insecure Registries:
    ::1/128
    1[27](....../actions/runs/1644262/job/10093592#step:5:28).0.0.0/8
   Registry Mirrors:
    https://mirror.gcr.io/
    https://...../
   Live Restore Enabled: false
   Product License: Community Engine
@tonistiigi
Copy link
Member

and the issue could result from authprovider.go#L62.

Tokens for Google Artifact Registry via Workload Identity expire after 5 minutes

Note that there are two timeouts here. When the registry asks for an authentication token, buildx will contact the auth service with your credentials to pull one. This token has an expiration time as a field, and buildkit/buildx will refresh it and get a new one if the previous token gets close to expiration. This is in https://github.com/moby/buildkit/blob/master/util/resolver/authorizer.go#L330 .

The other case is that for some credential helpers, the credentials themselves are not static and will expire (there are multiple levels of credentials). Because there is significant overhead in the credential helpers, buildx will cache them. When the token expires, build will try to generate a new one but may receive an error because cached credentials don't work anymore. This is the 10min cache expiration time you are pointing to.

You can try with a custom build that changes the timeout to 5min. If it works, we can consider making it configurable or lowering the default. If the error returned from the token endpoint is typed we could also consider a fix where we try again with uncached credentials. Note that this is used in client side so if you are using buildx you need to update the https://github.com/moby/buildkit/blob/master/session/auth/authprovider/authprovider.go#L62 in buildx repository.

@MichaelKorn
Copy link
Author

Thanks a lot @tonistiigi for the quick reply.
I had seen the code that processes the actual expiration of tokens. The problem here is that we are dealing with "id_tokens" and these are somewhat tricky. They are characterized by being very short-lived, confusion exists about how long they actually live (we have observed 5, 10, 15, and 60 minutes at different places), and that these tokens themselves are opaque (Google offers an endpoint where you get some information if you send the id_token there).
And if you use Workload Identity, then you have "id_tokens" everywhere.

In the meantime, my colleague @nobbs has manipulated buildx in a way that ensures that ultimately at the relevant place we have a 5-minute timeout (a few seconds less would be even better, as I indicated above): nobbs/buildx@444aa01
We have now been able to confirm in several tests that with this change the issue is resolved for us.

Should I create a PR with the changed proposed in the issue description?

@tonistiigi
Copy link
Member

Should I create a PR with the changed proposed in the issue description?

Yes, lowering the default to ~5min seems ok. But make the change in buildkit instead of overriding in buildx so all clients have a better default.

FYI @cpuguy83

MichaelKorn added a commit to MichaelKorn/buildkit that referenced this issue Mar 19, 2025
MichaelKorn added a commit to MichaelKorn/buildkit that referenced this issue Mar 19, 2025
MichaelKorn added a commit to MichaelKorn/buildkit that referenced this issue Mar 19, 2025
MichaelKorn added a commit to MichaelKorn/buildkit that referenced this issue Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants