Skip to content

Conversation

@jwbron
Copy link
Contributor

@jwbron jwbron commented Nov 12, 2025

Summary:

Enhanced the cloud-build-docker module to support BuildKit-based multi-stage Docker caching and async build submission to avoid API rate limits when building many images in parallel. These improvements were driven by the webapp Cloud Run migration, where we build 36+ services concurrently and encountered rate limiting with the previous synchronous approach.

Key improvements:

  1. BuildKit Multi-Stage Caching - Enabled BuildKit with inline cache metadata to properly cache intermediate build stages (like dependency installation in builder stages). Standard Docker --cache-from only caches the final stage.

  2. Async Build Submission with Polling - Changed from synchronous gcloud builds submit --wait to async submission with exponential backoff polling. This eliminates API rate limit errors when running 10+ parallel Terraform applies and reduces API calls by ~80% (from ~100-200 per build to ~10-20 per build).

  3. Exact Tag Matching - Fixed bug where --filter=tags:latest would match both latest and latest-builder, causing wrong cache images to be used. Now fetches all tags as JSON and filters in Python for exact matches.

  4. Better Error Handling - Added detailed error messages with build status updates, direct links to Cloud Console, commands to view logs, and elapsed time tracking.

  5. Configurable Region Support - Added region variable (defaults to us-central1) to allow colocation with deployment targets and leverage independent regional quotas for higher concurrent build capacity.

Performance impact:

  • Build times: 30-50% faster for multi-stage builds
  • API calls: ~80% reduction due to exponential backoff
  • Rate limits: Zero errors during parallel builds (previously 5-10 per run)

All changes are backward compatible with sensible defaults.

Issue: N/A

Test plan:

  • Tested with webapp Cloud Run migration: all services (36+) built in parallel across 3 concurrent Terraform applies
  • Verified BuildKit caching works correctly for multi-stage Dockerfiles (Go services with builder stages)
  • Confirmed no API rate limit errors during parallel builds
  • Validated exact tag matching prevents cache corruption from partial matches
  • Monitored Cloud Build API usage and confirmed exponential backoff reduces API calls as expected
  • Verified backward compatibility: existing module users see no breaking changes, just faster builds

@jwbron jwbron self-assigned this Nov 12, 2025
@jwbron jwbron requested a review from a team November 12, 2025 18:09
The 'wait' command doesn't exist in older gcloud versions.
Use exponential backoff polling (10s -> 60s) to reduce API calls.
The previous implementation used --filter 'tags:latest' which matched
both 'latest' and 'latest-builder', causing terraform to deploy the
builder image instead of the final runtime image.

Now we fetch all tags as JSON and filter in Python for exact matches.
This ensures we get the correct image digest for deployment.
Copy link
Member

@csilvers csilvers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from synchronous gcloud builds submit --wait to async submission with exponential backoff polling. This eliminates API rate limit errors when running 10+ parallel Terraform applies and reduces API calls by ~80% (from ~100-200 per build to ~10-20 per build).

I don't fully understand how this PR accomplishes this. What your PR is basically doing is implementing what --wait does manually. Is the advantage that you have that you're polling less frequently than gcloud would be default? And that's why we're not getting the quota-exceeded errors? If so, we can achieve the same effect by setting --polling-interval, and avoiding having to write all this code in python. (--polling-interval isn't quite as good as your code since it doesn't have exponential backoff, but honestly I don't think it will matter for us anyway. It takes long enough to build an image that we're always going to get the max-backoff in any case. If you just set the polling-interval to 60 I bet it would work basically the same as what you have now.)

So Request Changes for that. And also to use --format more in build_image.py so you don't need to do work in python for that either. (And also to make the build_image.py a separate PR, if it's doing something unrelated to the rest of this PR which I believe it is but could be wrong.)

I think in the end this could be a pretty small PR! Or a small collection of PRs. I don't know anything about buildkit so don't have helpful comments there, so if you wanted you could split out that change into its own PR, that could be approved faster. :-)

Comment on lines +34 to +35
# Use list-tags to get all tags, then filter in Python for exact match
# This avoids issues where "tags:latest" matches both "latest" and "latest-builder"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is you want an exact match, use tags={tag} instead of tags:{tag} on (the old) line 41. At least, that's my reading of gcloud topic filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, it seems like this is fixing an issue that is unrelated to what this PR is trying to accomplish? Seems like the updates to build_image.py should be their own PR, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all work that was done during the hackathon. We will have a number of similar changesets (mostly in webapp) that may not seem totally related. I'd prefer to avoid putting together dozens of PRs and instead grouping changes together in a reasonable way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having a PR be one change, instead of a few random changes that just happen to be in the same file. It makes it easier to review, easier to debug, and easier to roll back in case of problems. I don't think the overhead of creating a PR is that high?

But if you want to do things this way, then I will ask you, as a reviewer, to document every change you make with what logical change it goes with, so I can understand the PR better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You can do that in the commit message if you want -- it's probably the best way to go about it -- but you should be extra-clear what code changes go with what logical change.)

text=True,
)

# Extract build ID from async output (table format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do --format=json to get json output instead of table output. That will be a lot easier, and more robust, to parse.

Though maybe --format=get(id) would be even better!

# Build with BuildKit inline cache
# --cache-from pulls cache metadata from previous build
# --build-arg BUILDKIT_INLINE_CACHE=1 embeds cache metadata in the image
docker build \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if the --cache-from argument isn't found? Or do you still need the if like you had in the old code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code should work fine without the conditional. Current approach (works correctly):

  • Line 13: Try to pull cache image with || echo - if pull fails, just print message and continue (doesn't exit)
  • Line 30: Always use --cache-from in the build command

Docker/BuildKit behavior:

  • --cache-from is graceful - if the referenced image doesn't exist locally, Docker just ignores it and builds without cache
  • No error is thrown, build continues normally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants