-
Notifications
You must be signed in to change notification settings - Fork 0
Improve Cloud Build module for parallel builds and multi-stage Docker caching #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The 'wait' command doesn't exist in older gcloud versions. Use exponential backoff polling (10s -> 60s) to reduce API calls.
The previous implementation used --filter 'tags:latest' which matched both 'latest' and 'latest-builder', causing terraform to deploy the builder image instead of the final runtime image. Now we fetch all tags as JSON and filter in Python for exact matches. This ensures we get the correct image digest for deployment.
f944502 to
1683f6e
Compare
csilvers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed from synchronous gcloud builds submit --wait to async submission with exponential backoff polling. This eliminates API rate limit errors when running 10+ parallel Terraform applies and reduces API calls by ~80% (from ~100-200 per build to ~10-20 per build).
I don't fully understand how this PR accomplishes this. What your PR is basically doing is implementing what --wait does manually. Is the advantage that you have that you're polling less frequently than gcloud would be default? And that's why we're not getting the quota-exceeded errors? If so, we can achieve the same effect by setting --polling-interval, and avoiding having to write all this code in python. (--polling-interval isn't quite as good as your code since it doesn't have exponential backoff, but honestly I don't think it will matter for us anyway. It takes long enough to build an image that we're always going to get the max-backoff in any case. If you just set the polling-interval to 60 I bet it would work basically the same as what you have now.)
So Request Changes for that. And also to use --format more in build_image.py so you don't need to do work in python for that either. (And also to make the build_image.py a separate PR, if it's doing something unrelated to the rest of this PR which I believe it is but could be wrong.)
I think in the end this could be a pretty small PR! Or a small collection of PRs. I don't know anything about buildkit so don't have helpful comments there, so if you wanted you could split out that change into its own PR, that could be approved faster. :-)
terraform/modules/cloud-build-docker/.digests/khan-academy-staging_admin_latest.txt
Show resolved
Hide resolved
| # Use list-tags to get all tags, then filter in Python for exact match | ||
| # This avoids issues where "tags:latest" matches both "latest" and "latest-builder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the issue is you want an exact match, use tags={tag} instead of tags:{tag} on (the old) line 41. At least, that's my reading of gcloud topic filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, it seems like this is fixing an issue that is unrelated to what this PR is trying to accomplish? Seems like the updates to build_image.py should be their own PR, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all work that was done during the hackathon. We will have a number of similar changesets (mostly in webapp) that may not seem totally related. I'd prefer to avoid putting together dozens of PRs and instead grouping changes together in a reasonable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having a PR be one change, instead of a few random changes that just happen to be in the same file. It makes it easier to review, easier to debug, and easier to roll back in case of problems. I don't think the overhead of creating a PR is that high?
But if you want to do things this way, then I will ask you, as a reviewer, to document every change you make with what logical change it goes with, so I can understand the PR better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You can do that in the commit message if you want -- it's probably the best way to go about it -- but you should be extra-clear what code changes go with what logical change.)
| text=True, | ||
| ) | ||
|
|
||
| # Extract build ID from async output (table format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do --format=json to get json output instead of table output. That will be a lot easier, and more robust, to parse.
Though maybe --format=get(id) would be even better!
| # Build with BuildKit inline cache | ||
| # --cache-from pulls cache metadata from previous build | ||
| # --build-arg BUILDKIT_INLINE_CACHE=1 embeds cache metadata in the image | ||
| docker build \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if the --cache-from argument isn't found? Or do you still need the if like you had in the old code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code should work fine without the conditional. Current approach (works correctly):
- Line 13: Try to pull cache image with
|| echo -if pull fails, just print message and continue (doesn't exit) - Line 30: Always use
--cache-fromin the build command
Docker/BuildKit behavior:
--cache-fromis graceful - if the referenced image doesn't exist locally, Docker just ignores it and builds without cache- No error is thrown, build continues normally
Summary:
Enhanced the
cloud-build-dockermodule to support BuildKit-based multi-stage Docker caching and async build submission to avoid API rate limits when building many images in parallel. These improvements were driven by the webapp Cloud Run migration, where we build 36+ services concurrently and encountered rate limiting with the previous synchronous approach.Key improvements:
BuildKit Multi-Stage Caching - Enabled BuildKit with inline cache metadata to properly cache intermediate build stages (like dependency installation in builder stages). Standard Docker
--cache-fromonly caches the final stage.Async Build Submission with Polling - Changed from synchronous
gcloud builds submit --waitto async submission with exponential backoff polling. This eliminates API rate limit errors when running 10+ parallel Terraform applies and reduces API calls by ~80% (from ~100-200 per build to ~10-20 per build).Exact Tag Matching - Fixed bug where
--filter=tags:latestwould match bothlatestandlatest-builder, causing wrong cache images to be used. Now fetches all tags as JSON and filters in Python for exact matches.Better Error Handling - Added detailed error messages with build status updates, direct links to Cloud Console, commands to view logs, and elapsed time tracking.
Configurable Region Support - Added
regionvariable (defaults tous-central1) to allow colocation with deployment targets and leverage independent regional quotas for higher concurrent build capacity.Performance impact:
All changes are backward compatible with sensible defaults.
Issue: N/A
Test plan: