kgo: avoid updating metadata before issuing ListOffsets / OffsetForLeaderEpoch #1185
+23
−75
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The metadata update was added originally with the thinking that we needed the most up to date metadata so that we could issue these requests correctly.
I think automatic request sharding came later? Not sure, but anyway, ListOffsets and OffsetForLeaderEpoch don't use the metadata that that triggerUpdateMetadata updates anyway: triggerUpdateMetadata updates the main metadata used for producing & consuming, whereas all admin requests use a cached metadata. The cache already evicts topics or partitions that have errors.
Point is, the metadata update in listOrEpoch, at this point, existed only to have a bit of a pause to collapse many listOrEpoch calls into one chunk of running logic.
Removing the whole metadata updating makes it easier to reason about (do we need to update metadata now, or later?), and we can keep the same listOrEpoch coalescing behavior by waiting 5s on retries and merging ever listOrEpoch that fires during those 5s.
It's possible that users may see 5s pauses in some scenarios where they previously would not: a retry may pause things while other non-retry requests are trying to come in. In general we can point to retry collapsing being better anyway, but if the need arises, we can separate the wait to ONLY happen on requests that are retrying, and to ALWAYS allow non-retrying requests to go through ASAP with no merging.