-
Notifications
You must be signed in to change notification settings - Fork 180
Defer updating workflow completion metrics until completion accepted by server #2742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| if (context.isCancelRequested()) { | ||
| workflowStateMachines.cancelWorkflow(); | ||
| metricsScope.counter(MetricsType.WORKFLOW_CANCELED_COUNTER).inc(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These completion counters are now tracked inside workflow state machines object
temporal-sdk/src/test/java/io/temporal/client/functional/MetricsTest.java
Outdated
Show resolved
Hide resolved
|
@maciejdudko as co-maintainer may want to review as well |
maciejdudko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests could be more split up but it's not blocking. LGTM
| assertEquals( | ||
| 2, counts.get(io.temporal.worker.MetricsType.WORKFLOW_COMPLETED_COUNTER).intValue()); | ||
| assertEquals( | ||
| 1, counts.get(io.temporal.worker.MetricsType.WORKFLOW_FAILED_COUNTER).intValue()); | ||
| assertEquals( | ||
| 1, | ||
| counts | ||
| .get(io.temporal.worker.MetricsType.WORKFLOW_CONTINUE_AS_NEW_COUNTER) | ||
| .intValue()); | ||
| assertEquals( | ||
| 1, counts.get(io.temporal.worker.MetricsType.WORKFLOW_CANCELED_COUNTER).intValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: splitting each workflow run into separate testcase would more strongly verify that the right counter was incremented in each scenario. Could also add checking presence of WORKFLOW_E2E_LATENCY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the tests. I wasn't going to and was going to add e2e check in here, but it turns out we calculate e2e as current - start event time, and the latter is extremely skewed with time skipping causing negative values (which Tally ignores), so it was easier to check in separate tests.
What was changed
Defer updating workflow completion metrics until completion accepted by server. This is done by capturing the metrics to update in the task handler result and then providing a post-completion runnable to be invoked if the gRPC method is a success.
This technically changes behavior to no longer record metrics in situations like unhandled commands, which is a more accurate approach.
Checklist