Potential Temporal workers activity mix up? #2248
Closed
algirdasci
started this conversation in
General
Replies: 1 comment 25 replies
-
|
Hey @algirdasci 👋🏻 |
Beta Was this translation helpful? Give feedback.
25 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey,
We have been using Temporal Cloud for our workflows for couple months now. Recently, we had downtime caused by non-deterministic Temporal error:
Even though we did not had any deployments / code change hour prior this downtime, so I'm 100% sure the issue was not new code deployment. What we did had though was higher than usual volume of workflows dispatched (~2.5k) during short period of time. We have running one dedicated pod of RoadRunner with 8 Temporal workers, so they were very busy during execution, but resource usage history graph does not show anything abnormal. What is abnormal - the non-deterministic error was caused by activity execution, which should not belong to that workflow:
Workflow updates database based on action ID, so in this case it should be 64517126. The ID is immutable, so it cannot be changed in code and is preserved throughout workflow:
You can see, that event ID 23 appeared with completely different action ID of
64511416which should not be in this workflow in the first play and belongs to different workflow type, which also failed with same pattern, by the way. Events 23 and 24 shares atomically close timing, which makes me believe, that there might be some kind of race condition / message ID collision during reporting activities to Temporal, especially on higher load than usual. I'm thinking enablingmax_jobs/max_memory_limitoptions to reload workers periodically. But maybe you have other insight what could have gone wrong here and how to prevent it from happening in the future?Debug info:
RoadRunner version 2025.1.3
Laravel with roadrunner-php/laravel-bridge 6.30, PHP 8.3
.rr.base.yaml:
.rr.temporal.yaml
First logs after error started to occur:
Beta Was this translation helpful? Give feedback.
All reactions