-
Notifications
You must be signed in to change notification settings - Fork 649
[orchagent] Event-based Retry Strategy #3699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @a114j0y is there a design document for the changes implemented in this PR? |
Hi @prabhataravind could you help merge it? So we could include it in the 202511 release ! |
|
@a114j0y have you tested this with warm-boot? @vaibhavhd for viz. |
|
|
||
| for (auto &it : m_consumerMap) | ||
| { | ||
| count += retryToSync(it.first, threshold - count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here.
orchagent/orch.cpp
Outdated
| return false; | ||
| } | ||
|
|
||
| size_t Orch::retryToSync(const std::string &executorName, size_t threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation for this function explaining what threshold is.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| return count; | ||
| } | ||
|
|
||
| void Orch::notifyRetry(Orch *retryOrch, const std::string &executorName, const Constraint &cst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a114j0y who is calling this function for other constraints like RETRY_CST_NHG, RETRY_CST_NHG_REF, RETRY_CST_ECMP etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These codes would be in a separate NHG-related PR, which employs RetryCache for optimization. I can move these definitions to that pr.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
Signed-off-by: Yijiao Qin <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
What I did
I have implemented
RetryCacheallows agents to communicate in the push mode to schedule retry.
Document - RetryCache High Level Design (HLD PR#1822)
The document explains
Why I did it
Orchagent currently uses the pull mode to retry, it polls periodically to check whether the constraints are removed.
Our goal is to skip unnecessary retries, instead of pulling all, we allow some consumers to push the news.
Pushmode is better in some senarionhgorchpush the notification (next hop creation) torouteorch, instead ofrouteorchpulling periodically fromnhgorch.pullmode still good in most scenariospullis necessary.pushmode to notify constraint resolutions and kick-off corresponding retry.How I verified it
I injected invalid routes with inject_invalid_routes.py, then measure the gaps between two event loop iterations. The gap duration is linear with the number of invalid routes pending in the syncing cache.
However, after
retrycacheis introduced, those pending tasks are archived aside. The consumer no longer retries them at the end of an event loop. Then the gap gets decoupled from the number of pending tasks, which is O(1).