You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fixing replica checkpointing divergence with explicit CommitAOF calls (#1305)
* repro for failure
* move test to appropriate file; implement RoleRead lock for recovery, fix replica commits (still have other paths to lock down)
* cleanup
* extend role prevention changes to all ops on StoreApi; refactor a bit so we can express this with a using; save reusable instance of the locking object, acquire in one place
* derp; restore checking the actual role -_-
* failing test for if ReplicaSyncTask dies - killing it the same we've seen in practice, which is the replica throwing causing the connection from the primary to be torn down
* restore ignored parameter
* this works, but I don't love the polling
* Revert "this works, but I don't love the polling"
This reverts commit 7a05d61.
* attempt to resume replication in response to a GOSSIP and the fact that no active replication task exists
* throttle re-establishingment attempts; put behind a setting
* include new setting in .conf
* introduce notion of an upgradeable read lock for role reading, and hold it while resuming replication; prevents a TOCTOU issue with resuming replication from a replica
* downgrade logic was jacked, correct
* big ol' test for our particular setup (1:1 primary replica, matched settings, out of band commits); example of Replica losing all data if intervention for promotion not timely; allow Replica to come up AS a Replica even if its Primary is down
* wait for replica takeover to complete before proceeding in test
* add test for Replica comes back up as Replica, even if Primary unreachable
* fault on Primary side too, knock out that todo
* formatting
* on Replicas, wait until we can connect to a Primary to toss any loaded data; punch a config in to allow Replicas to load checkpoint & aof from disk; clean up some new tests for reliability
* update defaults.conf with new setting
* restore cancellation for test
* add tests for new configs
* address feedback; wrap this giant list of parameters in a record struct
* address feedback; do not block GOSSIP message while resyncing, move that work onto a background task
* formatting
* address feedback; default values in GarnetServerOptions
* kick up KeraLua so Windows Lua bits have CFG
* per call discussion; bump version
---------
Co-authored-by: Vasileios Zois <[email protected]>
Copy file name to clipboardExpand all lines: libs/cluster/Server/ClusterManagerWorkerState.cs
+18-4Lines changed: 18 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,7 @@
3
3
4
4
usingSystem;
5
5
usingSystem.Collections.Generic;
6
+
usingSystem.Diagnostics;
6
7
usingSystem.Text;
7
8
usingSystem.Threading;
8
9
usingGarnet.common;
@@ -141,11 +142,17 @@ public ReadOnlySpan<byte> TryReset(bool soft, int expirySeconds = 60)
141
142
/// Try to make this node a replica of node with nodeid
142
143
/// </summary>
143
144
/// <param name="nodeid"></param>
144
-
/// <param name="force">Check if node is clean (i.e. is PRIMARY without any assigned nodes)</param>
145
+
/// <param name="force">If false, checks if node is clean (i.e. is PRIMARY without any assigned nodes) before making changes.</param>
146
+
/// <param name="upgradeLock">If true, allows for a <see cref="RecoveryStatus.ReadRole"/> read lock to be upgraded to <see cref="RecoveryStatus.ClusterReplicate"/>.</param>
145
147
/// <param name="errorMessage">The ASCII encoded error response if the method returned <see langword="false"/>; otherwise <see langword="default"/></param>
0 commit comments