Bug Report: Encoded Semi-Sync Heartbeat Monitor Behavior Can Lead to Errant ERS

### Overview of the Issue

The semi-sync monitor was added in https://github.com/vitessio/vitess/pull/17763. It tries to ensure that when there are no other writes happening, it will generate its own heartbeat writes in order to try and unblock semi-sync. Semi-sync is blocked when the last write is waiting for a semi-sync ACK indefinitely because at the time of that write on the primary there were no replicas available to ACK it. This can be unblocked by performing another write when there are subsequently replicas available to ACK (a later ACK unblocks any previous writes waiting for an ACK as the ACK of the later GTID means the earlier ones were also received by that replica).

The initial implementation has two major issues, however, both of which cause problems in write heavy and heavily loaded replica sets where the system as a whole can become overloaded (TCP retries, packet loss, etc) which causes the timing of events to become atypical due to these delays:
  1. When [determining if semi-sync is blocked](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L40), it only [looks to see if there are 1 or more connections/sessions waiting for a semi-sync ACK](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L199-L225). This is wrong. It's not unusual to be waiting for an ACK for a brief period of time when the replicas are heavily loaded and lagging. Waiting != hung/blocked. It only means we're waiting 🙂. We need to see if there are still successful ACKs happening as that means that we're progressing and while we may have been / are waiting for an ACK in 1 or more sessions, we ARE making progress and thus we are NOT blocked.
  2. When we believe that semi-sync is blocked, we [start sending writes](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L291-L311) — up to 15, each one in its own connection/session, each one [doing an INSERT into the `_vt.semisync_heartbeat` table](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L344-L370) — until one of them goes through. During this time we never re-check to see if semi-sync is unblocked. The monitor assumes that there are no other writes happening outside of the monitor (from the user/app) that DID make it through during this time. And because the system is likely already overloaded, these new writes can themselves be delayed, themselves block, and cause further issues. 
      - The monitor itself could cause itself to become unable to function and cause the errant ERS in this way
           -  The primary has an unreliable network connection to the replicas for a period, this then causes the issue noted in point 1 
           -  The first INSERT it does blocks waiting for an ACK it won't receive
           - The ["cleanup" ticker](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L151) fires ([every 24 hours](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L44)) — which keeps the size of the `_vt.semisync_heartbeat` table minimal — and the TRUNCATE is executed, which tries to get a table level lock and blocks behind the INSERT (so we never even attempt to commit and request an ACK)
           -  All subsequent INSERT statements that the monitor runs to try and unblock things — remember, it's no longer checking to see if it was otherwise unblocked outside of the monitor and it will not check again until [`--semi-sync-monitor-interval`](https://vitess.io/docs/23.0/reference/programs/vttablet/) ([defaults to 10s](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L100)) is hit again — and these INSERTs block behind the TRUNCATE
           - We then hit the [15 writers limit](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/semisyncmonitor/monitor.go#L43) and the `FullStatus` `vttablet` [RPC response to `vtorc` indicates that semi-sync is stuck](https://github.com/vitessio/vitess/blob/d671bafcf0533c3895d90ee35a4dfd1f259aed10/go/vt/vttablet/tabletmanager/rpc_replication.go#L190) and it should perform an ERS to unblock the shard
           - In the end, we end up telling `vtorc` that semi-sync is fully blocked and it needs to do an ERS — when in reality everything is fine (just experiencing a reasonable amount of delay in the system). The behavior in this scenario is worse than if we had no monitor at all.

So today the monitor is ideal for replica sets with very low write rates, but potentially harmful for replica sets with high write rates. 

### Reproduction Steps

See code walkthrough.

### Binary Version

```sh
vtgate version Version: 24.0.0-SNAPSHOT (Git revision 3d2cdc546a6ac223866e0287782e8b3912efe2ca branch 'improve_semi-sync_monitor') built on Fri Nov  7 05:42:51 UTC 2025 by matt@pslord.local using go1.25.3 darwin/arm64
```

### Operating System and Environment details

```sh
N/A
```

### Log Fragments

```sh

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: Encoded Semi-Sync Heartbeat Monitor Behavior Can Lead to Errant ERS #18885

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug Report: Encoded Semi-Sync Heartbeat Monitor Behavior Can Lead to Errant ERS #18885

Description

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions