Glad to know I’m not crazy.

theanomaly · 2025-11-14T20:02:58 1763150578

AWS Support initially pushed back and suggested it's because of high replication lag but they were looking at metrics that were more than 24 hours old. What kind of failure did you encounter? I really want to understand what edge case we triggered in their failover process - especially since we could not reproduce it in other regions.

robben1234 · 2025-11-15T00:39:59 1763167199

My cluster recently started to failover every few days whenever it experiences the load to trigger scale up from 1-2 to 20+ acu.

And then I also encountered errors just like op in my app layer about trying to execute a write query via read-only transaction.

The workaround so far is to invalidate connection on error. When app reconnects the cluster write endpoint correctly leads to current primary.