This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.
Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.
> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"
An experience not exclusive to cloud vendors :) Even better when the vendor throws their hands up cause the issue is not reliably repro'able.
That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.
At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.
I've had customers with load related bugs for years simply because they'd reboot when the problem happened. When dealing with the F100 it seems there is a rather limited number of people in these organizations that can troubleshoot complex issues, that or they lock them away out of sight.
It is a tough bargain to be fair, and it is seen in other places too. From developers copying out their stuff from their local git repo, recloning from remote, then pasting their stuff back, all the way to phone repair just meaning "here's a new device, we synced all your data across for you", it's fairly hard to argue with the economic factors and the effectiveness of this approach at play.
With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.
What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]
Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.
[0] as delightfully subjective as those get of course
Theoretically you're supposed to assign lower prio to issues with known workarounds but then there should also be reporting for product management (which assigns weight by age of first occurrence and total count of similar issues).
Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.
I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.
If I'm reading this correctly, it sounds like the connection was already using autocommit by default? In that situation, if you initiate a transaction, and then it gets rolled back, you're back in autocommit unless/until you initiate another transaction.
If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.
Is there any scenario in a sane world where a transaction ceases to be in scope just because it went into an error state? I'd have expect the client to send an explicit ROLLBACK when they realize a transaction is in an error state, not for the server to end it and just notify the client. This is how psql appears to the end user.
postgres=# begin;
BEGIN
postgres=*# bork;
ERROR: syntax error at or near "bork"
LINE 1: bork;
^
postgres=!# select 1;
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# rollback;
ROLLBACK
postgres=# select 1;
?column?
----------
1
(1 row)
postgres=#
Every DBMS handles errors slightly differently. In a sane world you shouldn't ever ignore errors from the database. It's unfortunate to hear that a Python MySQL client library had a bug that failed to expose errors properly in one specific situation 16 years ago, but that's not terribly relevant to today.
Postgres behavior with errors isn't even necessarily desirable -- in terms of ergonomics, why should every typo in an interactive session require me to start my transaction over from scratch?
No, the situation described upthread is about a deadlock error, not a typo. In MySQL, syntax errors throw a statement-level error but it does not affect the transaction state in any way. If you were in a transaction, you're still in a transaction after a typo.
With deadlocks in MySQL, the error you receive is "Deadlock found when trying to get lock; try restarting transaction" i.e. it specifically tells you the transaction needs to start over in that situation.
In programmatic contexts, transactions are typically represented by an object/struct type, and a correctly-implemented database driver for MySQL handles this properly and invalidates the transaction object as appropriate if the error dictates it. So this isn't really even a common foot-gun in practical terms.
What do you mean? Autocommit mode is the default mode in Postgres and MS SQL Server as well. This is by no means a MySQL-specific behavior!
When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.
The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.
The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.
Lack of autocommit would be bad for performance at scale, since it would add latency to every single query. And the MVCC implications are non-trivial, especially for interactive queries (human taking their time typing) while using REPEATABLE READ isolation or stronger... every interactive query would effectively disrupt purge/vacuum until the user commits. And as the sibling comment noted, that would be quite harmful if the user completely forgets to commit, which is common.
In any case, that's a subjective opinion on database design, not a bug. Anyway it's fairly tangential to the client library bug described up-thread.
Autocommit mode is pretty handy for ad-hoc queries at least. You wouldn't want to have to remember to close the transaction since keeping a transaction open is often really bad for the DB
My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.
If you pay for the highest level of support you will get extremely good support. But it comes with signing a NDA so you're not going to read about anything coming out of it on a blog.
I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.
Nah if your monthly spend is really significant than you will get good support and issues you care about will get prioritized. Going from startup with 50K/month spend to a large company with untold millions per month spend experience is night and day. We have Dev managers and eng. from key AWS teams present in meetings when need be, we get issues we raise prioritized and added to dev roadmaps etc.
Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws
Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.
Their docs are not good and things frequently don't behave how you expect them to.
The article is low quality. It does not mention which Aurora PostgreSQL version
was involved, and it provides no real detail about how the staging environment
differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics
we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.
We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.
Here's what we think is happening:
Before the fix (pre-15.12.4):
1. Failover starts
2. Both instances accept and process writes simultaneously
3. Failover eventually completes after the writer steps down
4. Result: Potential data consistency issues ???
After the fix (15.12.4+):
1. Failover starts
2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests
3. Both instances restart/crash
4. Failover fails or requires manual intervention
The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue.
This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.
The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.
fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS
It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.
It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.
it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.
P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.
Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out.
Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.
Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.