The article is low quality. It does not mention which Aurora PostgreSQL version
was involved, and it provides no real detail about how the staging environment
differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics
we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.
We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.
Here's what we think is happening:
Before the fix (pre-15.12.4):
1. Failover starts
2. Both instances accept and process writes simultaneously
3. Failover eventually completes after the writer steps down
4. Result: Potential data consistency issues ???
After the fix (15.12.4+):
1. Failover starts
2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests
3. Both instances restart/crash
4. Failover fails or requires manual intervention
The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue.
This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.
The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.
This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.