Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Their bot management system is designed to push a configuration out to their entire network rapidly.

Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].

> While it’s certainly useful to examine the root cause in the code.

Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).

Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].

Of course, as with all distributed system failures, this is all easier said and done in hindsight.

[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...

[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050



>Once every 5m is not "rapidly".

Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.


By rapid I mean a rapid rollout of changes to 100% of the fleet, not how often changes are made.


Thanks for sharing that AWS doc




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: