To be fair, this failed in the non-rust path too because the bot management returned that all traffic was a bot. But yes, FL2 needs to catch panics from individual components but I’m not sure if failing open is necessarily that much better (it was in this case but the next incident could easily be the result of failing open).
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
Catching panic probably isn’t a great idea if there’s any unsafe code in the system. (Do the unsafe blocks really maintain heap invariants if across panics?)
Unsafe blocks have nothing to do with it. Yes - they maintain all the same invariants as safe blocks or those unsafe blocks are unsound regardless of panics. But there’s millions of way to architect this (eg a supervisor process that notices which layer in FL2 is crashing and just completely disables that layer when it starts up the proxy again. There’s challenges here because then you have to figure out what constitutes a perma crashing (eg what if it’s just 20% of all sites? Do you disable?). And in the general case you have the fail open/fail close decision anyway which you should just annotate individual layers with.
But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages
I think the parent is implying that the panic should be "caught" via a supervisor process, Erlang-style, rather than implying the literal use of `catch_unwind` to resume within the same process.
Supervisor is the brutalist way. But catch_unwind may be needed for perf and other reasons.
But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component
But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.
I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.
The unwrap should be replaced by code that creates enough alerting to make a P0 incidident from their canary deployment immediately.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.
Agreed - a big question why the file wasn’t test driven in staging and progressively rolled out. And also what alerting was missing within FL2 that they couldn’t pinpoint the unwrap instantly.
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.