To be fair, this failed in the non-rust path too because the bot management retu...

hedora · 2025-11-19T01:32:54 1763515974

Catching panic probably isn’t a great idea if there’s any unsafe code in the system. (Do the unsafe blocks really maintain heap invariants if across panics?)

vlovich123 · 2025-11-19T02:22:12 1763518932

Unsafe blocks have nothing to do with it. Yes - they maintain all the same invariants as safe blocks or those unsafe blocks are unsound regardless of panics. But there’s millions of way to architect this (eg a supervisor process that notices which layer in FL2 is crashing and just completely disables that layer when it starts up the proxy again. There’s challenges here because then you have to figure out what constitutes a perma crashing (eg what if it’s just 20% of all sites? Do you disable?). And in the general case you have the fail open/fail close decision anyway which you should just annotate individual layers with.

But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages

Feathercrown · 2025-11-19T03:58:18 1763524698

Incremental config changes sounds like it could lead to a LOT of bugs

vlovich123 · 2025-11-19T05:07:59 1763528879

Incremental in terms of 1% of the fleet using it, then 5% etc. this is standard course.

Another option is to make sure that config changes that fail to parse continue using the old config instead of resulting in an unusable service.

kibwen · 2025-11-19T02:37:00 1763519820

I think the parent is implying that the panic should be "caught" via a supervisor process, Erlang-style, rather than implying the literal use of `catch_unwind` to resume within the same process.

vlovich123 · 2025-11-19T05:11:51 1763529111

Supervisor is the brutalist way. But catch_unwind may be needed for perf and other reasons.

But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component

But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.

I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.

antihero · 2025-11-19T12:27:30 1763555250

The unwrap should be replaced by code that creates enough alerting to make a P0 incidident from their canary deployment immediately.

OR even, the bot code crashing should itself be generating alerts.

Canary deployment would be automatically rolled back until P0 incident resolved.

All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.

vlovich123 · 2025-11-20T04:27:35 1763612855

Agreed - a big question why the file wasn’t test driven in staging and progressively rolled out. And also what alerting was missing within FL2 that they couldn’t pinpoint the unwrap instantly.