Extracted system prompts are usually very, very accurate.
It's a slightly noisy process, and there may be minor changes to wording and formatting. Worst case, sections may be omitted intermittently. But system prompts that are extracted by AI-whispering shamans are usually very consistent - and a very good match for what those companies reveal officially.
In a few cases, the extracted prompts were compared to what the companies revealed themselves later, and it was basically a 1:1 match.
If this "soul document" is a part of the system prompt, then I would expect the same level of accuracy.
If it's learned, embedded in model weights? Much less accurate. It can probably be recovered fully, with a decent level of reliability, but only with some statistical methods and at least a few hundred $ worth of AI compute.
It's very unclear to me how it could be recovered if it wasn't part of the system prompt, especially how Claude knows it's called the "soul doc" if that was an internal nickname.
I mean, obviously we know how it happened - the text was shown to it during late-era post-training or SFT multiple times. That's the only way it could have memorized it. But I don't see the point in having it memorize such a document.
There are a few weirder training methods that involve wiring explicit bits of knowledge into the model.
I imagine that if you use them hard enough with the same exact text, you can attain full word for word memorization. This may be intentional, or a side effect of trying to wire other knowledge into the model while this document is also loaded into the context.
It's a slightly noisy process, and there may be minor changes to wording and formatting. Worst case, sections may be omitted intermittently. But system prompts that are extracted by AI-whispering shamans are usually very consistent - and a very good match for what those companies reveal officially.
In a few cases, the extracted prompts were compared to what the companies revealed themselves later, and it was basically a 1:1 match.
If this "soul document" is a part of the system prompt, then I would expect the same level of accuracy.
If it's learned, embedded in model weights? Much less accurate. It can probably be recovered fully, with a decent level of reliability, but only with some statistical methods and at least a few hundred $ worth of AI compute.