It's not an architecture problem of the transformer at all. This is the result o...

It's not an architecture problem of the transformer at all. This is the result of thinking the idea that you can make inviolable rules for a system you don't understand is not anything but ridiculous. You're never going to make inviolable rules for a neural network because we don't understand what is going on on the inside.