It's not an architecture problem of the transformer at all. This is the result of thinking the idea that you can make inviolable rules for a system you don't understand is not anything but ridiculous. You're never going to make inviolable rules for a neural network because we don't understand what is going on on the inside.