I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
AFAIK this is actually a separate mechanism, which is part of the visual cortex and not the retina. Essentially recognizing even a single object requires the complete attention of pretty much your entire brain in the moment of recognition.
What I am referring to is a much more basic form of shape recognition that goes on at the level of the neural networks in the retina.
I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.