A "480ns GPIO roundtrip" @ 100MHz implies 48 cycles for a single GPIO access. I would understand one or two cycles, but what does it spend the other ~46 cycles on? Does Python really have a >40x overhead compared to assembler or C even on optimised hardware or is the benchmark code that bad?
You're right that it can definitely be faster — there's real room for optimization.
When I have time, I may write a blog post that will explain where the cycles go, why it's different from raw assembler toggling, and how it could be improved.
Also, just to keep things in perspective — don't forget to compare apples to apples:
On a Pyboard running MicroPython, a simple GPIO roundtrip takes about 14 microseconds.
PyXL is already achieving 480 nanoseconds, so it’s a very different baseline.