This feels a bit ironic given how much the for loop has been villainized in numerical computing, by none other than languages like Python—and Matlab and R—where for loops are so awfully slow that you can only get performance if you avoid them like the plague and write (sometimes awkward) vectorized code that pushes all the for loops down into some C library (or C++ or Fortran). Compared with, say, Julia, where people are encouraged to "just write yourself a dang for loop"—because it's not only effective and simple, but also fast. I guess what I'm saying is that it feels like even though Python may embrace the for loop as syntax, that affinity seems superficial at best, since if you care at all about performance then Python immediately rejects the for loop.
In Python numeric computing it's common for your outer loops to be for loops and your inner loops to be vectorized PyTorch/whatever.
I personally like being able to easily comprehend and control what's being vectorized. Maybe it would be nice if my compiler could automatically replace any inefficient loops with vectorized equivalents, and I could think in whichever idiom came more naturally to the problem at hand. But I don't think there's anything too illogical about looping over epochs and batches, and then computing your loss function with matrices. Maybe I'm just used to a suboptimal way of doing things :)
> Maybe it would be nice if my compiler could automatically replace any inefficient loops with vectorized equivalents
The trouble is that a for loop is much more expressive than vectorized operations, so most for loops cannot be transformed into vectorized equivalents. The reason convincing people to write vectorized code works for performance is that you're constraining what they can express to a small set of operations that you already have fast code for (written in C). Instead of relying on compiler cleverness, this approach relies on human cleverness to express complex computations with that restricted set of vectorized primitives. Which is why it can feel like such a puzzle to write vectorized code that does what you want—because it is! So even if a compiler could spot some simple patterns and vectorize them for you, it would be incredibly brittle in the sense that as soon as you change the code just a little, it would immediately fall off a massive performance cliff.
I guess that's actually the second problem—the first problem is that there isn't any compiler in CPython to do this for you.
With hindsight, the languages turned out to be a bit "too dynamic" for their own good. Very few are changing variable types often enough for that feature to be useful. The downside, makes typing bugs possible/more likely, and slows down every access. Par for the course, would say that slow loops are a symptom not a cause.
matlab has been jit compiled for years now, the "for loops are slow" dogma is over.
numerical computing in python is kinda weird as it wasn't the original purpose and the fast math libraries were bolted on as an afterthought, but even then tools like numba do the same in python, although there's a bunch of nuance in writing simple enough python and hinting at the correct types for the variables in order to get it to compile something reasonable.
julia's let's use strict types, jit compile everything from day one and avoid locking approach is nice though.
After numba’ing several nontrivial pieces of numpy code in my life, you might as well just rewrite it in nopython mode for Cython unless it’s very trivial stuff. Numba errors and partial coverage of numpy is a huge time sink in my experience.