I think speed is the wrong word here. A better word is throughput.
The underlying issue with python is that it does not support threading well (due to the global interpreter lock) and mostly handles concurrency by forking processes instead. The traditional way of improving throughput is having more processes, which is expensive (e.g. you need more memory). This is a common pattern with other languages like ruby, php, etc.
Other languages use green threads / co-routines to implement async behavior and enable a single thread to handle multiple connections. On paper this should work in python as well except it has a few bottlenecks that the article outlines that result in throughput being somewhat worse than multi process & synchronous versions.
Memory is cheap; the cost is in constant de/serialization. Same with "just rewrite the hotspots in C!"-style advice; de/serialization can easily eat anything you saved by multiprocessing/rewriting. Python is a deceivingly hard language, and a lot of this is a direct result of the "all of CPython is the public C-extension interface!" design decision (significant limitations on optimizations => heavy dependency on C-extensions for anything remotely performance sensitive => package management has to deal extensively with the nightmare that is C packaging => no meaningful cross-platform artifacts or cross compilation => etc).
Memory is not cheap when dealing the real world cost of deploying a production system. The pre fork worker model used in many sync cases is very resource intensive and depending on the number of workers you're probably paying a lot more for the box it's running on, ofc this is different if you're running on your own metal but I have other issues with that.
> Memory is not cheap when dealing the real world cost of deploying a production system.
What? What makes you say that? What did you think I was talking about if not a production system? To be clear, we're talking about the overhead of single-digit additional python interpreters unless I'm misunderstanding something...
Observed costs from companies running the pre fork worker model vs alternative deployment methods and just in the benchmark they're running double digit interpreters which I've seen as more common and expensive.
Double-digit interpreters per host? Where is the expense? Interpreters have a relatively small memory overhead (<10mb). If you're running 100 interpreters per host (you shouldn't be), that's an extra $50/host/year. But you should be running <10/host, so an extra $5/host/year. Not ideal, but not "expensive", and if you care about costs your biggest mistake was using Python in the first place.
I don't know where you're seeing the < 10mb from the situation I saw they were easily consuming 30mb per interpreter. Even my cursory search around now shows them at roughly 15-20mb so assuming the 30mb Gunicorn was just misconfigured that's still an extra $100 per host using your estimate and what I'm looking at Googling around and across a situation where there are multiple public apis that's adding up pretty quickly.
Another google search shows me Gunicorn, for instance, using high memory on fork isn't exactly uncommon either.
Edit: I reworded some stuff up there and tried to make my point more clear.
Yes, C dependency management is awful, and because Python is only practical with C extensions for performance critical code, it ends up being a nightmare as well.
In our use case switching to asyncio it's like moving from 12 cores to 3... (And I'm pretty sure we are handling more concurrency... from 24-30 req/s to 150req/s
But our workload is mostly network related (db, external services...)
maybe author is concerned that many people are jumping the gun on async-await before we all fully understand why we need it at all. and that's true. but that paradigm was introduced (borrowed) to solve a completely different issue.
i would love to see how many concurrent connections those sync processes handle.
Hi - not sure what you mean by this. The sync workers handle one request (to completion) per worker. So 16 workers means 16 concurrent requests. For the async workers it's different - they do more concurrently - but as discussed their throughput is not better (and latency much worse).
Maybe what you're getting at is cases where there are a large number of (fairly sleepy) open connections? Eg for push updates and other websockety things. I didn't test that I'm afraid. The state of the art there seems to be using async and I think that's a broadly appropriate usage though that is generally not very performance sensitive code except that you try to do as little as possible in your connection manager code.
In the case of everything working smoothly that model may play out. But if you get a client that times out, or worse, a slow connection then they used one of your workers for a long time in a synchronous model. In the async model this has less of a footprint as you are still accepting other connections despite the slow progress of one of the workers.
yes many open connections is what i meant (suggested by other people as well). by the way, i really liked the writing, it's refreshing. and i agree with you that people aren't using async for the right reasons.
Thanks :) , really appreciate that. I think all technology goes through a period of wild over-application early on. My country is full of (hand dug) canals for example
Therefore, I would have liked to see how much memory all those workers use, and how many concurrent connections they can handle.