Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> profilers are more important than careful design.

> I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion

But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

I would expect profiling to mostly leads to micro-optimisations, eg combining or splitting the time a lock is taken, but when you're still designing you can look at avoiding as much need for synchronization as possible. eg: sharing data copy-on-write (not requiring locks as long as you have a reference) instead of having to lock the data when accessing it.

As another commenter says

> with asyncio we deploy a thread per worker (loop), and a worker per core. We also move cpu bound functions to a thread pool

you can't easily go from eg. thread-per-connection to a worker pool. that should have been caught during design



> But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

Yes and no. Again, I have not profiled or optimized servers or interpreted/JIT languages, so I bet there's a new ruleset.

Blocking can come from unexpected places. For example, if we use dependencies, then we don't have much control over the resources accessed by the dependency.

Sometimes, these dependencies are the OS or standard library. We would sometimes have to choose alternate system calls, as the ones we initially chose caused issues which were not exposed until the profile was run.

In my experience, the killer for us was often cache-breaking. Things like the length of the data in a variable could determine whether or not it was bounced from a register or low-level cache, and the impact could be astounding. This could lead to remedies like applying a visitor to break up a [supposedly] inconsequential temp buffer into cache-friendly bites.

Also, we sometimes had to recombine work that we had sent to threads, because that caused cache hits.

Unit testing could be useless. For example, the test images that we often used were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed onto a well-lit table, with a few targets.

Then, we would run an image from a pro shooter, with a Western prairie skyline, and the lengths of some of the convolution target blocks would be different. This could sometimes cause a cache-hit, with a demotion of a buffer. This taught us to use a large pool of test images, which was sometimes quite difficult. In some cases, we actually had to use synthesized images.

Since we were working on image processing software, we were already doing this in other work, but we learned to do it in the optimization work, too.

When my team was working on C++ optimization, we had a team from Intel come in and profile our apps.

It was pretty humbling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: