Thanks for the example! Sounds like 1.6 GB/s on an entire Tesla K80 (300W TDP).
This is in fact several times slower than our results on Skylake (with half the TDP), but note that K80 is from 2014.
The "25-fold speedup", as is often the case for such reports, comes from not optimizing the CPU side.
The "25-fold speedup", as is often the case for such reports, comes from not optimizing the CPU side.