I appreciate the time and care that went into this post, and there’s a nice discussion of various features.
Unfortunately the performance charts are completely devoid from reality, and in particular the discussion on tensorcores may be true from an instruction count perspective but does not reflect any third-party benchmark I’ve seen. For example: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks.... Nvidia has a history of straight-up-lying about tensorcore and other benchmarks (for example, see this thread from right after Nvidia announced an 8x improvement in speed on imagenet in tensorflow for V100: https://github.com/tensorflow/benchmarks/issues/77)
In general, fp16 is only 30-40% faster than fp32, and occasionally 2x in really optimal conditions.
> Unfortunately the performance charts are completely devoid from reality, and in particular the discussion on tensorcores may be true from an instruction count perspective but does not reflect any third-party benchmark I’ve seen. For example: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks...
The performance numbers posted here appear to almost exactly reflect the LambdaLabs numbers.
Lambda Labs: the RTX 2080 Ti is 96% as fast as Titan V, 73% as fast as Tesla V100 (32 GB)
timdettmers: RTX 2080 Ti normalized to 1, Titan V looks about 1.1 to 1.2, V100 is just below 1.5
> for example, see this thread from right after Nvidia announced an 8x improvement in speed on imagenet in tensorflow for V100
In my experience NVidia benchmark numbers in deep learning are rarely lies - they are highly optimised, in optimal conditions and rarely achievable in the real world. About what you'd expect from a vendor benchmark.
Thank you for cross-referencing that, the data does look accurate and my statement now seems exaggerated. I do think we need skepticism on the A100 charts though until third party benchmarks.
> In my experience NVidia benchmark numbers in deep learning are rarely lies - they are highly optimised, in optimal conditions and rarely achievable in the real world.
Right, but Nvidia claimed 1360 images/sec for resnet-50 on imagenet. To my knowledge this still hasn’t been realized by a third party. It also isn’t a 4x improvement for fp32 vs fp16, that’s comparing to previous generation. Improvement is more like 1.5x: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...
Dell hit 5,243 images/sec with one of their 4 V100s servers, which comes to 1,310 images/sec per V100. I find it very believable that NVidia would get ~200 images/sec more, since Dell jumped 50% with a change in their CPU/GPU connection topology.
Thank you for finding that! I’m glad to see the situation has improved since I last looked. 3x improvement over fp32 is impressive for sure. Their marketing claims of 8x still bother me though.
The thing I've seen is very limited (NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.[1]) which is probably true.
The problem with performance improvements is the diminishing returns part of Amdahl Law: an 8x improvement in math performance will just mean the math part becomes less important in terms of absolute performance.
In any case, I've found NVidia's claims in the machine learning area to be pretty good. Like most claims you have to read carefully to see exactly what the claim is, but that's not uncommon with performance claims.
> NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers
Right but I’ve benchmarked the best case scenario, ie a large GEMM call in C++, and still not seen anywhere close to 8x. I’ve never seen a code example, no matter how limited, showing a 8x speed up.
I should have been a bit clearer what went into the charts. I do not use theoretical marketing numbers but real-life benchmark data from NVIDIA and 4 other sources of benchmark between Titan V, V100, RTX 2080, RTX 2080 Ti and Titan RTX. Since I calibrate a model that needs to satisfy all sources as best as it can I think the numbers are pretty accurate.
Thank you for clarifying! I’m still skeptical of the chart’s A100 values but appreciated your reasonable attempt to de-bias. It’s always easier to critique then create so I also want to make sure I complement you on an excellent article :).
Thank you, I just updated the blog post with more detailed clarification of where the data comes from.
One thing that I am quite sure of for the A100 is its transformer performance. It turns out, large transformers are so strongly bottlenecked by memory bandwidth that you can just use memory bandwidth alone to measure performance — even across GPU architectures. The error between Volta and Turning with a pure bandwidth model is less than 5%. The NVIDIA transformer A100 benchmark data shows similar scaling. So I am pretty confident on the transformer numbers.
The computer vision numbers are more dependent on the network and it is difficult to generalize across all CNNs. For example, group convolution or depth-wise separable convolution based CNNs do not scale well with better GPUs and speedups will be small (1.2 - 1.5x) whereas some other networks like ResNet get pretty straightforward improvements (1.6x-1.7x). So CNN values are less straightforward because there is more diversity between CNNs compared to transformers.
It really depends on workload. For ImageNet and resnet type architecture its not unusual to get 3X speed up. It also depends on if you do full fp16, leave BNs alone etc. This is a lot because instead of training for 6 days, you now training for 2 days.
> ImageNet and resnet type architecture its not unusual to get 3X speed up
Source on this? I've done a good bit of CV benchmarking work and I don't recall anything like a 3x boost. 30-40% improvement is much more in line with what I remember.
Bought a second hand GTX 1060 a while ago to play around a bit more serious with neural networks. It's a good balance between being cheap, 6GB of memory but still being serious enough to get some work done. If you are a professional researcher or do multiple kaggle's per month, then yes, get the best card. But I suspect a lot of people are one category "below", where a 1060 is sufficient most of the time, and you go to cloud for the actual big workloads.
This strategy can keep you going for a couple of years. With models becoming as big as they are, I doubt how much SOTA an RTX 3070 is going to do in 2-3 years. (actually none of these cards come close to GPT-3). By that time you can pick up a second hand RTX 30x and still get the latest offerings in the cloud.
Buying second hand GPU comes with a bit of a risk by the way, someone suggested only buying if the price is less than half of the original price (can't remember the link).
Interestingly, the 1060 was mentioned among the suggested cards in one of the old versions of this same article (it gets updated every time Nvidia releases a new series).
> Do not buy GTX 16s series cards. These cards do not have tensor cores and, as such, provide relatively poor deep learning performance. I would choose a used RTX 2070 / RTX 2060 / RTX 2060 Super any day over a GTX 16s series card
...a few paragraphs later...
> If that is too expensive, a used GTX 980 Ti (6GB $150) or a used GTX 1650 Super ($190).
Having a GPU is better than not having one, even if doesn't having tensor cores.
If you are on a tight budget, his advice is to pick in this order:
> I have little money: Buy used cards. Hierarchy: RTX 2070 ($400), RTX 2060 ($300), GTX 1070 ($220), GTX 1070 Ti ($230), GTX 1650 Super ($190), GTX 980 Ti (6GB $150).
I guess it should be taken as:
* Strongly prefer cards with Tensor cores -> suggested cards
* if you need a card because you're currently CPU only, and you're strapped for cash, these are your best bets, but only if you have no other option.
The intent is probably “as many RTX as new as possible if you’re a full time researcher; otherwise most of recent GTX does fine”, but it’s not explicitly stated in the text.
Kudos to the author on producing an excellent, comprehensive yet readable, post (based on my initial very brief review)! Much appreciated. One thing that jumped at me, though, is the recommendation for the "I want to try deep learning, but I am not serious about it" scenario. Advising to use a physical GPU (unless it's already a part of the system at hand), especially RTX 2060 Super, IMO does not make any sense in this case. Using cheap cloud GPU instances is the optimal way to try deep learning.
Thank you! You have a good point. I think I would agree with you if somebody already has cloud computing skills, then the cloud is much more powerful to learn deep learning than your own GPU.
I figured that most people that start with deep learning might also lack cloud computing skills. Learning one thing at a time is easier and as such, just sticking a GPU into your desktop and focus on deep learning software / programing might yield a better experience.
I might update my blog post in the future with this detail.
You're welcome! While I understand your rationale now, I'm afraid I still disagree with it. :-) Simply because I find it very unlikely that people interested in and having enough skills to embark on any reasonable deep learning journey (even if just to try) would lack enough cloud computing skills to use cloud GPU instances. After all, using GPUs in the cloud is not much different from (and, thus, not more complex than) using physical GPUs in your local machine.
Fun fact: I've saved your post as PDF for offline reading and it clocked at 649(!) pages at the time of saving (32 pages for the post per se and the rest for the blog post comments). Combining that with feedback here at HN, it is clear that there is quite a lot of interest in the topic ...
Haha, 649 pages! Thanks for the discussion. I can understand your perspective. Maybe it would be best to add something to the blog post that discusses it from both perspectives and readers can then choose which perspective suits them better.
I should also a bit more in general about cloud computing, it seems some people agree that the post ran a bit short on that. At some point I just wanted to be done with it though — editing 10k word blog posts is not so much fun anymore!
It's my pleasure. I agree - it's a good idea to present both perspectives and allow readers to decide what works best for them.
I can certainly understand you being hesitant to add more content to an already sizeable post. Perhaps, several small paragraphs on important relevant aspects might still be worth considering (take it with a grain of salt, since I haven't actually read your post in detail, including cloud-related parts). Anyway, thank you very much, again, for your time and effort. Keep it up!
You're very welcome! BTW, do you have any interest in and time for potential consulting or advisory for a frontier/deep tech startup (ambitious goals, challenging tasks, great impact)? Not an immediate need, but, hopefully, it will be a more real opportunity in not so distant future.
Just noting that you can (still) do very well on non top-of-the-line cards.
I've won multiple silver Kaggle medals on a 1070. It's true that more power would be helpful, but I feel it's lack of technique (and time!) rather than compute that has held me back from gold medals.
Not disagreeing with you, but from what I've heard, the people who win kaggles are people who can try more things, having a faster card presumably would allow you to try more things because each attempt would take less time.
There is some truth in this, but it's not the entire picture.
I've thought about it a lot, and talked to lots of really good Kagglers about it. Most of them use multiple machines, rather than having maximum performance in a single machine.
This lets them run multiple completely different experiements at once, and then put extra compute onto the ones that seem good.
That is a big difference to me, when I have to experiment sequentially. The parallelism is more important than absolute speed a lot of the time.
Good article! But the comment on NVLink / PCIe4 (and PCIe3 x4 is good enough) doesn’t fit my experience. PCIe3 x4 can really impact your all-reduce performance for models such as Transformers. For ResNet, it can impact a bit too, but to the 5% to 10% range like the author mentioned.
I am also interested in whether PCIe4 can help unified memory for larger models. Guess have to wait for RTX 3090 actual release.
I agree. When training production speech recognition models (wav2letter) on huge datasets, I hit PCIe, NVLink, network, and probably storage bottlenecks when scaling. (This isn’t comparable to training on 40GB of OpenWebText, it’s terabytes of compressed audio)
When I tested the DGX2, 16-card bandwidth was finally not an issue (I was extremely happy with the NVSwitch)... but I hit a CPU bottleneck going from [64vcpu + 8 V100] to [96vcpu + 16 V100] (even with tricks to reduce CUDA CPU load).
I’m excited to sometime try my workload on 2x NVLinked 3090s with a 64-core AMD CPU, optimized PCIe v4 NVMe data path, additional caching, and some hand tuning of the training code. It’s possible this will be competitive with an 8x V100 google cloud instance based on some of my scaling pain graphs.
I’d consider 4x 3090s- but between only having a single NVLink port, and having to figure out how to even fit four 3x cards in a case without water cooling, it seems prudent to start with two.
What are you talking about? Are you writing custom CUDA code to run those large models? Because there's zero support for unified memory in any of the existing DL frameworks.
I want an external 3090 chassis on Thunderbolt, similar to this box. No muss, no fuss. AC, liquid cooling, all in an engineered box. Gaming or ML with my laptop or tablet pc.
I'm sure that eGPU enclosure manufacturers will eventually produce models fully compatible with GeForce RTX 30 series. In the meantime, FYI there are some relevant discussions regarding potential compatibility (with pin adapters) of some eGPU enclosures currently available on the market, e.g.: https://egpu.io/forums/thunderbolt-enclosures/razer-core-x-c....
Just buy a good ITX PC. eGPU boxes often costs more than a PC, larger than one as well, while performing less than one due to TB3 link and overall clunkiness.
Thunderbolt 4 is on it's way with Intel Tigerlake which doubles the bandwidth but that still leaves limitations such as being limited to x8 PCI lanes which means you're leaving performance on the table if you use high eng GPUs and being locked in to Intel ecosystem so no Ryzen CPUs for you :(
And also cost $$$ which makes it a luxury solution only suitable for the users who absolutely want a thin portable notebook for on the go and a powerful GPU at home for AI/Gaming.
Out of curiosity, there are now enterprise sellers offloading old Tesla cards for relatively cheap (ie K40 ~$100 - $150, k80/M40 ~$150 - $200); are these worth looking at on a budget versus the 900 or 1600 series cards? Especially given memory options.
Is anyone else trying to decide if you should upgrade from 1080 ti's?
I have a box with four of them which have served me well for a while now, and based on just raw performance it looks like upgrading to two RTX 3080s would exceed the performance of my current system.
I'm wondering if I should rush to sell off the cards on the used market before the prices crash and then use that money to swap over to Ampere.
Then, there's also the question of if there will be an RTX 3080 TI which will blow away the RTX 3080 and be a viable card for the next five years like the 1080 TI.
I'm really uncertain about what to do and wonder the calculus other people have done on this decision.
4x 1080Ti are probably about as fast as 2x 3080 if you're able to use FP16. They are most likely faster for FP32. And in any case they provide almost 2x memory.
Mmm I don't see any mention of Cloud offerings or T4 model, specifically Google Cloud team revealed a more detailed and comprehensive blog post which include more important aspects: inference, training, cost and time. https://cloud.google.com/blog/products/ai-machine-learning/y...
Really great to publish these builds and GPU suggestions. Putting together a system that works can be really frustrating, and knowing where to start is really helpful.
I've gone with GPU spot instances for my personal experiments. The key is to be able to bring up a machine in a couple minutes so you're ok with always tearing one down. A combination of ansible and some scripts that push code around helped a lot to create a useful environment for experimenting.
One thing I did after the RTX 30-series announcement was a back-of-the-envelope comparison of performance per dollar and performance per watt, taking NVIDIA's numbers at face value. The 3070 and 3080 are surprisingly close on both metrics. You pay a substantial premium for the 3090, but it does have the best performance per watt.
For NLP applications, the deciding factor is actually GPU memory, so the choice is limited (V100 32GB the best, and nothing below 16GB is worth considering)
Answer: Whatever your current generation mid-level Nvidia Geforce is. It's been this way for a while.
Though, you should probably use AWS ML Compute, since they even have Nvidia Ampere 100's, which cost $10,000 each, and it'll probably be more cost effective for heavier workloads.
Would be interested to know how the Tesla T4's fit in, if at all. They seem to be by far the most "affordable" if your goal is to get in at the low end in the data center space. But not at all sure if they represent value in terms of $/compute?
T4s are essentially a 2070 super with twice the device memory (plus minor changes to clock speeds to account for power/cooling). ~ 5x the price, but suitable for the data centre and larger networks.
Anyone here knows an article comparing GPU architectures for non deep learning that has a similar depth as this article addressing stuff like memory latencies, cycles, caches etc. too? Kudos to the author of above article, I really enjoyed reading it!
CUDA owns the industry. AMD has to step up with a good alternative to disrupt it. Any serious work is being done in CUDA and there are lots of resources for CUDA.
You don't need to use CUDA for GPU programming, no? It's simply a lock-in. But I'm sure Nvidia was pushing it quite lot to make sure it's hard not to use it. But it clearly has to go. It's not healthy for the industry.
> You don't need to use CUDA for GPU programming, no?
No, you don't _need_ to use CUDA for GPU programming, you can use OpenCL or Vulkan or probably even PHP instead.
I do, however, _want_ to use the best programming language for the task at hand. If that task is GPU programming, CUDA is the best language I know for that, much better than SyCL, OpenCL, Vulkan / OpenGL + shaders, etc.
If these other technologies would be better, I would use them instead.
CUDA can't be best if it's tied to a single GPU. It's DOA by definition. This idea of "a language that only works on this hardware" is out of some dinosaur lock-in handbook from the last century.
CUDA compiles to CPUs, and AMD has support for CUDA via HIP.
Not that this matters because your argument is flawed.
The claim that CUDA is not worth using because it lacks portability, only holds, if there is hardware worth using that's not supported by CUDA.
The only GPUs worth buying for compute are from nvidia and support CUDA, so your claim isn't true.
The only thing you achieve today by not using CUDA is paying a huge price in development quality for portability that you can't use.
The startup cemetery is filled with companies that made this trade-off and picked OpenCL just in case they wanted to use non-nvidia hardware. They were all killed by the velocity of their competitors that were using CUDA to deliver better products that payed the bills.
The only people for which it might make sense to avoid CUDA are "non-professionals" (hobbyist, etc.). If you only want to use OpenCL to "learn OpenCL", then OpenCL is the right choice. But if you want to make money, then CUDA was the right choice 15 years ago and still is the right choice today.
If that makes you angry, direct your anger properly. It isn't NVIDIA's fault that CUDA is really good. It is however, AMD's, Intel's, Apple's, Qualcomm, ARM's... fault that everything else _sucks hard_. Being angry at nvidia for delivering good products is just stupid. Its the other companies fault that they can't seem to be able to get their sh* together when it comes to GPU computing.
> The only GPUs worth buying for compute are from nvidia
That sounds like koolaid marketing to me. AMD GCN was more compute oriented than Nvidia for years and only lately AMD increased focus on gaming with RDNA.
That's a fact: check HPL, MLPerf, Spec, etc. results. MLPerf is the perfect example, were your results are only accepted if they can be verified by others. Where is AMD in there? (nowhere, their products suck for compute).
> AMD GCN was more compute oriented than Nvidia for years
No, the only thing AMD GCN was good for is as a very expensive stove.
AMD GCN had a lot of compute, on paper, and higher numbers than nvidia GPUs of the time. Unfortunately, AMD GCN's memory subsystem sucked, and it was impossible to deliver data fast enough to actually be able to use the compute.
So nvidia's hardware essentially destroyed GCN for any useful practical application.
IIRC, the only application for which GCN's got some use was bitcoin mining, which avoided hitting GCN's issues because it just requires doing a ton of useless work on a tiny amount of memory. Perfect for GCN right? Nope, nvidia's hardware was still better, but sold out, and GCN wasn't horrible at this, so it got some use.
AMD actually fired the architect of GCN over this. Yet this still perfectly summarizes AMD's GPGPU strategy of the last 15 years: higher numbers on paper, that cannot be achieved in practice, and lower that the numbers that nvidia's hardware achieves in practice.
Though this is a snapshot from ~2015, I don't think anything has significantly changed since then:
I developed a (more or less popular, though not in wide use) ML framework[0] back then. NVIDIA pretty quickly contacted us and offered to send a Titan X (top-of-the-line back then) our way.
We also tried to make the framework work with OpenCL, but were severely limited in doing so. Mostly because Nvidia intentionally limits their GPUs to an older OpenCL version that has an uncompetitive featureset. They started doing that once they had a decent lead in the market over AMD. With that OpenCL was no longer vendor agnostic, but rather AMD-only, and you would always need to fully support both backends to support both vendors.
Yes, Nvidia may produce the best GPUs right now and have great market and mindshare, but there is nothing "honest and deserved" about how they got there.
> because Nvidia intentionally limits their GPUs to an older OpenCL version that has an uncompetitive featureset
So I suppose you still used OpenCL for AMD and Intel hardware because that worked well there? That is, OpenCL only worked poorly on nvidia hardware?
If that's the case, I wonder what's your take on OpenCL 3.0 reverting all OpenCL > 1.2 features, such that OpenCL 3 is essentially OpenCL 1.2. The reason given by Khronos is that _nobody_ was implementing these features anyways. Yet your story sounds like AMD had great support for OpenCL newer features, and only nvidia's support was poor, kind of contradicting Khronos themselves.
Maybe you meant that nvidia has poor support for OpenCL 1 ?
Not sure if I can offer a good take, as I don't 100% remember. We didn't have much resources to tackle OpenCL, as it was already hard enough to compete against the other frameworks on the CUDA side alone, so that's where our focus was until our company shut down. IIRC we had all the basics working with OpenCL 1.2 (with an Intel CPU and Titan X as test devices) and were looking into transitioning towards 2.0, as the abstractions used there were much closer to those in CUDA. I remember particularly the memory management in 1.2 to be a pain and difficulties around async RAM<->VRAM copies and subsequent async kernel execution. The latter in particular accounted for the biggest performance differences when profiling the same tasks on the Nvidia GPU.
I'm not really informed about what happened with OpenCL 3.0, but I would count any 2.x features as essential when trying to compete with modern CUDA. I don't think the support for OpenCL _the standard_ was bad, but there wasn't really any comparable tooling for OpenCL, and a set of good cuBLAS, cuDNN, etc. alternatives was missing. If OpenCL 3 is really reverting back to 1.2, that sounds like a big mistake.
The fact that OpenCL is stuck on a C dialect, while CUDA is a polyglot runtime already puts OpenCL out of the game for me.
In fact it was thanks to this that Khronos finally woke up, but it was too late, and almost no one cares to target SPIR, while SYSCL just decided to go backend agnostic with the whole 1.2 => 3.0 rebranding.
This post looks very thorough, and came just in time for me. I'm looking to snag an upgrade from my GTX 970 for a mix of flight sim 2020 and digging into Fast.ai's course part 2.
The 970 has been my big hold-up, right now even simple models take a really long time to work with.
Have you tried using Google Colab and other online platforms?
I started the course a few days ago and so far Colab works well (I don’t really like Jupiter but that’s a detail...). It’s free and you have the choice between a CPU, a GPU, and a TPU.
Disclosure (since Colab is a Google product): I work at Google, but everything I say is my personal opinion and experience.
I really dig the overall idea of cloud notebooks. Back when I did fast.ai part 1, I used Paperspace Gradient. It was a pretty good experience, but moving files around was a bit of a hassle. For example, getting the images for the Planet Labs exercise took a round trip of downloading from Kaggle to my computer and re-uploading into Jupyter to do analysis.
Because of all those moving parts, I decided to give a try running things locally. To my surprise, setup was super easy and I was quickly productive! I really dig how customizable a local Jupyter server is, too.
I do use Colab, it's particularly great for collaboration/sharing notebooks, but my past experience has me hooked on the idea of a capable ML machine at home.
Plus: I can pitch it to myself and my spouse as an investment in personal development that happens to be able to game :D
Go for it! I have my own Jupyter docker image that I run on a server in the basement. PyCharm can even do code completion for TensorFlow inside a remote docker container. So it's instant, reproducible, and I don't hear a thing in my office :)
Thank you for posting the archived link. I read it of this archive page, and it's great. This article is so good, that it made to the HN front page without even being available!
From just Hacker News traffic? How is that possible, even a (properly configured) Wordpress installation on a Nanode can handle a spike of 10,000+ connections :/
Unfortunately the performance charts are completely devoid from reality, and in particular the discussion on tensorcores may be true from an instruction count perspective but does not reflect any third-party benchmark I’ve seen. For example: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks.... Nvidia has a history of straight-up-lying about tensorcore and other benchmarks (for example, see this thread from right after Nvidia announced an 8x improvement in speed on imagenet in tensorflow for V100: https://github.com/tensorflow/benchmarks/issues/77)
In general, fp16 is only 30-40% faster than fp32, and occasionally 2x in really optimal conditions.