Which GPUs to get for deep learning

tbenst · on Sept 7, 2020

I appreciate the time and care that went into this post, and there’s a nice discussion of various features.

Unfortunately the performance charts are completely devoid from reality, and in particular the discussion on tensorcores may be true from an instruction count perspective but does not reflect any third-party benchmark I’ve seen. For example: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks.... Nvidia has a history of straight-up-lying about tensorcore and other benchmarks (for example, see this thread from right after Nvidia announced an 8x improvement in speed on imagenet in tensorflow for V100: https://github.com/tensorflow/benchmarks/issues/77)

In general, fp16 is only 30-40% faster than fp32, and occasionally 2x in really optimal conditions.

nl · on Sept 7, 2020

> Unfortunately the performance charts are completely devoid from reality, and in particular the discussion on tensorcores may be true from an instruction count perspective but does not reflect any third-party benchmark I’ve seen. For example: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks...

The performance numbers posted here appear to almost exactly reflect the LambdaLabs numbers.

Lambda Labs: the RTX 2080 Ti is 96% as fast as Titan V, 73% as fast as Tesla V100 (32 GB)

timdettmers: RTX 2080 Ti normalized to 1, Titan V looks about 1.1 to 1.2, V100 is just below 1.5

> for example, see this thread from right after Nvidia announced an 8x improvement in speed on imagenet in tensorflow for V100

Well a non-NVidia person managed to get it up to just above 4x improvement without "using unreleased libraries from NVIDIA". From the same thread: https://github.com/tensorflow/benchmarks/issues/77#issuecomm...

In my experience NVidia benchmark numbers in deep learning are rarely lies - they are highly optimised, in optimal conditions and rarely achievable in the real world. About what you'd expect from a vendor benchmark.

tbenst · on Sept 8, 2020

Thank you for cross-referencing that, the data does look accurate and my statement now seems exaggerated. I do think we need skepticism on the A100 charts though until third party benchmarks.

> In my experience NVidia benchmark numbers in deep learning are rarely lies - they are highly optimised, in optimal conditions and rarely achievable in the real world.

Right, but Nvidia claimed 1360 images/sec for resnet-50 on imagenet. To my knowledge this still hasn’t been realized by a third party. It also isn’t a 4x improvement for fp32 vs fp16, that’s comparing to previous generation. Improvement is more like 1.5x: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

Even in very simple synthetic benchmarks the speed up is only 2x: https://github.com/tensorflow/benchmarks/issues/77#issuecomm...

I have not seen any benchmarks showing an 8x speedup. Have you? If not -> Nvidia lied.

nl · on Sept 8, 2020

> Right, but Nvidia claimed 1360 images/sec for resnet-50 on imagenet.

On https://images.nvidia.com/content/technologies/volta/pdf/vol... they claim 1,525 images/sec (!)

Dell hit 5,243 images/sec with one of their 4 V100s servers, which comes to 1,310 images/sec per V100. I find it very believable that NVidia would get ~200 images/sec more, since Dell jumped 50% with a change in their CPU/GPU connection topology.

See https://www.dell.com/support/article/en-au/sln317397/deep-le...

nl · on Sept 8, 2020

Also I just noticed that Google's XLA gets 1278 images/second on a single V100 in FP16 mode.

https://www.tensorflow.org/xla

tbenst · on Sept 8, 2020

Thank you for finding that! I’m glad to see the situation has improved since I last looked. 3x improvement over fp32 is impressive for sure. Their marketing claims of 8x still bother me though.

nl · on Sept 9, 2020

What exactly is the 8x claim?

The thing I've seen is very limited (NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.[1]) which is probably true.

The problem with performance improvements is the diminishing returns part of Amdahl Law: an 8x improvement in math performance will just mean the math part becomes less important in terms of absolute performance.

In any case, I've found NVidia's claims in the machine learning area to be pretty good. Like most claims you have to read carefully to see exactly what the claim is, but that's not uncommon with performance claims.

[1] https://docs.nvidia.com/deeplearning/performance/mixed-preci...

[2] https://en.wikipedia.org/wiki/Amdahl%27s_law#Relation_to_the...

tbenst · on Sept 9, 2020

> NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers

Right but I’ve benchmarked the best case scenario, ie a large GEMM call in C++, and still not seen anywhere close to 8x. I’ve never seen a code example, no matter how limited, showing a 8x speed up.

llukas · on Sept 9, 2020

https://developer.nvidia.com/blog/programming-tensor-cores-c...

See cuBLAS mixed-precision GEMM.

timdettmers · on Sept 8, 2020

I should have been a bit clearer what went into the charts. I do not use theoretical marketing numbers but real-life benchmark data from NVIDIA and 4 other sources of benchmark between Titan V, V100, RTX 2080, RTX 2080 Ti and Titan RTX. Since I calibrate a model that needs to satisfy all sources as best as it can I think the numbers are pretty accurate.

tbenst · on Sept 8, 2020

Thank you for clarifying! I’m still skeptical of the chart’s A100 values but appreciated your reasonable attempt to de-bias. It’s always easier to critique then create so I also want to make sure I complement you on an excellent article :).

timdettmers · on Sept 8, 2020

Thank you, I just updated the blog post with more detailed clarification of where the data comes from.

One thing that I am quite sure of for the A100 is its transformer performance. It turns out, large transformers are so strongly bottlenecked by memory bandwidth that you can just use memory bandwidth alone to measure performance — even across GPU architectures. The error between Volta and Turning with a pure bandwidth model is less than 5%. The NVIDIA transformer A100 benchmark data shows similar scaling. So I am pretty confident on the transformer numbers.

The computer vision numbers are more dependent on the network and it is difficult to generalize across all CNNs. For example, group convolution or depth-wise separable convolution based CNNs do not scale well with better GPUs and speedups will be small (1.2 - 1.5x) whereas some other networks like ResNet get pretty straightforward improvements (1.6x-1.7x). So CNN values are less straightforward because there is more diversity between CNNs compared to transformers.

llukas · on Sept 7, 2020

Cannot really lie here: https://mlperf.org/

YetAnotherNick · on Sept 8, 2020

I don't see the timing comparison between FP16 and FP32. Can you point me at it if it is there.

sytelus · on Sept 8, 2020

> fp16 is only 30-40% faster than fp32

It really depends on workload. For ImageNet and resnet type architecture its not unusual to get 3X speed up. It also depends on if you do full fp16, leave BNs alone etc. This is a lot because instead of training for 6 days, you now training for 2 days.

solidasparagus · on Sept 8, 2020

> ImageNet and resnet type architecture its not unusual to get 3X speed up

Source on this? I've done a good bit of CV benchmarking work and I don't recall anything like a 3x boost. 30-40% improvement is much more in line with what I remember.

esquire_900 · on Sept 8, 2020

Bought a second hand GTX 1060 a while ago to play around a bit more serious with neural networks. It's a good balance between being cheap, 6GB of memory but still being serious enough to get some work done. If you are a professional researcher or do multiple kaggle's per month, then yes, get the best card. But I suspect a lot of people are one category "below", where a 1060 is sufficient most of the time, and you go to cloud for the actual big workloads.

This strategy can keep you going for a couple of years. With models becoming as big as they are, I doubt how much SOTA an RTX 3070 is going to do in 2-3 years. (actually none of these cards come close to GPT-3). By that time you can pick up a second hand RTX 30x and still get the latest offerings in the cloud.

Buying second hand GPU comes with a bit of a risk by the way, someone suggested only buying if the price is less than half of the original price (can't remember the link).

mastazi · on Sept 8, 2020

Interestingly, the 1060 was mentioned among the suggested cards in one of the old versions of this same article (it gets updated every time Nvidia releases a new series).

coredog64 · on Sept 7, 2020

I’m confused:

> Do not buy GTX 16s series cards. These cards do not have tensor cores and, as such, provide relatively poor deep learning performance. I would choose a used RTX 2070 / RTX 2060 / RTX 2060 Super any day over a GTX 16s series card

...a few paragraphs later...

> If that is too expensive, a used GTX 980 Ti (6GB $150) or a used GTX 1650 Super ($190).

KSS42 · on Sept 7, 2020

Having a GPU is better than not having one, even if doesn't having tensor cores.

If you are on a tight budget, his advice is to pick in this order:

> I have little money: Buy used cards. Hierarchy: RTX 2070 ($400), RTX 2060 ($300), GTX 1070 ($220), GTX 1070 Ti ($230), GTX 1650 Super ($190), GTX 980 Ti (6GB $150).

FridgeSeal · on Sept 7, 2020

I guess it should be taken as: * Strongly prefer cards with Tensor cores -> suggested cards * if you need a card because you're currently CPU only, and you're strapped for cash, these are your best bets, but only if you have no other option.

numpad0 · on Sept 7, 2020

The intent is probably “as many RTX as new as possible if you’re a full time researcher; otherwise most of recent GTX does fine”, but it’s not explicitly stated in the text.

timdettmers · on Sept 7, 2020

This is good feedback. Will note this down and incorporate it in a small update.

timdettmers · on Sept 8, 2020

I just did a small fix to address this. Thanks again.

ablekh · on Sept 8, 2020

Kudos to the author on producing an excellent, comprehensive yet readable, post (based on my initial very brief review)! Much appreciated. One thing that jumped at me, though, is the recommendation for the "I want to try deep learning, but I am not serious about it" scenario. Advising to use a physical GPU (unless it's already a part of the system at hand), especially RTX 2060 Super, IMO does not make any sense in this case. Using cheap cloud GPU instances is the optimal way to try deep learning.

timdettmers · on Sept 8, 2020

Thank you! You have a good point. I think I would agree with you if somebody already has cloud computing skills, then the cloud is much more powerful to learn deep learning than your own GPU.

I figured that most people that start with deep learning might also lack cloud computing skills. Learning one thing at a time is easier and as such, just sticking a GPU into your desktop and focus on deep learning software / programing might yield a better experience.

I might update my blog post in the future with this detail.

ablekh · on Sept 8, 2020

You're welcome! While I understand your rationale now, I'm afraid I still disagree with it. :-) Simply because I find it very unlikely that people interested in and having enough skills to embark on any reasonable deep learning journey (even if just to try) would lack enough cloud computing skills to use cloud GPU instances. After all, using GPUs in the cloud is not much different from (and, thus, not more complex than) using physical GPUs in your local machine.

Fun fact: I've saved your post as PDF for offline reading and it clocked at 649(!) pages at the time of saving (32 pages for the post per se and the rest for the blog post comments). Combining that with feedback here at HN, it is clear that there is quite a lot of interest in the topic ...

timdettmers · on Sept 8, 2020

Haha, 649 pages! Thanks for the discussion. I can understand your perspective. Maybe it would be best to add something to the blog post that discusses it from both perspectives and readers can then choose which perspective suits them better.

I should also a bit more in general about cloud computing, it seems some people agree that the post ran a bit short on that. At some point I just wanted to be done with it though — editing 10k word blog posts is not so much fun anymore!

ablekh · on Sept 8, 2020

It's my pleasure. I agree - it's a good idea to present both perspectives and allow readers to decide what works best for them.

I can certainly understand you being hesitant to add more content to an already sizeable post. Perhaps, several small paragraphs on important relevant aspects might still be worth considering (take it with a grain of salt, since I haven't actually read your post in detail, including cloud-related parts). Anyway, thank you very much, again, for your time and effort. Keep it up!

timdettmers · on Sept 8, 2020

Thank you so much!

ablekh · on Sept 8, 2020

You're very welcome! BTW, do you have any interest in and time for potential consulting or advisory for a frontier/deep tech startup (ambitious goals, challenging tasks, great impact)? Not an immediate need, but, hopefully, it will be a more real opportunity in not so distant future.

jmnicolas · on Sept 8, 2020

The card could also be used for gaming.

ablekh · on Sept 8, 2020

Fair enough. However, this use case is far beyond the subject and scope of the post at hand. :-)

nl · on Sept 7, 2020

Just noting that you can (still) do very well on non top-of-the-line cards.

I've won multiple silver Kaggle medals on a 1070. It's true that more power would be helpful, but I feel it's lack of technique (and time!) rather than compute that has held me back from gold medals.

edude03 · on Sept 8, 2020

Not disagreeing with you, but from what I've heard, the people who win kaggles are people who can try more things, having a faster card presumably would allow you to try more things because each attempt would take less time.

nl · on Sept 8, 2020

There is some truth in this, but it's not the entire picture.

I've thought about it a lot, and talked to lots of really good Kagglers about it. Most of them use multiple machines, rather than having maximum performance in a single machine.

This lets them run multiple completely different experiements at once, and then put extra compute onto the ones that seem good.

That is a big difference to me, when I have to experiment sequentially. The parallelism is more important than absolute speed a lot of the time.

liuliu · on Sept 8, 2020

Good article! But the comment on NVLink / PCIe4 (and PCIe3 x4 is good enough) doesn’t fit my experience. PCIe3 x4 can really impact your all-reduce performance for models such as Transformers. For ResNet, it can impact a bit too, but to the 5% to 10% range like the author mentioned.

I am also interested in whether PCIe4 can help unified memory for larger models. Guess have to wait for RTX 3090 actual release.

lunixbochs · on Sept 8, 2020

I agree. When training production speech recognition models (wav2letter) on huge datasets, I hit PCIe, NVLink, network, and probably storage bottlenecks when scaling. (This isn’t comparable to training on 40GB of OpenWebText, it’s terabytes of compressed audio)

When I tested the DGX2, 16-card bandwidth was finally not an issue (I was extremely happy with the NVSwitch)... but I hit a CPU bottleneck going from [64vcpu + 8 V100] to [96vcpu + 16 V100] (even with tricks to reduce CUDA CPU load).

I’m excited to sometime try my workload on 2x NVLinked 3090s with a 64-core AMD CPU, optimized PCIe v4 NVMe data path, additional caching, and some hand tuning of the training code. It’s possible this will be competitive with an 8x V100 google cloud instance based on some of my scaling pain graphs.

I’d consider 4x 3090s- but between only having a single NVLink port, and having to figure out how to even fit four 3x cards in a case without water cooling, it seems prudent to start with two.

p1esk · on Sept 8, 2020

unified memory for larger models

What are you talking about? Are you writing custom CUDA code to run those large models? Because there's zero support for unified memory in any of the existing DL frameworks.

liuliu · on Sept 8, 2020

Yes. I have an alternative DL framework to play whatever I want with :)

p1esk · on Sept 8, 2020

Link?

lunixbochs · on Sept 8, 2020

Probably this:

https://libnnc.org

https://github.com/liuliu/ccv/tree/unstable/lib/nnc

fxtentacle · on Sept 7, 2020

Sadly, this is mostly purchase advise.

In short: You need lots of RAM.

And stay away from overclocked (founders edition) and from datacenter models due to heat or price problems.

timdettmers · on Sept 7, 2020

What else would you like to see?

fxtentacle · on Sept 8, 2020

The HN submission was initially called "advice on using GPU for deep learning" so I was hoping for optimization tips, too.

dade_ · on Sept 8, 2020

I want an external 3090 chassis on Thunderbolt, similar to this box. No muss, no fuss. AC, liquid cooling, all in an engineered box. Gaming or ML with my laptop or tablet pc.

https://www.gigabyte.com/Graphics-Card/GV-N208TIXEB-11GC#kf

ablekh · on Sept 8, 2020

I'm sure that eGPU enclosure manufacturers will eventually produce models fully compatible with GeForce RTX 30 series. In the meantime, FYI there are some relevant discussions regarding potential compatibility (with pin adapters) of some eGPU enclosures currently available on the market, e.g.: https://egpu.io/forums/thunderbolt-enclosures/razer-core-x-c....

numpad0 · on Sept 8, 2020

Just buy a good ITX PC. eGPU boxes often costs more than a PC, larger than one as well, while performing less than one due to TB3 link and overall clunkiness.

ChuckNorris89 · on Sept 8, 2020

Thunderbolt 4 is on it's way with Intel Tigerlake which doubles the bandwidth but that still leaves limitations such as being limited to x8 PCI lanes which means you're leaving performance on the table if you use high eng GPUs and being locked in to Intel ecosystem so no Ryzen CPUs for you :(

And also cost $$$ which makes it a luxury solution only suitable for the users who absolutely want a thin portable notebook for on the go and a powerful GPU at home for AI/Gaming.

jtflynnz · on Sept 8, 2020

Out of curiosity, there are now enterprise sellers offloading old Tesla cards for relatively cheap (ie K40 ~$100 - $150, k80/M40 ~$150 - $200); are these worth looking at on a budget versus the 900 or 1600 series cards? Especially given memory options.

djaque · on Sept 8, 2020

Is anyone else trying to decide if you should upgrade from 1080 ti's?

I have a box with four of them which have served me well for a while now, and based on just raw performance it looks like upgrading to two RTX 3080s would exceed the performance of my current system.

I'm wondering if I should rush to sell off the cards on the used market before the prices crash and then use that money to swap over to Ampere.

Then, there's also the question of if there will be an RTX 3080 TI which will blow away the RTX 3080 and be a viable card for the next five years like the 1080 TI.

I'm really uncertain about what to do and wonder the calculus other people have done on this decision.

p1esk · on Sept 8, 2020

4x 1080Ti are probably about as fast as 2x 3080 if you're able to use FP16. They are most likely faster for FP32. And in any case they provide almost 2x memory.

ezconnect · on Sept 8, 2020

You also need to account for the 350W rating of the 3080. I think its 100W more than the 1080ti's.

spicyramen · on Sept 8, 2020

Mmm I don't see any mention of Cloud offerings or T4 model, specifically Google Cloud team revealed a more detailed and comprehensive blog post which include more important aspects: inference, training, cost and time. https://cloud.google.com/blog/products/ai-machine-learning/y...

lostdog · on Sept 7, 2020

Really great to publish these builds and GPU suggestions. Putting together a system that works can be really frustrating, and knowing where to start is really helpful.

I've gone with GPU spot instances for my personal experiments. The key is to be able to bring up a machine in a couple minutes so you're ok with always tearing one down. A combination of ansible and some scripts that push code around helped a lot to create a useful environment for experimenting.

SloopJon · on Sept 8, 2020

One thing I did after the RTX 30-series announcement was a back-of-the-envelope comparison of performance per dollar and performance per watt, taking NVIDIA's numbers at face value. The 3070 and 3080 are surprisingly close on both metrics. You pay a substantial premium for the 3090, but it does have the best performance per watt.

oxygenz · on Sept 8, 2020

It looks like most are thinking 3080 might be the sweetspot for value?

acidbaseextract · on Sept 8, 2020

Definitely, especially from a memory bandwidth perspective: https://youtu.be/KpnIx1kLq9w

tanilama · on Sept 8, 2020

For NLP applications, the deciding factor is actually GPU memory, so the choice is limited (V100 32GB the best, and nothing below 16GB is worth considering)

p1esk · on Sept 8, 2020

Quadro 8000 has 48GB and costs ~$5,300, so it's a much better value than V100.

aborsy · on Sept 7, 2020

Thanks for the post, it was very good!

It would be great if you could add something about the hardware requirements of the reinforcement learning and video prediction.

RedComet · on Sept 8, 2020

Are any of the bigger frameworks supporting AMD yet?

figomore · on Sept 8, 2020

Pytorch has support to ROCM.

mkl · on Sept 8, 2020

Do you know how performance/$ compares?

leoh · on Sept 8, 2020

Why not just use a cloud offering? The AWS analysis is interesting, but there are other great offerings like Google's colab, which offers a free GPU https://colab.research.google.com/notebooks/intro.ipynb.

pjmlp · on Sept 8, 2020

Not everyone gladly puts private stuff on other people's computers.

jesterson · on Sept 8, 2020

You can remove the word private - not everyone gladly puts stuff in general on other people's computers

leoh · on Sept 8, 2020

You just did when you posted this comment

Apofis · on Sept 8, 2020

Answer: Whatever your current generation mid-level Nvidia Geforce is. It's been this way for a while.

Though, you should probably use AWS ML Compute, since they even have Nvidia Ampere 100's, which cost $10,000 each, and it'll probably be more cost effective for heavier workloads.

zmmmmm · on Sept 8, 2020

Would be interested to know how the Tesla T4's fit in, if at all. They seem to be by far the most "affordable" if your goal is to get in at the low end in the data center space. But not at all sure if they represent value in terms of $/compute?

ptheywood · on Sept 8, 2020

T4s are essentially a 2070 super with twice the device memory (plus minor changes to clock speeds to account for power/cooling). ~ 5x the price, but suitable for the data centre and larger networks.

aqohn123 · on Sept 8, 2020

Anyone here knows an article comparing GPU architectures for non deep learning that has a similar depth as this article addressing stuff like memory latencies, cycles, caches etc. too? Kudos to the author of above article, I really enjoyed reading it!

ngcc_hk · on Sept 8, 2020

Before I push that button have to say that I bought two 1080ti based on the site 3 years ago. Still remember the concern on memory.

Hope it will not ... 3090 got 24 and price is not as unreachable as Titan then.

oxygenz · on Sept 8, 2020

agreed!

shmerl · on Sept 8, 2020

Looks very Nvidia centric. What about using other GPUs for deep learning?

nanagojo · on Sept 8, 2020

CUDA owns the industry. AMD has to step up with a good alternative to disrupt it. Any serious work is being done in CUDA and there are lots of resources for CUDA.

shmerl · on Sept 8, 2020

You don't need to use CUDA for GPU programming, no? It's simply a lock-in. But I'm sure Nvidia was pushing it quite lot to make sure it's hard not to use it. But it clearly has to go. It's not healthy for the industry.

fluffything · on Sept 8, 2020

> You don't need to use CUDA for GPU programming, no?

No, you don't _need_ to use CUDA for GPU programming, you can use OpenCL or Vulkan or probably even PHP instead.

I do, however, _want_ to use the best programming language for the task at hand. If that task is GPU programming, CUDA is the best language I know for that, much better than SyCL, OpenCL, Vulkan / OpenGL + shaders, etc.

If these other technologies would be better, I would use them instead.

shmerl · on Sept 8, 2020

CUDA can't be best if it's tied to a single GPU. It's DOA by definition. This idea of "a language that only works on this hardware" is out of some dinosaur lock-in handbook from the last century.

fluffything · on Sept 9, 2020

CUDA compiles to CPUs, and AMD has support for CUDA via HIP.

Not that this matters because your argument is flawed.

The claim that CUDA is not worth using because it lacks portability, only holds, if there is hardware worth using that's not supported by CUDA.

The only GPUs worth buying for compute are from nvidia and support CUDA, so your claim isn't true.

The only thing you achieve today by not using CUDA is paying a huge price in development quality for portability that you can't use.

The startup cemetery is filled with companies that made this trade-off and picked OpenCL just in case they wanted to use non-nvidia hardware. They were all killed by the velocity of their competitors that were using CUDA to deliver better products that payed the bills.

The only people for which it might make sense to avoid CUDA are "non-professionals" (hobbyist, etc.). If you only want to use OpenCL to "learn OpenCL", then OpenCL is the right choice. But if you want to make money, then CUDA was the right choice 15 years ago and still is the right choice today.

If that makes you angry, direct your anger properly. It isn't NVIDIA's fault that CUDA is really good. It is however, AMD's, Intel's, Apple's, Qualcomm, ARM's... fault that everything else _sucks hard_. Being angry at nvidia for delivering good products is just stupid. Its the other companies fault that they can't seem to be able to get their sh* together when it comes to GPU computing.

shmerl · on Sept 9, 2020

> The only GPUs worth buying for compute are from nvidia

That sounds like koolaid marketing to me. AMD GCN was more compute oriented than Nvidia for years and only lately AMD increased focus on gaming with RDNA.

fluffything · on Sept 10, 2020

> That sounds like koolaid marketing to me.

That's a fact: check HPL, MLPerf, Spec, etc. results. MLPerf is the perfect example, were your results are only accepted if they can be verified by others. Where is AMD in there? (nowhere, their products suck for compute).

> AMD GCN was more compute oriented than Nvidia for years

No, the only thing AMD GCN was good for is as a very expensive stove.

AMD GCN had a lot of compute, on paper, and higher numbers than nvidia GPUs of the time. Unfortunately, AMD GCN's memory subsystem sucked, and it was impossible to deliver data fast enough to actually be able to use the compute.

So nvidia's hardware essentially destroyed GCN for any useful practical application.

IIRC, the only application for which GCN's got some use was bitcoin mining, which avoided hitting GCN's issues because it just requires doing a ton of useless work on a tiny amount of memory. Perfect for GCN right? Nope, nvidia's hardware was still better, but sold out, and GCN wasn't horrible at this, so it got some use.

AMD actually fired the architect of GCN over this. Yet this still perfectly summarizes AMD's GPGPU strategy of the last 15 years: higher numbers on paper, that cannot be achieved in practice, and lower that the numbers that nvidia's hardware achieves in practice.

atomlib · on Sept 8, 2020

Are you trying to say Nvidia created a great framework for GPU compute so everyone is using it?

shmerl · on Sept 8, 2020

Nvidia probably sponsored a lot of high level frameworks that only support CUDA, essentially poisoning them with lock-in.

pjmlp · on Sept 8, 2020

Nope, it is up to Khronos and other card vendors to actually offer tooling beyond stone age C dialects and printf style debugging.

NVidia cannot be blamed for their incompetency.

fluffything · on Sept 8, 2020

Somebody below mentions that nvidia donated free GPUs to the developers of these open source frameworks.

Why didn't AMD do this as well? I thought AMD was strong on supporting open source.

atomlib · on Sept 8, 2020

[flagged]

hobofan · on Sept 8, 2020

Though this is a snapshot from ~2015, I don't think anything has significantly changed since then:

I developed a (more or less popular, though not in wide use) ML framework[0] back then. NVIDIA pretty quickly contacted us and offered to send a Titan X (top-of-the-line back then) our way.

We also tried to make the framework work with OpenCL, but were severely limited in doing so. Mostly because Nvidia intentionally limits their GPUs to an older OpenCL version that has an uncompetitive featureset. They started doing that once they had a decent lead in the market over AMD. With that OpenCL was no longer vendor agnostic, but rather AMD-only, and you would always need to fully support both backends to support both vendors.

Yes, Nvidia may produce the best GPUs right now and have great market and mindshare, but there is nothing "honest and deserved" about how they got there.

[0]: https://github.com/autumnai/leaf

fluffything · on Sept 8, 2020

> because Nvidia intentionally limits their GPUs to an older OpenCL version that has an uncompetitive featureset

So I suppose you still used OpenCL for AMD and Intel hardware because that worked well there? That is, OpenCL only worked poorly on nvidia hardware?

If that's the case, I wonder what's your take on OpenCL 3.0 reverting all OpenCL > 1.2 features, such that OpenCL 3 is essentially OpenCL 1.2. The reason given by Khronos is that _nobody_ was implementing these features anyways. Yet your story sounds like AMD had great support for OpenCL newer features, and only nvidia's support was poor, kind of contradicting Khronos themselves.

Maybe you meant that nvidia has poor support for OpenCL 1 ?

hobofan · on Sept 8, 2020

Not sure if I can offer a good take, as I don't 100% remember. We didn't have much resources to tackle OpenCL, as it was already hard enough to compete against the other frameworks on the CUDA side alone, so that's where our focus was until our company shut down. IIRC we had all the basics working with OpenCL 1.2 (with an Intel CPU and Titan X as test devices) and were looking into transitioning towards 2.0, as the abstractions used there were much closer to those in CUDA. I remember particularly the memory management in 1.2 to be a pain and difficulties around async RAM<->VRAM copies and subsequent async kernel execution. The latter in particular accounted for the biggest performance differences when profiling the same tasks on the Nvidia GPU.

I'm not really informed about what happened with OpenCL 3.0, but I would count any 2.x features as essential when trying to compete with modern CUDA. I don't think the support for OpenCL _the standard_ was bad, but there wasn't really any comparable tooling for OpenCL, and a set of good cuBLAS, cuDNN, etc. alternatives was missing. If OpenCL 3 is really reverting back to 1.2, that sounds like a big mistake.

fluffything · on Sept 9, 2020

> If OpenCL 3 is really reverting back to 1.2, that sounds like a big mistake.

See: https://www.anandtech.com/show/15746/opencl-30-announced-hit...

This was also extensively discussed here in HNs. You might be able to find that thread, was really interesting.

pjmlp · on Sept 8, 2020

The fact that OpenCL is stuck on a C dialect, while CUDA is a polyglot runtime already puts OpenCL out of the game for me.

In fact it was thanks to this that Khronos finally woke up, but it was too late, and almost no one cares to target SPIR, while SYSCL just decided to go backend agnostic with the whole 1.2 => 3.0 rebranding.

As for the rest, rules of the game.

shmerl · on Sept 8, 2020

Nvidia has been doing this anti-competitive junk for years. There is nothing honest and deserved about crooked practices.

Today it should be easier to target compute queues with something like Vulkan though, which is a generic GPU API, not limited to graphics.

pjmlp · on Sept 8, 2020

Another example of Khronos being stuck with C mindset and poor tooling.

Vulkan Compute so far hasn't gotten anywhere.

shmerl · on Sept 8, 2020

AMD apparently had asynchronous compute support in hardware when Nvidia didn't. And cryptocurrency miners bought AMD a lot more as well.

I don't call lock-in shenanigans a fair advantage. And supporting lock-in is a slippery slope.

pjmlp · on Sept 8, 2020

That did not come with support for C++, Fortran, .NET, Java, graphical debuggers with ability to step through shader code, so bad luck.

sdenton4 · on Sept 8, 2020

(well, there is also 'serious' work being done with TPUs...)

fluffything · on Sept 8, 2020

> What about using other GPUs for deep learning?

There aren't any.

shmerl · on Sept 8, 2020

Aren't any what?

fluffything · on Sept 9, 2020

GPUs from other vendors that are worth using for DL.

k12sosse · on Sept 8, 2020

Wish they would make dedicate ML cards so the gamers don't have to fight with the deepfakers and miners

ngcc_hk · on Sept 8, 2020

No comment on the direct transfer from Ssd to Gpu bypassing cpu ... Amy relevance.

nsriv · on Sept 7, 2020

Site down apparently

PNWChris · on Sept 7, 2020

Same situation for me, luckily it looks like archive.org snagged a backup while it was up!

https://web.archive.org/web/20200907164516/https://timdettme...

This post looks very thorough, and came just in time for me. I'm looking to snag an upgrade from my GTX 970 for a mix of flight sim 2020 and digging into Fast.ai's course part 2.

The 970 has been my big hold-up, right now even simple models take a really long time to work with.

dgellow · on Sept 7, 2020

Have you tried using Google Colab and other online platforms? I started the course a few days ago and so far Colab works well (I don’t really like Jupiter but that’s a detail...). It’s free and you have the choice between a CPU, a GPU, and a TPU.

PNWChris · on Sept 7, 2020

Disclosure (since Colab is a Google product): I work at Google, but everything I say is my personal opinion and experience.

I really dig the overall idea of cloud notebooks. Back when I did fast.ai part 1, I used Paperspace Gradient. It was a pretty good experience, but moving files around was a bit of a hassle. For example, getting the images for the Planet Labs exercise took a round trip of downloading from Kaggle to my computer and re-uploading into Jupyter to do analysis.

Because of all those moving parts, I decided to give a try running things locally. To my surprise, setup was super easy and I was quickly productive! I really dig how customizable a local Jupyter server is, too.

I do use Colab, it's particularly great for collaboration/sharing notebooks, but my past experience has me hooked on the idea of a capable ML machine at home.

Plus: I can pitch it to myself and my spouse as an investment in personal development that happens to be able to game :D

fxtentacle · on Sept 7, 2020

Go for it! I have my own Jupyter docker image that I run on a server in the basement. PyCharm can even do code completion for TensorFlow inside a remote docker container. So it's instant, reproducible, and I don't hear a thing in my office :)

g42gregory · on Sept 7, 2020

Thank you for posting the archived link. I read it of this archive page, and it's great. This article is so good, that it made to the HN front page without even being available!

rcarmo · on Sept 7, 2020

Yep. Hugged to death, can’t seem to find a recent mirror/cache either.

bserge · on Sept 7, 2020

From just Hacker News traffic? How is that possible, even a (properly configured) Wordpress installation on a Nanode can handle a spike of 10,000+ connections :/

timdettmers · on Sept 7, 2020

Is up again and stable it seems. I shot my cache while trying to "tune" it. Well, I learned my lesson.