Thank you, I just updated the blog post with more detailed clarification of where the data comes from.
One thing that I am quite sure of for the A100 is its transformer performance. It turns out, large transformers are so strongly bottlenecked by memory bandwidth that you can just use memory bandwidth alone to measure performance — even across GPU architectures. The error between Volta and Turning with a pure bandwidth model is less than 5%. The NVIDIA transformer A100 benchmark data shows similar scaling. So I am pretty confident on the transformer numbers.
The computer vision numbers are more dependent on the network and it is difficult to generalize across all CNNs. For example, group convolution or depth-wise separable convolution based CNNs do not scale well with better GPUs and speedups will be small (1.2 - 1.5x) whereas some other networks like ResNet get pretty straightforward improvements (1.6x-1.7x). So CNN values are less straightforward because there is more diversity between CNNs compared to transformers.
One thing that I am quite sure of for the A100 is its transformer performance. It turns out, large transformers are so strongly bottlenecked by memory bandwidth that you can just use memory bandwidth alone to measure performance — even across GPU architectures. The error between Volta and Turning with a pure bandwidth model is less than 5%. The NVIDIA transformer A100 benchmark data shows similar scaling. So I am pretty confident on the transformer numbers.
The computer vision numbers are more dependent on the network and it is difficult to generalize across all CNNs. For example, group convolution or depth-wise separable convolution based CNNs do not scale well with better GPUs and speedups will be small (1.2 - 1.5x) whereas some other networks like ResNet get pretty straightforward improvements (1.6x-1.7x). So CNN values are less straightforward because there is more diversity between CNNs compared to transformers.