Isn't it misleading to just add up the output width of all SIMD ALU pipelines an...

adrian_b · 2025-03-01T15:17:09 1740842229

You are right that in most CPUs the 3 or 4 vector execution units are not completely identical.

Therefore some operations may use the entire datapath width, while others may use only a fraction, e.g. only a half or only two thirds or only three quarters.

However you cannot really discuss these details without listing all such instructions, i.e. reproducing the tables from the Intel or AMD optimization guides of from Agner Fog's optimization documents.

For the purpose of this discussion thread, these details are not really relevant, because for Intel and AMD the classification of the instructions is mostly the same, i.e. the cheap instructions, like addition operations, can be executed in all execution units, using the entire datapath width, while certain more expensive operations, like multiplication/division/square root/shuffle may be done only in a subset of the execution units, so they can use only a fraction of the datapath width (but when possible they will be coupled with simple instructions using the remainder of the datapath, maintaining a total throughput equal with the datapath width).

Because most instructions are classified by cost in the same way by AMD and Intel, the throughput ratio between AMD and Intel is typically the same both for instructions using the full datapath width and for those using only a fraction.

Like I have said, with very few exceptions (including FMUL/FMA/LD/ST), the throughput for 512-bit instructions has been the same for Zen 4 and the Intel CPUs with AVX-512 support, as determined by the common 1024-bit datapath width, including for the instructions that could use only a half-width 512-bit datapath.

dzaima · 2025-03-01T15:43:29 1740843809

Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA taking 3 inputs? (applies equally to both so doesn't change anything materially; And even goes back to Haswell, which too is capable of 2 256-bit FMA/cycle)

adrian_b · 2025-03-01T17:15:03 1740849303

That is why I have written the throughput "for results", to clarify the meaning (the throughput for output results is determined by the number of execution units; it does not depend on the number of input operands).

The vector register file has a number of read and write ports, e.g. 10 x 512-bit read ports for recent AMD CPUs (i.e. 10 ports can provide the input operands for 2 x FMA + 2 FADD, when no store instructions are done simultaneously).

So a detailed explanation of the "datapath widths", would have to take into account the number of read and write ports, because some combinations of instructions cannot be executed simultaneously, even when there are available execution units, because the paths between the register file and the execution units are occupied.

Even more complicately, some combinations of instructions that would be prohibited by not having enough register read and write ports, can actually be done simultaneously because there are bypass paths between the execution units that allow the sharing of some input operands or the direct use of output operands as input operands, without passing through the register file.

The structure of the Intel vector execution units, with 3 x 256-bit execution units, 2 of which can do FMA, goes indeed back to Haswell, as you say.

The Lion Cove core launched in 2024 is the first Intel core that uses the enhanced structure used by AMD Zen for many years, with 4 execution units, where 2 can do FMA/FMUL, but all 4 can do FADD.

Starting with the Skylake Server CPUs, the Intel CPUs with AVX-512 support retain the Haswell structure when executing 256-bit or narrower instructions, but when executing 512-bit instructions, 2 x 256-bit execution units are paired to make a 512-bit execution unit, while the third 256-bit execution unit is paired with an otherwise unused 256-bit execution unit to make a second 512-bit execution unit.

Of these 2 x 512-bit execution units, only one can do FMA. Certain Intel SKUs add a second 512-bit FMA unit, so in those both 512-bit execution units can do FMA (this fact is mentioned where applicable in the CPU descriptions from the Intel Ark site).

dzaima · 2025-03-01T20:26:16 1740860776

So the 1024-bit number is the number of vector output bits per cycle, i.e. 2×FMA+2×FADD = (2+2)×256-bit? Is the term "datapath width" used for that anywhere else? (I guess you've prefixed that with "total " in some places, which makes much more sense)

adrian_b · 2025-03-02T08:17:56 1740903476

"Datapath width" is somewhat ambiguous.

For most operations, an ALU has a width in 1-bit subunits, e.g. adders, and the same number as the number of subunits is the width in bit lines of the output path and of each of the 2 input paths that are used for most input operands. Some operations use only one input path, while others, like FMA or bit select may need 3 input paths.

The width of the datapath is normally taken to be the number of 1-bit subunits of the execution units, which is equal to the width in bit lines of the output path.

Depending on the implemented instruction set, the number of input paths having the same width as the output path may vary, e.g. either 2 or 3. In reality this is even more complicated, e.g. for 4 execution units you may have 10 input paths, whose connections can be changed dynamically, so they may be provide 3 input paths for some execution units and 2 input paths for other execution units, depending on what micro-operations happen to be executed there during a clock cycle. Moreover there may be many additional bypass operand paths.

Therefore, if you say that the datapath width for a single execution unit is of 256 bits, because it has 256 x 1-bit ALU subunits and 256 bit lines for output, that does not determine completely the complexity of the execution unit, because it may have a total input path width with values varying e.g. between 512 bit lines to 1024 bit lines or even more (which are selected with multiplexers).

The datapath width for a single execution units matters very little for the performance of a CPU or GPU. What matters is the total datapath width, summed over all available execution units, which is what determines the CPU throughput when executing some program.

For AVX programs, starting with Zen 2 the AMD CPUs had a total datapath width of 1024 bits vs. 768 bits for Intel, which is why they were beating easily the Intel CPUs in AVX benchmarks.

For 512-bit AVX-512 instructions, Zen 4 and the Intel Xeon CPUs with P-cores have the same total datapath width for instructions other than FMUL/FMA/LD/ST, which has resulted in the same throughput per clock cycle for the programs that do not depend heavily on floating-point multiplications. Because Zen 4 had higher clock frequencies in power-limited conditions, Zen 4 has typically beaten the Xeons in AVX-512 benchmarks, with the exception of the programs that can use the AMX instruction set, which is not implemented yet by AMD.

The "double-pumped" term used about Zen 4 has created a lot of confusion, because it does not refer to the datapath width, but only to the number of available floating-point multipliers, which is half of that of the top models of Intel Xeons, so any FP multiplications must require a double number of clock cycles on Zen 4.

The term "double-pumped" is actually true for many models of AMD Radeon GPUs, where e.g. a 2048-bit instruction (64 wavefront size) is executed in 2 clock cycles as 2 x 1024-bit micro-operations (32 wavefront size).

On Zen 4, it is not at all certain that this is how the 512-bit instructions are executed, because unlike on Radeon, on Zen 4 there are 2 parallel execution units that can execute the instruction halves simultaneously, which results in the same throughput as when the execution is "double-pumped" in a single execution unit.

dzaima · 2025-03-01T22:58:39 1740869919

oops, haswell has only 3 SIMD ALUs, i.e. 768 bits of output per cycle, not 1024.