AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
The most recent Intel and AMD CPU cores (Lion Cove and Zen 5) have identical vector datapath widths, but for many years, for 256-bit AVX Intel had a narrower datapath than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit for AMD (4 x 256-bit).
Only when executing 512-bit AVX-512 instructions, the vector datapath of Intel was extended to 1024-bit (2 x 512-bit), matching the datapath used by AMD for any vector instructions.
There were only 2 advantages of Intel AVX-512 vs. AMD executing AVX or the initial AVX-512 implementation of Zen 4.
The first was that some Intel CPU models, but only the more expensive SKUs, i.e. most of the Gold and all of the Platinum, had 2 x 512-bit FMA units, while the cheap Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit (but AMD Zen 4 still had 2 x 512-bit FADD units). Therefore Intel could do 2 FMUL or FMA per clock cycle, while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
The second was that Intel had a double width link to the L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit stores per clock cycle, while Zen 4 could do only 1 x 512-bit loads per cycle + 1 x 512-bit stores every other cycle. (In a balanced CPU core design the throughput for vector FMA and for vector loads from the L1 cache must be the same, which is true for both old and new Intel and AMD CPU cores.)
With the exception of vector load/store and FMUL/FMA, Zen 4 had the same or better AVX-512 throughput for most instructions, in comparison with Intel Sapphire Rapids/Emerald Rapids. There were a few instructions with a poor implementation on Zen 4 and a few instructions with a poor implementation on Intel, where either Intel or AMD were significantly better than the other.
> AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
"Thus as many of us predicted, 512-bit instructions are split into 2 x 256-bit of the same instruction. And 512-bit is always half-throughput of their 256-bit versions."
Is that wrong?
There's a lot of it being described as "double pumped" going around…
(tbh I couldn't care less about how wide the interface buses are, as long as they can deliver in sum total a reasonable bandwidth at a reasonable latency… especially on the further out cache hierarchies the latency overshadows the width so much it doesn't matter if it comes down to 1×512 or 2×256. The question at hand here is the total width of the ALUs and effective IPC.)
Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
The AMD Zen cores had for several generations, until Zen 4, 4 (four) vector execution units with a width of 256 bits, i.e. a total datapath width of 1024 bits.
On an 1024-bit datapath, you can execute either four 256-bit instructions per clock cycle or two 512-bit instructions per clock cycle.
While the number of instructions executed per cycle varies, the data processing throughput is the same, 1024 bits per clock cycle, as determined by the datapath width.
The use of the word "double-pumped" by the AMD CEO has been a very unfortunate choice, because it has been completely misunderstood by most people, who have never read the AMD technical documentation and who have never tested the behavior of the micro-architecture of the Zen CPU cores.
On Zen 4, the advantage of using AVX-512 is not from a different throughput, but it is caused by a better instruction set and by the avoiding of bottlenecks in the CPU core front-end, at instruction fetching, decoding, renaming and dispatching.
On the Intel P cores before Lion Cove, the datapath for 256-bit instructions had a width of 768 bits, as they had three 256-bit execution units. For most 256-bit instructions the throughput was of 768 bits per clock cycle. However the three execution units were not identical, so some 256-bit instructions had only a throughput of 512 bits per cycle.
When the older Intel P cores executed 512-bit instructions, the instructions with a 512 bit/cycle throughput remained at that throughput, but most of the instructions with a 768 bit/cycle throughput had their throughput increased to 1024 bit/cycle, matching the AMD throughput, by using an additional 256-bit datapath section that stayed unused when executing 256-bit or narrower instructions.
While what is said above applies to most vector instructions, floating-point multiplication and FMA have different rules, because their throughput is not determined by the width of the datapath, but it may be smaller, being determined by the number of available floating-point multipliers.
Cheap Intel CPUs and AMD Zen 2/Zen 3/Zen 4 had FP multipliers with a total throughput of 512 bits of results per clock cycle, while the expensive Xeon Gold and Platinum had FP multipliers with a total throughput of 1024 bits of results per clock cycle.
The "double-pumped" term is applicable only to FP multiplication, where Zen 4 and cheap Intel CPUs require a double number of clock cycles to produce the same results as expensive Intel CPUs. It may be also applied, even if that is even less appropriate, to vector load and store, where the path to the L1 data cache was narrower in Zen 4 than in Intel CPUs.
The "double-pumped" term is not applicable to the very large number of other AVX-512 instructions, whose throughput is determined by the width of the vector datapath, not by the width of the FP multipliers or by the L1 data cache connection.
Zen 5 doubles the vector datapath width to 2048 bits, so many 512-bit AVX-512 instructions have a 2048 bit/cycle throughput, except FMUL/FMA, which have an 1024 bit/cycle throughput, determined by the width of the FP multipliers. (Because there are only 4 execution units, 256-bit instructions cannot use the full datapath.)
Intel Diamond Rapids, expected by the end of 2026, is likely to have the same vector throughput as Zen 5. Until then, the Lion Cove cores from consumer CPUs, like Arrow Lake S, Arrow Lake H and Lunar Lake, are crippled, by having a half-width datapath of 1024 bits, which cannot compete with a Zen 5 that executes AVX-512 instructions.
Isn't it misleading to just add up the output width of all SIMD ALU pipelines and call the sum "datapath width", because you can't freely mix and match when the available ALUs pipelines determine what operations you can compute at full width?
You are right that in most CPUs the 3 or 4 vector execution units are not completely identical.
Therefore some operations may use the entire datapath width, while others may use only a fraction, e.g. only a half or only two thirds or only three quarters.
However you cannot really discuss these details without listing all such instructions, i.e. reproducing the tables from the Intel or AMD optimization guides of from Agner Fog's optimization documents.
For the purpose of this discussion thread, these details are not really relevant, because for Intel and AMD the classification of the instructions is mostly the same, i.e. the cheap instructions, like addition operations, can be executed in all execution units, using the entire datapath width, while certain more expensive operations, like multiplication/division/square root/shuffle may be done only in a subset of the execution units, so they can use only a fraction of the datapath width (but when possible they will be coupled with simple instructions using the remainder of the datapath, maintaining a total throughput equal with the datapath width).
Because most instructions are classified by cost in the same way by AMD and Intel, the throughput ratio between AMD and Intel is typically the same both for instructions using the full datapath width and for those using only a fraction.
Like I have said, with very few exceptions (including FMUL/FMA/LD/ST), the throughput for 512-bit instructions has been the same for Zen 4 and the Intel CPUs with AVX-512 support, as determined by the common 1024-bit datapath width, including for the instructions that could use only a half-width 512-bit datapath.
Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA taking 3 inputs? (applies equally to both so doesn't change anything materially; And even goes back to Haswell, which too is capable of 2 256-bit FMA/cycle)
That is why I have written the throughput "for results", to clarify the meaning (the throughput for output results is determined by the number of execution units; it does not depend on the number of input operands).
The vector register file has a number of read and write ports, e.g. 10 x 512-bit read ports for recent AMD CPUs (i.e. 10 ports can provide the input operands for 2 x FMA + 2 FADD, when no store instructions are done simultaneously).
So a detailed explanation of the "datapath widths", would have to take into account the number of read and write ports, because some combinations of instructions cannot be executed simultaneously, even when there are available execution units, because the paths between the register file and the execution units are occupied.
Even more complicately, some combinations of instructions that would be prohibited by not having enough register read and write ports, can actually be done simultaneously because there are bypass paths between the execution units that allow the sharing of some input operands or the direct use of output operands as input operands, without passing through the register file.
The structure of the Intel vector execution units, with 3 x 256-bit execution units, 2 of which can do FMA, goes indeed back to Haswell, as you say.
The Lion Cove core launched in 2024 is the first Intel core that uses the enhanced structure used by AMD Zen for many years, with 4 execution units, where 2 can do FMA/FMUL, but all 4 can do FADD.
Starting with the Skylake Server CPUs, the Intel CPUs with AVX-512 support retain the Haswell structure when executing 256-bit or narrower instructions, but when executing 512-bit instructions, 2 x 256-bit execution units are paired to make a 512-bit execution unit, while the third 256-bit execution unit is paired with an otherwise unused 256-bit execution unit to make a second 512-bit execution unit.
Of these 2 x 512-bit execution units, only one can do FMA. Certain Intel SKUs add a second 512-bit FMA unit, so in those both 512-bit execution units can do FMA (this fact is mentioned where applicable in the CPU descriptions from the Intel Ark site).
So the 1024-bit number is the number of vector output bits per cycle, i.e. 2×FMA+2×FADD = (2+2)×256-bit? Is the term "datapath width" used for that anywhere else? (I guess you've prefixed that with "total " in some places, which makes much more sense)
For most operations, an ALU has a width in 1-bit subunits, e.g. adders, and the same number as the number of subunits is the width in bit lines of the output path and of each of the 2 input paths that are used for most input operands. Some operations use only one input path, while others, like FMA or bit select may need 3 input paths.
The width of the datapath is normally taken to be the number of 1-bit subunits of the execution units, which is equal to the width in bit lines of the output path.
Depending on the implemented instruction set, the number of input paths having the same width as the output path may vary, e.g. either 2 or 3. In reality this is even more complicated, e.g. for 4 execution units you may have 10 input paths, whose connections can be changed dynamically, so they may be provide 3 input paths for some execution units and 2 input paths for other execution units, depending on what micro-operations happen to be executed there during a clock cycle. Moreover there may be many additional bypass operand paths.
Therefore, if you say that the datapath width for a single execution unit is of 256 bits, because it has 256 x 1-bit ALU subunits and 256 bit lines for output, that does not determine completely the complexity of the execution unit, because it may have a total input path width with values varying e.g. between 512 bit lines to 1024 bit lines or even more (which are selected with multiplexers).
The datapath width for a single execution units matters very little for the performance of a CPU or GPU. What matters is the total datapath width, summed over all available execution units, which is what determines the CPU throughput when executing some program.
For AVX programs, starting with Zen 2 the AMD CPUs had a total datapath width of 1024 bits vs. 768 bits for Intel, which is why they were beating easily the Intel CPUs in AVX benchmarks.
For 512-bit AVX-512 instructions, Zen 4 and the Intel Xeon CPUs with P-cores have the same total datapath width for instructions other than FMUL/FMA/LD/ST, which has resulted in the same throughput per clock cycle for the programs that do not depend heavily on floating-point multiplications. Because Zen 4 had higher clock frequencies in power-limited conditions, Zen 4 has typically beaten the Xeons in AVX-512 benchmarks, with the exception of the programs that can use the AMX instruction set, which is not implemented yet by AMD.
The "double-pumped" term used about Zen 4 has created a lot of confusion, because it does not refer to the datapath width, but only to the number of available floating-point multipliers, which is half of that of the top models of Intel Xeons, so any FP multiplications must require a double number of clock cycles on Zen 4.
The term "double-pumped" is actually true for many models of AMD Radeon GPUs, where e.g. a 2048-bit instruction (64 wavefront size) is executed in 2 clock cycles as 2 x 1024-bit micro-operations (32 wavefront size).
On Zen 4, it is not at all certain that this is how the 512-bit instructions are executed, because unlike on Radeon, on Zen 4 there are 2 parallel execution units that can execute the instruction halves simultaneously, which results in the same throughput as when the execution is "double-pumped" in a single execution unit.
> Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
I think you are the one who hasn't read documentation or tested the behavior of Zen cores. Read literally any AMD material about Zen4: it mentions that the AVX512 implementation is done over two cycles because there are 256-bit datapaths.
On page 34 of the Zen4 Software Optimization Guide[^1], it literally says:
> Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a 512-bit operation.
It is not certain if what AMD writes there is true, because it is almost impossible to determine by testing whether the 2 halves of a 512-bit instruction are executed sequentially in 2 clock cycles of the same execution unit or they are executed in the same clock cycle in 2 execution units.
Some people have attempted to test this claim of AMD by measuring instruction latencies. The results have not been clear, but they tended to support that this AMD claim is false.
Regardless whether this AMD claim is true or false, this does not change anything for the end user.
For any relevant 512-bit instruction, there are 2 or 4 available execution units. The 512-bit instructions are split into 2 x 256-bit micro-operations, and then either 4 or 2 such micro-operations are issued simultaneously, corresponding to the total datapath width of 1024 bits, or to the partial datapath width available for a few instructions, e.g. FMUL and FMA, resulting in a throughput of 1024 bits of results per clock cycle for most instructions (512 bits for FMA/FMUL), the same as for any Intel CPU supporting AVX-512 (with the exception of FMA/FMUL, where the throughput matches only the cheaper Xeon SKUs).
The throughput would be the same, i.e. of 1024 bits per cycle, regardless if what AMD said is true, i.e. that when executing 8 x 256-bit micro-operations in 2 clock cycles, the pair of micro-operations executed in the same execution unit comes from a single instruction, or if the claim is false and the pair of micro-operations executed in the same execution unit comes from 2 distinct instructions.
The throughput depends only on the total datapath width of 1024 bits and it does not depend on the details of the order in which the micro-operations are issued to the execution units.
The fact that one execution unit has a data path of 256 bits is irrelevant for the throughput of a CPU. Only the total datapath width matter.
For instance, an ARM Cortex-X4 CPU core has the datapath width for a single execution unit of only 128 bits. That does not mean that it is slower than a consumer Intel CPU core that supports only AVX, which has a datapath width for a single execution unit of 256 bits.
In fact both CPU cores have the same vector FMA throughput, because they have the same total datapath width for FMA instructions of 512 bits, i.e. 4 x 128 bits for Cortex-X4 and 2 x 256 bits for a consumer Intel P-core, e.g. Raptor Cove.
It is not enough to read the documentation if you do not think about what you read, to assess whether it is correct or not.
Technical documentation is not usually written by the engineers that have designed the device, so it frequently contains errors when the technical writer has not understood what the designers have said, or the writer has attempted to synthesize or simplify the information, but that has resulted in a changed meaning.
It doesn't really matter if the two "halves" are issued in sequence or in parallel¹; either way they use 2 "slots" of execution which are therefore not available for other use — whether that other use be parallel issue, OOE or HT². To my knowledge, AVX512 code tends to be "concentrated", there's generally not a lot of non-AVX512 code mixed in that would lead to a more even spread on resources. If that were the case, the 2-slot approach would be less visible, but that's not really in the nature of SIMD code paths.
But at the same time, 8×256bit units would be better than 4×512, as the former would allow more thruput with non-AVX512 code. But that costs other resources (and would probably also limit achievable clocks since increasing complexity generally strains timing…) 3 or 4 units seems to be what Intel & AMD engineers decided to be best in tradeoffs. But all the more notable then that Zen4→Zen5 is not only a 256→512 width change but also a 3→4 unit increase³, even if the added unit is "only" a FADD one.
(I guess this is what you've been trying to argue all along. It hasn't been very clear. I'm not sure why you brought up load/store widths to begin with, and arguing "AMD didn't have a narrower datapath" isn't quite productive when the point seems to be "Intel had the same narrower datapath"?)
¹ the latency difference should be minor in context of existing pipeline depth, but of course a latency difference exists. As you note it seems not very easy to measure.
² HT is probably the least important there, though I'd also assume there are quite a few AVX512 workloads that can in fact load all cores and threads of a CPU.
The most recent Intel and AMD CPU cores (Lion Cove and Zen 5) have identical vector datapath widths, but for many years, for 256-bit AVX Intel had a narrower datapath than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit for AMD (4 x 256-bit).
Only when executing 512-bit AVX-512 instructions, the vector datapath of Intel was extended to 1024-bit (2 x 512-bit), matching the datapath used by AMD for any vector instructions.
There were only 2 advantages of Intel AVX-512 vs. AMD executing AVX or the initial AVX-512 implementation of Zen 4.
The first was that some Intel CPU models, but only the more expensive SKUs, i.e. most of the Gold and all of the Platinum, had 2 x 512-bit FMA units, while the cheap Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit (but AMD Zen 4 still had 2 x 512-bit FADD units). Therefore Intel could do 2 FMUL or FMA per clock cycle, while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
The second was that Intel had a double width link to the L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit stores per clock cycle, while Zen 4 could do only 1 x 512-bit loads per cycle + 1 x 512-bit stores every other cycle. (In a balanced CPU core design the throughput for vector FMA and for vector loads from the L1 cache must be the same, which is true for both old and new Intel and AMD CPU cores.)
With the exception of vector load/store and FMUL/FMA, Zen 4 had the same or better AVX-512 throughput for most instructions, in comparison with Intel Sapphire Rapids/Emerald Rapids. There were a few instructions with a poor implementation on Zen 4 and a few instructions with a poor implementation on Intel, where either Intel or AMD were significantly better than the other.