SMT (simultaneous multithreading) is a feature in Ryzen which enables two threads to share one core simultaneously, improving performance in increasingly common scenarios.
Intel has been doing this for a full 15 years, starting with the Xeon in February of 2002. It would only be fair to assume that AMD would not do as well with their first try compared to a well-optimized technology that has been revised and improved upon immensely during its long life-span.
Is that the case? In short: NO, AMD has effectively knocked it out of the park on their very first try!
Ryzen's SMT implementation shows incredible scaling. Nearly 60% peak!
What's more is that there are only a few cases where any downsides to enabling SMT are seen. However, that was not the case prior to AGESA 126.96.36.199. The SMT penalties were as high as 15%.
The average performance uplift for multi-threaded workloads is 35.76%. The SMT penalty mostly averages out to zero, but is as high as 3.37%.
More to come...
When enabling SMT there's generally a price to be paid. This price is usually most easily demonstrable in situations where there's a loss of performance when enabling SMT.
On Ryzen there are three structures which are statically partitioned between the two CPU threads that run on a core:
However, very few workloads can reach, let alone exceed, 3 instructions/cycle (IPC)... and those that can will usually be limited by the store queue on Ryzen... which is effectively 22-deep per thread when SMT is enabled. One of the few examples that can reach, and even exceed, an IPC of 3.0 is CPU-Z's built-in benchmark, which can reach 3.16 IPC on Ryzen with SMT enabled... or disabled.
After some searching, the only benchmark I've found to show lower PEAK IPC per thread with SMT enabled is Geekbench 3, where peak IPC/thread drops from 4.05 to 3.1 for the entire run. However, that is per thread - peak throughput on a single core actually rose almost 50%.
Sometimes the pipelines are so fast that they become starved by the front-end or the store queue becomes full and induces a stall. An SIMD instruction split into smaller operations can run with different data on multiple pipelines at the same time. One such instruction can operate on up to four pipelines at the same time!
If mixing integer and floating point SIMD instructions then you can repeatedly have EIGHT results in the same cycle. With a 22-deep store queue that takes 4 cycles to store a line back to the cache, you will bottleneck. Disabling SMT gives a 44-deep store queue, which will not prove to be a bottleneck. Enlarging the store queue alone on Zen 2.0 should mostly resolve the SMT penalties we see.
|Eight-core / Dual CCX Scaling|
|More to come...|
|More to come...|