LoonZen - SMT/CCX

SMT Scaling

SMT (simultaneous multithreading) is a feature in Ryzen which enables two threads to share one core simultaneously, improving performance in increasingly common scenarios.

Intel has been doing this for a full 15 years, starting with the Xeon in February of 2002. It would only be fair to assume that AMD would not do as well with their first try compared to a well-optimized technology that has been revised and improved upon immensely during its long life-span.

Is that the case? In short: NO, AMD has effectively knocked it out of the park on their very first try!

SMT Scaling
Ryzen's SMT implementation shows incredible scaling. Nearly 60% peak! What's more is that there are only a few cases where any downsides to enabling SMT are seen. However, that was not the case prior to AGESA 1.0.0.4. The SMT penalties were as high as 15%. The average performance uplift for multi-threaded workloads is 35.76%. The SMT penalty mostly averages out to zero, but is as high as 3.37%. More to come...

SMT Penalty
When enabling SMT there's generally a price to be paid. This price is usually most easily demonstrable in situations where there's a loss of performance when enabling SMT. On Ryzen there are three structures which are statically partitioned between the two CPU threads that run on a core: Micro-op Queue Store Queue Retirement Queue This means that the maximum potential throughput of a single thread is, in fact, cut in half with SMT is enabled. AMD sought to alleviate this by having a six-wide micro-op queue, an eight-wide retirement queue with 192 entries, and a 44-entry store queue. As demonstrated above, this has been largely successful. However, very few workloads can reach, let alone exceed, 3 instructions/cycle (IPC)... and those that can will usually be limited by the store queue on Ryzen... which is effectively 22-deep per thread when SMT is enabled. One of the few examples that can reach, and even exceed, an IPC of 3.0 is CPU-Z's built-in benchmark, which can reach 3.16 IPC on Ryzen with SMT enabled... or disabled. After some searching, the only benchmark I've found to show lower PEAK IPC per thread with SMT enabled is Geekbench 3, where peak IPC/thread drops from 4.05 to 3.1 for the entire run. However, that is per thread - peak throughput on a single core actually rose almost 50%. SMT penalties are a side-effect of high performance pipelines. Sometimes the pipelines are so fast that they become starved by the front-end or the store queue becomes full and induces a stall. An SIMD instruction split into smaller operations can run with different data on multiple pipelines at the same time. One such instruction can operate on up to four pipelines at the same time! If mixing integer and floating point SIMD instructions then you can repeatedly have EIGHT results in the same cycle. With a 22-deep store queue that takes 4 cycles to store a line back to the cache, you will bottleneck. Disabling SMT gives a 44-deep store queue, which will not prove to be a bottleneck. Enlarging the store queue alone on Zen 2.0 should mostly resolve the SMT penalties we see. Despite these drawbacks, it's usually better to leave SMT enabled - I had to work hard to show these penalties. And none of them represent real-world workloads. Still, some games are known to be sensitive to this, just none that I tested.

CCX Penalties

Eight-core / Dual CCX Scaling
More to come...

CCX Latencies
More to come...

Continue...