25 February 2020

Analyse This: The Next Gen Consoles (Part 7) [UPDATED]


Within yesterday's information release from Microsoft had one final tidbit of information that I didn't address and that was the allusion to the fact that the Xbox Series X is four times as powerful, in terms of CPU power, as the Xbox One:
"Delivering four times the processing power of an Xbox One and enabling developers to leverage 12 TFLOPS of GPU (Graphics Processing Unit) performance – twice that of an Xbox One X and more than eight times the original Xbox One."

Yes, it isn't specifically stated that this is the CPU improvement but by a process of elimination (4x GPU performance is below the One X TFLOPS value) it appears to refer to CPU processing power. So what, exactly, does this mean and how does it fit into the information we have for current Ryzen APU offerings?


Codenames and performance...


I noted back in November 2019 that it looked, to me, like the leaked codenames for the development kit processors were referencing mid-tier performance; an X6XX part (e.g. R5 1600, 2600, or 3600). The leaked performance numbers for the "Flute" APU also loosely corroborated this with performance just below a Ryzen 7 1700X over at userbenchmark. Now, the benchmarking suite has changed the weightings of the tests it runs so it's a bit difficult to compare with current results using these figures. I can "interpolate" the results from currently listed results though the original results are no longer listed in the database because they were compared to a Ryzen 7 3700X. I can use the 3700X current results with the previous results to see the way the weightings changed.

It's not a perfect manipulation but it gives us results of 1-core: 88.6; 4-core: 341.4; and 8-core: 626.6. We'll come back to these results later on...

However, we do also have several Cinebench scores for many prior and currrent CPUs and APUs that are applicable to this test. Digital Foundry noted that an Athlon 5370 clocked at 1.6 GHz roughly corresponds to a PS4. Extrapolating these Cinebench R15 results to the 1.75 GHz clock speed of the Xbox One by cross-referencing them with results for a stock 2.05 GHz Athlon 5350 and Athlon 2.3 GHz 5370, you manage to approximate an R15 single core score of 38 and multicore score of 140.

The orange cells are calculated from surrounding data. Data that was unable to be calculated or was completely absent is marked with a dark grey cell.

One aspect that must be taken into account from the Digital Foundry results is that their multicore "simulation" of the PS4 CPU is only 4 core, not 8 - as found in the PS4/XBO. Whilst this doesn't have an impact on the single core, you'll note that the stated 4.7x improvement between a 3.2 GHz 3700X and a "stock" PS4 is actually more like 4.3x (I presume there was a transcription error here because other calculated ratios are correct), the multi-core would need to be multiplied by a factor of 2 in order to be closer to the actual performance.

The PS4 is not the console we have a performance number for, though. I performed the same extrapolation as above for the XBO Userbenchmark scores in order to have a second test suite that might be able to corroborate the approximations. Multiplying all of these calculated scores by a factor of 4 (as per the Xbox news blogpost), factor in 4-extra cores for the multi-core benchmark by multiplying the result by 2 and we achieve a theoretical Xbox SX single core score of 153 and a multi-core score of 1120. The Userbenchmark scores are 117, 427 and 888 for 1-core, 4-core and 8-core results, respectively. 

Now that we have these theoretical values we can look at the Cinebench scores for the Zen 2-based APUs, the 4800U and the 4800H (listed in the table above, along with their Userbenchmark scores). Here we can see the huge performance uplift and optimisation of the APU Zen 2 cores compared to the Zen, Zen+ and Zen 2 desktop parts. With a TDP of just 15-45 W, the APUs are almost matching the single core performance of the 3700X and beating the 1600, 1600 AF and 2600 considerably. However, as I noted last time, reducing the TDP and L2 + L3 cache sizes has affected multicore performance as the chips are unable to run as hot for as long, and are bottlenecked by having to draw in/swap out data to the main system memory much sooner than for their desktop counterparts.

What is apparent is that whatever "Flute" was, it has been improved upon considerably because, as per the APUs' results, it was only matching a Ryzen 5 3550H part (included in the table above) for 1-core and 4-core but outperforming the 3550H in the multicore score with 627 vs 493. This is a very confusing result and it begs the question as to what, exactly, "Flute" was meant to be testing and what architectural makeup it had.

Comparison of a theoretical Xbox Series X CPU performance (4x Xbox One) with derived results for down-clocked 4800U and 4800H parts. The "Middle" allow for a slight increase in TDP and these scores allow us to see that a Zen 2 APU clocked at 2.55 GHz in a 25 W TDP envelope would potentially match the 4x improvement over the XBO.

Either way, the theoretical Xbox Series X (above) gives us our target benchmark results. Looking at the 4800U and 4800H APU benchmark results, we can see that these are already in the ballpark of 3.78x to 6.11x a performance improvement across all the various benchmarks. However, there are likely to be significant TDP and power constraints on the Xbox SX in comparison with these two APUs due to the fact that both APUs are only sporting 8 and 7 GPU CUs (Compute Units) in comparison to the >40 CUs rumoured to be present on the SX APU die. It would be likely that, in order to guarantee the performance of the GPU, the CPU would be necessarily clocked lower in order to allow a greater TDP budget for the GPU portion of the die.

Previously, I tried to estimate relative die areas for raytracing, CPU and GPU cores. I looked at existing EPYC embedded parts on the market and concluded that the 3251 would fit the design spec of the rumoured next gen consoles, though I determined that its 2.5 GHz, 50 W TDP would need to be scaled back in order to "fit" within a console form factor and match the Gonzalo leak in terms of 1.6 GHz base clock. 

I also calculated that Flute was operating below it's maximum boost clock (though above its base clock) during the leaked testing results, averaging at around 2.6 GHz. During these thoughts, I also determined that it was unlikely to be final silicon (and may not even have been the final architecture [i.e. Zen 2 - though at the time I thought that the consoles would inherit Zen 3]) and that it was not operating at the desired performance target at the time of its testing. We can see above that Flute is some way behind the theoretical Xbox Series X and, I presume an almost equal PS5 CPU block. Personally, I believe that there will be a smaller difference in the CPU between the two consoles than the rest of the system.

Anyway, to get back to the point at hand: I performed a calculation to reduce the clockspeed of the 4800U and H APUs by matching the single core performance in Cinebench R15. This gave a clockspeed of 2.55 GHz. From this I recalculated each remaining benchmark for these two APUs at the new, downclocked frequency. Then, seeing that the 15 W part was not quite performant enough to match the multicore score, I increased the theoretical TDP by interpolating between the U and H parts to 25 W. This gives us a CPU with performance approximating all of the benchmark results apart from the Userbenchmark 8-core result, which lags behind by 197 points.

I didn't have another image to fill this gap, so here's one I made earlier...
The reason for this discrepancy could be down to the way I extrapolated the 8-core result for the theoretical result for the Xbox One. I multiplied the Cinebench multicore score by 2 in order to account for the extra 4 cores and performed the same, simple approximation for the Userbenchmark score. However, whilst this appears to have been fine for Cinebench, it may not be appropriate for Userbenchmark because of the way that testing suite weights the results of its testing procedure. It could be that with double the cores, you do not get double the score.

Ideally, we need to find score results for two CPUs using the same architecture and similar base and boost clocks - though one needs to have 4 cores and the other 8. Unfortunately, the third generation Ryzen desktop processors bottom out at 6 cores and the mobile APUs have not been completely benchmarked on Userbenchmark (the 4700U is not found). However, we can compare the 4300U (4C/4T) to the 4800U (8C/16T) and the 3200GE (4C/4T) and the 3400GE (4C/8T).

Compiling the data (in the table below) you can see that there is an observable effect on the benchmark results upon adding more threads. It seems that Userbenchmark's testing overly favours additional threads (whether they are logical or physical) over increased clock speed - it is not a 1:1 increase with frequency or available threads/cores. Most notably, if we apply the same scaling factor of the (4C/4T) 3200GE to the (4C/8T) 3400GE in our theoretical scaling from the Athlon 5370 @1.75 GHz to the Xbox One, we reach an 8-core score of 174 points. Multiplying that by a factor of 4 gives us 697 points which is in line with what we see for the other scores in our "Middle" CPU, above.

So, with regards to this small discrepancy, I'm not that concerned about the mismatch.

If you take a look back at those numbers in the two tables above, you'll see that my original assumption was pretty close to the mark - this performance level is right around that of the Ryzen 5 1600 and 1600 AF. It also correlates pretty closely to the performance of the Flute CPU scores, though improved somewhat.

In this context, it could be inferred that Flute was a development chip testing out TDP and approximating performance from a fully enabled 8C/16T zen+ die running at or near R5 1600 clock speeds. In fact, this could intimate the entire reason the 1600 AF exists - as a precursor to the scaling down of existing core technology as a test case for console development kits.

There's an outsized effect in the way Userbenchmark's suite works on increased logical and physical threads...

So, what have we got? Well, I think this approximation looks pretty good. A CPU operating at 25 W TDP, 2.55 GHz frequency provides almost exactly four times the processing power of the original Xbox One CPU. This TDP leaves plenty of thermal room for a large GPU to operate on-die considering that the One X had a TDP of 180 W. That's around 150 W headroom before you reach that prior upper limit. Given that RDNA 2 is meant to be much more power efficient than Navi is, 36-40 CUs could easily fit into that TDP (for PS5). Though the SX might have more of an issue if it's as large as it's rumoured to be.

However, I'm not quite so sure that the SX will be 56 CUs. My calculations put the number of possible CUs in the Xbox SX at around 48. Doing some more back-of-the-napkin maths, I arrive at 12.00 TFLOPS for a 48 CU GPU running at 1.84 GHz with 14 GHz GDDR6. In comparison, with the same memory configuration, 36 CUs running at 2.0 GHz (for PS5) yields 9.78 TFLOPS. Yes, that's a bit of a difference but I spoke about the trade-offs between general vs dedicated hardware that appears to be the difference in direction between Microsoft and SONY last time

What is interesting here is that, if we assume that the PS5 has a 36 CU GPU running under a 125 W or 130 W TDP (putting the total at around 165 W for total APU die*) that would peg the SX at around 133 W TDP for a 48 CU GPU (putting the total at around 155-160 W TDP for the total APU die**). [Please see below for an update on these calculations]
*I'm assuming 5 W for ancillary silicon not accounted for by the CPU and GPU.
**I'm assuming no extra wattage due to the ancillary elements being accounted for by the GPU in the case of the SX. 
Now, I should stress that these thermal meanderings are far less rigorous than the calculations I performed for the number of CUs and the CPU power/frequency so don't pay attention to them until we get more information from the platform holders. I'm just eyeballing the TDP of AMD's RX 5000 series cards, extrapolating some optimisation based on the 8 Vega CUs present in the 4800U (1792 TFLOPs @ 8 CU) and 4800H (1433 TFLOPS @ 7 CUs) APUs. 

When you run those 7 nm Vega numbers through a conversion, you arrive at 11.3 TFLOPS for a 48 CU Vega-based APU at 1840 MHz and 9.2 TFLOPS for a 36 CU Vega-based APU at 2000 MHz. These numbers are incredibly close to those slated for Xbox Series X and rumoured for PS5 and RDNA 2 should be able to push more than Vega does.

Anyway, as always. I hope you've enjoyed this descent into madness. I'll post another update on this as and when information comes to light.

[UPDATE]

I realised I had confused the estimated TDPs incorrectly because I had not subtracted the 25W of the CPU + 8 CU from the initial value for the PS5 and not taken the lower clock of the SX's CUs into account. Looking at the TDPs of the two APUs I used in the comparison, it is clear that the majority of the heat is being emitted by the CPU portion of the die. The 4800H is running 7 CUs at a lower clockspeed (1.6 GHz) but the CPU is running at both a faster base and boost frequency (2.9/3.8). The 4800U is running 8 CUs at a higher clockspeed (1.75 GHz) but the CPU is clocked lower for both base and boost frequencies (1.8/3.2).

If we assume for the 15 W part that 8 W is generated by the CPU and 7 W is generated by the GPU under full load then we could estimate that the 45 W part has 40 W from the CPU and 5 W from the GPU. It's probably not that cut and dry but it's a starting point. We can then extrapolate that a static 2.55 GHz chip (with no boost frequency) with 8 CU at 2 GHz at 25 W could have 15-16 W from the GPU and 9-10 W from the CPU* for the PS5. The SX would have a different TDP because the CPU portion remains static but the heat generated from the GPU would be lower due to the lower clockspeed. So, let's say we keep that 9-10 W CPU and decrease the 8 CUs to 10 W - which gives a 20 W total for the unit.
*I'm going with this split because when chips are pushed to higher frequencies they run hotter in a non-linear manner. Since the CPU would be static and within the turbo frequency of the 4800U, I'm not expecting it to generate a lot more heat and that heat would be a constant instead of fluctuating and so easier to wick away.
Taking those assumptions, I'm realising that I was lowballing the the TDP of an optimised 36 CU 2GHz part. If the RX5700 @ 1.625 GHz was a 180 W TDP part then even with optimisations, we're looking at 130 W for the same clockspeed, not significantly higher. Looking across the isle at NVidia, a 36 CU optimised successor to the RX 5700 could look something like the base RTX 2070 (it's only a 20%-ish jump in performance). That's still a 175 W part versus the 180 W TDP of the RX 5700...

Looking back at historical Radeon cards, we have the RX 5700 at 22% more performance than the RX Vega 56 generating 85% of the TDP and using 64% of the CUs (56 vs 36). To make a CU-to-CU comparison with the RX 580, the 5700 is 54% more performant generating 97% of the TDP whilst using the same number of CUs. The RX 5700 actually has almost equal performance with the higher end RX Vega 64 but that chip was pushed incredibly hard, almost to its limit (in fact undervolting the Vega series was a thing) and that produced an incredibly hot chip (295 W TDP!).

I don't have time to write witty commentary on the console flame wars...
Chirag Nayak

So, if the 7nm process combined with the switch from GCN to Navi  could result in a TDP drop to 61%, it seems possible that moving to the 7nm+ node and RDNA2 could result in another 20-40% drop too. That would put 36 CU at 110 W TDP when running at 1.625 GHz. Pushing that up to 2 GHz could land us at 130-150 W TDP. Within the realm of my initial thoughts.

Putting all those things together, this would mean that a theoretical PS5 CPU+8CU combo would be a 43 W TDP part*. This is not that crazy given that 8 CUs in an RX 5700 correspond to 40 W TDP and in an RX 5700 XT, 45 W. Yes, it's a pretty large improvement in efficiency and performance but we saw the same thing from Vega to Navi.
* [150/(36/8) = 33.3] + 10 W CPU = 43.3 W
The SX, though running at a lower frequency has increased CUs. I *know* that the frequency relationship to TDP is non-linear however, I don't know of an equation that can help me estimate a curved TDP/frequency relationship for any given GPU architecture. So, on first basis, I'm going to assume a linear relationship...

We have 33.3 W TDP per 8 CU @ 2GHz for the PS5, linearly speaking, we get 30.7 W TDP for those 8 CU @ 1.84 GHz and thus 184 W TDP for the full 48 CU GPU. This would mean that the theoretical TDP of the SX CPU+8CU combo would be 40.7 W. That's still incredibly close to the estimated TDP of the PS5 CPU which is somewhat reassurring - from a mathematically beautiful standpoint. The equivalent linear reduction in Navi (RX 5700 XT to RX 5700) gives us a 6 W TDP decrease over the 2 GHz to 1.84 GHz range. Yes, a 50% increase from the 3 W calculation in efficiency but not that unreasonable based on all the assumed improvements.

Finally, after all that recalculation, the PS5 now has an estimated 43 W + 116.5 W + 5 W = 165 W total die TDP. The Xbox Series X has an estimated 10 W + 184 W = 194 W... OR we have a 10 W + 163 W = 173 W TDP if we assume the wattage decrease of the current Navi GPUs.

Looking at historical APU TDPs of the current generation of consoles, these numbers are pretty much in line with what we've see before: 160 W PS4 Pro and 180 W One X TDP.

Once again, these TDP musings are nowhere near as accurate or definitive as the CPU/GPU performance calculations and even those are estimations based on extrapolations from current and prior technology. I have no insider information - this is purely for fun.

2 comments:

Unknown said...

Wow nice analysis, you have put some time into this and it shows. Keep it up!

Duoae said...

Thanks, obviously this is all just speculation but it was fun to put together.