11 March 2020

Analyse This: Performance VS Processing power... (I was wrong!)

The Xbox Series X SoC in all its glory...
This post has been in the works since I last posted and in response to certain comentators on the merits and accuracy of that post. I'll get around to the full Xbox Series X and PS5 reveals this week but for now, let me indulge in a bit of naval gazing and shoulder shrugging...

Recently, I had a certain post about the potential CPU performance of the Xbox One Series X, based upon statements issued from the official Xbox news blog. This, of course, caused some waves. Many people were unhappy with the theoretical performance of the proposed CPU - despite the fact that it would be incredibly performant by any console standard - i.e. it would be the most powerful CPU ever to be placed into a console form-factor! This is not a "weak" CPU and the Ryzen 5 1600 AF is not a weak CPU (though not what we'd be getting!) and an underclocked 4800H would also not be a weak CPU...

The second most common refrain was from people saying that 4x the processing power correlates with 2x clock speed, combined with 2x the number of processing threads. This argument is flawed on several levels but, aside from that it's nonsense. Here's why:
This is where I get into a bit of a scientific debate about what is and what is not a metric. TFLOPS are a metric, meters are a metric (the nomenclature being derived from the Greek "metron", a "measure"), clock speed is a metric, core count and thread count are both metrics.

You cannot combine metrics without changing what you are measuring. This is a fundamental part of science and reality. If you measure how far you travelled within a given timeframe, you measured a new metric - average velocity. Let's say you're travelling at 8 m/s for 4 seconds, when you measure the distance you travelled (32 metres) and the time taken (4 seconds) you don't then multiply or add those two portions of the metric to get the final result. i.e. it's not 32x4=128 or 32+4=36...

The same thing goes for all CPU metrics as well. They are their own measure and, when combined, form a new measure - somewhat unrelated to the original measures that were used to create it but not independent.

A FLOP is a measurement calculating the number of floating point operations that can be performed. TFLOPS are the number of FLOP calculations, divided by 1x10^12, performed per second. Now, it just so happens that there are a lot of computational operations that require these calculations and, thus, the TFLOP or GFLOP are pretty good indicators of performance. HOWEVER, they are not the only indicator of performance in computational means. The point is, you can compare the number of TFLOPS or GFLOPS as much as you want but when a given benchmark is run, the underlying architectural differences will have a big impact on your final score.

This is brought into stark relief with the Navi VS GCN comparisons that have been doing the rounds over the last year or so. The rumoured, speculated and then confirmed 12 TFLOPS performance of the Xbox SX GPU were so over-analysed, to the point where it was easy to get lost along the way. Many people speculated that it could be a comparative performance - i.e. 9-10 TFLOPS of the Navi architecture would perform as well as 12 TFLOPS of the GPU architectures found in the XBO and XBOX.

Of course, this wasn't the case. The same blogpost that claimed a 4x processing uplift also confirmed the 12 TFLOPS (2x the performance of the old architecture) as a raw TFLOP to TFLOP comparison. In actuality, this means even greater things for the next generation consoles because 12 TFLOPS of Navi performance will be more like 15-16 TFLOPS of the XBO architecture - it is just that much more efficient and performant. 

However, you may be asking yourselves: "How did we get to that conclusion?"

Well, it's all because of benchmarks. Benchmarks are another metric - an artificial one (or synthetic, as we like to call it in the business) because the benchmark itself only tests for the specific idiosyncracies of the given programme. However, the metric that a benchmark tests for and provides as a number is not directly correlated to any physical makeup of any particular computational architecture. Some architectures perform very well on certain benchmarks and not on others.

Cinebench is a standard synthetic benchmarking tool used to understand the relative performance of processors...
Going back to the original premise of the article, processor clockspeed and threadcount are two independent metrics that have no direct causation with regards to pure performance. They are correlated with the overall performance of any given computational architecture but they do not guarantee it.

A simple example of this can be found in any CPU family. A 4 core Pentium 4 CPU could run at 3.0 GHz. Does that mean a 4core/4thread (4C/4T) Core i5-6500 @ 3.2 GHz is only slightly better? 

Not even a chance!!!!!

Clockspeed is not an indicator of performance outside of a given microarchitecture because this metric has no direct linear relationship to the overall performance of a piece of silicon. In the same way, the number of threads is not an indicator of performance within or between microarchitectures. Why? Lower numbers of threads with poorer architectural efficiency or lower numbers of threads operating at a higher clockspeed within a given architecture will give vastly different performance. Hence why a 2600X might be better than a 2700 when gaming, or why Threadripper is not so much better than any Ryzen chips for gaming, despite the increased core count.

Look, an R7 1700X is 27% less performant than the 3700X for 200 MHz more clock speed (both of them being 8C/16T). That's all on the IPC and other efficiency benefits that Zen 2 has over Zen 1. There's also some overlap with the improvement in performance granted by the process shrink from 14 to 7 nm but I think that mostly comes from heat dissipation and power draw - not in raw performance numbers, such as IPC. In fact, you can see from my prior analysis that a 1600 AF is not so much more performant than a 1600 or much less performant than a 2600... despite the process shrink from 14 to 12 nm.

So how does this all this shake out?

The point is that you need a standardised benchmark in order to make sense of inter-generational architectural performance. You cannot rely on a standard metric because a single metric does not encompass all the potential of a given architecture. However, peformance over a specific (or multiple) benchmark(s) gives you the perspective across architectural changes and other metric variations (such as differences in clockspeed).

So what exactly does this mean for that Xbox One VS Xbox Series X comparison? It means that TFLOPS are TFLOPS. They are not related to any other peformance than their definition - they are not a relative metric. HOWEVER, processing performance is a metric, i.e. it's pegged against a benchmark*.
*Somewhere in AMD's or Microsoft's testing centres there's a benchmarking programme that has run on both original Xbox One hardware and SX hardware... and has come up with a number of four times the performance.
It means that these performance numbers and metrics are hardware, platform and architecturally agnostic - they don't care what you're running (as long as the hardware supports the instruction sets the benchmark is testing against), in order to be able to compare between any of these items you need to be agnostic in your determination...

So, coming back to the original premise of those who dissent (i.e. those who believe that "4x processing power of an Xbox One" means "a summation of 2x clockspeed and 2x thread count") are not only wrong but wrong in every assumption, premise and their understanding of measuring anything. If you scaled the original 8086 processor by 8 times and increased the clock speed to 3.0 GHz you would not get something comparable to modern processors... the same is true here.

Let's look at a 2x clockspeed plus 2x thread count CPU comparison: A Core2 Duo E 4400 VS a Core i3-7320. That's a 2 Ghz @ 2 threads VS a 4.1 GHz @ 4 threads... It's not a fair comparison and no one is doing this or wanting to even equate these two chips across generations. However, Intel's generational performance has been pretty dire for quite a while now and what we're looking at here is a marginal increase in difference in performance on the single core result because the Core architectures from Nehalem onwards are related to the Core 2 architecture of the Duo E 4400.

What this means in terms of comparisons is that, even inter-generationally, Intel's chips have a lot in common but still exhibit big improvements in performance (especially multi-core) though mostly through a reliance on increased clock speed.

Switching over to AMD things aren't quite so straightforward... The Athlon X2 250 (2C/2T @ 3 GHz) versus the Ryzen 3 1200 (4C/4T 3.1 GHz) shows that if you could just multiply the increase in cores by the increase in clocks and obtain "4x" the performance, the Athlon would be performing above the Ryzen, clock for clock in single-threaded performance.

Clearly, in real world situations this is not, and could not be, the case otherwise AMD would have stuck with that architecture and would not have bothered innovating from the discontinued K10 or Bulldozer architectures.

Due to AMD's switching architectures, direct core/frequency comparisons are actually much more difficult. However, suffice it to say that Ryzen and Zen 2 are significantly more performant than prior architectures, whereas Intel's architectures are more small-step evolutions, iterating on a common design footprint (albeit with numerous optimisations and instruction set inclusions).

Finally, I come to the "shoulder shrugging" portion of my blogpost:

If I was an emjoii, this is what I'd be right now...

So, I was wrong.

Xbox Series X has its CPU pegged at 8C/8T @ 3.8 GHz or 8C/16T @ 3.6 GHz.

HOWEVER, that's not the whole story. Yes, I was wrong in my estimation... but I was wrong because Microsoft stated something that is patently false. I've looked at the specs and there's only one thing about the SX which is 4x anything of the XBO and that's the number of CUs/shaders in the GPU. They completely messed-up the wording around that sentence.

HOWEVER, to be fair to Microsoft - one thing I didn't realise at the time was that there was clarification further down the article. They state that the Zen 2 CPU gives "4x the performance of an Xbox One X". That's quite a bit of a difference, there, Microsoft!

4x what?!!
Digital Foundry had that comparison I had previously linked to, where they simulated an Xbox One and Xbox One X, then extrapolated to a "Next Gen Xbox". Pegging the Athlon 5370 at 2.3 GHz, they achieved a Cinebench single core score of 49 and a multicore score of 183. If we multiply those values by 4, we get 196 and 732 (1464 doubled for 16 threads), respectively. The single core is very close to the Ryzen 7 4800U's score of 192 but the multicore score is in a slight excess of the U's 1302 and slightly lower than the 4800H's 1712.

I discussed last time that the Renoir's reduction in cache means that multicore performance will suffer slightly and since the 3700X is a fully-enabled desktop part (though running at a decreased clockspeed in this scenario) it still retains that extra L2 & L3 cache. This explains the discrepancy with regards to multicore performance.

However (again), there's still a discrepancy between the stated clockspeeds of the SX and the clockspeeds of the 3700X, 4800U and 4800H. A 3700X is operating between 3.6 - 4.0 GHz and a U is operating between 1.8 - 3.2 GHz whilst an H is operating betewen 2.9 - 4.2 GHz. So, in reality, it's kind of strange that the performance isn't more like 5 or 6 times a One X because it seems that the Renoir cores are constrained in some manner (perhaps the cache?) which means that single core performance doesn't scale well with clockspeed - both the U and H have identical 1-core performance in cinebench R15, both are slightly below the 3700X, despite a 1.0 GHz difference in boost frequency.

So, in the end, my previous methodology was correct but I fell victim to a typo/mis-typed sentence from Microsoft's big reveal.

Next time... the analysis of the two revealed consoles' specs and what they mean!

No comments: