Hole in my head: Analyse This: Next Generation PC gaming... [UPDATED 04-Sep-2020]

Nvidia's much-anticipated event has literally just wrapped up and there's actually not too much to parse in terms of reveals. The big one - no increase in release price - is, for me, the primary shocker. I think it's great... but at the same time, we're still talking RTX 20 series levels of pricing... but I find it hard to complain about getting the performance we really deserved with the 20 series for the price increase. Let's not forget, graphics hardware releases are slowing considerably and what was once a 50% performance increase over 2 years has now become a 70-80% performance increase over 4 years, at an increased price point.

But lets put that behind us and just jump in and have a poke around the reveal...

I'm going to ignore game-related stuff and focus purely on the tech side of the equation since I personally don't care whether game "X" suppors raytracing or not. I'm just interested in the technology and what's coming in the future. This won't be a deep dive because, frankly, I don't have the time or expertise to do one but these are the big impressions that were made on me during the presentation. Bearing that in mind, some of these "hot takes" may be a little off in my understanding because of the limited information available to the public.

Ampere overview...

First up, the interesting thing to me is that each SM is now able to perform 2 simultaneous shader operations per clock per processing block (of which there are 4 per SM) (see the headline graphic) this comes despite an apparently "static" Texture Processing Cluster (TPC) configuration in the Streaming Multiprocessor (SM), giving rise to 8704 ~~4352~~ CUDA cores/68 ~~136~~ 68 SM, an increase of around ~~48%~~ 296% FP32 CUDA cores compared to the 2080. What's interesting here is that, if there's been no reconfiguration of the TPCs and SMs then the CUDA cores themselves have been reconfigured it seems that the number of CUDA cores has increased in each SM to enable two separate instructions to be processed simultaneously. Given finite resources, this is likely to mean that each SM is able to perform "half" instructions on ~~just two blocks of~~ the four available (16 32 CUDA + 2 Tensor) blocks. when INT calculations are required.

[UPDATE] The proper specs were listed on Nvidia's site so i updated this section. See the end of the post for the proper analysis...

This is the SM diagram from the Turing architecture...

The full 2080 die* is able to do 1.7x the RT calc and also double the inference of the sparse matrix information for DLSS upscaling and other AI-related tasks that can utilise such algorithms at the same boost clock speed of 1710 MHz. This is pretty impressive! However, we don't really have enough information to understand how these performance improvements have occurred unless there are now double the number of Tensor and RT cores per SM... and as far as I'm aware, there have been no leaks surrounding these items.

*Jensen is careful to speak ONLY in terms of the 3080 configuration throughout the presentation...

This is sort of mimicking the technology that is being used on the next gen Xbox consoles... but not quite.

I/O boost...

Nvidia announced their RTX I/O, which is their added-value version of DirectStorage API (a part of DirectX) from Microsoft. Interestingly, Microsoft also released a new dev blog today covering this new technology - but I'll get to that in a moment. I've covered before how next gen games are going to require 1-2 CPU cores to be required to move around uncompressed data from the SSD storage. The presentation slide above confirms this but then goes a step further and covers what processing power would be required for compressed data* - 24 CPU cores.

Now, Nvidia didn't cover what sort of compression method was being utilised so I can't compare between the Xbox Series X and this example but it's telling that where Xbox will save the work of 5 CPU cores, using only 0.1 of a CPU core in conjunction with the hardware decompression block, Nvidia will be saving the work of 24 cores, using only 0.5 of a CPU core. For comparison, Sony will be saving 10-11 CPU cores but we don't know how much remaining overhead is required from their CPU block.

*Interestingly, Microsoft have stated that the Series X's hardware decompression unit is saving the work of 5 CPU cores in terms of decompression, this seems to be an order or magnitude beyond that.

It's clear that the amount of CPU resources "saved" are directly linked to the I/O requirements, which are directly linked to the output of the hardware. Which is where the DirectStorage blogpost comes into play. It is strongly hinted at in the post that supported systems and NVMe drives will be compatible this API. However, it isn't really clear what "supported" means given the lack of clarification in the post itself.

Leaning on the Nvidia keynote, I am betting that the limitation is not whether you have a compatible graphics card (though this certainly will be required) but PCIe Gen 4. We've already seen that Horizon Zero Dawn is helped when moving from 8x to 16x PCIe Gen 3 and, if we had more powerful graphics cards that supported Gen 4 this trend would likely continue. We also see that it's performance improves with increasing cores, despite this being a game that runs well on an underpowered PS4 base platform.

At any rate, Nvidia are indicating that up to 14 GB/s could be transferred in this manner, opening up the way to native high refresh rate 4K and 8K gaming being possible.

[UPDATE]

I forgot to note that, at the very end of the blog post, Microsoft state that they plan to give a preview to developers in 2021... so don't expect this to matter until 2023 at the earliest! Don't worry about compatibility or buying a PC that can handle this right now... is not going to affect games in any way.

This is the card I'm most interested in out of this line-up... it's also interesting that it features a shortened PCB with a pass-through front fan design. It also explains why the power pin is in the middle of the card - there is no card for it to be part of at the end of the casing!

The Ecosystem...

Nvidia's whole presentation was around 40 minutes long and they spent ~11.7 mins of it speaking about the software ecosystem they're building around their hardware. That's significant and it's readily apparent that they plan on leveraging their in-built advantage when it comes to machine learning in their RTX architecture - something that AMD has yet to show us.

In the same way that I (and many other commentators) have previously said that "the pixel is dead", it is now apparent that Nvidia believes that traditional graphics performance will no longer "win" a generation. Instead, like Apple before them, Nvidia are very focussed on providing additional value, beyond mere rendering. To that end, they have DLSS, RTX voice, Machine Learning (character concepting, voice to facial animation, machine learning character animation, video to 3D, sparsity denoising, physics simulation), Omniverse Machinima, Nvidia Reflex, Nvidia broadcast, ShadowPlay and Ansel... I may even have missed a couple!

In the same way that microchip superiority is no longer about Moore's Law, the graphics acceleration business is no longer about performance - we've reached peak performance!

It's strange but they mention the 2080 compared to the 3080 but it's not on their chart (I added it for them) and released performance numbers from Digital Foundry put it at around 1.7-1.8x faster, not 2x.... I wonder which cherry-picked game is 2x faster?

The Cost of Performance...

Going from the above numbers I mentioned in the intro paragraph, the RTX 3080 is around 44 - 47% faster than an RX 5700 XT (26%*1.7 or 1.8) and we're getting that performance at $699 whereas the RX 5700 XT was $399. If the rumours are true and "Big Navi" is simply a doubling of the number of compute units on the graphics die (along with RT and other new additions and optimisations) then we can expect around 74% peformance uplift from the RX 5700 XT to the 80 CU RDNA 2.0 die, putting it slightly ahead of the RTX 3080 in rasterisation performance on average*.

*We still don't know how good RT is on RDNA 2.0 or if it even supports machine learning super-sampling (MLSS) to the same extent as Nvidia does with DLSS.

However, a theoretical RX 6700 XT is unlikely to be the Big Navi - this would probably be reserved for an RX 6900 XT variant in the naming scheme. So where would an RX 6700 XT fall? So far, various leaks have pointed to 80, 72 and 56 CU counts in next gen Navi-based cards. Those might end up being the 6900 XT, 6800 XT and 6700 XT providing us with increases of 74%, 59% and 30% of the RX 5700 XT performance (48%, 33% and 4% faster than the RTX 2080 respectively at the same clock speed as the RX 5700 XT).

Even if next gen Navi cards do not manage to significantly raise their clock speeds above 1800 MHz they are likely to perform very competitively with Nvidia's offerings and that, I believe, is why Nvidia did not raise prices again for this coming generation.

I do wonder about AMD's pricing strategy though... Do they need to keep prices lower in order to increase marketshare or will pure performance metrics alone manage that in their eyes? If the theoretical RX 6900 XT is around 10% faster on average compared to the RTX 3080 and comes in at $499-599, is that enough? If the RX 6700 XT comes in at a static $499 price point with RTX 2080 raster performance but better ray tracing, is that enough?

[UPDATE] I realised that i messed up here since my comparison above is for the 2080, NOT the 2080 Ti. This means that the performance of the RX 6900 XT would be around 3-5% faster than an RTX 3070 at current RX 5700 XT clock speeds, NOT a 3080...

~~This puts an entirely new light on this conclusion - Nvidia have probably left AMD in the dust and it also means that the next gen consoles have very low levels of performance...~~

Nvidia have gone all-in on the ecosystem this time around and it's an impressive ecosystem that is bound to capture the imaginations and interest of the gaming public. Yes, AMD might have caught up in pure numbers but they're still lagging far behind in the gamer mindshare and brand... and you can argue that it was Nvidia's investment in AI these last ten years that has set them up to jump ahead of AMD just as they have finally caught up...

[PROPER UPDATE]

I'm not usually someone who writes quickly or covers things in almost realtime and this one attempt really came to bite me in the ass. No sooner had I published this, it turned out that the official AIB details were all wrong, giving me the wrong information on the composition of the Streaming Multiprocessors (which I've now corrected).

I also made a slight error because I calculated the theoretical performance of the next gen AMD cards but then compared the performance numbers to the wrong card! Oops! That's now corrected as well. But I want to address both of these points to make them a bit clearer:

My mock-up of a GA102 Streaming Multiprocessor (SM), mushed together from the A100 whitepaper and the information given by Nvidia at the presentation for the RTX 30 series cards... and one provided by Ryan Smith from AnandTech over on Twitter.

[UPDATE] I've added in a new diagram that's surfaced as the embargo is apparently to lift today so we'll get much more accurate information and I won't have to speculate any further on this! :) I'll add a link once the article itself drops. However, the interesting thing I see is that the INT pipeline is merged with an FP32 pipeline (fewer supporting blocks are required, hence the reduction in LD/ST units).

I am a bit concerned if the block diagram is 100% accurate because it appears that there's only a single tensor core unit? I understand that these are new tensor cores but it's still a halving of what was in turing. It may just be a layout thing and their improved operation from that point but I'm now left with a lot of questions regarding the claimed tensor core performance.

In the whitepaper covering A100, the performance claim over Volta was 10x (20x with sparsity) with a total of 156 TFLOPS (312 TFLOPS) for TF32 TC ops over 108 SMs. However, for the GA102 presentation, with only 84 SMs, Nvidia was claiming 238 Tensor TFLOPS. This corresponds to the sparcity figure and shows a slightly lower than 1:1 scaling for the reduction in SM count. But I don't think Turing had the ability to work with sparse matrices so the comparison is more like 121 TFLOPS (156/108*84) to 89 TFLOPS (which it now turns out is the figure for the 2080 Super and not the full-die 2080 Ti!! AND is FP16, NOT FP32... with Deep Learning. Nvidia's messaging and marketing is really causing a lot of unnecessary confusion here!), which is only 1.4x as much throughput... in the worst case scenarios - For the RTX 3090!

The RTX 3080 will feature slightly less than 80% of those numbers and not so large of a leap over the RTX 2080 as we're moving from 48 SMs to 68 SMs - approximately a 42% increase overall. This would be around 89 tensor TFLOPS to 98-119 TFLOPS, an increase of 1.1-1.3x before we add in additional throughput with sparsity (2.7x).

We really need these cards in the hands of reviewers because I'm not yet convinced of the overall improvement we're going to see for this new generation... there are too many questions about the data we've been given so far!

First off, Nvidia have pulled a little bit of a trick here. Yes, they've vastly increased the number of CUDA cores (what they call the floating point blocks in their diagram above) and have actually ~~replaced the 8x FP64 blocks in the design of the A100 SMs with an additional~~ combined the 16x INT32 blocks with 16x FP32 blocks but what it seems they haven't done is vastly improve the throughput of the processing blocks within each SM or the connectivity between SMs and the rest of the architecture.

I also wonder if the efficiency of those combined blocks is less than pure INT or FP units...

So, just like the A100 server architecture, you're not getting a 2x improvement in performance for the additional CUDA cores because the amount of improvement is related to which types of workloads are being performed. In fact, it seems like you're getting around 1.25x to 2.7x depending on whether you're working on 4xFP16, 2xFP32, or a mixed INTXX/FPXX etc. workloads.

I'm going to go out on a limb here and call the RTX 30 series cards the FX-series of the GPU world... However, Nvidia have managed to brute force their way to actual performance, unlike AMD in their CPU line-up.

This is an architecture which is going to have a highly variable performance that is entirely dependent on the workload. Now, it may be that in some workloads, performance is on par with SM-equivalent RTX series cards but in certain others it is vastly superior on a per-SM basis.

This is an opportunity for AMD and their RX 6000 series of cards.

The simplified and obfuscated Series X GPU block diagram...

I said above, the RX 6900 XT, 6800 XT & 6700 XT is probably 48%, 33% & 4% faster than an RTX 2080 at RX 5700 XT clock speeds (i.e. 1755 MHz game clock). However, that's not the full story. The expected clock speeds for Navi 2x are around 2.1 GHz. Working that into the equation we get a completely different result.

Whilst the RTX 30 series cards' clock speeds have not really budged past 200 MHz improvement, Navi is expected to improve by around ~250-350 MHz. Taking the worst case scenario of ~250 MHz increase to 2.0 GHz, we get a 69%, 52% & 18% increase over the RTX 2080. A best case scenario (~350 MHz) gives us 77%, 59% & 24% over the RTX 2080.

Given Digital Foundry's analysis of the games they showcased, the RTX 3080 has a 60-80% increase over the RTX 2080.

I know that Navi 2x's architecture is quite different from Ampere's but it's apparent that both architectures have made trade-offs with the workloads being able to be performed. I explained Ampere's idiosyncracies above but I did also outline Navi 2x's before - being unable to perform RT or texture ops at the same time.

After all the corrections and misconceptions flying around after the Nvidia event I had become confused but now I've had enough time to absorb the information and get it into a better collection in my mind.

The RX 6000 cards are likely very competitive with Nvidia's RTX 30 series. Not with the RTX 3090 but then when you cut down your SM's to 80% of the total die's worth, you're losing a lot of performance. The RX 6900 XT is likely to perform equivalently to the RTX 3080 (320 W) with a lower TDP (say, 240 W for 3 W per CU in an 80 CU card) and, personally, I think that's a massive achievement. Whether they will dethrone Nvidia will be determined by the price point. If they are able to come in at $50-100 less than Nvidia's price point with more RAM and using less power (considering their process node's higher yields), I think they could really hurt Nvidia.

Could AMD really kill Nvidia? No. They're late, again. They're quiet, again. They're obtuse, again.

Nvidia really understand how to market and how to position their products. AMD are really not able to compete with them in this field. On the CPU side, AMD are lucky that Intel are completely incompetent at the moment when it comes to execution on the technical front but also on the marketing front as well.

Hole in my head

1 September 2020

Analyse This: Next Generation PC gaming... [UPDATED 04-Sep-2020]

Ampere overview...

I/O boost...

The Ecosystem...

The Cost of Performance...

[PROPER UPDATE]

No comments:

People who don't trust me...