21 July 2023

The Performance uplift of Ada Lovelace over Ampere...

 

One thing i feel like I may have been known for is being one of the first people to comment on the fact that there was no "IPC" uplift for AMD's RDNA 2 over RDNA architecture. Well, I never had an RX 5000 series card to check with but Hardware Unboxed confirmed this in practice. So, it was nice to feel validated. 

Now, I am aware that RDNA 3 is nothng more than a frequency adjusted RDNA2 (because their extra FP32 configurations do not appear to be easily used in existing programmes), but the question still burns within me: have Nvidia been able to increase the performance of their architecture from Ampere to Lovelace?

Let's find out...


The Intro...


Previously, I predicted that Nvidia would be mostly relying on the improvement of process node advantages in their next architecture: I wondered if it was really necesary for Nvidia to focus on rasterisation improvement over other features like raytracing, given the lacklustre increase in performance from the RTX 20 to 30 series...

In contrast, RDNA 2 felt like a big step for AMD. Sure, it had lots of success in improving data management and, as such, helped improve energy use in the architecture. But, as I conjectured back then: performance improvements were coming from the increased clock speeds, not from any other architectural magic.

On the surface, Nvidia's Lovelace architecture looks a lot like the tech giant took a sneaky glance at AMD's homework and applied their own variation on the theme of increasing cache size and decreasing the width of the memory interface.

In fact, the internals of the SM is the same between the two architectures. But can the large L2 cache actually help improve calculation speed? On the face of it, you might assume it could since the v-cache on AMD's CPUs vastly helps throughput by retaining important working data locally.

Now, unfortunately, there is no facility that I'm aware of that allows the user to disable parts of the GPU silicon like we are able to with CPUs, to simulate lower resource unit parts. Fortunately, we (and really, I specifically mean, "me") are blessed with a GPU from each of the RTX 30 and RTX 40 sites lineups that are configured with the same number of cores. In fact, if we take a quick comparison between the RYX 3070 and RTX 4070, we can see that they are remarkably similar! (Which is not the case for many of the available parts.)

Core specs of the two cards...

Sure, the number of ROPs is reduced and, of course, the amount of L2 memory has been increased (a major part of the architectural upgrade talking points) and, as AMD had done with RDNA2, the video memory bus width is decreased due to that larger L2 cache. This last point comes about because the so-called "hit-rate" for data will be higher on-chip, there is less need to travel to the VRAM to get what is needed to do calculations.

This is great, because memory controllers are expensive in terms of die area and energy cost... plus, they are expensive in terms of circuit board design and component cost - you have to run those traces, place the GDDR chip and manage the power phases for the additional components. If you can effectively get the same performance (or close enough!) by removing some of them, then you've got a product that is cheaper to produce...

The trade-off is that this effect of cache is stronger at lower resolutions - at higher rendering resolutions, the wider VRAM bus width wins-out. i.e. High-end cards need to have a large memory bus width. Now, some of this discrepancy can be countered by hugely increasing the frequency of the memory: faster working memory can do more in the same period of time. This is precisely what Nvidia have done with the RTX 4070 - pairing it with fast GDDR6X running at 21 Gbps compared to the relatively measley 14 Gbps GDDR6 on the RTX 3070.

So, my hope is that this will be an interesting test to see how much the increased L2 cache really helps with the Ada Lovelace architecture...


AMD released this slide, speaking about the trade-off between cache size and data hit-rate...



The Setup...


Just like HUB did in their RDNA 2 testing, the aim is to lock the core clock frequency to a setting that both cards can achieve using a piece of software. I used MSI Afterburner for this as it integrates well with the Nvidia GPUs. In practice, this was more of a challenge for the Ampere card as it kept wanting to drop clocks, very slightly. However, you will see that, in practice there's not really an issue with a ~20Mhz difference.

In contrast the Lovelace card had no issues and took the down-clocking in its stride. Of course, this strategy isn't fool-proof because by reducing the core clocks, we are also slowing down all the on-die caches and this may have a non-linear larger effect on performance than we might expect.

As a result, I will attempt to extrapolate between the stock performance and when underclocked to see the scaling.

One thing I also wanted to mitigate as much as possible was the vast gulf in memory frequency between the two parts. At stock, the 3070 has 14 Gbps memory while the 4070 has 21 Gbps, owing to the use of GDDR6X.

So, I increased the 3070's memory to 16 Gbps and reduced the 4070's to 20 Gbps*. This doesn't close the gap as much as I'd like but it's better than nothing!** 

One last note - as usual, I am recording the tests using Nvidia's Frameview as I find that internal game benchmarks do not usually line up with the actual results. For instance, though I've used the included benchmark in Returnal, the values given for average, minimum, and maximum fps vary slightly from the results calculated using frameview.
*I am not aware of any way to reduce this further as underclocking the memory is quite limited in the Afterburner software. 

**In retrospect, I should have left the 4070's memory speed where it was as these alterations actually put the theoretical memory bandwidth of the 3070 at 512 GB/s and the 4070 at 480 GB/s - swapping them! Next time, I won't make the same mistake :) 

Here, we can compare the effect of memory speed and core clock speed on the result...

Returnal is a game that is completely GPU-limited, so the above results are not being constrained by the i5-12400 in my system. Looking at the results we can see that the performance drop is, indeed, non-linear: For a 30% drop in core clock speed, we are losing around 18% of the performance, and clawing back another couple of percent as we bump up the memory frequency from 20 Gbps to 21 Gbps and then 22 Gbps for a loss of only 16 %.

I should note that I am not quite sure why the minimum fps is lower at the stock settings but this could be one of those results where there was a hiccup in the system. I would disregard the comparisons to the minimum fps value in the stock configuration for this test, since it was not also observed when raytracing was enabled.

While I didn't perform the memory scaling test with RT enabled, we have the same difference of 19% performance loss for the 30% core clock frequency loss - which I think is pretty impressive!

What IS interesting to note is that we are confirming, here, that the RTX 4070 is limited by the memory bandwidth. Unfortunately, I could not manage higher on my card but it seems clear to me that Nvidia could pull out more performance by upping the spec even further to the 24 or 26 Gbps standard (or, you know, by widening the bus width!).

This, in my opinion, has implications for the RTX 4060 series cards - they are using GDDR6, with the 4060 using 17 Gbps memory and the Ti variants using 18 Gbps memory. We've already seen that these cards are bandwidth constrained in the reviews posted by many outlets but I wonder at the potential for extra performance if they had been paired with GDDR6X like their bigger siblings...


With Returnal, we can see zero uplift between the two architectures...


With the downclocking, we can see that the RTX 4070 is essentially an RTX 3070 in performance - there is zero improvement in architecture observed here - just as observed with RDNA2. What is surprising to me is how the RTX 3070 actually slightly beats its successor in almost all metrics - the 4070 has a slightly more consistent presentation, with fewer sequential frametime excursions beyond 3 standard deviations. I would speculate that this might be an effect of the larger L2 cache in action. However, it's not a large improvement and the player would probably not notice the difference.


Being CPU-bound, the frametime spikes can be pretty brutal in areas of high-density assets, even after a few patches...


Jedi Survivor's semi-open worlds provide a different challenge, one that is bottlenecked around the CPU and system memory - resulting in large frametime spikes when traversing the world. Again, the experience between the 3070 and the clock-matched 4070 is very similar, with the 4070 only pulling ahead when ray tracing is enabled. Again, something I am guessing is related to the larger L2 cache - since we know that some RT workloads can benefit from the larger bandwidth and latency of being on-chip.

The lower maximum frametime spike in the test of the RTX 4070 with RT on is explainable because I messed-up that run because I got into an unplanned extended fight with some battledroids and just looped around again, without re-loading. That spike, which is pretty consistent for all first time runs, is reduced on the second go-around as some data must already be resident in the RAM, significantly reducing the severity of the spike. I bet you can guess the location of the spike, too - it's just as you enter the Pyloon Saloon...



Next up, I have two tests from within Hogwart's Legacy as I wanted to test the effect of moving quickly throughout the world as well as the traditional run around Hogsmeade. Each of these tests is performed from a fresh load into the game so there are no shenanigans as above with Jedi Survivor.

With RT off, both cards perform very similarly - though the RTX 4070 is more consistent in its presentation with fewer excursions noted. RT on tells a different story, though: the 3070 outperforms the 4070 by a decent margin in Hogsmeade! To be honest, I cannot fully explain this result... I might have guessed that it was related to the lower memory bandwidth, but I tested with 21 Gbps memory and got a result that was within the margins of error. 

It's possible that this is related to down-clocking of the die itself, with everything running a little slower.


Moving onto the broomflight test, we see the expected results - with RT off, the 3070 is very slightly outperforming its successor, while with RT enabled, we see the 4070 pull ahead by a similar margin as the 3070 in the prior test.

Our second to last test is one of my favourite, as it's actually fun to swing through the city! Spider-man continues the trend of the RTX 3070 outperforming the 4070, with RT enabled and off. However, once again, we can see that the performance of the 4070 with RT enabled closes the gap significantly, which could be a combination of the improved RT cores and larger L2 cache...


The last of the benchmarks is a return to The Last of Us - which, after many, many patches, is now performing quite well. In this test, we see a similar result for both cards, with the RTX 4070 performing slightly better in terms of presentational consistency.

Once again, we're looking at virually identical performance...



The Conclusion...


As was the situation with the move from RDNA1 to RDNA2, the main performance benefit in the new architecture of the RTX 40 series over the RTX 30 series is the increase in core clock frequency. Yes, the larger L2 cache helps to mitigate the narrower memory bus width but it is not really having any effect on absolute graphical performance of the card as the underlying architecture is essentially the same.

Nvidia themselves put forth the idea that (paraphrasing) "increasing cache hit rate increases frame rate".

In terms of compensating for a narrower bus width and avoiding bottlenecks, they can be correct in a data-starved environment where increased latency will cause frametime spikes - the L2 can reduce those as the rest of the die has to wait less time for the data it requires. They are also correct that this change improves efficiency by a large margin by reducing the memory traffic to VRAM.

However, we can see in my data that the L2 is, for the most part, having no noticeable effect on performance compared to the last generation part.

What is likely having a negative effect on the performance is both the lower VRAM bandwidth and fewer numbers of ROP units (running at the lower core frequency!) on the die and these cut-backs could be masking any potential benefit that the increase in L2 cache might have from a theoretical standpoint in an iso-frequency environment.


Nvidia's comparison with an RTX 4060 Ti with 32MB L2 cache versus a special version with only 2 MB L2 cache...


So, Nvidia might not have been telling fibs - they were just speaking in the general sense and not about specific designs that they were putting into their new GPU lineup compared to the prior generation of cards as they have shown (above) that adding the larger L2 does increase performance on the RTX 4060 Ti.

What would have been interesting to see is the effect of the larger L2 cache coupled with the same number of ROPs and memory bus width as the RTX 3070... Alas, we can only dream. What is undeniable is the amount of energy saving going on in the new architecture. Across all results, the RTX 4070 was pulling roughly half the power compared to the 3070 at the same performance.

And that, as they say, is that.

The performance of the new architecture is technically the same as the old but only because they have removed aspects that helped with that performance and countering those removals with that additional cache. If the cards were identical in all other aspects then we might actually be seeing a larger performance uplift than we are...

No comments: