21 April 2021

DirectStorage and its impact on PC gaming...


At the beginning of the year, I railed against the focus on standardising game API features across PC and console, specifically with regards to designing systems around the limitations of the console hardware. Well, today we got some updated information on how DirectStorage will work and it's basically as bad as I said it would be...


Like a Broken Record...


Back in January, I made the argument that it is more efficient to pre-load large amounts of data into system memory (RAM) and use that to load to the GPU, requesting that users have a larger quantity of RAM in their gaming systems in order to facilitate this. I also made the observation that the majority of games aren't even fully utilising 16 GB of system memory with a demanding game such as Cyberpunk 2077 only resulting in around 10 GB total memory allocation, along with the rest of the then-running system programmes!

Given that windows typically loads up around 3-4 GB from a fresh boot, we're looking at this computationally and graphically demanding title utilising around 6-7 GB at ultra settings. That's not a lot and it's nowhere near to saturating the system resources that are available (my current PC has 32 GB of RAM). For consoles (where memory is tight) and for PC systems where there are fewer memory and CPU/GPU resources available, it does make sense to optimise for data throughput from storage as opposed to having data in the system memory. But it doesn't make sense for systems where this isn't the case.

This is how things currently work... is this flow really improved with DirectStorage?

What we need is for game engines to scale to use the amount of RAM in a given system and, unfortunately from what I can tell, this isn't currently the case. In contrast, game engines are already very good at scaling to meet the available CPU and GPU resources and, with the advent of DirectStorage and SamplerFeedback, they will scale with the capabilities of the system storage. However, as I predicted, and as is obvious from looking at the block diagrams shown in the presentation, system RAM, the lynchpin of everything that happens on a gaming rig is left out of the equation.

DirectStorage doesn't bypass the RAM, leading to faster access to the GPU memory from storage like it does for both current generation consoles from Microsoft*. It just means that developers are able to continue to use system resources less optimally than they could be programming for because it's not optimal to operate on the edge of available resources**.
*Notably because neither console has system RAM, but that's a moot point as it's the principle of operation we're interested in here. Sony also has similar functionality in their hardware as well but it's not utilising this Microsoft API.
**In this scenario, the resource is the availability of data being pulled off of system storage for the game engine to process on the GPU. A miss of data in the memory will result in dropped frames or lower MIP levels being used (i.e. texture/asset LOD pop-in).
Despite providing no benefit for latency from storage to RAM, this API also then promotes utilisaiton of GPU resources to decompress the data. Yeah, sure, you're going to save on VRAM requirements (making Nvidia happy) but wasn't that already the point of SamplerFeedback?

So... basically everthing is the same except that decompression happens on the GPU, stealing limited graphics-displaying resources?

This is completely backwards to me. GPU resources are the most limited part of running any game - especially at higher resolutions where gaming is fantastically GPU-limited and extra CPU resources just don't matter at all. Look at any CPU/GPU scaling benchmarks (I'm linking the fantastic work done by Hardware Unboxed over at Techspot as one example and Tom's Hardware as another) and you will see that at lower resolutions, a better CPU will get you more fps but at 4K (and to some extent 1440p), you just need a bigger, faster GPU to push your fps higher... when upgrading your CPU will have basically no effect.

Are developers really saying that decompression on the CPU is holding back developing games with more detailed worlds? Then, how in the hell are gamers meant to see those more detailed worlds when their GPUs are losing performance per frame decompressing assets stored in their VRAM? Especially when we're talking about a there-and-back latency of around 500 ns* to the VRAM from the GPU core. This paradigm will really heavily affect the ability of GPUs to output higher framerates as decompression will impact not only the compute resources of a graphics card but also the available bandwidth to the VRAM on the card for assets that are actually in use in the current frame...
*2x ~250 ns and not including actual processing time!
We're already in situations where people are saying it's crazy to couple mid-to-high end GPUs with low-to-mid-range CPUs, with Geforce cards' CPU driver overheard issues compounding that fact. So it's likely that for any given system where the CPU is weak enough that decompressing data is an issue for a game engine, a weaker GPU will also be present. So that doesn't help. Are gamers meant to couple RX 6800 XTs with 4 core/8 thread CPUs now?

An example of the Geforce driver using limited CPU resources and resulting in worse overall performance... (Hardware Unboxed)

Meanwhile, the CPU will have more resources available to it to do basically nothing. At higher resolutions, it can't push out more frames because it's waiting on the GPU. At lower resolutions, the majority of 6 core CPUs are already pushing out more than 100 fps with last gen midrange GPUs (e.g. RX 5700 and RTX 2070) and in titles where that's not the case, then the CPU and GPU are already being pushed heavily and shifting some more of that workload onto the GPU isn't likely to help matters any*...
*See WatchDogs Legion or AC Valhalla for examples of this...

You can see that the component of performance that most affects the number of frames outputted per second is very heavily tied to the GPU... (Tom's Hardware)

...and what's even crazier is that CPU resources are set to increase at a faster rate than GPU resources in comparison with what games are demanding of the system requirements. 8-core Zen 2 CPUs are available in the consoles now, with 7 of those cores dedicated to running the game process. On PC, 6 cores has become a sort of minimum standard for gaming systems with the latest processor lineups from both Intel and AMD starting at 6 cores. It's also important to note that each of those six cores has more performance* than those found in the consoles. Is having the decompression of game assets on the CPU really holding games back on PC? I'm struggling to see how it is...
*Due to a combination of better IPC and higher sustained clock frequencies...
The last thing to add to this equation of confusing decision-making is that SSD performance can vary wildly between drives and even for the same drive depending on how full it is! With drives filled to 80% capacity losing up to 30% random read performance for queue depths larger than 64 and 40% sequential read performance for queue depths of less than 4. Games are big, they take up space!*
*News at 10!
The consoles have dedicated, specialised hardware to decompress data leaving the CPU and GPU free to do their thing...

Relying on the uncertain read performance of an SSD for loading assets into your system RAM, then transferring them to the GPU and using limited GPU resources in realtime is the fever dream of a madperson. Instead, we could be pre-loading assets into RAM, relying on the very consistent performance of system RAM and increased CPU resources to do the same job. But what do I know?

In Conclusion...


Like I said last time, it's boggling my mind that (in my opinion) time is being wasted on developing and implementing this feature in the PC space when all indications are that these problems are specifically last gen and, going forward, only on consoles. As I said above, the systems where this technology could really help are already weak and placing a higher burden on the GPU is not likely to help. In higher-end systems, RAM quantity, CPU and GPU performance are not likely to be lacking so DirectStorage cannot help there because it's much worse to steal GPU resources and bandwidth and it's worse to draw data directly from the SSD instead of from system RAM as a standard operation.

I honestly feel like a broken record, here...

4 comments:

Neutro said...

Nice article, thanks for this information. Keep it up x).

Duoae said...

Thanks! :)

ratty said...

isn't the decompression hardware accelerated somehow? like how you can set PhysX to both CPU and GPU. but selecting GPU doesn't really take away GPU power , at least noticeably? while selecting CPU for PhysX definitely slows things down.

Duoae said...

I think it depends, ratty. How many things running concurrently don't take away "power"? There probably is a performance difference between Physx running on the GPU and with it off, it's just not huge. But add on this, add on RT, tessellation etc.... At some point you're going to notice it.

Plus, the problem right now is the way the API is designed, it still needs to work through CPU and main system memory. You're not gaining anything by taking resources away from the GPU.