25 March 2022

DirectStorage, again...

 

It's approximately eleven months since I last trashed on DirectStorage and, helpfully, to ease us into the anniversary, we have the official release of the API and another presentation covering the technology - now of an actual implementation within a game which will release relatively soon! 

Actually, it's quite exciting because I get to see whether I was correct in my understanding of how things were going to play out.

So, let's dive in!

I'm going to quash any curiosity right off the bat: It makes no difference.*

This is actually a rather poor way to show of the technology because it conflates several different technological aspects together, creating confusion in the actual comparison... Plus, it really looks like the game and game engine they have created in Forspoken is optimised AMAZINGLY well, with regards to data I/O. So, we honestly don't see any real difference.
*Actually, it's more nuanced than that - specifically, it makes no real difference in this game.
But let's take a step back and start with the presentation by Luminous Production's Teppei Ono as part of GDC this week.

From the very first slide... ugh!

Why do people keep doing this? Why do they conflate sequential reads from storage with random access (IOPS)? The first point is great - DS API manages small files (many data requests) really well. That's great for games! 

The second point? You don't ever get those huge sequential read speeds advertised on the SSD specs in a game because the data isn't sequential! It's lots of little files (yes, sometimes all packaged up into a single file or .pak or whatever). The idea of "data being contiguous" on the storage volume died when we moved to solid state storage. Even a very large data file doesn't need to have all its bits physically stored next to each other on the drive(s)!

It is actually impossible for the SSD controller to manage multiples of requests for myriad files in a random read access at the data rate of the sequential "single" request for a very large file, despite there being a wishy-washy abstraction between what constitutes the difference between the two scenarios. It's why we have these differences in the benchmark numbers!!

Finally, the point about CPU bottlenecks and most games having loading times of 10+ seconds just falls flat for me because this presentation does not prove that these are anything other than "lack of purchasing high-end hardware" and/or "lack of specific game engine optimisation".

That sequential read throughput is mostly meaningless for gaming...

Next up, we get onto a brief overview of how DirectStorage will work and also how it currently works... but I've discussed that before. Ono then gets to an interesting bit: load times.

This has been something of a pet peeve of mine for a while and I was trying to point this out with it falling on deaf ears - so it's nice to finally have an actual industry source to point to and say, "Hey, look! I'm not just some crazy kook on the internet mashing on their keyboard....*".
*Well, most of the time...
Processing time for gaming programmes is now (and has been) a significant portion of the loading times we experience as gamers. We arrived at this situation just by moving to SATA SSDs! And NVMe adoption, even with the "crappy" implementation of the file I/O system on Windows 7 and 10 still wasn't poor enough to handicap that benefit.

There was a reason I was advocating for people to buy "more cores" and why there are many people saying to buy higher end CPUs because those CPUs have higher IPC, clockspeeds, and cache sizes, resulting in better performance when loading and processing data.

Sure, the NVMe drive with DS is 1.69x the throughput of the same drive on the Win32 API... but the difference in loading time is 0.2 seconds. This is also where we come back to my initial point about this being a poor way to show off the technology because the system specs for this test are just insane... An R9 5900X, 64 GB RAM and an RX 6800 XT?! I... what differences are we even expecting to see here? 

On a more "average" or "normal" (perhaps "median" is a better description) system, is there ANY observable difference? And what about on Windows 10? Just as benchmarkers like Hardware Unboxed pair their GPUs with the highest performant CPU in order to be able to challenge them... or conversely, the highest performant GPU with all their CPUs so that they can show differences, will those incredibly slight differences evaporate on mid-range hardware?

I would be heavily betting towards an emphatic, "Yes!"


Yes, great advancements in streaming of assets... only you've moved the bottleneck 5 minutes down the road.

Now, this isn't Luminous Production's issue, this is Microsoft wanting to showcase something to show that they're not wasting their time working on an API that has very little real world utility instead of spending those resources elsewhere... but I've beaten that drum before, so I'll move on.

Following this, there's a brief interlude comparing some gameplay loads, which again showcase very slight differences before we move onto another faux pas in presentation of data: percentage differences. The famous (paraphrased) saying, "There's lies, damn lies, and then there's statistics!" applies here.

The way I was taught presentation of data, especially when dealing with small values or relative changes in small values, was to use the units you are measuring in - in this case, units of time (it looks like milliseconds based on the CPU utilisation graph). So we're talking about a reduction by half for decompression time and asset initialisation. However, without the actual absolute time value there, it's difficult for a consumer of the presentation to actually digest what this means in real terms. 40% and 50% are big, impressive numbers... but if we're talking reducing the time from 200 ms to 110 ms, for example, when the overall process is 2000 ms reduced to 1600 ms then it's not so impressive. 

This is reflected in the relatively minor improvements to loading times.


Nice, large percentages... so what's the absolute improvement given the -0.2 sec to loading you showed a moment earlier?

There is one counter point I've seen touted around on the internet from commenters* regarding this presentation about how DirectStorage's REAL benefit is for streaming of assets during gameplay - supposedly helping out with frame time spikes (and thus dips in frame rate). I've not seen any indication that this is a solution to that problem - the problem being developers need to prioritise the decompression of assets and compilation of shaders for the specific hardware configuration - both of which require CPU resources to do so.

This current iteration of DirectStorage doesn't fix that issue for of three reasons:
  1. The developers still have to manually do it in their game code.
  2. The specs of a given CPU didn't change, there aren't more resources to do it in real time and reducing the latency from the storage to the RAM for operation on the CPU before shipping it back for copying to the GPU memory doesn't really help with that. 
  3. This demonstration shows that the actual achieved throughput from memory is for HUGE amounts of parallel data (talking GB here), not a few small files for compilation of shaders... This isn't loading from GDDR, where operations are made faster by having more chips in parallel on the board because it increases the total bit width of the connection, SSDs are not improved in this manner... (well, unless they're in RAID arrays)

*I mean literally in the comments, not the main article - though currently one of the promoted comments states this...


Let's see, "M.2", SATA SSD, and (SATA?) HDD? Or is it an IDE HDD?! (Okay, I'm being facetious)


Conclusion...


I've still yet to see anything that gets me excited about DirectStorage use for gaming and this presentation didn't bring me any closer. However, there is one final thing I did not address until now:

This comparison made no sense from a software perspective.

Forspoken's engine looks really well optimised with regards to I/O - as I mentioned before. This is like comparing switching between BestBuy's brand running shoes and the highest-end bespoke running shoes and then having a professional athlete running 5 km to see if there's a difference. 

Ideally, you would show this sort of exhibition event on an engine that typically doesn't deal well with I/O... but maybe I'm crazy here?

Additionally, there was a huge elephant in the room regarding compatibility. 

Microsoft have specifically stated that DirectStorage is part of Windows 11 I/O subsystem (storage stack) but that Windows 11 has a new I/O subsystem (storage stack) compared to Windows 10.

Microsoft have also stated that DirectStorage requires NVMe to work, whilst explaining devices on other protocols will still work, just without the optimisations.

So what is even being compared between DirectStorage "M.2"*, SATA SSD and HDD and without DirectStorage on the same drives?

Have they changed the specs of the protocol and have performed black magic to get the optimisations running on the SATA interface? Enquiring minds would like to know!


*This is terrible nomenclature in this particular instance because M.2 is the form factor, not the protocol. Specifically, SATA M.2 drives are a thing and run on the same bus as SATA SSDs. They should have used the NVMe branding.

If not, this is even more damning of the performance benefits of DirectStorage - the uplift we're seeing on the middle and right hand side is not due to DS but due to the windows storage stack and I/O updates...

At any rate, I'm sure everything will come to light, given time.


Wake me up in another year when we have more games and examples in different, less optimised engines that may show bigger differences...

No comments: