28 March 2020

Analyse This: The Next Gen Consoles (Part 9) [UPDATED]


So, the Xbox Series X is mostly unveiled at this point, with only some questions regarding audio implementation and underlying graphics architecture remaining. The PS5 is partly unveiled with questions surrounding physical makeup and operating system "costs". I want to place a caveat here that these are very complicated technical discussions - I'm not an expert and it's possible I've misunderstood something but, where possible, I take my information from multiple sources and viewpoints in order to understand a problem from both a top-down and bottom-up approach - something that I think you can see in my meanderings.

Let's take a look at what each system has confirmed and how those specs may effect the user experience...
If you want the simple breakdown of specs, Eurogamer, via Digital Foundry, have the best articles on the subject, IMO. Instead I'm going to go into a deeper dive on the three to four ways that the consoles are different and how I think that affects their relative performance. But are there big differences between the two consoles?

We've been hearing for months that there's not much between the two devices from Microsoft and SONY, with "sources" on both sides of the argument claiming that each console has an advantage in specific scenarios. Incontrovertibly, Microsoft has the more performant graphics capabilites with 1.4x the physical makeup of the Playstation 5's GPU core. That's only part of the story though, with the PS5 running a minimum 1.2x faster than the Series X across the majority of workload scenarios. That narrows the advantage of the SX (in terms of pure, averaged, GPU compute power) to around 1.18x-1.2x that of the PS5.

But what about the CPU? Performing the same, simple ratio calculation, you can work out that the SX is 1.02x - 1.10x more powerful than the PS5's, depending on the scenario. Not that big a difference, really... and the CPU/GPU should sport pretty much the same feature set on both consoles.

However, everyone and their dog are talking about the GPUs and have been for a long time: It's not all that interesting to me at this point in time until more of their underlying architectures are revealed. Those three to four* things that are more interesting to me are:
  • RAM
  • I/O
  • SSD speed and function
  • Audio hardware
*I'm including I/O with the SSD as a complete concept despite listing them as separate bullet points.
Unfortunately, we don't have the full information on the SX's audio hardware implementation, meaning we can't yet do a proper comparison between the two consoles for that. So let's begin with the RAM configuration.

RAM


Let me put this bluntly - the memory configuration on the Series X is sub-optimal.

I understand there are rumours that the SX had 24 GB or 20 GB at some point early in its design process but the credible leaks have always pointed to 16 GB which means that, if this was the case, it was very early on in the development of the console. So what are we (and developers) stuck with? 16 GB of GDDR6 @ 14 GHz connected to a 320-bit bus (that's 5 x 64-bit memory controllers). 

Microsoft is touting the 10 GB @ 560 GB/s and 6 GB @ 336 GB/s asymmetric configuration as a bonus but it's sort-of not. We've had this specific situation at least once before in the form of the NVidia GTX 650 Ti and a similar situation in the form of the 660 Ti. Both of those cards suffered from an asymmetrical configuration, affecting memory once the "symmetrical" portion of the interface was "full".

Interleaved memory configurations for the SX's asymmetric memory configuration including an averaged value and one where simultaneous access is possible using pseudo-channel mode. You can see that, overall, the Xbox SX will only ever reach the maximum access speeds if it's not ever accessing the less wide portion of the memory...
This diagram was updated to reflect the error I made below, by counting the 4x1 GB chips twice. Effective, averaged access drops to 280 GB/s when equal switching is performed across the two address spaces... I was over-estimating it before.
Now, you may be asking what I mean by "full". Well, it comes down to two things: first is that, unlike some commentators might believe, the maximum bandwidth of the interface is limited to the 320-bit controllers and the matching 10 chips x 32 bit/pin x 14 GHz/Gbps interface of the GDDR6 memory.

That means that the maximum theoretical bandwidth is 560 GB/s, not 896 GB/s (560 + 336). Secondly, memory has to be interleaved in order to function on a given clock timing to improve the parallelism of the configuration. Interleaving is why you don't get a single 16 GB RAM chip, instead we get multiple 1 GB or 2 GB chips because it's vastly more efficient. HBM is a different story because the dies are parallel with multiple channels per pin and multiple frequencies are possible to be run across each chip in a stack, unlike DDR/GDDR which has to have all chips run at the same frequency.

However, what this means is that you need to have address space symmetry in order have interleaving of the RAM, i.e. you need to have all your chips presenting the same "capacity" of memory in order for it to work. Looking at the diagram above, you can see the SX's configuration, the first 1 GB of each RAM chip is interleaved across the entire 320-bit memory interface, giving rise to 10 GB operating with a bandwidth of 560 GB/s but what about the other 6 GB of RAM?

Those two banks of three chips either side of the processor house 2 GB per chip. How does that extra 1 GB get accessed? It can't be accessed at the same time as the first 1 GB because the memory interface is saturated. What happens, instead, is that the memory controller must instead "switch" to the interleaved addressable space covered by those 6x 1 GB portions. This means that, for the 6 GB "slower" memory (in reality, it's not slower but less wide) the memory interface must address that on a separate clock cycle if it wants to be accessed at the full width of the available bus.

The fallout of this can be quite complicated depending on how Microsoft have worked out their memory bus architecture. It could be a complete "switch" whereby on one clock cycle the memory interface uses the interleaved 10 GB portion and on the following clock cycle it accesses the 6 GB portion. This implementation would have the effect of averaging the effective bandwidth for all the memory. If you average this access, you get 280 392 GB/s for the 10 GB portion and 168 GB/s for the 6 GB portion for a given time frame but individual cycles would be counted at their full bandwidth. I realised i was counting the 4x1 GB chips twice here - apologies for the mistake! 

Interleaved memory configuration for the PS5's symmetric memory configuration... You can see that, overall, the PS5 has the edge in pure, consistent throughput...
However, there is another scenario with memory being assigned to each portion based on availability. In this configuration, the memory bandwidth (and access) is dependent on how much RAM is in use. Below 10 GB, the RAM will always operate at 560 GB/s. Above 10 GB utilisation, the memory interface must start switching or splitting the access to the memory portions. I don't know if it's technically possible to actually access two different interleaved portions of memory simultaneously by using the two 16-bit channels of the GDDR6 chip but if it were (and the standard appears to allow for it), you'd end up with the same 392/168 GB/s memory bandwidths as the "averaged" shown  in the diagram scenario mentioned above.

If Microsoft were able to simultaneously access and decouple individual chips from the interleaved portions of memory through their memory controller then you could theoretically push the access to an asymmetric balance, being able to switch between a pure 560 GB/s for 10 GB RAM and a mixed 224 GB/s from 4 GB of that same portion and the full 336 GB/s of the 6 GB portion (also pictured above). This seems unlikely to my understanding of how things work and undesirable from a technical standpoint in terms of game memory access and also architecture design.

In comparison, the PS5 has a static 448 GB/s bandwidth for the entire 16 GB of GDDR6 (also operating at 14 GHz, across a 256-bit interface). Yes, the SX has 2.5 GB reserved for system functions and we don't know how much the PS5 reserves for that similar functionality but it doesn't matter - the Xbox SX either has only 7.5 GB of interleaved memory operating at 560 GB/s for game utilisation before it has to start "lowering" the effective bandwidth of the memory below that of the PS5... or the SX has an averaged mixed memory bandwidth that is always below that of the baseline PS4 PS5*. Either option puts the SX at a disadvantage to the PS5 for more memory intensive games and the latter puts it at a disadvantage all of the time.
*Guys, if you notice a typo, please point it out in the comments of the article in question - not on some random Neogaf thread. :p Plus, if developers don't say there's an issue - it's because it isn't an issue, the console architecture still works, it just works more slowly that it has the potential to do! Hence, "suboptimal".
The Xbox's custom SSD hasn't been entirely clarified yet but the majority of devices on the market for PCIe 4.0 operate on an 8 channel interface...


I/O and Storage


Moving onto the I/O and SSD access, we're faced with a similar scenario - though Microsoft have done nothing sub-optimal here, they just have a slower interface.

14 GHz GDDR6 RAM operates at around 1.75 GB/s per pin, per chip (14 Gbps [32 data pins per chip x 10 chips gives total potential bandwidth of 560 GB/s - matching the 320-bit interface]). Originally, I was concerned that would be too close to the total bandwidth of the SSD but Microsoft have upgraded to a 2.4/4.8 GB/s read interface with their SSD which is, in theory, enough to utilise the equivalent of 1.7% of 5 GDDR6 chips uploading the decompressed data in parallel each second, leaving a lot of overhead for further operations on those chips and the remaining 6 chips free for completely separate operations. (4.8 GB/5 (1 GB) chips /1.75x32 GB/s)

In comparison, SONY can utilise the equivalent of 3.2% of the bandwidth of 6 GDDR6 chips, in parallel, per second (9 GB/5 (2 GB) chips /(1.75x32 GB/s)) due to the combination of a unified interleaved address space and unified larger RAM capacity (i.e. all the chips are 2 GB in capacity so, unlike the SX, the interface does not need to use more chips [or portion of their total bandwidth] to store the same amount of data).

Turning this around to the unified pool of memory, the SX can utilise 0.86% of the total pin bandwidth whereas the PS5 can use 2.01% of the total pin bandwidth, all of this puts the SX at just under half the theoretical performance (ratio of 0.42) of the PS5 for moving things from the system storage.

Unfortunately, we don't know the random read IOPS for either console as this number will more accurately reflect the real world performance of the drives but going on the above figures this means is that the SX can fill the RAM with raw data (2.4 GB/s) in 6.67 seconds whereas the PS5 can fill the RAM (5.5 GB/s) in 2.9 seconds, again, 2.3x the rate of the SX (this is just literally the inverse ratio of the above comparison with the decompressed data).

However, that's not the entire story. We also have to look at the custom I/O solutions and other technology that both console makers have placed on-die in order to overcome many potential bottlenecks and limitations:

The decompression capabilities and I/O management of both consoles are very impressive, but again, SONY edges out Microsoft with the equivalent of 10-11 Zen 2 CPU cores to 5 cores in pure decompression power. This optimisation on SONY's part really lifts all of the pressure off of the CPU, allowing it to be almost entirely focussed on the game programme and OS functions. That means that the PS5 can move up to 5.5 GB/s compressed data from the SSD and the decompression chip can decompress up to 22 GB/s from that 5.5 GB compressed data, depending on the compressibility of that underlying raw data (with 9 GB as a lower bound figure).

Data fill rates for the entire memory configuration of each console; the PS5 unsurprisingly outperforms the SX... *I used the "bonus" 20% figure for the SX's BCPack compression algorithm.

Meanwhile, the SX can move up to 4.8 GB/s of compressed data from the SSD and the decompression chip can decompress up to 6 GB/s of compressed data. However, Microsoft also have a specific decompression algorithm for texture data* called BCPack (an evolution of BCn formats) which can potentially add another 20% compression on top of that achieved by the PS5's Kraken algorithm (which this engineer estimates at a 20-30% compression factor). However, that's not an apples-to-apples comparison because this in on uncompressed data, whereas the PS5 should be using a form of RDO which the same specialist reckons will bridge the gap in compression of texture data when combined with Kraken. So, in the name of fairness and lack of information, I'm going to leave only the confirmed stats from the hardware manufacturers and not speculate about further  potential compression advantages.
*Along with prediction engines that try to reduce the amount of texture data moved to memory called Sampler Feedback Streaming [SFS] which improve efficiency of RAM usage - in terms of texture use - by 2x-3x. i.e. 2.7 MB per 4K texture instead of 8 MB.
While the SFS won't help with data loading from the SSD, it will help with data management within the RAM, potentially allowing for less frequent re-loading of data into RAM - which will improve  the efficiency of the system, overall - something which is impossible to even measure at this point in time - especially because the PS5 will also have systems in place to manage data more intelligently.**
[UPDATE] After reading an explanation from James Stanard (over on Twitter) regarding how SFS works, it seems that it does also help reduce loading of data from the SSD. I had initially thought that this part of the velocity architecture was silicon-based and so the whole MIP would need to be loaded into a buffer from the SSD before the unnecessary information was discarded prior to loading to RAM but apparently it's more software-based. Of course, the overall, absolute benefit of this is not clear - not all data loaded into the RAM is texture data and not all of that is the highest MIP level. PS5 also has similar functionality baked into the coherency engines in the I/O but that has not been fully revealed as-yet so we'll have to see how this aspect of the two consoles stacks up. Either way, reducing memory overhead is a big part of graphics technology for both NVidia and AMD so I don't think this is such a big deal...
This capability, combined with the consistent access to the entirety of the system memory, enables the PS5 to have more detailed level design in the form of geometry, models and meshes. It's been said by Alexander Battaglia that this increased speed won't lead to more detailed open worlds because most open worlds are based on variation achieved through procedural methods. However, in my opinion, this isn't entirely true or accurate.

The majority of open world games utilise procedural content on top of static geometry and meshes. Think of Assassin's Creed Odyssey/Origins, Batman Arkham City/Origins/Knight, Red Dead Redemption 2GTA 5 or Subnautica. All of them open worlds, all of their "variations" are small aspects drawn from a standard pre-made piece of art - whether that's just a palette swap or model compositing. The only open world game that is heavily procedurally generated that I can think of is No Man's Sky. Even games such as Factorio or Satisfactory do not go the route of No Man's Sky...

In the majority of games, procedural generation is still a vast minority of the content generation. Texture  and  geometry draws are the vast majority of data required from the disk. Even in games such as No Man's Sky, there are meshes that are composited or even just entirely draw from disk.

The Series X's SSD actually looks like it can be replaced... although you'd have to disassemble the entire console to be able to do so...
Looking at the performance of the two consoles on last-gen games, you'll see that it takes 830 milliseconds on PS5 compared to 8,100 milliseconds on PS4 Pro for Spiderman to load whereas it takes State of Decay 2 an average of 9775 milliseconds to load on the SX compared to 45,250 milliseconds on One X. (Videos here) That's an improvement of 9.76x on the PS5 and 4.62x on the SX... and that's for last gen games which don't even fill up as much RAM as I would expect for next generation titles.

Here I attempted to estimate the RAM usage of each game based on the time it took to swap out RAM contents and thus game session. We can see that State of Decay 2 has some overhead issues - perhaps it's not entirely optimised for this scenario... this is a simple model and not accurate to actual system RAM contents since I'm just dividing by 2 but it gives us a look at potential bottlenecks in the I/O system of the SX.

Now, this really isn't a fair test and isn't necessarily a "true" indication of either console's performance but these are the examples that both companies are putting out there for us to consume and understand. Why is it perhaps not a true indication of their performance? Well, combining the numbers above for the SSD performance you would get either (2.4 GB/s) x 9.78 secs = 23.4 GB of raw data or (4.8 GB/s) x 9.78 secs = 46.9 GB of compressed data... which are both impossible. State of Decay 2 does not (and cannot) ship that much data into memory for the game to load. Not to mention that swapping games on the SX takes approximately the same amount of time... Therefore, it's only logical to assume there are some inherent load buffers in the game that delay or prolong the loading times which do not port over well to the next generation.

In comparison, the Spiderman demo is either (5.5 GB/s) x 0.83 secs = 4.6 GB or (9 GB/s) x 0.83 secs = 7.47 GB, both of which are plausible. However, since I don't know the real memory footprint of Spiderman I don't know which number is accurate.

This is a really interesting implementation of using a power envelope to determine the activity across the die...


Audio Hardware


In my opinion, the "pixel" is well and truly dead. The majority of PC players in the world play at the 1080p resolution. The majority of TVs in peoples' houses are 720-1080p. 4K is a vast minority - yes, of course it's gaining ground as people replace their screens but the point is that most people are happy with their current setup and don't see the added bonus of upgrading the resolution or size of the screen setup unnecessarily.

Unfortunately, Microsoft have pushed their audio features much less than SONY have - I presume because it was not a huge focus of the console, instead they decided to focus on raytracing, graphics throughput, variable refresh rate, auto low latency mode and HDR. If you're not going to use the added rasterisation power through targetting a higher resolution, instead opting for optimisations that allow you to render at lower resolutions and scale up, why bother modelling the console around that processing power in the first place?

In comparison, SONY hasn't even name-checked HDR output like Microsoft have with 3D audio.

What we do know about the SX's audio solution is that it is a custom audio hardware block which will output compatible signals in Dolby Atmos, DTS:X and Windows Sonic codecs. This hardware will handle wave propagation and diffraction but has not officially (as far as I can find) and will also linked this with the ray tracing engine on the GPU (Thanks to TheBigBacon for the correction).

SONY, on the other hand, have gone all-in on their audio implementation. I had speculated previously that the audio solution might be based on AMD's TrueAudioNext and their GPU CU cores. Thinking that, I had presumed that the console designers would provide a subset of of their total CU count on the GPU for this function. Instead, SONY have actually modified the CU units from AMD's design to make them more like the SPUs in the PS3's Cell architecture (no SRAM cache, direct data access from the DMA controller through the CPU/GPU and back out again to the system memory). We don't know how many altered CUs are present in this Tempest engine but SONY have said that the SIMD computational power is equivalent to the entire 8 core Jaguar CPU that was in the PS4.

Essentially, SONY decided to reduce the amount of fully fledged CUs available to the GPU in order to provide this audio solution. This also means that the PS5's sound processing will take less CPU power from the system compared to the SX - which, again, counts against the SX in terms of resources available to run games.

I guess that I'll have more on this as the features are fully revealed.

SONY's implementation of RT is able to be spread across many different systems...

Conclusion


The numbers are clear - the PS5 has the bandwidth and I/O silicon in place to optimise the data transfer between all the various computing and storage elements whereas the SX has some sub-optimal implementations combined with really smart prediction engines but these, according to what has been announced by Microsoft, perform below the specs of the PS5. Sure, the GPU might be much larger in the SX but the system itself can't supply as much data for that computation to take place. 

Yes, the PS5 has a narrower GPU but the system supporting that GPU is much stronger and more in-line with what the GPU is expecting to be handed to it.

Added to this, the audio solution in the PS5 also alleviates processing overhead from the CPU, allowing it to focus on the game executable. I'm sure the SX has ways of offloading audio processing to its own custom hardware but I seriously doubt that it has a) the same latency as this solution, b) equal capabilities or c) the ability to be altered through code updates afterwards.

In contrast, the SX has the bigger and wider GPU but, given all the technical solutions that are being implemented to render games at lower than the final output resolution and have them look as good, does pushing more pixels really matter?


There was also a very interesting (and long) video released this morning from Coreteks that has his own point of view on these features - largely agreeing with my own conclusions, including my original prediction that the 7 nm+ process node would be utilised for these SoCs.

50 comments:

Unknown said...

"The numbers are clear". Not really when your RAM chart is very inaccurate.

Duoae said...

Hi,Giacomo. MS stated that the 10 GB is "GPU optimal" whereas the 6 GB is "standard". According to how interleaving works for RAM access it would be a performance hit to always utilise the split RAM when the console didn't need to.

Therefore, since the performance of "standard" memory is the same for those bits of memory it can reside in the "fast"/wide pool of RAM until it needs to be pushed over to the "slower" pool.

I was saying that the console could reach 7.5 + 2.5 GB (10 GB if my maths is correct) before it has to switch to the 336 GB/s address space. It can't utilise both interleaved spaces at the same time because the 320-bit bus can't surpass 560 GB/s. Therefore, that's when the access/speed penalty starts coming into play. (Yes, 13.5 GB total still available to the game, but it's a mix of fast/ slow address spaces.)

That 7.5 and 2.5 is a best case scenario, i mention later in the article that it's possible the console is always switching between the fast and slow address spaces - but i don't think that would make sense.

You could address both by using pseudo channel mode... but that would be a nightmare because you can't predict when specific data will need to be pulled from both fast and slow address spaces, which I think would result in a lot of latency.

https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

Duoae said...

Can you clarify which one you're talking about? The memory diagram, the memory fill rate, or the estimated RAM use when swapping between the XBO games?

It's possible i made a mistake but if i did it'd be easier if you helped me see it. :)

Duoae said...

No, probelm! Have a good day, Giacomo. :)

Pete said...

This is great stuff. I sort off expected more of the tech press to break things down like this.

The only question I have is that MS specifically stated the following:

"In terms of how the memory is allocated, games get a total of 13.5GB in total, which encompasses all 10GB of GPU optimal memory and 3.5GB of standard memory. This leaves 2.5GB of GDDR6 memory from the slower pool for the operating system and the front-end shell"

Doesn't this sort of imply that they indeed found a way to disconnect/decouple the top 4 chips (as per the diagram) when accessing the extra 6GB from the 6 chips on the side? Which you sort of alluded to in your breakdown?

If that is not the case, isn't MS lying basically?

Duoae said...

Hey, Unknown. Not at all. MS have told the truth in everything they've stated. Obviously, i don't know what solution they have implemented on the controller side of things but you'll note that in the portion you quoted, they are specifically speaking about a situation where the entire 16 GB is allocated.

Unfortunately, because the maximum bandwidth of the 320-bit bus is 560 GB/s, the two portions cannot be accessed simultaneously at their maximum speeds as this would go above this limit. There are two offerings here - sequential access and simultaneous but shared bandwidth (which results in that reduced scenario in the diagram).

Basically, in certain scenarios, the specs will perform as stated, in some others - not. Just like how TFOPS on desktop graphics cards are calculated on boost clocks (theoretical maximum) as opposed to their constant guaranteed clock speed.

Unknown guy said...

Hello,

Can i ask you your opinion about the comments below ? The guy says he is a "computer engineer" :

"Completely different situation since the CPU isn't trying to share this memory. The CPU doesn't need the bandwidth that the GPU does for its workloads (the slower bandwidth is fine for a CPU) and it has cache to offload the RAM. The cache is extremely effective for CPU workloads. They've had great cache and branch prediction algorithms since the 90s and they've only gotten better. I've posted this before and it's still being ignored.

Go look up the ram bandwidth that a 12 core Ryzen 3900X has. Here's a hint. It's much less than 100GB/s.

You guys are barking up the wrong tree and you're missing the bigger picture of how this RAM setup will actually be used."

"The cache gets loaded up, the CPU gets into a loop, RAM doesn't need to be accessed as frequently. This is why cache exists. The GPU gets the full run of the bus at these times at the full 560GB/s rate. I'm a computer engineer. This is what I do. I'm not wrong here."


"We don't know how large the cache is, but you'd be surprised at how effective cache is when it comes to executing CPU code for both data and instructions. The cache doesn't need to be a huge percentage of system memory. Program loops are often quite small. Branch prediction is excellent and the code is pre-loaded into the cache as it is executed. Data is iterated on and is loaded into the cache. Cache is very effective. Do some reading on the subject. There are tons of papers out there."

Duoae said...

Hi Unknown,

Reading through this person's comments seems like he's speaking truthfully and correctly (at least within the bounds of my limited knowledge). His comments actually don't disagree with my own. Essentially, when the content of the narrower memory address space needs to be accessed, the bus must share the bandwidth (560 GB/s in total) between the two address spaces. This will result in the situations I've outlined in this article.

Now, the frequency of this access is up for open debate and something that I doubt this person knows unless they're working on the Series X engineering team or for a development house that's creating a game for the console. They might be comparing to a 3900X but the consoles use the cut-down Renoir Zen 2 cores. Microsoft themselves stated that the SX has only 76 MB of cache across the entire chip (that's including the 52 CUs in the GPU). I'd expect the CPU cores to to have the reduced Renoir cache sizes for L2 and L3.

RDNA 1.0 has 128 KB of L1 and 4 MB of L2 cache across a dual compute unit. If we assume the same size for RDNA 2.0 and in the custom chip of the SX then we'd get 3.3 MB and 104 MB respectively plus 512 KB x2 L2 and 4MB x2 L3 for a total of ~116.3 MB. Clearly that cannot be the case as it's higher than 76 MB... therefore the CPU of the SX might have less cache than even the mobile Zen 2 chips and nowhere near that available to the CPU of the 3900X (70.7 MB).

Duoae said...

Oops, forgot to write the conclusion - So the SX chip will need to make more frequent trips out to system RAM than any desktop part. In fact, you can see the reduction in multicore scores for the mobile Zen 2 parts compared to their desktop counterparts in various benchmarks (I covered that in my prior post where I tried to estimate the power of the Xbox chip based on Microsoft's comments).

However, even in the event that the CPU doesn't need to access the RAM very often, it could not randomly access the RAM when the game logic needs to access its data as the performance the game was expecting would not be there. That's why I suggested that there might be alternating access, in order to provide a consistent environment that is entirely predictable.

Pete said...

Thanks for the reply.

Let me just preface this by saying that I only have experience on low level embedded devices (think Cortex). So I'm basically just asking questions with that frame of reference. I have no idea how things work when it get to GDDR6 level, but I'm hoping that some of my basic understanding does translate.

This might be a low level question, but it might help me understand things better. If you're memory is configured as the XSX, 10x1GB with 320bit bus, does this mean your bytes are in essence interleaved over the 10 chips. I.e byte 0-3 on chip 0, byte 3-7 on chip 1, ..., byte 36-39 on chip 10? Meaning if you as a programmer where to do a 40 byte data fetch (0-40) from RAM to CPU/GPU, you could clock that out all at once over the bus?

If so, isn't it just a matter of the memory controller configuring the bus and chips based on the address of the memory being accessed? Similarly to how you can have a single memory controller on an Cortex MCU either access SDRAM or NOR Flash depending on only the address of the memory being accessed, even they these have different control signals and latencies.

So for example, if a GPU/CPU needs to access data in the address space of 0 - 10GB, it configures the bus in full 560GB/s mode 10x1GB @ 320bit. If it needs to access data in the 10 - 16GB address space, it configures the bus in 336GB/s mode 6x1GB @ 192bit. I'm assuming you can somehow just deselect the top 4x1GB chips and use some control signal to select the upper/lower half of the 2GB chip in order to expose the correct 1GB segment/bank.

Regards,
Pete

Duoae said...

Hi Pete,

If I'm understanding your question correctly - isn't that exactly what I did in the blogpost? The controller switches between the two address spaces whenever it needs to access data from either one.

However, you can't just have random access occurring whenever you want. Imagine there's a game running and all of a sudden there's a request from a friend to join a group in another game - the OS needs to work with the data in the RAM (or load some in). At that same time, the game is requesting something (texture data, let's say) and due to the other address space being active, it doesn't get the data in time for the next frame or two. My suggestion was that the access might be pre-arranged through a switching mechanism whereby the whole system knows when the access to the different address spaces is going to be available, allowing all the programmes to be able to queue items effectively.

There's a similar scenario for when a game has data in the first 10 GB of space and the 3.5 GB of space and needs to request data from both (e.g. texture data in address space 1 and sound data in address space 2).

This isn't like on a desktop where the CPU has the DDR RAM and the GPU has its VRAM. Nothing from the OS is going to interrupt the queued items happening on the VRAM unless the game itself is interrupted to show video data from the OS - in which case most games automatically pause (or crash) - because they're two different buses.

What this suggested solution would mean is that the GDDR6 RAM is operating across shared bus cycles, meaning that the actual access is averaged. Sure, you look at a single cycle of access and it's 560 GB/s or 336 GB/s but if you compare 2 or 4 cycles and count the data rate of access to the RAM over multiple cycles you get the averaged figure I posted in the blogpost because in each alternate cycle, one of the address spaces has 0 GB/s.

I keep trying to get the point across that the two address spaces are not faster or slower, they're just more wide and less wide. The RAM is always operating at the same frequency, it's just there's less parallelism. So, yes, the bytes are probably (or should be expected to be) interleaved across the entire address space (across each chip as you gave in your example).

Pete said...

Hi Duoae,

I would expect some sort of priority based interrupt mechanism in place for the CPU/GPU to gain access to the memory while it's busy, no?

However, I think your point stands. It's not so much the individual bandwidth to the different memory addresses that is important. Rather, it's the fact that your "faster" memory will get interrupted by your "slower" memory. Affectively making your total RAM bandwidth something more akin to A x 560GB/s + B x 336GB/s, with A + B = 1.

And I guess herein lies the problem. In order to maximize that total bandwidth, you probably put extra burden on the developers for managing that split pool. Whereas for Sony, since everything is a flat 448GB/s, that burden is not there.

Or you could implement a hardware solution as you proposed where the controller just switches between the two address spaces. But that would almost effectively half your total bandwidth (A = B = 0.5) and that is assuming you actually do grab data from both memory spaces every other clock.

Regards,
Pete

jamesnormio said...

What is this bullshit blog? How much did sony pay you to write this FUD? its full of crap and WRONG on so many levels! wtf

N900 said...

I think this is the worst blog post I've ever seen. Really. Get a life man, no need to invest so much time in defending a stupid plasticbox... s

Duoae said...

Exactly! Whilst we don't know what solution is in place, we know that "effective" memory access will be below what the possible max access speeds are when averaged over time. The only thing i would contest in your post is that half your bandwidth would not be A = B = 0.5, as that would give 280 GB/s for both address spaces. It would be A/2 + B/2 because the address spaces are not symmetrical. (Maybe i made a mistake there because I'm just quickly doing this without any "paper" but that would give an average of 448 GB/s - equal to that symmetrical bandwidth of the PS5. In fact, i just realised that i averaged badly in the article and counted the 4×1 GB RAM modules twice... giving more bandwidth than would exist. Maybe that's what the first reply was in reference to? D'oh!)

Pete said...

You're correct yes. Good catch. However I do wonder if A >> B, making it a bit moot. But I suspect that's only something a developer would know.

Unknown guy said...

Thanks for the clear explanation :)

Josh said...

Very interesting.

Duoae said...

Thanks, Unknown Guy & Josh!

fybyfyby said...

Hi , I dont understand one thing. For XSX and PS5 both CPU and GPU are acessing shared memory pool (at least in terms of memory controller).

So both will suffer from random cpu gpu accesss. If CPU of PS5 access memory, GPU cant. If there is symetrical mix of accesses, GPU access to ram in case of PS5 will be half - 224 GB/s. So I dont see, how much different this is according to XSX.

If PS5 has different route to memory for CPU and GPU, I understand, that it will have advantage over XSX. But here it hasnt.

Duoae said...

Hi fy (sorry for shortening :) ). The difference is that the PS5 can use all 4 GDDR controllers to access all of the 16 GB, the SX can only use 4 of the 5 controllers to access the 6 GB and cannot do this at the same time as accessing the 10 GB using the 5 controllers - at least not at full 'speed'.

Put it another way, if SX had 10x 1 GB chips instead of the mixed 6x 2 GB & 4x 1GB chips the 560 GB/s bandwidth across 5 controllers would still be present as the bandwidth is limited to 32-bit per chip, no matter the capacity.

Duoae said...

To finish the thought, the reason we don't usually see these asymmetric configurations on graphics cards is because there's a trade-off in access to the RAM when going above the larger, wider pool. Take a look at that 560 Ti article link in my post for another explanation on the subject.

fybyfyby said...

Hi , thank you for the quick reply. I understand what are you trying to tell me (I hope :-) ). Simply put, if you need to address both address spaces, throughput will be dramaticaly lower.

I think in case of XSX scenario of usage is important.

If you strictly use fast memory for gpu and slow for cpu, it is similar like in case of PS5. With random access, you get slower speed for CPU and slower for GPU. Difference is, that (in ideal case) half of time for xsx you use fast access and half of time slow access. Whereas in case of PS5 you use half of time normal (slower than fast and faster than slow) access and other half of time normal accesss also.

You can than complicate things with XSX that way, if CPU will try to access also fast ram - and that will be case of that gpus - one processor accessing both pools.
And that can be pain. You cant rely on constant speed of memory. So if you want to use advantage of fast ram (to outperform ps5), you have to really isolate cpu and gpu parts. If you do that, you get similar case like ps5 is. But its playing with numbers. What is important is real use by memory, cpu time spending reading from memory etc... So I understand this can be little harder to manage. On PS5 you have simply one pool of same speed memory and it doesnt matter in what address space is gpu and cpu. But you have to also try to restrict cpu access to memory, or gpu acces will fall down.

So I think I understood it, but I think it is maybe little pain to manage access to memory in case of xsx. So you are practically restricted for 3,5 GB ram for cpu to not mess with fast ram. And even then if you access slow ram with cpu, you will lower down access of gpu - but that is normal bus sharing method and its predictable.

On the case of gpu - I think main pain was that developers simply arent used to split memory management on gpu on fast and slow and try to maximaly utilize fast access without pinging into slower pool.

Is that what you mean?

Thank you.

Duoae said...

I think you're mostly there, the only thing is that on SX you cannot access the slow memory without impacting the speed of access to the fast memory because they work through the same controllers.

On PS5, assuming that the JEDEC spec is adhered to, you can even improve access to multiple bits of memory by arranging them within the same column but on different rows, essentially parallelising certain memory operations. You couldn't do that across the two memory pools in the SX because the memory needs to be interleaved in order to accomplish that.

Duoae said...

This is to say that, yes, bandwidth to a particular component will always take away from other components but since we don't have a block diagram of the architectures of the two consoles, that's not what I'm comparing - I'm specifically comparing performance of the RAM setup.

fybyfyby said...

Oh yeah, I see the point with interleaving. That could be a problem. Im looking forward for another tech analysis!

Unknown said...

In regards to the statements that the asymmetric memory is a negative. You're assuming that the developer will need/want to exceed 10GB of the allocated GPU based memory. However some of the most expensive PC GPUs on the market at the moment use 11GB 352bit memory.
If the developer does not exceed the 10GB then the memory bandwidth is significantly faster and a clever implementation given budget restrictions.
The developer only has 3.5GB of "CPU" memory left anyway, and I fail to believe that the developer will not have the majority if not all of that utilised already for CPU related tasks.
I do not see how this "Asymmetric" setup is any different to development for the PC where majority of people's rigs have lots of DDR4 memory for the CPU and separate GDDR6 memory on the GPU.

So basically I agree with your blog post about the negatives. But I don't think it will actually be a problem that comes up. But the gain in GPU based memory clocks is well worth the risk.
If I am wrong about the developer not utilising the 3.5GB for CPU related tasks. Then I agree the uniform approach the PS5 has taken is better.

Duoae said...

I agree with you - this is the point of my 7.5 + 2.5 GB example. I believe that most initial games this next generation will use far less than 13.5 GB. Plus, combined with their prediction engines (e.g. SFS) less RAM will be used overall.

As i said in the blogpost - in circumstances where a game utilises more than 7.5 GB of RAM, (forcing the OS to be moved to the slower pool) then the PS5 memory setup will have the advantage.

As soon as the 6 GB pool of RAM needs to be accessed, there is a performance penalty.

In terms of comparison to PC, actually this is quite different. It's more accurate to compare it to a GPU. You see, each DDR stick has a transparent controller imbedded in it, interleaving the identical capacit DDR RAM chips on the stick into a unified pool of RAM. In comparison, GDDR has greater speed and parallelism due to the broader bandwidth provided by 64-bit wide controllers that couple with two GDDR RAM chips.

In a general computer scenario, the graphics card never utilises or interrupts data access to the slower DDR and the OS, etc. doesn't access the GDDR. The exception to this is seen on laptops or desktop APUs, where often system RAM will be utilised for graphics data and you will always see a massive performance cost for doing things this way.

So,I don't think the comparison with a normal desktop is accurate or relevant.

RKO said...

Ciao

RKO said...

Hi, I'm Italian. Congratulations for your description, I wanted to ask him but according to her the same ps5 ssd with his ID. compared to the name for the file

It can make a big difference. Thanks

RKO said...

Developers will be able to choose between low or high level access but it is the new I / O API that will allow them to take advantage of the extreme speed of the new hardware. The concept of files and paths has been superseded in favor of a new ID-based system, which tells the machine exactly where to find all the data it needs in the shortest possible time. Developers simply need to specify the ID, the initial and final location, and after a few milliseconds the data is displayed. Two lists of commands are sent to the hardware: one with a list of IDs while the other focuses on memory allocation and deallocation (so as to make sure that the memory is always free to accommodate new data).

Unknown said...

Where is the 7.5 figure coming from.
The OS is always on the slower memory.
The spread is 10 and 3.5. With 2.5 for the OS in the slower memory reserve.

https://gamingbolt.com/xbox-series-x-allocates-13-5-gb-of-memory-to-games

My comparison to a Pc is to think of the 10GB as physically part of the GPU and treat the slower memory as DDR4.

Pete said...

Hi Duoae,

It seems like the biggest issue with the split interface is when you interrupt a high bandwidth transfer with a low bandwidth one. Effectively lowering the bandwidth of the initial transfer.

However, thinking about it a bit more, I'm wondering if this is not something you could (or has been) solve(d) in programming? Let's assume your game engine is running in these 33ms(30fps) or 16ms(60fps) time slots. If you schedule your high bandwidth transfers/operations to happen first, followed by your low bandwidth transfers/operations, with some idle time left in the time slot, I don't see the split interface being *such* an issue.

Since I'm not a game dev, I have no idea if this is standard practice or completely impractical. But just an interesting thought none the less.

The above could however break down if the predefined memory split doesn't work for your game. I.e. you need more than 10GB of graphics related memory so you start to dip into the slower 3.5GB. Then things might get hairy.

Duoae said...

Hi Pete, yes you're correct in your reasoning. However, i think the point above about cache is important. MS have stated there's only 76 MB os SRAM on- die. That amount of cache shared between CPU & GPU as well as DMA and I/O is actually very small.

Within a given frame (let's take your example of 16 ms since 60 fps is a next gen target), the CPU must request data from the SSD (slow - 4.8 GB/s), send data to the CPU for update from the game .exe (let's say we're using the show RAM), send data to the GPU and audio processors (fast and also maybe slow) and possibly even return data from each to the RAM in order for another component to provide secondary processing on (e.g. ray racing information passed to the audio component).

That seems like a good deal of potential switching between the two pools of memory. If my quick mental calculations are correct, the GDDR can switch 224 times in that 16 ms period which equates to 56 MHz of GDDR6 frequency. Using that figure across a 320-bit bus gives 9 GB transferred for the wider pool of RAM and 5 GB transferred for the narrower pill of RAM (if they are accessed exclusively for that period).

That doesn't take into account actual processing times for the components or overhead in switching the controller between the two address spaces (since we can't possibly know those) but let's say that the data transfer is a 70/30 GPU/CPU+audio split with no OS overhead. You get 6.3 GB transferred to the GPU and 1.5 GB for the CPU+audio for a total of 7.8 GB.

I actually think the GPU number doesn't sound outrageous though maybe the other is quite a high value but like you I'm not involved in making games. This value can fluctuate between 6.6 - 9 GB theoretical max, depending on the split...

On the PS5 you can perform the same calculation and I get a flat 7 GB for all scenarios.

Assuming I've performed that calculation correctly, that's the theoretical maximum for this particular ratio split for the SX, the reality will be lower due to those overheads i was mentioning (including SSD transfer speed). The PS5 has that transfer rate as a guarantee, minus its SSD overhead (which should be a bit less due to its faster speed) along with similar overheads from having to hand data back to the RAM depending on how much SRAM it has on-die (we don't have a number from Sony yet).

Now, if the PS5 had the same number of CUs in the GPU as the SX then that could potentially be a bottleneck but it has 0.84x the data requirements of the SX GPU (freq x CU and not taking into account features like VRS).

That's still looking to me like the components of the PS5 could be potentially being fed what they need more easily and reliably than those on the SX, even with mitigations such as VRS and SFS.

Duoae said...

I'm writing a follow-up article to explain this. Many people are not following the example i was trying to give. The short version is - MS never said the OS couldn't reside on the fast pool of RAM, just that these functions see no benefit from doing so.

Duoae said...

Also, as i stated above or maybe elsewhere, you don't have to stop access to the GDDR6 in a graphics card to access the DDR4. In the case of the SX, you sort of have to.

Duoae said...

Oops, i was re-reading my comments and realised that i said the controller was on the DDR dimms. I meant to write channel. My brain myst have temporarily melted :/

Duoae said...

I'm not sure I'm convinced about this particular tech, at least not from the description. Files and folders are abstractions for us humans, not the systems. Maybe I'm wrong but i thought all file systems use pointers and memory column/row information in order to quickly access data.

I don't see anything different in this description. They seem to be happy with the solution though so maybe there's more to it?

Unknown said...

OK. So I've re-read the section and understand what you mean in regards to the OS being pushed to 'slow memory' after the 320bit bus is saturated (10GB). Ie once interleaving is required to access memory beyond the addressable 320bit bus.

I do have one question though. Isn't this exactly the same issue for PS5 but after 256bit bus (8GB) is saturated?

Unless they "decouple the individual chips from the interleaved portions of memory" then whenever the GPU or the CPU needs to access seperate halves of the 2gb chip then the interleaving kicks in as an issue and halves the attainable throughput. Or am I still way off base in understanding this?

Ps I'm sorry if I've come off as hostile. I legitimately am curious and not trying to say you are wrong.

Duoae said...

Hey, Unknown.

You're not coming off as hostile at all! Actually, people asking me questions also helps me understand things better because it makes me look at what i know (or think i know) from a different perspective.

So, just one thing to correct: the bus is how wide a road is, i.e. how many cars can be in parallel - not how much space is in the car park. After all, cars can leave the car park and be replaced by new ones.

So the system would access the narrower memory pool when the first one is full. *in theory!!*

As we've been saying in my discussion with Pete- we don't know whether the system mandates access to both pools of RAM in given time frames or whether it's entirely programmer- controlled. That makes a big difference in terms of how this all plays out.

My scenario of the OS residing in the fast RAM is entirely hypothetical. It makes sense to me from an architectural and efficiency standpoint but maybe that's not how Microsoft have decided to do it...

One last thing, i think that if the memory is interleaved then you cab hr decile the chips - just the access width. I.e. you use one 16-bit channel to access each memory pool. The JEDEC seems to allow for this but it's not a given.

Duoae said...

Dann, autocorrect.

"Then you can't decouple the chips - just the access width.

Ger said...

Hi Duoae...

http://disruptiveludens.com/reply-to-duoae

😃

Duoae said...

Thanks, I'll take a look when I get a chance!

LateToTheParty said...

This is a very interesting read. I'm not a techy person, at all, but if I understood correctly, the 10GB of "GPU optimized" RAM cannot run at a bandwidth of 560 GB/s if the other 6GB is accessed because once you do that, the "slower" RAM will occupy the 32 lanes to each memory controller. If that happens, then the bandwidth of the "fast" RAM will go down because the bandwidth is determined by (frequency) x (# of chips) x (# of lanes) / 8 and the less lanes occupied by the "fast" RAM, the lower its bandwidth.

Reading your writeup makes me wonder if the asymmetrical RAM setup will exacerbate the XSX's issue with how it needs to have separate pools for loading and streaming assets. According to this person, the XSX has to do this because you can't load data into frame while the frame is rendering. He predicted that the XSX would use 5GB for the working set, 2.5GB for loading, and 2.5 for streaming. However, he did this with the assumption that the OS would use the "slower" RAM. If the OS is occupying the "fast" RAM, then wouldn't that make make the asymmetrical RAM setup an even bigger albatross? The user also mentioned how with the PS5, it doesn't need to have a separate loading and streaming pool because the GPU cache scrubbers will "reload assets on the fly while the GPU is rendering in frame".

Sorry to dump even more reading material on you, but as someone who has extremely limited knowledge on this stuff (I know how to build a PC, but that's it), I'm just very curious about all of this stuff.

https://www.neogaf.com/threads/ssd-and-loading-times-demystified.1532466/

Duoae said...

Thanks, I'll take a look. I should be releasing my follow-up to this article tonight.

You understand the gist of article, yes. However, I have no idea about what you wrote about until i take a look at what that person wrote. In theory, each poll of RAM should be parallelised, meaning that you don't partition individual chips or "sizes" of memory between different operations - it should just be about access to the full pool.

Whether there are limitations on the hardware connecting to the memory controllers, I cannot say or know until more information is released on the internal makeup of each console. I'd be surprised if this person is able to say that either, so looking forward to reading it!

Duoae said...

Hey, LateToTheParty.

I think this person has some interesting information but is making some assumptions based on a single game engine. The numbers they are quoting are specific to Guerilla's engine and how it manages the memory available in the PS4. In fact, when you add up all the memory you get some weird total...

I'll look into this a bit deeper and do another post around this concept but I'm going to need a bit of time to fully absorb and (internally) debate the ramifications of this...

LateToTheParty said...

Thanks for the responses. I'm completely uneducated on how these things work, but you managed to explain things well enough that I can understand them. Apparently, this interview with a Crytek engineer, Ali Salehi, got around the interwebs and his main criticism on the XSX hardware is also its RAM setup. And the interview has been removed possibly due to NDA reasons, but the transcript can be found here: https://www.neogaf.com/threads/ali-salehi-a-rendering-engineer-at-crytek-contrasts-the-next-gen-consoles-in-interview-up-tweets-article-removed.1535138/

I look forward to your next writeup. I've already learned a lot just from reading from your stuff.

Duoae said...

Hi LateToTheParty,

Thanks and no problem!

Actually, I already saw the Salehi interview. It's kind of nice to know I'm not completely crazy, especially when so many people were just saying "No! You're wrong!" ;)

I don't think there's much point in my covering his rescinded interview because I'd just be retreading old ground (plus, he really didn't say anything outside of repeating what other people had said - though I can see how he'd get in trouble). Until I get more info on either console I think my part 10 will be the last entry (on console stuff) for a while...

dc_coder_84 said...

When comparing CPUs I would only compare them in their fastest hardware mode which means which means with hardware multithreading. The CPU mode without SMT is only for "old" game engines. So the XSX CPU is only 3.66/3.5 = 1.045 times faster than the PS5 CPU. It's misleading when you say the XSX CPU is 1.1x faster.

MetalSpirit said...

Duoae

You stated:

"RDNA 1.0 has 128 KB of L1 and 4 MB of L2 cache across a dual compute unit. If we assume the same size for RDNA 2.0 and in the custom chip of the SX then we'd get 3.3 MB and 104 MB respectively plus 512 KB x2 L2 and 4MB x2 L3 for a total of ~116.3 MB. Clearly that cannot be the case as it's higher than 76 MB... therefore the CPU of the SX might have less cache than even the mobile Zen 2 chips and nowhere near that available to the CPU of the 3900X (70.7 MB)."

Are you shure about this?

RDNA L1 caches are not on the CUs, but on the Arrays. There are two Shader Engines, each with two shader Arrays, each shader array with 128 KB L1 cache (512 KB total on all 4 Arrays).
L2 is a total of 4 MB, and there is no L3 cache!

https://hexus.net/media/uploaded/2019/6/3af2863e-d5a3-40d5-8757-252ae540215b.PNG

There is a L0 cache with 32 KB in size, shared by all 4 simd in a dual CU.

So total GPU cache for a 40 cu RDNA unit would be 640 KB L0, 2048 KB L1, 4096 KB L2.

Correct me if wrong.