11 April 2020

Analyse This: The Next Gen Consoles (Part 10) - Comments

Last time I got some quite in-depth comments that required a long response. I figured that I'd give a proper blogpost over to the answers... so, to cover everything from Pete, MetalSpirit and LateToTheParty:

Graph showing the various rates of data able to be transferred per 16 ms frame depending on the percentage amount of time allocated to each portion of RAM...
MetalSpirit said:
"As far as I see it, GDDR6 is accessed on a dual 16 bits bus. So each 64 bits controller is accessing each memory module on a 2x 16 bits bus, and two modules per controller.

So if more than 10 GB is used, regardless of bandwidth need, if that memory is accessed we need at least do divert 6 16 bits channels, removing 96 Bits tomthe bus, or 168 GB/s from the fast memory, giving it 392 GB/s.
So, even with the extra 6GB beeing used for stuff with low acesses, every time an access is made, bandwidth drops to 392 GB/s on the fast memory, regardless of you pulling 168 GB from the slow memory, or not!
And if you pass 168 GB/s pulled from that 6GB memory, you will need to divert the second 16 bit channel to that memory, leaving 244 GB/s on the fast memory.
But if a simple access will divert the bus channels, then you cannot ever rely on always having 560:GB/s on the fast memory. 392 GB/s seems a much more secure number,
For 560 GB/s you would need to use only 10 GB or pull from both memories at once, but having only 2.5 GB available on those 6 GB I find it difficult for you to manage to pull the extra 168 GB/s.
Am I wrong?"
LateToTheParty said:
"This Resetera user by the name of Lady Gaia measured both systems in terms of GB/s/TF. Turns out if the CPU needs 48 GB/s, the GB/s/TF is the same for both systems so you're pretty dead on. For the XSX, if you need to free up X GB/s bandwidth of the CPU, the theoretical peak GPU bandwidth actually takes a larger penalty than X GB/s because of its asymmetrical RAM design.
Yes, any time you access the 6 GB portion of RAM you impact the bandwidth available to the 10 GB portion. You can see that in the graph above and also the crude APU/CPU diagrams I've been posting for this and the last entry - those numbers match what MetalSpirit posted.

There are a few questions over Microsoft's /exact/ implementation, though. Even Lady Gaia's post has some assumptions which disagree with my own. That person basically takes a best case scenario whereby each chip is accessed individually and there is no interleaving of the address space.

I tried to cover that a bit in the first post on this subject but it wasn't something I explored properly because it goes against how RAM is usually arranged and controlled... and it goes against the JEDEC standards. This was also the gist of the disagreement I had with Urian over at Disruptive Ludens - I was speaking about what happens after the piece of the hardware he was labelling as the DCT. Both he and Lady Gaia are looking at RAM access from a component/system point of view and I'm looking the other way from a RAM perspective.

Yes, the CPU might only pull 48 GB/s bandwidth due to its interface with the system but the RAM setup doesn't care about what limits individual components have in accessing it. That is why the equation is not simply (560 - 48 = 512 GB/s & 448 - 48 = 400 GB/s). You can only subtract that amount (that simply) when your components are all addressing the same memory address space. Actually, in this example, the calculation for the PS5 works out the same but not for the SX because of the issues we've discussed over the last two posts.

Going back to the unknown idiosyncrasies of the SX. there are effectively three possibilities that everyone is speaking about:

  • Non-interleaved individual access to each module or controller (i.e. two RAM modules)
  • Simultaneous interleaved access to the entire controller setup
  • Non-simultaneous interleaved access to the entire controller setup

The last two are what I was covering and the first one is what both Urian and Lady Gaia were talking about, even if they never mention it.

Non-Interleaved Individual Access

From what I understand, that first scenario is very unlikely because a) it goes against the JEDEC standard; & b) it goes against most RAM setups (DDR and GDDR) ever implemented and has huge costs in terms of complexity.

When I say it goes against the JEDEC standard, what I mean is that RAM is usually interleaved in order to improve parallelism and access times. This is the reason that DDR and GDDR modules must all work at the same rated frequency. If they are working at different frequencies then the bus becomes desynchronised and you have bottlenecks and misses all over the place... The exception to this is HBM RAM which can operate (theoretically) with different frequencies on each layer of a stack of HBM RAM. Of course, AFAIK, there has never been a system implemented in this manner due to the added complexity!! (But the capability is there)

Now, for the memory in both next gen consoles, there is a consistent clock (frequency) across all modules and the access bandwidth for each address space is large (560, 336 & 448). If the modules were accessed independently, those wide bandwidth numbers could not be claimed. You'd have 3x64-bit for OS and 10x64-bit / 4x64-bit depending on the moment-to-moment access from the different components. This would also be awkward because the memory would not be as easily shared between the different components - a big selling point of this type of setup. It would also result in huge amounts of wasted bandwidth whereby the chips that were waiting for the OS/game sounds to be accessed would be left idle when not needed.

So, thinking about this logically, this first possible RAM configuration does not appear to be implemented. As in - I can't logically justify it from a rational point of view.

Simultaneous Interleaved Access

This configuration could allow for 16-bit access across individual chips, resulting in 280/168 GB/s access across the two address spaces. This would leave the other channel of the 4x 1GB modules unused but the access bandwidth would be synchronised, making the configuration much simpler to manage.

There is another potential of this as well – asynchronous access across a single address space. 4 modules running at 32-bit and the other 6 shared across two 16-bit interfaces. That would theoretically give the above figures of 392/168 GB/s.

Of course, I don’t know if this is possible, but as from what I have read of the JEDEC standard, both possibilities are theoretically allowed but I don’t know how the reality would shake out.

Non-Simultaneous Interleaved Access

The third possibility is basically the standard setup for both DDR and GDDR implementations. You have an interleaved address space across which is written data in parallel fashion in order for that data to be accessed more quickly. If there’s a file of X kb across Y modules then you can pull it off at the access speed of Y modules multiplied by their access width. If the data is across just one module, you’re limited to the access speed of the controller covering that module.

In this scenario, data could be moved on a priority basis (as Pete pointed out in the comments last time) meaning that the developer can minimise the amount of time spent accessing the slower pool of memory.

A crude system schematic showing the potential RAM access modes that the Series X might be using...

 Going back to Lady Gaia’s example of 80 GB/s – if it was a single address space of memory then a simple ratio (like they did) between different memory sizes would work, but this isn’t the case because it’s 48 GB/s of access (and this isn’t confirmed as the limit for the bandwidth of the CPU) removed from whatever current width of address space is being accessed – whether that’s the 10 GB portion or the 6 GB portion. The bandwidth of those 6 modules is 336 GB/s when accessed at their full potential so 48 GB is covered by that, meaning that you don’t need to perform a ratio calculation because it’s unlikely that such an amount of data would need to be moved per frame. 

To put it another way – it would take ~9x 16 ms frames to move 48 GB but why would such a large amount ever need/(be expected) to be moved to or from the CPU in such a timeframe? That makes it a moot point, in my opinion... The access cost for the CPU is always 48 GB/s, however the access cost to the narrower pool of RAM (192-bit/336 GB/s) is separate from that calculation and is omnipresent regardless of what’s accessing it.

The fill rates available to the two consoles, showing the relative strength of the PS5...

System requirements

Pete said:
"Another interesting metric is if you calculate (best case) bandwidth per GPU TeraFlop for each system.
XSX = 560 / 12.15 = 46.1
PS5 = 448 / 10.28 = 43.6
Surprisingly close ;-)"

Regarding Pete’s observation on the ratio of (best case) RAM access bandwidth to GPU TFLOP, this may not be the best metric to measure but it is another point that’s indicative of the underlying “hunger” of each system’s GPU. The only reason I say this is because the TFLOP is a calculation-based metric, whereas a lot of data moved to the GPU will inform other operations which do not necessitate a part of these calculations. As Mark Cerny pointed out in his talk – other parts of the GPU run faster when running at higher clock speeds. In that scenario, the other parts of the PS5 GPU require more feeding with data… so the numbers are not so clear-cut.

It’s for this reason that I like to stay more abstract and compare static resources (such as cores/Compute Units) to frequency, and bandwidth. I’m already treading on shaky ground just keeping to these abstracts, if I go with a more specific analysis then there’s a possibility I’m looking too low-level and drawing conclusions on specific parts of an architecture that could belie the whole picture.


RKO said...

Salve ma Ps5 potrebbe avere huma la tecnologia di AMD.

RKO said...

Salve ma Ps5 potrebbe avere huma la tecnologia di AMD.

RKO said...

hi, ps5 could mount AMD's huma technology.

MetalSpirit said...

First of all, thank you for the reply.
The best solution, in terms of bandwidth seems to be what Pete mentioned.
But even so, it seems to me that even if time spent on that memory is small, maximizing bandwidth, drops would occur. And although bandwidth can in average be above 392 GB/s, counting on more could cause drops in performance when the 6 GB memory is accessed.
So, if I'm seeing this right, the limitations caused by the split can never really be removed. Although they can
be minimized.

Duoae said...

Hi MetalSpirit,

Yes, developers can work around them - much in the same way the PS3 and Xbox One could be worked around but there's always a cost.

However, saying that - the Series X is so powerful and we have so many more "tricks" that both consoles will be utilising (VRS, etc) that it really doesn't matter from a user standpoint. It just means that both consoles will be presenting themselves effectively identically to the end-user, except for a couple of games.

I think the power of the SX is "reduced" partially because of this and partially because cross-platform games will target the lowest common denominator - Early on, the PS4 Pro and Xbox One X and, later in the generation, PS5.

Duoae said...


I don't know if it's still called HUMA anymore but, as far as I understand it, both PS4 and Xbox One use this already... it's guaranteed that since the PS5 and SX *only* have a single block of GDDR6 that this concept is being used.

So, the answer is yes.

MetalSpirit said...

Btw, do you know what the Geometry Engine is? Is it something PS5 exclusive?
Is this different from the Geometry Processor, standard in RDNA?

Duoae said...

Hi MetalSpirit,

I'll take a look into it and get back to you. My basic understanding is that they're different and that it is similar to the implementation in NVidia cards and is a way of culling unnecessary detail from mesh models which, unlike level geometry (which traditionally utilise BSP trees or voxel structures), are usually fully rendered and in active memory even when partially or mostly obscured.

These new techniques are used to reduce graphical overhead and increase frame rate through lessening the amount of processing the graphics card has to do.

DavidB said...

Also to the point of JEDEC standards, in the end both MS and Sony have to design devices that can be mass produced in a cost effective manner. Custom non-JEDEC 16GB of RAM in my simple mind doesn't meet such a standard?

Duoae said...

Hi DavidB,

Well, the RAM modules themselves are standard. The controllers are integrated into the custom SoC from AMD so I guess anything is *possible*... whether it's *probable* is another discussion :)

Pete said...

Hi Duoae,

Great post. I think this should clear up a lot of confusion.

Thanks for all the research and effort. I wont dare to open a GDDR6 datasheet, nevermind try and explain it.

I find it weird that console tech is so criminally under-explained compared to PC tech in the enthuisiast space. For example, what's the equivalent of anandtech in the console segment. The irony is that there's a lot more interesting custom tech in consoles than your average gaming PC. Atleast in this gen it seems.

Looking forward to if an when you do another deep dive.

Andy said...

The problem with the system MetalSpirit proposed is that a interleaved system demands all chips have the same performance so if the 6GB portion is accessed using 6x16bit channels the remaining 4 chips must run at half speed to match the other 6 chips for the interleaved system to work. Giving GPU 280GB/s & CPU 168GB/s.

Am I wrong?

Andy said...

"4 modules running at 32-bit and the other 6 shared across two 16-bit interfaces. That would theoretically give the above figures of 392/168 GB/s"

What do you mean by asynchronous access? and how could GPU interleaved memory possibly work with 4 chips (56GB/s each) being faster than the 6 other chips (28GB/s each) giving 392GB/s.

Duoae said...

I'll explain below:

Duoae said...

Hi Andy,

The GDDR6 spec is a bit vague when it comes to this aspect. What is stated is that the memory modules must run at the same frequency and that the two 16-bit channels can operate independently (or linked together [pseudo channel], or even split in two[x8]! - each with their own cost in terms of access/bandwidth).



In terms of an interleaved address space this is an abstraction, so the actual implementation can be defined by the persons who design the controller. There are two types of interleaving (High and Low order). As far as I understand it, for high order interleaving, the only restriction is on unified *access*... not simultaneous access width, since the memory address is specified with each access. Therefore, it would be possible for the separate channels (16-bit) to operate asynchronously over different controllers and access data of different widths across each module.

This would have the penalty of not being able to access contiguous address areas, resulting in potentially slower data retrieval for some operations but this could be offset by the higher overall bandwidth available by choosing to do so.


Andy said...

Hi Duoae

I don't have a problem with 16bit adress what doesn't make sense is chips providing different bandwidth, for interleaved memory to work all chips must perform the same.

For the model in which GPU accesses 4 chips at 56GB/s each and 6 chips at 28GB/s woulnt this have the same issues you mentioned for chips running at different frequencies? "the bus becomes desynchronised and you have bottlenecks and misses all over the place... "

Duoae said...

Hi Andy,

I've looked at this over the last few weeks and I've found no documentation that specifies this.

Desynchronisation is specifically to do with frequency and clock regulation, not bandwidth. Bandwidth is related to the bit-width per pin on the RAM.

The only stipulation I've seen for interleaved memory for DDR/GDDR is that it must have the same clock and capacity. But that seems specific to those types of RAM. There doesn't appear to be any limitation in terms of interleaved memory (you can even mix different types and speeds of RAM/ROM) as long as the controller supports it.

Since we don't know what implementation is in place and what the AMD controllers on the SoC are capable of, there is the possibility of both scenarios being plausible.

Duoae said...

Of course, there would be a penalty for doing it that way - you'd have to potentially wait for data off the modules being accessed across a single 16-bit interface but there might be scenarios where it was preferable to accessing zero data from that whole bank of interleaved address space...

My personal preference (in terms of simplicity) would be for non-simultaneous access. But in a closed, standardised system you could do it in a more complicated manner in order to eke out as much performance as possible.

Of course, i wouldn't have gone with the SX's memory setup in the first place :)

Duoae said...

I was thinking about this a bit more - maybe i confused matters when i said "asynchronous" i should have written something different but i cannot think of a word to describe a "narrower" access like this...

DavidB said...

"Of course, i wouldn't have gone with the SX's memory setup in the first place :)"

Maybe I missed where you said WHY MS would have done this? Why make it so overly complicated, forcing devs to either decide which set of RAM in which to store particular assets? Will such a memory scheme make the PCB easier to design, or cheaper to produce, or a non-trivial cost reduction, or easier cross-development for PC, or benefits backwards compatibility somehow, or ??? Or the SX will be able to feed data to the GPU just that much marginally "faster" thus a frames/second marketing advantage over the PS5? I'm just struggling, as an engineer, to understand why do this...

Pete said...

The one reason I can think of would be cost. Let's assume they *need* the 560GB/s bandwidth to feed their GPU. If they wanted a single bandwidth over the full memory pool (ala PS5), they needed 20GB (10x2GB). Maybe this adds up? Not sure what the going rate on GDDR6 is these day.

Andy said...

Hi Duoae
Are you sure this is the case? "Desynchronisation is specifically to do with frequency and clock regulation, not bandwidth. Bandwidth is related to the bit-width per pin on the RAM." Isn't frequency related to bandwidth?

My high level knowledge of interleaved memory system the chips GPU access act as one uniform pool that's why all chips must run at the same frequency and be of the same size.

If you have chips providing different bandwidth how would the pool synchronize reads/writes? In your example the 4 chips at 56GB/s would stall waiting for the 6 28GB/s chips. That's why I suggested capping the 4 remaining chips at 28GB/s Giving GPU 280GB/s

The complexity increase required to have multiple memory buses to treat each group as a discrete pool is not suited for gaming and the cost increase would undermine the decision to go with 4x1GB chips.. Going for 10x2GB chips might well be cheaper

Duoae said...

Hi DavidB,

Supposedly they had planned a 20 GB capacity but then budget cuts happened (I mentioned it briefly in one of my posts). The only thing with this story is that the very early leaks of the SX had only 16 GB of RAM so I'm not sure how much exaggeration is going on with this story (i.e. it may have been in the original early design but when the overall cost of the system was estimated they had to scale some things back and compromise in order to reach that estimated BoM).

Duoae said...

Hi Andy,

Well, as far as I understand things, yes. Synchronisation is specifically about clock synchronicity. Bandwidth is irrelevant. In fact, your example (56 GB/s waiting for 28 GB/s) makes no sense because neither are waiting for or limiting each other.

To put it this way, data is stored across various chips in one of several ways when interleaving is in place. Most data is actually quite small in size - people talk about words in this instance as the data is stored within a "word" of arrayed bits on the RAM and this is the actual address (column/row) that the external call is asking for.


Now, the bandwidth is how many bits can be sent/received per given timeframe. The available bandwidth is limited by the interface to the memory modules - in the case of GDDR6 we're talking about two pins/channels, each with 16-bits width.

The memory controllers are able to talk independently across each pin in order to access the memory addresses requested. It is up to the controllers to prioritise and coordinate recall of data within an interleaved address space - regardless of the available bandwidth.

Let's say there's a "large" file to be retrieved and it is stored across 4 chips, in a single word of each chip. The controller must know that the data is stored where it is and thus coordinates the sending of the address spaces for each chip to each chip at the same time, within the same clock. If the frequency of the chips would be different, then that controller would be waiting across several cycles to coordinate the returned data, if they're all the same (synchronised) then the data can be sent on the same 16n pre-fetch and returned on the next 32-bit word operation - there's no waiting around. (This is a 32-bit size, it could have been a 16-bit size address)

Let's go with the more complicated situation that we're talking about here. Let's say two of the 4 chips are 2 GB and two are 1 GB - with two, separate, interleaved address spaces. Let's say we have the same data being requested but at the same time we have another piece of data required to be retrieved from the 1x 1 GB of the second address space. The controller (through priority indication from the developer in the game code) decides which is more important and performs this operation first before switching to the second retrieval.

Changing the hypothetical scenario a little, let's make that first piece of data smaller, so that it fits across just two of the 4x 1GB address space of the interleaved memory - the two which are not 2 GB capacity and thus not sharing bandwidth with the second address space. The second piece of data on the second address space is unchanged.

Now, the memory controllers receive the same instructions with the same priority calls from the programme but they find there is no conflict in address spaces. The retrieval can occur at the same time because the 32-bit word can be read and sent within the same period of time from the two sets of 2 chips.

The problem really only occurs when data is being shared over chips/pins that are being accessed simultaneously. Now, if you want to wait a bit longer or if you want to sub-divide your access to the 16-bit width of a single pin, there's no real reason why you couldn't, just that there could potentially be a penalty in access as the controller must wait for the remainder of the information to be passed into its cache.

It would be less efficient.

The read and writes do not need to take place simultaneously but they need to be delivered to/from the memory controllers from/to the external requesting hardware simultaneously.

Regarding multiple memory buses, well... graphics cards already have multiple controllers - it's just that there's a piece of hardware which makes this transparent to the system. The Radeon RX 5700 has 4x 64-bit wide controllers that manage the workflow of the DRAM modules.

Andy said...

Thanks for the reply, can you answer this and tell me where im wrong:

Lets say GPU is working with 3GB of data spread across 10 chips for maximum bandwidth, if 4 of those 10 chips are running at half speed (28GB/s) then wouldnt the GPU have to wait for the 4 slower chips to catch up in order access to the full package of data its working with

I read online that as little as a few MBs worth of data/files are spread across all chips because of the nature of interleaving. So for example in order for GPU to access a 4MB assets it has to read all memory modules to piece it together.

Duoae said...

Hi Andy,

It may be easier to speak in bits than GB/s. Access is never likely to be for an entire second, you'd be talking about reading the entire 16GB RAM 35 times - something that isn't really happening.

The simple answer to your question is "yes". The more complicated answer is that your 24 billion bits (3 GB) isn't going to be accessed in a single cycle/ memory operation.

Now, it's a bit early in the morning but by my calculations it will take 75 Hz to complete that 3 GB retrieval (or 0.005 sec). That's at 10x 32-bit interface. If you reduce 4 of those widths you'd get a 256-bit interface. With that you'd complete a 3 GB retrieval in 93.75 Hz (or 0.0067 sec).

Those are 30-40 % of a 16 ms frame time. **with actually having to fetch address spaces, the transfer will never be this efficient but let's say it's 1.5x these numbers, so 40-65 % of a 16 ms frame time**

So, yes, the hardware needs to wait for the data but that can be said for any operation. 3 GB will take longer than 1 GB, for example. However, it's still doable within a 60 fps frame time.

Now the issue with moving 3GB is that those operations can be affected by access to the second pool of RAM. Typically, this access will not require as much data as the GPU operations (i.e. code and calculations take less bytes than images) but it will still require a decent amount between the audio, game logic (AI, etc) and OS calls, especially considering various overheads. This will reduce the overall bit-width available to the fast portion of RAM within a given frame time.

Obviously, the requirements of a particular game will be different but at some point, (see my graph) the GPU will become data starved.

Duoae said...

Oops, i wrote Hz instead of cycles when speaking about the no. of cycles required to complete a 3 GB transfer. Sorry, i was tired. Just to clarify tge RAM is working at 14,000 Hz and 75 of those cycles (give or take) are required for the transfer across a 320-bit bus.

Andy said...
This comment has been removed by the author.
Andy said...

It just all seems so abstract to me to talk in small bits quantities when devs will be working with 10GB VRAM and most of it will be active (since fast SSD will drastically reduce the need to cache in vram assets for the next several seconds of gameplay)

The concept of interleaving is that all chips function as a single fast pool that's how the GPU would see it correct?

So say GPU is working on the 10GB pool at 560GB/s when the CPU needs access the 6GB pool using the 392/168 GB/ model:

How would the GPU see the 10 chips act a uniform pool if 4 chips have higher throughput (56GB/s) than the other 6 chips (28GB/s), the data GPU is working with its theoretically spread across all chips.

To help me understand can you answer my previous question: A 4MB asset (32000000 bits) would be spread across all 10 chips correct? So if the GPU wants to access it would have to wait for the other 6 chips to catch up to the faster 4 chips. Wouldn't this stall be the equivalent of capping the 4x1GB chips to 28GB/s for a total bandwidth of 280GB/s?

Thanks for your responses, really appreciate them i hope im not becoming a pain in the rear

Pete said...

Hi Andy,

The way I look at this (and this is just conjecture), is that it actually uses the 32-bit bus over all 10 chips. If you look at the memory config diagrams that Duoae posted, you basically just jump between either the top left (if you access the high BW 10GB) and the bottom left (if you access the narrower BW 6GB).

You then basically have the six 2GB chips with their memory split in half. The one half is interleaved with the high BW 10GB config, and the other half is interleaved with the narrower BW 6GB config.

The four 1GB chips are disabled/disconnected when accessing the narrower BW 6GB config.

Duoae said...

Hi guys,

Not accessing one interleaved address space when accessing the other is the absolute worst case scenario.

I.e. it would only happen when the data needed was all stored on those 6x 2GB chips.

In actuality, as Pete said, the total addressable bandwidth to each address space is varying between 320 - 128-bit and 192 - 0-bit (for cases where either address space is fully accessed).

Since i can't model that (and i don't think anyone can because it would be application specific and need knowledge of OS requirements!), I've been working on the approximation of time spent per frame (thanks to Pete's excellent comparison).

The reason why the large address space can be accessed at the same time as the smaller one is because the 64-bit controllers are separate and can send their own requests to the modules they are connected to (with the previous stipulation that only data that is not spread across the inaccessible modules can be accessed in this manner).

Again, Andy, i think it's hurting your understanding to speak about this in GB/s instead of bit width.

People like to use the comparison of lanes on a motorway. Well, each 16-bit channel on each module is a separate lane for a total of 20 lanes. The motorway has a speed limit that you cannot go above or below (14,000 Hz) and travel time is instantaneous so the vehicles spend zero time on the motorway - they just appear at the destination from their origin but more than one vehicle cannot use a lane at a time (1 cycle of the memory [1/14,0000]).

Imagine there are two cities linked to 12 of the lanes (6 modules) at one end of the motorway and there's a factory at the other end of the motorway that, in rush hour, workers must compete to use to get to work on time. The first, larger city also has another 8 lanes all to itself.

When traffic from the second, smaller city wants to go to the factory, the traffic from the bigger city on those 12 lanes is stopped but can continue on the 8 remaining lanes.

However, how many cars able to travel at the same time from the two cities changes based on which lanes are being used from each city. The total possible number of vehicles never changes (it's 20, one for each lane) but if only one vehicle from three smaller city is travelling, the remaining 19 vehicles can be from the larger city.

Does that make any sense?

Duoae said...

The bit where the scenario becomes complicated is that there's a car pool initiative implemented from the factory whereby vehicles from a single city meet up at a pit stop mid-way down the motorway, they then get into a single vehicle which delivers them to the factory.

This car pool vehicle cannot leave until every passenger arrives at the pit stop. Ideally, all passengers arrive at the same time but sometimes the car pool vehicle has to wait a cycle or two before it fills up.

(I think this should cover the scenario where a single file is spread across multiple modules)

RKO said...

Salve, possibile che la ram divisa di xbox serie x, serva per tenere la retrocompatibilita con i giochi xbox one che ha esdram e xbox one x che ha 12gb di ram a 384 bit a 326gb/s.

Unknown guy said...

Hi Duoae,

Do you think we'll see many games with ram budget >10GB ? Does it mean that if this situation happens, the Xsx can potentially push less pixels that PS5 due to bandwith limitations ?

Thanks for this very informative blog by the way.

Duoae said...

Hey Unknown Guy,

Honestly? Yes and no. And I think that the answer isn't that simple. :)

There's an issue with the premise of this question. My own reasoning is that (from my perspective and from the released statements during the reveal of the Series X) Microsoft's OS could sit in the wide pool of memory until it's pushed out. People forget that the pools of memory are an abstraction. The 16-bit channels to each memory module have access to all address spaces on the RAM. It's the controllers that decide how to split them and access them.

However, if everyone who's been dismissing me in this respect is correct, it doesn't even matter if games utilise less than 7.0-7.5 GB RAM as the OS will always have a penalty to access it and /also/ non-graphics data will also always sit in the narrower bandwidth memory as well. It doesn't make any sense to me to do things that way but it's a possiblity.

So, there could be no benefit to running a game at less than that.

In terms of whether games will breach larger than 10 GB in their RAM allocation? Yes, absolutely. The reason is *because they can* (and because the platform will support it.

Let's put it this way - Control at 4K resolution eats up 8 GB of VRAM. That doesn't even take into account system RAM usage... and I couldn't find numbers for that (and since I don't own the game I couldn't test it). I think that game probably already meets or just breaks the 10 GB barrier (if the system in question allows for it).


35% of Steam systems have 8 GB of system RAM and 40% have 16+ GB, so I'm not worried about games not supporting less RAM but I think developers will eat up that RAM budget very easily when targetting 60-120 fps and 2K/4K resolutions.


Regarding the second part of your question - no, I think the GPU on the SX is basically more powerful so it will always be able to push more raw pixels. However, it might limit itself to something more close to the end performance of the PS5 through being a bit RAM hungry and not being able to get that (depending on the game). I think at that point, developers will drop the internal rendered resolution and rely on upscaling techniques to output at 4K. Which makes the comparison meaningless anyway :)

Unknown guy said...

Thanks Duoae. Why I meant by "pushing less pixels" is exactly as you say, limiting itself in the case it's starving for bandwith, forcing the dev to drop the native resolution a little (not a big deal with the reconstruction algorithms and dynamic res anyway).

By the way what are the chances that Sony upgrades the gddr6 chips to 16 gbps (or better) ? 512GB/s would be perfect unless these consoles need less bandwith that we think (with all the little caches everywhere) ?

Duoae said...

Hi Unknown Guy,

To be honest, I think the price/profitability of the PS5 is already razor-thin. I don't think 16 Gbps GDDR6 is viable at this point in time for products that aren't high-margin (same story with HBM2).

I don't think the PS5 is bandwidth starved.

MetalSpirit said...

Hi Duoae

As for the RAM usage on the Xbox, maybe Sample Feedback Streaming can greatly reduce memory needs.
Since total memory is 12.5 GB, with it it might just be possible to avoid the last 2,5 GB.
But as you say, and I see it as you do, even the OS on that RAM will cause bandwidth cuts. So counting on over 392 GB/s is tricky.

But X has 76 MB caches... If I'm seeing it right, this is a lot! RDNA seems to have 512 KB L1 chaches and 4 MB L2 caches. Even adding L0 caches (less than 1 MB) , and CPU caches (16 MB per CCX), this seems a lot. (correct me if wrong, I'm checking this chache data for the first time, and as I type)
Maybe cache is there to compensate for these small times accessing the lower speed pool!

RKO said...

Se ti può servire.

Andy said...

Duoae & Pete can you answer this example to help me understand?
392/168 GB/ model:

First do you agree that every single asset/file /texture must be spread across all 10 chips. That's how interleaving works no? to take full advantage of 560GB/s. Micromanaging assets in 16 different pools and having it all sync together sounds like a developer nightmare worse than the CELL BE.

So back to my example: a 4MB asset (32000000 bits) would be spread across all 10 chips correct? So if the GPU wants to access this asset it would have to wait for the other 6 chips (28GB/s) to catch up to the faster 4 chips 56GB/s. Wouldn't this stall be the equivalent of capping the 4x1GB chips to 28GB/s for a total bandwidth of 280GB/s?

That is why i think the 392GB/s 168 GB/s model in practice performs the same as the 280GB/s 168GB/s model

Pete said...

Hey Andy,

I have no GDDR6 experience, so take what I say with a grain of salt. Having said that, I think that the bandwidth to each chip is always 56GB/s (32bit x 14Gbps). The difference is in the amount of chips active/selected when accessing the different “pools”. When accessing the fast memory, all 10 chips are active, with data (4MB in your example) interleaved over all 10 chips @560GB/s. When the slower memory is accessed, only the 6 2GB chips are active/selected, with data (4MB) interleaved over these 6 chips @ 336GB/s.

Even though the 6 2GB chips contain data from both the fast pool and slower pool, the data doesn't clash, since the memory controller can just swap to the upper lower half depending on which configuration it's in (fast or slow). In technical terms, it could be as easy as flipping the MSB on the address bus on the 2GB chips.
Now I haven't gone the extra mile to go through the GDDR6 spec (as Duoae did), but atleast from a traditional memory point of view, I can see the above being a feasible solution.

This is also not really something that the developer needs to be aware of. All this can be abstracted by making use of address spaces. The memory controller could then configure itself based on these addresses.

One downside of this approach, as mentioned in Duoae's article is that you can't have simultaneous access from GPU and CPU. But 1, I don't see a nice configuration where you can allow for this without impacting bandwidth, and 2, I don't think you would want simultaneous access to the CPU memory since this seems to be a shared pool between CPU/GPU (as per some peoples comments) and this could create race conditions. So I would just go with the above approach and use priorities and interrupts to manage access to the memory controller.

Pete said...

To maybe rephrase a bit. I know you can have two channels per chip, but i think they're going with the unified approach, aka treat it as one 32bit access.

Pete said...

Also, I notice that your question is mainly target at the 392/192 model. :facepalm:

In this case, I think the confusion is that for the 4x1GB chips, you're saving 32 bits at a time, and for the 6x2GB chips your saving 16 bits at a time (since the other 16bits are reserved for the other channel used by the CPU). Therefore it doesn't stall, it just uses that bandwidth to save more data.

Having said that, it does pose a dilemma in the sense that your address space is not unified per se. For the 2GB chips you would have 1GB available per channel whereas for the 1GB chips, you have 512MB available per channel. Resulting in 392GB/s when accessing the first 7GB of data (interleaved over 10chips/12channels), but drop to 168GB/s for the final 3GB.

All in all, not a preferred approach I would say.

PS. @Duoae, let me know if I'm talking nonsense.

Andy said...

I agree with your first comments as that is how it will work in practice most likely even MS recommendation alluded to switching on a cycle by cycle basic

The problem i have is with the simultaneous access using the 392/168 GB/s model
If CPU & GPU have to access memory simultaneously the bandwidth of the 6x2GB chips will be split between the two.

At a glance this becomes 392GB/s for GPU & 168GB/s for CPU BUT that's not taking into account interleaved memory access. Data is spread across all 10 chips evenly if GPU wants to pull say a 3MB asset it will have to draw it from all 10 chips: the 4x1GB (56GB/s) chips will deliver their portion of the 3MB asset however GPU will have to wait for the remaining 6 chips (28GB/s) to get the remaining bits of the 3MB asset. In practice this stall limits GPU to 280GB/s

Duoae said...

Hi MetalSpirit,

Sorry for the late reply. I've been a bit busy these last few days. Yes, for sure SFS will alleviate some unnecessary loading of data into RAM.

However, the actual access to and from the SSDs are quite slow and narrow in comparison to the access the APU enjoys. If you look at the specs, we can see that the compressed data rate is 4.8 GB/s guaranteed with up to 6 GB/s for data that compresses better. That's still about three whole seconds to fill the RAM (I'm averaging here and not counting the full 16 GB).

Now, the only time this is going to happen is when a game os initially loading into memory or a new level is loading. Most assets don't need such a large amount of data in such a small amount of time. Most likely, you'd see a game loading in only a portion of the potential per second.

Just doing some simple maths allows us to see that the SXs' SSD can move 76.8 MB per 16 ms frame, which is effectively a rounding error on the bandwidth available to the GDDR6 (8.96 GB for the same time period when accessing the larger address space). SFS will just make this rounding error smaller and allow more effective utilisation of RAM in order to load more assets per frame in the game.

The way i see it, is that the necessity to switch between the two address spaces and manage that is the big issue because, as the post that was linked where the analysis was done on Shadowfall's memory utilisation showed, you begin to take time away from accessing certain data when you have to do that... and that hurts performance.

Regarding the cache sizes, yeah it's a lot. I agree and i mentioned that too. It will help stop the GPU from heading out to the main memory all the time... but i don't believe that most of that cache will be for either CPU or GPU, i think it may be for the custom silicon (e.g the decompression engine, sound engine, etc). Going from my estimation last time, almost half of that SRAM figure would be for non-CPU/GPU functions... but, as you say - it's possible what you're saying is in effect.

Duoae said...

Thanks, RKO. I was good with the video though :)

Duoae said...

I think Pete replied quite thoroughly. I would day this though - data won't necessarily be stored equally across all memory modules in the address space. Memory has fixed sizes of address spaces (if my nomenclature is correct, these are called words). Each file is put into the minimum amount of words that it needs to be in. If your file is small, (say 2 words) then maybe it's only stored across two modules. If it's large (say 20 words) then those are spread, somehow, across the ten modules.

As i tied to explain above, if your data being accessed doesn't need the entire array of modules to be accessed then the remaining can be used to access the other address space - presuming there's no conflict in access requirements.

Duoae said...

I think the only thing i disagree with is thinking about CPU/GPU access. The bandwidth is from the DCT/memory controllers to the GDDR6, anything behind that doesn't know what's going on in the memory array. Remember that the GPU also has access to the narrower address space too... as do other parts of the silicon (i.e through DMA).

I actually am not really following/ understanding what you mean by saving data, though...

Andy said...

Hi duoae
"Each file is put into the minimum amount of words that it needs to be in. If your file is small, (say 2 words)"

Even a tiny 1KB (8000bits) file will take 250 32bit words... That's the nature of interleaved memory: Every single file is spread across all chips evenly. That is why the 392GB/s 168 GB/s model in practice performs the same as the 280GB/s 168GB/s model

"if your data being accessed doesn't need the entire array of modules"
I don't think this discrimination exists on interleaved memory,every piece of data on the 10GB pool is spread across all 10 chips evenly

Andy said...

For the 1KB example: 1kb (8000 bits) is spread across 10 modules, 25 32bit words for each module (total 250 words)

The 1GB modules will deliver 25 words each each (100 words total) in 1.78 nano seconds while the 2GB modules will deliver the remaining 150 words in 3.57 nano seconds.

The GPU will have to wait 3.57 nano seconds to access the 250 words (8000 bits)that form the 1KB file. This is 280GB/s

Pete said...

Yes you're right. I got confused with simultaneous access of the chip meaning simultaneous access of the memory space. They're two independent channels accessing two independent memory spaces.

Also, replace "memory saving" with "memory access".

Duoae said...

Okay, I stand corrected. Thanks for taking the time to explain to me. I guess I was thinking too abstractly.

Then I suppose the edge cases where individual controllers could access the data within will be rather small.... which brings me back to my original limitations of having to switch between the two memory pools entirely.

Pete said...

I think you should check your assumption of data being interleaved evenly.

Because the 2GB chips have 1 channel (16bits) reserved for the slow pool, only 1 channel (16bits) is used for the fast pool.

So let's say the GPU wants to access 28 bytes. These bytes will be interleaved as follows. 2bytes per 2GB chip x6 + 4bytes per 1GB chip x4 = 28bytes. So in 1 cycle it'll acces all 28 bytes giving you a bandwidth of 392GB/s normalised.

Duoae said...

But wouldn't that result in a situation whereby Microsoft couldn't claim 560 GB/s / 336 GB/s (320 / 192-bit)?

As far as I understand (and I could be wrong!) both channels to each chip have access to both memory spaces.

Pete said...

Yes. Which is why i think the 392/168 GB/s model doesn't work.

I was just trying to explain to Andy how that model *could* work.

Ger said...

Hello again Duoae, what do you think about this? I need more points of view http://disruptiveludens.com/el-cuello-de-botella-en-la-next-gen

Pete said...

Hi Duoae,

Yes you're right. Seems like I misinterpreted the micron design guide. It's actually dual ported. _sigh_

After doing some extra reading, it seems like we might need to reconsider memory being accessed (and interleaved) over one big bus (320/192-bit). It seems like the most common unit of access will be 64 cache lines (64 bytes). This, together with making full use of the GDDR6 16x prefetch, means we're most likely not going to see interleaving on a wide scale.

So instead of single and wide access, we might see a bunch of smaller accesses happen in parallel. The end result might still be the same in terms of bandwidth throughput. It might actually be better, since in the case of CPU/GPU contention, you're not holding up the full bus. Maybe this was common knowledge, but not for me, which caused some confusion on my part.

The missing piece of the puzzle for me, and hopefully you can provide some information here, is what does the bus between the GPU and memory look like. Because in order for this parallelism to hold, the bus should actually be a group of busses, each being able to make their own request.

RKO said...

Ciao, Scrubber gpu di Ps5.
Pulizia del cache della gpu che xbox non ha.
Quanta differenza fa.
Quanta prestazione perde la gpu.

Andy said...

@pete im not sure i follow, would you mind explaining your reasoning? im far from an expert.

The way i understand it for simultaneous CPU/GPU access the GPU will use 16bit address space (1GB) on the 2GB modules and access the 1GB modules on the 16bit address space as well to keep it symmetrical

Just think about it: if you start assigning data unevenly favoring the 1GB modules on a 2:1 ratio you'll eventually find yourself out of space and end up with a worse outcome 6GB at 392GB/s and 4GB at 96bits 168GB/s im not sure even if this frankenstein configuration is possible and even if it was not worth the hassle for developers because you end up with worse performance

Andy said...

@pete im not sure i follow, would you mind explaining your reasoning? im far from an expert.

The way i understand it for simultaneous CPU/GPU access the GPU will use 16bit address space (1GB) on the 2GB modules and access the 1GB modules on the 16bit address space as well to keep it symmetrical

Just think about it: if you start assigning data unevenly favoring the 1GB modules on a 2:1 ratio you'll eventually find yourself out of space and end up with a worse outcome 6GB at 392GB/s and 4GB at 96bits 168GB/s im not sure even if this frankenstein configuration is possible and even if it was not worth the hassle for developers because you end up with worse performance

Andy said...

Hi duoae, i tried to post my comment many times but it won't go through

Unknown guy said...

Hi Duoae :

There is something I don't understand in your post.

I have read Lady Gaia's comment again and here what is says :

"The CPU and GPU share the same bus to the same pool of RAM on both the Series X and PS5. There's no way around that. Only when the CPU is doing literally nothing can the GPU utilize the full bandwidth on either one (except it won't have any work to do because the CPU is what queues up work, so that's quite literally never going to happen.)

As I stated above, if you assume the CPU needs the same 48GB/s on both, then you have 400GB/s remaining on the PS5 or about 39.1GB/s/TF for the 10.23TF PS5 (which is a pretty meaningless measure but it's what you're considering here.) On the Series X you have 480GB/s left or about 39.7GB/s/TF for the 12.1 TG CPU. They're in pretty much the same shape at these rates.

The reason why 48GB/s of bus traffic for the CPU on Series X costs you 80GB/s of the theoretical peak GPU bandwidth is because it's tying up the whole 320-bit bus to transmit only 192 bits of data per cycle from the slower portion of RAM we've been told will be typically used by the CPU. 48GB / 192 bits * 320 bits = 80GB of effective bandwidth used to make that 48GB/s available. This is because only six of the ten RAM chips can contribute to that additional 6GB over and above the 10GB of RAM that can be accessed more quickly when all ten are used in parallel."

It says the result for SX bandwith is 480GB/s and not 512 GB/s like mentioned in your article (even if the CPU bandwidth of 48 GB/s has not been confirmed).

Could you please clarify this point ?

Thanks :)

Duoae said...

Hey Andy,

Sorry about that. It seems that with the move to the new "blogger" that many comments are or were getting marked for moderation but, using the old interface, I wasn't seeing them. I switched to the new, less functional interface the other day and started going through everything and found many comments back into April!

Quite annoying! Worse still, there's no indication as to why a comment is marked for moderation...

Duoae said...

Hi Unknown guy,

For me it's pointless assigning a bandwidth per second to the CPU/GPU separately because every game is going to have different requirements and because it's certain that some operations do not require either processor block to perform.

It's also a bit pointless because the CPU and GPU don't *only* access one or other pool of RAM.

Combining that knowledge, the figure of 80GB/s assumes a 50:50 split per second between access to both portions of RAM on the SX. That ignores any simultaneous access to either portion of RAM for both CPU/GPU.

I.e. the CPU needs to instruct that a sound is played and must retrieve it from the smaller pool of RAM within a frame's worth of time (I've been using a standard 16ms). The GPU needs to access some data from the same pool. They schedule it at the same time and the GPU loses no bandwidth.

So, you see how it's not that simple? That's why the graph is based on access per frame time. You can easily see how access is affected by each component's requirements. It's unlikely to ever be a flat number.

For another example - in Mark Cerny's talk (or maybe the Euro gamer interview afterward?) when speaking about the smart shift he mentioned that maybe the CPU finishes its frame computations early and the APU can divert more energy to the GPU, allowing it to increase in frequency.

Duoae said...

Hi Ger,

Sorry for the late reply. The new blogger system has been eating my comments and i didn't notice.

I read Urian's piece but i think there are two main areas i either disagree with or think he is assuming too much.

1) I think he assumes that PS5 will have a static amount of RAM dedicated to the GPU. This may not be the case. I think the only thing we can guarantee is a static portion devoted to the OS - like in the desktop environment he mentions. Personally, i think SONY would be a little crazy to define a limit on graphics or non- graphics access when there is none on the SX. The SX has an artificial boundary on its RAM segmentation but not on access - that's something that commentators on the internet have said (which is what I've pointed out many times).

2) His bytes/flops calculation is incorrect, in my opinion, because he only considers one part of the memory architecture. We know for a fact that the GPU cannot access that full 560 GB/s bandwidth all the time (though many of us disagree on exactly by how much that is).

Yes, PS5 will also share its bandwidth but there'll be no point in time where it will dip below 448 GB/s total. As discussed in the comments on this blog: Whenever the SX has to access the narrower memory pool, the 10 GB of RAM is essentially "not there". How much and how severe that is will differ from game to game but it will be present.

I'm pretty sure i did some sort of similar calculation and saw that the SX had a lower ratio...

Duoae said...

Yeah, at this point in time, we don't know about memory bus configuration and i doubt we'll find out until after the consoles release.

I keep flip flopping on whether wide/ narrow or 64-bit multi narrow design is correct. As you probably saw in our previous discussions. However, i think based on Microsoft's explanation as to the split and memory bandwidth that wide/ narrow is correct and that individual domains within the memory pools wouldn't make any sense despite being allowed by the spec.

E.g. if you spread a 200 MB file across 10 chips (320 bit), the read performance is much better than across 2 chips (64 bit)...

I think the last time i spoke on this subject i had been convinced that for larger real world files (not the theoretical small files i was imagining) small sub- domains with individual bus access make no sense. It needs to be a unified bus like on graphics cards and in system memory.

Duoae said...


I see the scrubber as semi-analogous to SFS. SFS is front loaded (though I'm a bit uncertain as to how it pre-emptively culls unwanted data) whereas the scrubber in PS5 is there to deal with things once they're in RAM.

I don't think either system will suffer in this aspect.