23 October 2019

Analyse This: The Next Gen Consoles...

DISCLAIMER:  This is NOT an official slide from AMD or anyone else!!!! 

I've been following the drip feed of leaks and official releases from Nvidia, Intel, AMD, SONY and Microsoft about their respective expected technologies for many years. However, I recently began noticing a trend regarding the potential technologies in the next XBOX and the PS5 that, when I looked deeper, seemed to make some sort of sense despite appearing crazy at a surface level. But, let me lay everything out for you to decide for yourselves...

Now, many people are viewing the wired article as saying that the PS5 will feature some sort of NVMe drive integrated onto the motherboard or somesuch but I think this is misguided because that would mean that every console would be a ticking timebomb for the point where the solid state memory failed... and NVMe is no more specialised than an SATA SSD in terms of technology. Both those interfaces are well understood and nobody has been able to point to a "specialised" hard drive interface that could meet the term "more specialised" - there's just no indication from any storage technology provider that there is some sort of huge jump in performance around the corner from solid state technology, otherwise we'd be hearing about consumer-orientated devices as well.

Now, here's where it gets interesting and where the theory could fall apart. The codename Prospero hints at the EPYC Milan CPU being used in the PS5. Whilst 7nm Zen 3 is slated as being in use for EPYC Milan, in that same article the EPYC Milan design is stated as being complete and is slated to ship in 2020 - putting it in the same timeframe as the PS5/XBOX Scarlett. Saying that, EPYC Milan is Gen 3 of the EPYC processor line, not third generation "Ryzen".

Ryzen generations, as defined by AMD.

However, there could have been a bit of miscommuncation or purposeful obfuscation on Mr. Cerny's part here. Technically, the AMD series has been "Zen, Zen+, Zen 2 and then Zen 3".... in some readings, a third generation Ryzen line would be "Zen 2" (3rd gen Ryzen, 2nd gen EPYC) but in others it could be "Zen 3" - which would power EPYC Milan and Ryzen 4000 series chips. In fact, in the follow-up article on Wired, the specifics are removed and only Ryzen/Navi are mentioned.

To be fair, the distinction between Ryzen and EPYC/Threadripper can be one of semantics because the main differences between those tiers of chipsets are the way the core technologies are linked together and the security features, I/O and instruction set support. The base CPU complex (CCX) or core layout is identical between each Ryzen, EPYC and Threadripper within a Zen generation. For example, Zen 2 had 2x 4 Zen cores in each CCX with separated 16MB L3 cache per 4 cores. However, Zen 3 is said to have a unified 32MB L3 cache shared between all 8 cores.

This is actually an official slide from the press pack. You can see the evolution between Zen 2 and Zen 3 core philosophy.

Further to that, Anandtech had confirmation from AMD that no Ryzen Zen 2 (Matisse) Accelerated Processing Unit (APU) would be released and that any potential future APU would use a different design. At the current time, AMD's Raven Ridge (Zen architecture, 14 nm process) is several design iterations behind the state of the art from AMD but did combine processor and iGPU access to die-situated DDR4 (not GDDR5). Whereas the PS4 and Xbox One both utilise GDDR5 linked to their semi-custom APUs. However, rumours of AMD's Tarnhelm APU design (which never saw the light of day) had an APU combined with die-mounted HBM memory. While HBM is quite expensive, it is incredibly power efficient and also has a speed advantage over DDR and GDDR due to its bandwidth and proximity to the CPU/GPU, despite having to work through an interposer which adds latency.

This cancelled Tarnhelm APU could have been the groundwork to achieving a potential EPYC-designed APU with integrated HBM.

One aspect that I think supports the possibility of an EPYC-like chip being utilised in the next gen consoles is the required size of any potential solution. No consumer Ryzen APU parts have utilised more than 4 cores (8 threads). That spans the Ryzen 5 2400G, 3400G and the V1807B integrated part. Plus, all of those parts only had 11 Vega compute units integrated into their chip design with die/chip sizes around 210 mm^2 using a manufacturing node of 12 nm - smaller than any of the existing current gen consoles (smallest is Xbox One S at 240 mm^2) which utilise smaller Jaguar CPU cores and graphical compute units. The largest (and most powerful) the XBOX One X has a die size of 359 mm^2 using a manufacturing node of 16 nm.

Ryzen chips (as they are currently envisioned) do not have a large enough size to accommodate 8 cores and any significant amount of Navi CUs. For a comparison, Eurogamer performed a speculative test to emulate what might be expected in a next gen console based on the limited information we currently have of their capabilities. This paired a downclocked Ryzen 7 3700X with an RX 5700 and saw a 100% performance increase in a couple of carefully selected games. However, the die size of the RX 5700 is around 251 mm^2 and the Ryzen 7 2700X is 213 mm^2 (I wasn't able to accurately source the die size of the Ryzen 7 3700X but it is estimated at around 204 mm^2. However, the Ryzen 7 1700X was stated to have the same die size as the 2700X).

As an interesting aside, I found another source stating a consistent 192 mm^2 area for the Ryzen 7 dies but the difference is rather small. I'm just mentioning it here for completeness.

This puts a combined die/chip size of around 464 mm^2, assuming a one-to-one addition (it may not be as simple as that due to the requirement of laying down added infinity fabric to link the two functions). This is basically a doubling of the current Ryzen 8 core die size and isn't possible on a Ryzen platform which could only theoretically accept a further 80 mm^2 die on the package - which may be why there were no Ryzen 3000 APUs planned - there just wouldn't be the required performance uplift from the currently available 2400G using four Zen cores as the number of CUs you could fit onto the chip would be less then the 11 CUs featured on the available APUs - you'd need to develop a custom socket and/or chip.

It just wouldn't be worth the design effort or tying up the manufacturing capacity in order to put such a product onto the market. Given that TSMC (the current provider of AMD's 7 nm parts) is production constrained and has a 6 month lead time, whereas the Ryzen 5 2400G can be manufactured through Global Foundaries without issue, it would make no sense to have a low value, low volume part clogging up production for the more valuable AMD product lines.

The Chinese-only, never released, Subor Z+.

For comparison, Threadripper die sizes (16 cores, 14 nm) are around 426 mm^2 and don't include any graphics processing silicon but those use a much larger socket and chip design to achieve that density of dies. However, there is a more recent integrated Zen+ design that was an APU - the unreleased Subor Z+ console. Although the console never made it into consumer hands, there were several models sent for review to western tech outlets and the die size was estimated to be 398 mm^2. The Subor Z+ had 24 Vega CUs combined with a 4/8 core/thread Zen design - basically an upscaled variation on the official AMD APUs with a larger area dedicated to the graphics processor.

This die size lines up very nicely with the slightly smaller current console generation dies which feature fewer CUs and more (but smaller) CPU cores. This console also had an SSD drive for launching the system and a mechanical HDD for storage - potentially similar to what would be expected for the XBOX Scarlett and PS5.

It was observed that the Ryzen APUs did not outperform the Subor Z+, which essentially matched a Kaby Lake G part that operated at a higher frequency (but which was generationally older) if you ignore the effect that HBM had on the benchmarks for the Intel APU. The Subor Z+ was also benchmarked to be performant between an RX 580 and a GTX 1060. These statistics are all well  below what is expected (and has been rumoured) from the next generation of consoles and what is possible on current desktop parts. I summarised the cinebench results from the two Eurogamer articles from Digital Foundry in the table below:

From the Digital Foundry articles at Eurogamer

Going back to that rumour about the EPYC Milan variant with 15 tiles (and looking back up to the EPYC chip design in the image I posted above in the article), while WCCFTech seem to be under the impression that the extra tiles would be HBM due to the limitations of including DD4 on the chipset, there doesn't appear to be any reason why the only other silicon would be HBM. What is interesting about this is that Kaby Lake G (mentioned above) was actually a collaboration between Intel and AMD and had a chip with separate dies (chiplets) for processor, graphics and memory on the same integrated chip. This is similar to the current concept for the Zen 2 and Zen 3 designs from AMD where multiple chiplets at varying manufacturing process nodes can be combined on a single chip.

The only information I was able to find put the size of the Kaby Lake G chip at 1872 mm^2. Where the CPU die would be estimated to around 125 mm^2 and the Vega graphics die at 208 mm^2 in a package that operated with a 100W TDP - not so unusual for a console-type of power draw. If the graphical representation, below, is to scale then that puts the 4GB HBM module at around 90 mm^2.

A Kaby Lake G representation from Intel.

This estimated size matches relatively well with reported die footprint for an HBM gen 2 stack at 91.99 mm^2 manufactured on a 21nm process node. It is said, annecdotally, that AMD's GPUs are quite bandwidth-limited and their performance can be increased through pairing with the relatively expensive HBM. This could be the reason for the pairing in the Kaby Lake G... or it could simply be a size and power optimisation, with 1 GB GDDR5 (4x256 MB dies) taking up 672 mm^2 of chip space compared to that single stack footprint of 92 mm^2. Either way, there is precedent for inclusion of HBM on an AMD chip in CPU datacentre applications and for graphics applications at the mid and lower end of the segment.

There is another precedent for a locally available fast cache of RAM in the console space as well - the Xbox 360. The 360 had a 10 MB block of eDRAM which was used to peform memory intensive operations which would benefit from increased bandwidth (256 GB/s compared to the 22.4 GB/s of the bandwidth to the 512 MB GDDR3 main system memory). Whilst it is arguable whether the 360 really benefitted from this unusual architecture or not, the concept might be applied to the next generation of consoles where having fast loading of game assets is a number one priority. If we were going for a usable number, 8 or 16 GB or HBM on-chip memory would be sufficient to have a greatly improved loading experience if coupled with 8 GB of GDDR6 (as seen in current AMD graphics cards) for a total of 16 or 24 GB shared system memory.

This would allow for a significant portion of game assets to rest in the incredibly fast access of the HBM and the not so terrible access of the main system RAM, resulting in vastly decreased loading times and streaming times. This combination could also provide a "best of both worlds" scenario whereby high immediate bandwidth is balanced with slower but intermediate (to the SSD/HDD) working memory:

A comparison of the bandwidth available to each type of RAM (via anandtech and SK Hynix)

GDDR6 has an interface width of 32-bit* per chip, maxing out with a bandwidth of 448 GB/s for 8x 1 GB chips @ 14 GHz across a 256-bit interface. HBM2 has a memory interface width of 1024-bit per stack, maxing out with a bandwidth of 1024 GB/s for 4x 4 GB chips @ 2 GHz across a 4096-bit interface - in theory, fewer stacks of more dense memory would actually result in "worse" performance: the same memory configuration (8x 1 GB HBM2 chips @ 2GHz with an 8192-bit interface could reach a bandwidth of 2048 GB/s) but at the same time, would result in increased heat production. Ultimately, implementation of HBM is a trade-off as the specification allows up to a stack of 12x 1 GB RAM layers, though no one is manufacturing more than 8.
*The spec says 16-bit per chip but this is actually a 32-bit width split in two, so for the sake of the calculation to the available bandwidth, 32-bit works fine.
Conversely, NVMe SSDs can only muster theoretical bandwidths of 32 Gbps / 3.9GB/s. However, those are peak read speeds and generally, over an extended period of activity, real world transfer speeds reduce somewhat. On the "write" side of things, NVMe drives are around 2.0 GB/s... and this is for a Samsung 970 EVO - one of the higher-end devices. This is why I think that the "custom solution" hinted at by PS5 architect Mark Cerny cannot be purely an SSD with whatever interface. The performance of the NAND flash is just not there.

The other aspect of this is that games and programmes benefit from being stored in the system RAM instead of being swapped back onto the hard drive in a page file. At the moment, on PC, system RAM of 8 GB is a minimum and, ideally, you'd be looking to have 16 GB dual channel RAM. This means that it is unlikely that any next gen console would be sporting 8 GB total system RAM as this quantity is already currently found in the Playstation 4/Pro and XBOS while the XBOX has 12 GB system RAM. Looking specifically at the One X, this all but guarantees that 16 GB system RAM will be the minimum in the next gen systems.

Let's add these things all up and see what we get in terms of die size and therefore chip size requirements:
  • 8 Zen 3 (Milan) cores - 81 mm^2
  • I/O - 130 mm^2
  • HBM2E (2x 4GB stacks, 2048-bit @ 2.4 GHz giving 614 GB/s bandwidth) - 2x 92 mm^2
  • RX 5700 (36 Navi CUs) - 251 mm^2
  • AMD secure processor (ARM A5 CPU) - 0.3 mm^2
Total die size = 649 mm^2

This doesn't include the 3d audio solution or any hardware raytracing (RT) which we know is present based on the wired articles. However, it IS below even the chip size of Kaby Lake G (1872 mm^2), though probably not possible given the approximately 425 mm^2 total die size of the chiplets and well below what is possible on an EPYC/Threadripper 4411 mm^2 chip (1008 mm^2 die size).

The EPYC die size is more than large enough to accommodate the Zen 3 cores, I/O, 2x HBM2E, AMD secure, 3D sound and 36 Navi CUs. That's around "7" chiplets. However, the size of those chiplets are not common. The Ryzen I/O chiplet can fit into the EPYC I/O 3 times, the AMD secure (Cortex-A5) is tiny in comparison with the rest of the silicon on offer. Given how "expensive" in terms of area the RTX raytracing cores are, I wouldn't be surprised if a large portion of the remaining chip space is related to implementing that in hardware. Reports put the die "cost" of raytracing on RTX cards at 1/3... though some people dispute that, their own, inaccurate, calculations put it at around 24% - not so different from 33% and in fact well within the realm of error.

If we take "overall die size" on an EPYC chip in comparison with what we have, we can even make a more educated guess as to the total die area dedicated to each item which might fit with a "15 tiles" design. The RTX 2070 has a die area of 445 mm^2. Taking 20% of that for RT capabilities gives us 89 mm^2 - similar to the 81 mm^2 chiplet size of 8 Zen 3 cores and associated silicon.

Going on that assumption, in terms of cost of implementation (especially for a console product) we're looking at an available extra 359 mm^2 die space on a EPYC-sized integrated part. That gives the possibility of exactly 4x RT cores being implemented (This is just a massive coincidence!). However, this doesn't leave any room for the 3daudio, which is unlikely to be an off-chip implementation. So let's say there's only 2x 89 mm^2 for the RT cores (achieving greater than RTX 2080 levels of RT), leaving us with 181 mm^2 silicon die remaining for ancillary items. That's more than enough...

The supposed PS5 developer kit leaked from the Instituto Nacional da Propriedade Industria in Brazil. Supposedly confirmed by Gizmodo.

However, this unconventional part must match something that is already existing in terms of Zen 2 and EPYC.... Looking at the integrated EPYC product stack, we can see that there is already the 3201 with an 8/8 core/thread split with a base clock of 1.5 GHz and a boost clock of 3.1 GHz - very close to the "leaked" Gonzalo CPU. There also exists the 3251 which boasts an 8/16 core/thread split with a base clock of 2.5 GHz and a boost clock of 3.1 GHz - which is basically identical except for the TDP which is 30 and 50 W respectively. It seems to me that there's very little reason why the 8/16 core/thread part could not operate at a base frequency of 1.5 GHz instead of 2.5 whilst using SMT.

These factors squarely put any 13 TFLOP beast in the chip area range of the EPYC processors, even on 7 nm or 7++ nm manufacturing process nodes. These SoCs are going to be very large and relatively expensive to produce, though their predicted scale of production (based on this current console generation) should reduce their bulk cost somewhat, it does bear the question of whether both SONY and Microsoft will be losing a lot of money on each console... For reference, an 8 core EPYC  7232P processor cost $450 in August 2019 and had a TDP of 120 W, though the EPYC integrated 3251 processor cost only $315 in 2018 and had a TDP of 50W.

I may be crazy but that doesn't seem outside of the range of possibility for next gen consoles....

No comments: