Zen 5 already has extremely impressive memory bandwidth. A 192 core EPYC CPU has over committed memory bandwidth twice as much as a 16 core AM5. Don't forget, both of those CPU complex can handle twice the number of threads as they have cores. Even if they double RAM bandwidth, they are going to need a ton more lanes and a ton more cache because they are going to have wild latency. Now, imagine a world in which each chiplet has 32GB of RAM directly connected with a fraction of a contemporary memory bus latency. The performance per core would go up dramatically. We would have less CPU cores burning through memory faster. What's more, we would have a lower pin count socket for some cost saving.
According to a couple of large retailers, AMD motherboards are out selling Intel motherboards 10 to 1. Wow. Intel is a dumpster fire. They are in an existential crisis.
The I/O Die is from Zen 4 and should have been updated for Zen 5 to meet the increasing needs of RAM speeds and capacity. It just signals to people that Zen 5 is a stepping stone from Zen 4 to Zen 6. For most PC users, the current offerings are 100% fine. For people who lament the sane priced HEDT era (like me), what is being offered is gimped IO and RAM capacity to push the HEDT crowd into (now) obscenely priced TR setups. Current high end AM5 boards are insanely priced and are a balancing act of trade-offs with the limited lanes. There is nothing that an X870E has over X670E other than new wi-fi and mandated USB 4. In fact, some boards are regressions in expandability due to the USB 4 and limited PCIe lanes. Now what you propose sounds nice, and if there is a push for integrating RAM onto the CPU package with a forward thinking amount and speed, I would be fine with that absolutely. The problem is, most of the time, integrating usually is code for planned obsolescence and e-waste. I do not doubt that at all.
This is not an effective strategy. To a small extent, you can mitigate latency with cache and all CPUs do some of that but what happens when the CPU branches and needs to start loading a bunch of new cells into the program register? Suddenly, the cache is invalidated and we're waiting for a bunch of cycles while memory is loaded into the CPU. The higher the speed, the more important latency becomes. At 1 cycle of 5GHz, electricity can travel roughly 4cm. To lower latency, distance must be reduced.
I do not foresee RAM being integrated anytime soon. If evolving the I/O Die is not the solution due to latency, then what is?
This is because legions of folks will declare it a bad idea without understanding it or even testing it. Sooner or later, someone will notice that making the numbers higher is not helping outside the marketing department. We are approaching a limit of diminishing returns. I'm not saying we're there, just that it is on the medium horizon. The Fujitsu A64fx is at the core of a couple of the most powerful supercomputers in the world and it has a surprisingly low ARM core count with 32GB of on-package HBM2 RAM per part. The A64FX was the highest performing CPU, when released. The next iteration of this package is expected to restore that crown. It makes more sense than it would to continue what we are doing. Show me an application that randomly hits more than 64GB hard. Even database engines don't do that. I can't think of one that hits more than 4GB hard, including postgres. Apps tend to load data in big chunks and blow through it mostly sequentially. I know you're racking your brain, right now, to find a 1 in a million exception that proves I'm wrong. No need for that. PCIe 3 was really fast. PCIe for has twice the bandwidth but with variable aperture DMA, it's way more efficient than PCIe 3 making it more in the range of 3~4x faster. PCIe 5 doubles the bandwidth again. PCIe 5 can transfer 4GB/s per lane. My M.2 SSD is PCIe v3. It's a WD Black from a few years ago and it's really fast. If an SSD were 8x faster than mine, that would be really, really, really fast. Don't forget, M.2 is a connector with 4 lanes of PCIe. We are near the point, it would make a lot of sense to place 32 or 64GB of RAM on the package and scale above that with a paging swap partition. Latency would go down a lot, the vast majority of the time. Sure, the SSD is going to have high latency by RAM standards but don't forget we have column access latencies over 30 cycles for DDR5 RAM, right now. Some is well into the 40 cycle latency. That's quite a bit. We could even leave the RAM slots as a massive page swap and add the on package RAM, as a massive L4 cache. I suspect, at that point, it would make a lot of sense to delete L3 cache and use the RAM chiplet as L3. So, if someone does pursue this architecture in future, they will be able to load a 2TB database into RAM and directly access any cell, exactly like we can do right now with the current architecture. There is no need to panic over not being able to stuff cheap, Chinese, RAM into expansion slots.