High Bandwidth Memory: Wide & Slow Makes It Fast

Architecturally, the single most notable addition to AMD’s collection of technologies for Fiji is High Bandwidth Memory (HBM). HBM is a next-generation memory standard that will ultimately come to many (if not all) GPUs as the successor to GDDR5. HBM promises a significant increase in memory bandwidth through the use of an ultra-wide, relatively low-clocked memory bus, with die stacked DRAM used to efficiently place the many DRAM dies needed to drive the wide bus.

As part of their pre-Fury X launch activities, AMD briefed the press on HBM back in May, offering virtually every detail one could want on HBM, how it worked, and the benefits of the technology. So for today’s launch there’s relatively little that’s new to say on the subject, but I wanted to quickly recap what we have seen so far.

After several years of GDDR5 – first used on the Radeon HD 4870 in 2008 – HBM comes at a time where GDDR5 is reaching its limits, and companies have been working on its successors. As awesome as GDDR5 is (and it delivers quite a bit of memory bandwidth compared to just about anything else), GDDR5 is already a bit of a power hog and rather complex to implement. GDDR5’s immediate successors would deliver more bandwidth, but they would also exacerbate this problem by drawing even more power and introducing all of the complexity inherent in differential I/O.

So to succeed GDDR5, AMD, Hynix, and the JEDEC as a whole have taken a very different path. Rather than attempting to push a very high bandwidth, narrow(ish) memory bus standard even higher, they have opted to go in the opposite direction with HBM. HBM would significantly back off of the clockspeeds used, but in return it would go wider than GDDR5. Much, much wider.

The ultimate direction that HBM takes us is with a very wide memory bus clocked at a low frequency. For Fiji, AMD has a 4096-bit memory bus clocked at 1000MHz (500MHz DDR). The use of such a wide bus more than offsets the reduction in clockspeed, allowing R9 Fury X to deliver 60% more memory bandwidth than the R9 290X’s GDDR5 implementation.

On the technical side of things, creating HBM has required a few different technologies to be created/improved in order to assemble the final product. The memory bus itself is rather simple (which is in and of itself a benefit), but a 4096-bit wide memory bus is by conventional standards absurdly wide. It requires thousands of contacts and traces, many times more than even 512-bit GDDR5 required (and that was already a lot).

To solve this problem HBM introduces the concept of a silicon interposer. With traditional packaging not up to the challenge of routing so many traces, the one material/package that is capable of hitting the necessary density is fabbed silicon, and thus the silicon interposer. Essentially a partially fabbed chip with just the metal layers but no logic, the interposer is a large chip whose purpose is to allow the ultra-wide 4096-bit memory bus to be created between a GPU and its VRAM, implemented as traces in the metal layers. The interposer itself is not especially complex, however because of the sheer size of the interposer (it needs to be large enough to hold the GPU and VRAM) the interposer brings with it its own challenges.

Meanwhile even though the interposer solves the immediate challenges of implementing a 4096-bit memory bus, the next issue that crops up is where to put the necessary DRAM dies. It takes 16 dies at 256-bits wide each to create the 4096-bit memory bus, and even at its largest size the interposer is still a fraction of the size of the PCB space that traditional GDDR5 chips occupy. As a result the DRAM required for an HBM solution needed to be denser than ever before in a 2D sense.

The solution to that problem was the creation of die-stacking the DRAM. If you can’t go wider, go taller, which is exactly what has happened with HBM. In HBM1 the stacks can go up to 4 dies high, allowing the necessary 16 dies to be reduced to a far more easily managed 4 stacks. With a base logic die at the bottom of each stack to serve as the PHY between the DRAM and the GPU (technically making the complete stack 5 dies), stacking the DRAM is what makes it practical to put so much RAM so close to the GPU.

The final new piece of technology in HBM comes in the die stacks themselves. With the need to route a 1024-bit memory bus through 4 memory dies, traditional package-on-package wire bonding is no longer sufficient. To connect up the memory dies, much like the interposer itself, a newer, denser connectivity method is required.


TSVs. Image Courtesy The International Center for Materials Nanoarchitectonics

To solve that problem, the HBM memory stacks implement Through-Silicon Vias, which involves running the vias straight through silicon devices in order to connect layers. The end result is something vaguely akin to DRAM dies surface mounted on top of each other via microbumps, but with the ability to communicate through the layers. From a manufacturing standpoint, between the silicon interposer and TSVs, TSVs are the more difficult technology to master as it essentially combines all the challenges of DRAM fabbing with the challenges of stacking those DRAM dies on top of each other.

Combined together as a single product, HBM is the next generation of GPU memory technology thanks to the fact that it offers multiple benefits over GDDR5. Memory bandwidth of course is a big part, but of similar significance is the power savings from HBM. The greatly simplified memory bus requires far less power be spent on the bus itself, and as a result the amount of power spent on VRAM is reduced. As we discussed earlier AMD is looking at a 20-30W VRAM power savings on R9 Fury X over R9 290X.

The third major benefit of HBM over GDDR5 goes back to the size benefits discussed earlier. Because all of the VRAM in an HBM setup fits on-chip, this frees up a significant amount of space. The R9 Fury X PCB is 3” shorter than the R9 290X PCB, and the bulk of these savings come from the space savings enjoyed by using HBM. Along with the immediate space savings of 4 small HBM stacks as opposed to 16 GDDR5 memory chips, AMD also gets to cut down on the amount of power delivery circuitry needed to support the VRAM, further saving space and some bill of material costs in the process.

On the downside though, it is the bill of materials that is the biggest question hanging over HBM. Since HBM introduces several new technologies there are any number of things that can go wrong, all of which can drive up the costs. Of particular concern is the yield on the HBM memory stacks, as the TSV technology is especially intricate and said to be difficult to master. The interposer on the other hand is simpler, but it still represents something that has never been done before, and AMD admits upfront that the manufacturing facilities being used to create the interposer are old 65nm lines originally used for full chip production. So while the interposer does not approach the cost of a full logic chip, there is still the matter of the existing manufacturing lines being sub-optimal for high-volume low-cost production. Meanwhile AMD does get to enjoy some cost savings as well – the HBM PHYs are certainly much easier to implement than GDDR5 PHYs on Fiji itself, and the overall package is cheaper since it doesn't have GDDR5 memory running through it – though it's unlikely that these savings outweigh the other costs of implementing HBM at this time.

Ultimately AMD Is not willing to discuss HBM costs or yields at this time. Practically speaking it’s not a consumer matter – what matters to video card buyers is the $650 price tag on the R9 Fury X – and from a trade secrets perspective AMD is loath to share too much about what they have learned since they are the first HBM customer and want to enjoy as much of that advantage as is possible. At this point I feel it’s a safe bet that the 4GB HBM implementation on Fiji is costing AMD more than the 4GB (or even 8GB) GDDR5 implementations on Hawaii cards, but beyond that it’s difficult to say much more on costs.

That said, regardless of what the costs are now, HBM will be the future for AMD, and for the GPU industry as a whole. NVIDIA has already committed to using HBM technology for their high-end Pascal GPU in 2016, so AMD will be joined by other parties next year. Meanwhile AMD has much grander plans for HBM, intending to bring it to other products as costs allow. HBM on lower-priced GPUs is practically a given, meanwhile equipping AMD’s APUs with HBM would solve one of the greatest problems AMD faces today on the iGPU performance front, which is that 128-bit DDR3 bottlenecks the iGPU on their Kaveri APUs. AMD could build a better iGPU, if only they had more bandwidth to feed it with. This is a problem HBM is well positioned to solve.

Finally, at the end of the day what can’t be perfectly captured in words is AMD’s pride in being the first to roll out HBM. AMD was the first (and only) company to support GDDR4, they were the first company to support GDDR5, and now they are the first company to support HBM. The company has put significant resources into helping to develop the technology alongside Hynix, UMC, ASE, Amkor, and the JEDEC, and they see the launch of the technology as a testament to their engineering capabilities.

Furthermore they also see the fact that they are first as being a significant advantage going forward, as it means they have a generational advantage on arch-rival NVIDIA in implementing the technology. Case in point, NVIDIA’s first GDDR5 memory controller was by all accounts an underperformer, and it wasn’t until their second generation GDDR5 controller for Kepler that NVIDIA was able to hit (and even exceed) their aimed for memory clockspeeds. Admittedly this comes down to AMD hoping NVIDIA is going to stumble here, but at the end of the day the company is optimistic that all of their work is going to allow them to get more out of HBM than NVIDIA will be able to.

Power Efficiency: Putting A Lid On Fiji HBM: The 4GB Question
Comments Locked

458 Comments

View All Comments

  • D. Lister - Thursday, July 2, 2015 - link

    "AMD had tessellation years before nVidia, but it went unused until DX11, by which time nVidia knew AMD's capabilities and intentionally designed a way to stay ahead in tessellation. AMD's own technology being used against it only because it released it so early. HBM, I fear, will be another example of this. AMD helped to develop HBM and interposer technologies and used them first, but I bet nVidia will benefit most from them."

    AMD is often first at announcing features. Nvidia is often first at implementing them properly. It is clever marketing vs clever engineering. At the end of the day, one gets more customers than the other.
  • sabrewings - Thursday, July 2, 2015 - link

    While you're right that Nvidia paid for the chips used in 980 Tis, they're still most likely not fit for Titan X use and are cut to remove the underperforming sections. Without really knowing what their GM200 yields are like, I'd be willing to be the $1000 price of the Titan X was already paying for the 980 Ti chips. So, Nvidia gets to play with binned chips to sell at $650 while AMD has to rely on fully up chips added to an expensive interposer with more expensive memory and a more expensive cooling solution to meet the same price point for performance. Nvidia definitely forced AMD into a corner here, so as I said I would say they won.

    Though, I don't necessarily say that AMD lost, they just make it look much harder to do what Nvidia was already doing and making bookoo cash at that. This only makes AMD's problems worse as they won't get the volume to gain marketshare and they're not hitting the margins needed to heavily reinvest in R&D for the next round.
  • Kutark - Friday, July 3, 2015 - link

    So basically what you're saying is Nvidia is a better run company with smarter people working there.
  • squngy - Friday, July 3, 2015 - link

    "and they cost more per chip to produce than AMD's Fiji GPU."

    Unless AMD has a genie making it for them that's impossible.
    Not only is fiji larger, it also uses a totally new technology (HBM).
  • JumpingJack - Saturday, July 4, 2015 - link

    "AMD had tessellation years before nVidia, but it went unused until DX11, by which time nVidia knew AMD's capabilities and intentionally designed a way to stay ahead in tessellation. AMD's own technology being used against it only because it released it so early. HBM, I fear, will be another example of this. AMD helped to develop HBM and interposer technologies and used them first, but I bet nVidia will benefit most from them."

    AMD fanboys make it sound like AMD can actually walk on water. AMD did work with Hynix, but the magic of HBM comes in the density from die stacking, which AMD did nothing (they are no longer the actual chipmaker as you probably know). As for interposers, this is not new technology, interposers are well established techniques for condensing an array of devices into one package.

    AMD deserves credit for bringing the technology to market, no doubt, but their actually IP contribution is quite small.
  • ianmills - Thursday, July 2, 2015 - link

    Good that you are feeling better Ryan and thanks for the review :)
    That being said Anandtech needs keep us better informed when things come up.... The way this site handled it though is gonna lose this site readers...
  • Kristian Vättö - Thursday, July 2, 2015 - link

    Ryan tweeted about the Fiji schedule several times and we were also open about it in the comments whenever someone asked, even though it wasn't relevant to the article in question. It's not like we were secretive about it and I think a full article of an article delay would be a little overkill.
  • sabrewings - Thursday, July 2, 2015 - link

    Those tweets are even featured on the site in the side bar. Not sure how much clearer it could get without an article about a delayed article.
  • testbug00 - Sunday, July 5, 2015 - link

    Pipeline story... Dunno title, but, for text, explain it there. Have a link to THG as owned by same company now if readers want to read a review immediately.

    Twitter is non-ideal.
  • funkforce - Monday, July 6, 2015 - link

    The problem isn't only with the delays, it is that since Ryan took over as Editor in Chief I suspect his workload is too large.
    Because this also happened with the Nvidia GTX 960 review. He told 5-6 people (including me) for 5 weeks that it would come, and then it didn’t and he stopped responding to inquires about it.
    Now in what way is that a good way to build a good relationship and trust between you and your readers?
    I love Ryan's writing, this article was one of the best I've read in a long time. But not everyone is good at everything, maybe Ryan needs to focus on only GPU reviews and not running the site or whatever his other responsibilities are as Edit. in Chief.

    Because the Reviews are what most ppl. come here for and what built this site. You guys are amazing, but AT never used to miss releasing articles the same day NDA was lifted in the past that I can remember. And promising things and then not delivering, sticking your head in the sand and not even apologizing isn’t a way to build up trust and uphold and strengthen the large following this site has.

    I love this site, been reading it since the 1st year it came out, and that's why I care and I want you to continue and prosper.
    Since a lot of ppl. can’t reed the twitter feed then what you did here: http://www.anandtech.com/show/8923/nvidia-launches...
    Is the way to go if something comes up, but then you have to deliver on your promises.

Log in

Don't have an account? Sign up now