GDDR5 is something that has been established as DRAM of choice for a long time now for high performance GPUs. However as the bandwidth continues to grow GDDR5 continues to consume more power, and as graphics processing is majority power and heat limited giving the DRAM more power means less can be distributed to the GPU.
The problem now arises now where GDDR5 is reaching the stage at which more power is needed to supply the memory its bandwidth reducing the power for the GPU computations. This causes what’s known as a performance flat line. However we are in luck as this hasn’t just popped out of the blue as AMD have been working for several years on a fix for this situation.
The chart below illustrates these factors considering total power, performance and time:
Another problem with this is the ergonomic factor on the PCB itself, as the chips aren’t reducing in size as the latest chips are super dense and are capable of 8Gb/sec data rates, however their 32 bit interface means you still need a large number of devices to reach a high total bandwidth, and what this does is consume the space on your PCB.
Up until now these issues have been dealt with by shrinking and integrating the components upon the main processor dye, however this is not a viable option for DRAM, this is due to their architecture, requirements and build process. Overall this can lead to a much more expletive system to fabricate if combined with the GPU so this isn’t a viable option.
The picture below shows the progression of this technique through the years to increase the bandwidth:
Scaling up and off chip design being non-viable options an intermediate ground is required for this increase in GDDR5 bandwidth. The solution AMD has created is what’s called a “silicon interposer” which quite littlery a middle ground area between the processor and the DRAM dyes.
This interposer will be located above the substrate so the DRAM and the processor are now combined into a single component, however they are still not on the exact same dye. This is a direct and short connection and allows data to be transferred from the GPU through the interposer and the DRAM, this gives a much larger memory interface. By doing this the GPU can run a much lower clock speeds since so much data can be crunched in one cycle, which finally means the power per bit ratio also reduces and bandwidth per watt increases.
The picture below illustrates this new architecture achieving this result:
Having a much wider memory bus the new design requires what’s known as a whole new type of memory, known as high bandwidth memory or HBM. The key feature of this memory is that it is a three dimensional memory type with stacked layers upon each other being placed upon the logic dye underneath as this is where the connection to the processor is made.
Getting this to work is complex however there are two primary factors which need to be taken into account. Firstly the DRAM needs to be extremely thin, in the 100 micron range which can cause issues during the build process. Then the next issue is connecting the dyes together, to counter this a new interconnection has been created called TSV (through silicon via). A TSV is literally a hole poked vertically through the thinned silicon wafer and filled with copper, and as each dye has multiple TSV’s when staked upon each other they are interconnected using a high density solder micro-bumps, this also connects the interposer with the processor.
While this process is complicated it results in an extremely short interconnection between the dies, and benefits from the reduced size and power consumption. The pictures below illustrates this process and architecture:
A single stack of 4 HBM dyes has a total bus size of 1,024 bits through 8 independent 128 bit channels ( having 2 per dye). The 1GHz clock speed now produces the memory bandwidth of 128GB/sec per stack, and most importantly the required voltage is now lowered to 1.2v as the previous GDDR5 typically requires 1.5v providing a 125% increase in efficiency.
The picture below compares the two types of architecture:
This new design is much more efficient in terms of power requirements and rebalances the ratio of DRAM and processor power consumption and in future models can grow to be one of the primary ways to increase the memory bandwidth upon the device.
The picture below illustrates the increase in GB/sec of bandwidth per watt of power used by the system:
The space that is saved by this process is a little over exaggerated, with the 1GB of HBM they represent uses 4 2GB GDDR5 dyes, with Samsung developing 8Gb HBM at this current time. As the picture below shows the space savings are massive.
What we have investigated so far is only the first generation of HBM and its looking to improve vastly in the future ventures into this area as currently the 1GB single stack of four 2GB dyes with a 1GHz clock cycle can give 128GB/sec bandwidth per stack. SK Hynix has outlined that in the future generations each dye will have an 8Gb capacity and each stack will have four our 8 dyes in total making a 4GB or 8Gb stack. As clocks can reach up to 2GHz this produces a bandwidth of 256GB/sec per stack with AMD saying that 16 high stacks is a possibility this idea can obviously produce possibly the highest bandwidths for memory possible but this is set for the far future most likely.
The picture below represents the two generations and their differences at this current time: