Whereas this yr’s Spring GTC occasion would not characteristic any new GPUs or GPU architectures from NVIDIA, the corporate continues to be rolling out new merchandise primarily based on the Hopper and Ada Lovelace GPUs launched final yr. On the excessive finish of the market, the corporate right now publicizes a brand new H100 accelerator variant particularly aimed toward heavy customers of language fashions: the H100 NVL.
The H100 NVL is an attention-grabbing variant of NVIDIA’s H100 PCIe card which, an indication of the occasions and of NVIDIA’s broad success within the discipline of AI, is aimed toward a singular market: the deployment of huge language fashions (LLM ). There are some things that make this card atypical of NVIDIA’s typical server fare – not the least of which is that it is 2 PCIe H100 playing cards which might be already wired collectively – however the massive benefit is the massive reminiscence capability. The twin-GPU combo card presents 188GB of HBM3 reminiscence – 94GB per card – delivering extra reminiscence per GPU than some other NVIDIA half thus far, even throughout the H100 household.
NVIDIA H100 Accelerator Specs Comparability | |||||
NVL H100 | PCIe H100 | H100SXM | |||
CUDA FP32 cores | 2×16896? | 14592 | 16896 | ||
Tensor cores | 2×528? | 456 | 528 | ||
Enhance clock | 1.98GHz? | 1.75GHz | 1.98GHz | ||
Reminiscence clock | ~5.1 Gbps HBM3 | 3.2 Gbps HBM2e | 5.23 Gbps HBM3 | ||
Reminiscence bus width | 6144 bit | 5120 bit | 5120 bit | ||
Reminiscence bandwidth | 2 x 3.9 TB/s | 2 TB/s | 3.35 TB/s | ||
VRAM | 2 x 94 GB (188 GB) | 80 GB | 80 GB | ||
Vector FP32 | 2×67 TFLOPS? | 51 TFLOPS | 67 TFLOPS | ||
Vector FP64 | 2×34 TFLOPS? | 26 TFLOPS | 34 TFLOPS | ||
INT8 tensor | 2 tops 1980 | 1513 TOPS | 1980 TOPS | ||
Tensor FP16 | 2 x 990 TFLOPS | 756 TFLOPS | 990 TFLOPS | ||
TF32 tensor | 2 x 495 TFLOPS | 378 TFLOPS | 495 TFLOPS | ||
FP64 Tensor | 2×67 TFLOPS? | 51 TFLOPS | 67 TFLOPS | ||
interconnection | NVLink 4 18 hyperlinks (900 GB/s) |
NVLink 4 (600 GB/s) |
NVLink 4 18 hyperlinks (900 GB/s) |
||
GPUs | 2x GH100 (814mm2) |
GH100 (814mm2) |
GH100 (814mm2) |
||
Variety of transistors | 2 x 80B | 80B | 80B | ||
PDT | 700W | 350W | 700-800W | ||
Manufacturing course of | TSMC4N | TSMC4N | TSMC4N | ||
Interface | 2x PCIe 5.0 (Quad slot) |
PCIe 5.0 (Twin slot) |
SXM5 | ||
Structure | Hopper | Hopper | Hopper |
Driving this SKU is a selected area of interest: reminiscence capability. Massive language fashions just like the GPT household are in some ways tied to reminiscence capability, as they’ll rapidly fill even an H100 accelerator with a purpose to retain all of their settings (175B within the case of the bigger GPT-3 fashions). Because of this, NVIDIA opted to create a brand new H100 SKU that provides barely extra reminiscence per GPU than their common H100 elements, which prime out at 80GB per GPU.
Below the hood, what we’re is principally a particular GH100 GPU tray that sits on prime of a PCIe card. All GH100 GPUs include 6 HBM reminiscence stacks – both HBM2e or HBM3 – with a capability of 16 GB per stack. Nonetheless, for efficiency causes, NVIDIA solely ships its typical H100 elements with 5 of the 6 HBM stacks activated. So whereas there may be nominally 96GB of VRAM on every GPU, solely 80GB is on the market on normal SKUs.
The H100 NVL, in flip, is the absolutely activated Mythic SKU with all 6 stacks activated. By lighting the 6e HBM stack, NVIDIA is ready to entry the extra reminiscence and extra reminiscence bandwidth it presents. This may materially influence returns – how a lot is a intently guarded NVIDIA secret – however the LLM market is outwardly large enough and prepared to pay a excessive sufficient premium for near-perfect GH100 packages to make it worthwhile.
Even then, it is value noting that clients haven’t got entry to the complete 96GB per card. As a substitute, with a complete reminiscence capability of 188GB, they successfully get 94GB per card. NVIDIA did not go into element about this design quirk in our pre-briefing forward of right now’s keynote, however we suspect it is also for efficiency causes, giving NVIDIA some respiration room to disable unhealthy cells (or layers) in HBM3 reminiscence stacks. The online result’s that the brand new SKU presents a further 14GB of reminiscence per GH100 GPU, a 17.5% enhance in reminiscence. In the meantime, the cardboard’s total reminiscence bandwidth quantities to 7.8 TB/second, which equates to three.9 TB/second for particular person playing cards.
In addition to the elevated reminiscence capability, in some ways the person playing cards of the H100 NVL twin GPU/twin card are similar to the SXM5 model of the H100 positioned on a PCIe card. Whereas the conventional H100 PCIe is crippled by way of slower HBM2e reminiscence, fewer energetic SM/tensor cores, and decrease clock speeds, the tensor core efficiency figures that NVIDIA quotes for the H100 NVL are all on par with the H100 SXM5, indicating that this card isn’t any smaller than the conventional PCIe card. We’re nonetheless awaiting ultimate and full product specs, however assuming all the pieces right here is as introduced, the GH100s coming into the H100 NVL would signify the very best GH100s at present accessible.
And an insistence on the plural is critical right here. As acknowledged earlier, the H100 NVL just isn’t a single GPU half, however moderately a twin GPU/twin card half, and it presents itself to the host system as such. The {hardware} itself relies on two H100s within the PCIe type issue which might be tied collectively utilizing three NVLink 4 bridges. Bodily, that is nearly an identical to NVIDIA’s present H100 PCIe design – which might already be paired utilizing NVLink bridges – so the distinction just isn’t within the development of the two-card/four-slot monster, however moderately within the high quality of the silicon inside. In different phrases, you’ll be able to assemble traditional H100 PCie playing cards right now, nevertheless it would not match the reminiscence bandwidth, reminiscence capability, or tensor throughput of the H100 NVL.
Surprisingly, regardless of the stellar specs, the TDPs virtually keep. The H100 NVL is a 700W to 800W half, which breaks all the way down to 350W to 400W per card, the decrease restrict of which is similar TDP as the usual H100 PCIe. On this case, NVIDIA appears to prioritize compatibility over peak efficiency, as few server chassis can deal with PCIe playing cards over 350W (and even fewer over 400W), which implies that TDPs should stay trustworthy. Nonetheless, given the upper efficiency and reminiscence bandwidth, it is unclear how NVIDIA delivers that additional efficiency. Energy binning can go a good distance right here, nevertheless it may also be a case of NVIDIA giving the cardboard the next than typical increase clock pace, for the reason that goal market is primarily involved with efficiency of the tensor and will not activate the entire GPU directly.
In any other case, NVIDIA’s choice to launch what is actually one of the best H100 bin is an uncommon alternative given their basic desire for SXM elements, nevertheless it’s a call that is sensible within the context of LLM buyer wants. Massive SXM-based H100 clusters can simply scale as much as 8 GPUs, however the quantity of NVLink bandwidth accessible in between is hampered by the necessity to undergo NVSwitches. For a dual-GPU configuration solely, pairing a set of PCIe playing cards is far more easy, with the mounted hyperlink guaranteeing 600 GB/second of bandwidth between the playing cards.
However maybe extra necessary than that, it is simply having the ability to rapidly deploy H100 NVL into present infrastructure. Slightly than requiring the set up of H100 HGX service playing cards particularly designed to pair GPUs, LLM clients can merely launch H100 NVLs in new server builds or as a comparatively fast improve to present server builds . NVIDIA is concentrating on a really particular market right here, in spite of everything, so SXM’s regular benefit (and NVIDIA’s capacity to push its collective weight) could not apply right here.
In complete, NVIDIA touts the H100 NVL as providing 12 occasions the GPT3-175B inference throughput because the final technology HGX A100 (8 H100 NVL vs 8 A100). Which, for patrons trying to deploy and scale their techniques for LLM workloads as rapidly as potential, is unquestionably going to be tempting. As beforehand acknowledged, the H100 NVL would not carry something new by way of architectural options – a lot of the efficiency increase comes from the brand new Hopper structure transformer motors – however the H100 NVL will serve a selected area of interest as than PCIe H100 quickest. possibility and the choice with the biggest GPU reminiscence pool.
In conclusion, in accordance with NVIDIA, H100 NVL playing cards will start transport within the second half of this yr. The corporate would not quote a worth, however for what is actually a superior GH100 bin, we count on them to fetch a premium worth. Particularly in mild of how the explosion in LLM utilization is popping into one other gold rush for the server GPU market.