Nvidia claims new software library doubles LLM inference speed on H100 GPU

Open source TensorRT-LLM comes out next month, targets generative AI workloads

Nvidia plans to release an open-source software library that it claims will double the speed of inferencing large language models (LLMs) on its H100 GPUs.

TensorRT-LLM will be integrated into Nvidia's NeMo LLM framework as part of the Nvidia AI Enterprise software suite early next month. It is currently available in early access.

“We’ve doubled the performance by using the latest techniques, the latest schedulers, and incorporating the latest optimizations and kernels,” Ian Buck, VP of hyperscale and HPC at Nvidia, said.

“Those techniques improve performance, not just by increasing efficiency but also optimizing the algorithm end-to-end.”

TensorRT-LLM will also support other Nvidia GPUs, including the A100, L4, L40, L40S, and the upcoming Grace Hopper Superchip (which is a H100 combined with a Grace CPU).

The software library includes a new 'In-Flight batching' scheduler which allows work to enter and exit the GPU independent of other tasks. The library also offers automatic FP8 conversion, a DL compiler for kernel fusion, and a mixed precision optimizer.

Nvidia claims new software library doubles LLM inference speed on H100 GPU

More in The Compute, Storage & Networking Channel

Issue 52 - 3D printing a data center

Michael Dell: AI to drive data center demand up 100x over next 10 years

Episode Develop grid resilience by achieving power independence

Tags

Data Center Networking Trends 2025

Customer guide to data center decarbonization

Digital Twins for Data Centers

Are You Data and AI Ready?