Nvidia plans to release an open-source software library that it claims will double the speed of inferencing large language models (LLMs) on its H100 GPUs.
TensorRT-LLM will be integrated into Nvidia's NeMo LLM framework as part of the Nvidia AI Enterprise software suite early next month. It is currently available in early access.
“We’ve doubled the performance by using the latest techniques, the latest schedulers, and incorporating the latest optimizations and kernels,” Ian Buck, VP of hyperscale and HPC at Nvidia, said.
“Those techniques improve performance, not just by increasing efficiency but also optimizing the algorithm end-to-end.”
TensorRT-LLM will also support other Nvidia GPUs, including the A100, L4, L40, L40S, and the upcoming Grace Hopper Superchip (which is a H100 combined with a Grace CPU).
The software library includes a new 'In-Flight batching' scheduler which allows work to enter and exit the GPU independent of other tasks. The library also offers automatic FP8 conversion, a DL compiler for kernel fusion, and a mixed precision optimizer.