At a certain point, it pays to stop generalizing. Artificial intelligence has come a long way using graphics processing units (GPUs), a sector led by Nvidia. Now Intel is pitching specialist Training and Inference chips as a lower-cost alternative.

AI specialist Nervana was just two years old when it was bought by Intel for around $400m in 2016. Since then the company has been absorbed as a division of the chip giant and is hard at work developing application-specific integrated circuits (ASICs) designed for Training and Inference.

Nervana founder Naveen Rao is now Intel's corporate VP and GM for Artificial Intelligence. In November, DCD caught up with him at Intel’s AI Summit in San Francisco.

He said: “With this next phase of AI, we’re reaching a breaking point in terms of computational hardware and memory." In other words, GPUs just aren't going to cut it.

“Purpose-built hardware like Intel’s Nervana NNP range is necessary to continue the incredible progress in AI,” Rao said.

"You're going to see this benefiting everybody because the whole purpose of the computer is shifting to be an AI machine.

"It appears that pretty much every application is going to need AI [in the future], probably a lot of Inference and, at some point, Training.”

This feature appeared in the January issue of DCD Magazine. Subscribe for free today.

The two core aspects of deep learning are Training and Inference.

AI Training involves feeding large amounts of data into an infant AI model or neural network, again and again until the model can make an accurate prediction. Inference is the deployment of the trained model to make decisions in the field.

These roles are usually conducted by GPUs because of their ability to calculate vast amounts of mathematical equations. The newest, cheapest and smallest wattage GPU that Nvidia has is the 70W Tesla T4 series.

Inference

The NNP range replaces the GPU with specialist hardware, conducting Training and Inference on separate NNP-T and NNP-I chips, which use less power and are much more scalable.

For example, the Inference chip, the NNP I-1000, is available in two products: the NNP I-1100, a 12W card which holds a single NNP-I chip, and the NNP I-1300, a 75W card which holds two NNP-I chips.

“Power matters. You can't keep throwing up a power rack computation to solve these problems in the IoT world with data centers,” said Rao.

The more powerful training chip, the NNP-T 1000, contains up to 24 tensor processing cores, along with memory, and a fast inter-chip communications link (ICL) with 16 112 Gb/sec channels.

Like the Inference chip, the Training chip is available in two products: the 300W NNP-T 1300 card and the 375W NNP-T 14000 card.

Despite the prominence of the NNP range, Intel is still pitching GPUs for more specialist cases.

Mere days after the AI Summit, it announced that America’s first exascale system, Aurora, would house the company’s new Ponte Vecchio GPU.

Described as the 'workhorse' for HPC and AI convergence, Ponte Vecchio is a powerful GPU designed to take on more taxing roles.

Intel’s VP and GM of Enterprise and Government Group, Rajeeb Hazra, made it clear that the company’s grand plan for AI is to have a chip for any particular need, specific or general.

Aurora will be required to multi-task, hence the necessity for powerful GPUs.

Hazra said: "As high-performance computing moves from traditional modeling and simulation to the advent of data, there will be a drive for diverse computing needs, which will [spur] a new tailwind for heterogeneous computing.

"One size doesn’t fit all. We must look at the architectures [and how they are] tuned to the various needs of this era.

"If you need a general-purpose solution then Ponte Vecchio has already described its leadership performance [for] when you get those workloads that have tremendous bandwidth requirements and dense floating-point operations.

"If you were then to take the next step and say ‘I am very, very interested in the best possible performance for AI deep learning, Training and Inference,’ that's what our NNP families are for.

They are less general-purpose in some sense than the GPU, which runs a broader set of workloads, but [are] acutely and solely focused on deep learning and scale.

"And so that is what we believe is the right approach to a diverse set of workloads that are also morphing quite quickly as the industry experiments and innovates."

CPUs, GPUs, and ASICs

To cut a long story short, Intel wants to give customers a choice depending on what circumstances they are facing.

Naveen Rao said: “CPUs have hooks in them to help with Inference and Training so customers can start on Xeons [CPUs], which they probably already have, or they may eventually move to an NNP or even to an FPGA depending on what kind of flexibility they need.”

Splitting deep learning functions onto ASICs such as the NNP devices is by no means a "novel" idea. It has also been used by Google, in its Tensor Processing Unit (TPU) and its Inference and Training variants, released in 2017 on its cloud service.

Google has data centers full of TPUs that are available to rent. Notable customers who use the processors include Lyft, Twitter, and HSBC.

At the AI Summit event, Intel showed a 10-rack pod with 480 NNP-T cards, using their ICL links, and with no external switch. This platform trained multi-billion parameter models in reasonable amounts of time.

For field deployments, NNP-I chips will be placed inside a regular server rack. Intel says.

Intel’s tests implied that the NNP-I would out punch its rival, in a quarter of the physical space. Intel deploys the NNP-I in a single rack unit (1U) chassis, which holds up to 32 NNP-I chips in the “ruler” form factor, a long slim module.

Intel said this had 3.7 times the density of a 4U module which would be required to hold 20 Nvidia Tesla T4s. Nvidia declined to comment to DCD in time for print.

Naveen Rao said: “I think a lot of data center operators have already latched onto that and have an expandable infrastructure as it is. The NNP-I is probably the new order. It's an order of magnitude more efficient than that in general purpose.

“So, if you know Inference is 30 percent of your total workloads. Then it makes a lot of sense to incorporate something like this.”

Early adopters

Facebook’s machine learning compiler Glow already uses the NNP-I. The social media giant’s AI director, Mikhail Smelyanskiy, said: “With 2.4 billion users today, there are a lot of seemingly unrelated products or services, but in reality, there are many AI algorithms that are running underneath. And some examples are photo tagging [or translation]."

Similarly, Baidu is an early adopter of the new NNP-T. Kenneth Church, an AI Research Fellow at Baidu, said the company focused on implementing the training chip for Paddle-Paddle, an open-source deep-learning platform at Baidu that is used by 1.5 million developers in China and the chip to power its X-Man 4.0 Open Accelerator Infrastructure.

Gadi Singer, AI Products group VP and head of the design team on the NNP-I, gave DCD some extra details on its deployment. “Unlike some other services that were very specifically focused on solving a particular problem, we built it for a family of [general deployment] issues,” Singer said.

“This is a plugin… so you would see different types of deployment. In some cases, you would see a rack like today or you have sockets that [are] today used for SSDs.

“Because of data centers… we needed something that will work within existing infrastructure, allowing [data center operators] to simply plug the rack onto extension sockets.

“It's built as a toolbox. When new usages arise, you can use this in a very diverse manner and use more of these to scale up.

“One thing that is very clear in our space is that, by the time you're finished finding a solution to a problem, the problem has already changed.”

What is an NNP?

Intel claims its AI products will generate over $3.5bn in revenue in 2019, and by 2022 it hopes to bring in $10bn.

The NNP-I and NNP-T are complementary pieces of hardware, but have relatively different microarchitectures.

The NNP-I is made up of around 12 Inference Engines (ICE) whereas the NNP-T comes with 24 Tensor Processing Clusters (TPC). Intel says these can hit performances of around 119Tops (Tera operations per second), or 119 trillion operations per second.

Unlike previous performance measures such as TeraFlops (trillions of floating point operations per second), this figure is hard to compare because AI operations combine data types.

The NNP-T is equipped with the PCI Express 4.0 x16 graphics interface (PCIe x16) as opposed to the NNP-I’s PCIe v3 x8. The Training chip has 32GB of High Bandwidth Memory (HBM) connected to four HBM ports. The amount of distributed memory on the chip is 60MB (2.5MB per TPC).

The NNP-I, on the other hand, has 4MB of Static RAM and Dynamic RAM bandwidth of 68GBps. The amount of internal memory is 75MB.

Time for AI to get specialized?

Inference

CPUs, GPUs, and ASICs

Early adopters

Further reading

Intel buys Israeli AI chip firm Habana Labs for $2bn

AWS announces new Arm CPU and dedicated inference chip

Huawei unveils new AI-centric products, cloud services

Tags

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence

Deliver high quality hyperscale projects

Guide to environmental sustainability metrics for data centers