Where will the models live?
This is the question that could define the next generation of tech titans, as the data center industry scrambles to support an expected - albeit far from guaranteed - surge in generative artificial intelligence (AI) workloads.
Model training will happen in large data centers, closer in design to the supercomputers of the last decade than the classic enterprise facilities of today. They will likely require enormous amounts of power, liquid cooling, and tens of thousands of GPUs.
But what about the inference, the phase once the model is trained it is being put to production work? Where will AI models live and operate wen they are all grown up and ready to work?
The AI model will likely need more compute in total in this phase, because it will be trained just a few times, but used by millions daily. It will also be more distributed, running on lower-end GPUs or CPUs close to the users.
Training can happen far from users, as models take months to create and are not latency-sensitive. But, once it’s out in the real world and being inferenced by end users, the time it takes to load and respond could become business-critical.
That adds up to a demand for inference at the Edge, according to Edge infrastructure operator Cloudflare.
By the end of last year, Cloudflare said that it would have Nvidia GPUs in more than 100 cities offering its ‘Workers AI’ service for generative AI workloads. Within a year, it expects to have deployed them ‘nearly everywhere’ across its network, which spans data centers in more than 300 cities.
The company began as a content delivery network (CDN) operator, but has expanded into broader networking and security services, slowly becoming more and more like a cloud company.
Now, it wants to dominate the AI inferencing space.
Where the model lives
Some people have proposed that AI inference could be delegated right down to the end user devices that are delivering the results to users. Today’s phones certainly have a lot of processing power - the iPhone 15’s A17 chip has six GPU cores giving enough performance for 4K video at up to 60fps - but John Engates, Cloudflare field CTO, says this is nowhere near enough for inference.
“A certain amount [of generative AI work] is going to be done on the device,” Engates told DCD. “But it's limited, the device has only a certain amount of processing power and battery capacity. The GPUs are nowhere near as capable as what lives in a data center.
“People like to talk about what the latest iPhone is capable of in terms of GPU, but when you stack it up against an Nvidia GPU running in a server, it’s orders of magnitude on top of orders of magnitude different in terms of capability.”
While some smaller models can run on devices - just as voice recognition for AI systems like Google Assistant is handled by the phone - Engates believes that the limitations of the hardware will mean that the larger and better models are more suited for the Edge.
“[Meta’s] Llama 2 is more than 100 gigabytes,” he said, far too large for portable devices.
“If we can host that at the Edge, and do some inference with these GPUs, we can take a lot of the bandwidth limitations and performance limitations away and combine those with what lives on the device. It's not an ‘either-or,’ but maybe a ‘both.’”
Where latency matters
“Our whole business model is built on small data centers everywhere - some of them are fairly substantial, but generally speaking, small and everywhere,” Engates said. “They're living inside of cloud providers or telcos or the data centers that exist in a particular geography. Every geo is different; every country has its own challenges.”
That has led to a vast, globe-spanning infrastructure focused on reducing latency.
“We’re 50 milliseconds from 95 percent of the world's population,” he said. “What can you do with that? Security makes sense, and distributing content makes sense. And then AI inference at the Edge makes a lot of sense, because you've got to really think about how latency affects the performance and what we could do to turbocharge applications.”
This bears further unpacking: With generative AI changing so rapidly, the exact end use cases remain unknown. Certain workloads like image generation take time to create artwork, so shaving off a few seconds of latency will have limited impact.
Users have reported frustration at the speed of ChatGPT conversations, but it’s likely more to do with the speed the model takes to run (alongside GPU shortages) than the physical proximity to users. While it will still benefit from being at the Edge, Engates says latency will become more critical in the next generation of AI.
“Think about a voice application like Siri. You’ll want it to be immediate, you’ll want it to be like the conversation you and I are having right now,” he said. “And that's going to require a pretty cool combination of on-device, in the cloud, and at the Edge.”
Engates admitted that we don’t yet know what the latency-sensitive applications will be, noting that self-driving cars could benefit from generative AI to help perceive the world.
While current autonomous vehicles have become skilled at image recognition, a large language model could help explain those images to the car - for example, the car may be able to recognize a man or a child by the side of the road, but the LLM would be better at understanding that the child could be more likely to suddenly dash out into oncoming traffic.
Such cars, however, are likely to continue to rely on on-board compute for inference, given the obvious need for extremely low latency.
The Edge will also serve another, more mundane, function for generative AI: Compliance. Data is already tightly regulated in some regions, but the mainstream breakout nature of generative AI could lead to far more government oversight. Different nations will demand different versions of models to suit their own takes on freedom of information, copyright, job protections, and privacy.
Cloudflare’s Workers AI will include its own restrictions. It will not support customer-provided models, and only support Meta's Llama 2 7B and M2m100-1.2, OpenAI's Whisper, Hugging Face's Distilbert-sst-2-int8, Microsoft's Resnet-50, and Baai's bge-base-en-v1.5 models.
Cloudflare plans to add more models in the future, with the help of Hugging Face.
“You've got to start somewhere,” Engates said, seeing this approach as ensuring that “the basic use cases are up and running.”
But he expects the use cases to expand: “We're going to have to figure out some systems for managing the costs associated with hosting your own models and how those live in our cloud. I think caching is probably the biggest thing - how many places do you want the same model to live? How fast does it need to be available in these different locations?
“There will be customers that ask us for very specific things over time, and we'll have to figure out how to enable those. This was about trying to show people what's possible, and get it out there quickly. Then the team goes back to work and iterates for the next round of releases.”
The first wave
There’s enough demand for this first step into generative AI to support the initial roll-out, Engates said.
“People are all trying to experiment with what they're going to do with generative AI - I saw a number of people building their own chatbots right on top of Cloudflare’s Edge. Another person built a Google Translate-type system in 18 lines of code. The goal is just to make it as easy as possible for developers to try things out and get them up and running. It's early days and a lot of these things are still in beta mode.”
But he hopes that Workers AI will move beyond experimentation and allow new projects to come out of the infrastructure, with the ‘build it and they will come’ mentality that Edge proponents have often hoped for.
“I imagine very soon these will mature and turn into things that people will rely on every day with very, very strict SLAs around uptime and performance,” he said. “We have to get it out there for people to tell us what they want.”
Engates is hopeful that the market feedback will point to something transformative, akin to the key technological leaps of times past.
“It reminds me of these big inflection points in our lifetime,” he said. “My career goes all the way back: When I started in the early ‘90s, the Internet was new. I started an ISP right out of university, and I left to go help start Rackspace as the CTO for almost 18 years.
“The next big inflection was mobile, and then the next one was cloud. Now we're here with AI, and it seems to me almost bigger than the others combined. It's taking advantage of all of them and it's building on them to launch this new thing.”
A new network
It’s hard to say just how profound this moment is.
There’s a possibility that the bubble bursts and Cloudflare will have to curb any wider ambitions and repurpose the GPUs for other applications, including its ongoing efforts to use AI to make its network smarter.
Then there’s the possibility that the concept lives up to the hype - that every business runs its own model (or at least version of a model), and every person regularly converses with an instantaneous virtual assistant over voice or even video.
That could require a shift-change in the scale at which Cloudflare will have to operate. It may require more capacity than can be provided at the smaller or more telco-focused data centers it often frequents, necessitating more wholesale deployments and bigger Edge deployments.
“Within Cloudflare, there are different layers of what we consider Edge. There's the Edge that's inside of a cabinet in somebody else's data center, versus larger infrastructure in places like New York that have considerable populations,” Engates said.
“Cloudflare’s network is going to evolve and change over time - this is a living, breathing, thing,” he said. “We've invested in people that really understand the hyperscale market very well, our teams are growing in terms of being able to innovate in that context.
“It’s all so that we can become the foundation for all this cool stuff that we think is coming.”