Data center networks have changed, and there’s no going back. They have become so complex, that they cannot be understood in real time by human beings.
They have to be automated. But that brings a danger: how can you be sure they have been automated correctly?
The complexity comes from the demands of the applications which are running - and the need to run them on multiple clouds and infrastructure.
“There are a lot of new demands and pressures being placed on data center networks,” says analyst Brad Casemore of IDC, speaking at a DCD online event. “Application architectures have really redefined data center networking requirements. As a result of the evolution of application architectures, there's a need for modernization within data centers.”
Applications rule the network
Networks used to dictate terms to the applications that ran on them, says Sanjeevan Srikrishnan, senior global solutions architect at Equinix: “In the good old days, we'd go out and build a killer network. We'd be like hey, I want 100G backbone links and all of this crazy infrastructure. Then the business would come to us and say, ‘Can I run my app on your network?’ and we’d say “No, sorry, it doesn't meet our needs.’”
Network architects could actually ask the business to go away and rebuild applications to suit the network: “Take it away, break it into these three tiers. Bring it back to me like this.”
It’s not like that anymore: “Nobody does it that way anymore. Application is the king Kahuna. It's the bottom of the triangle. It’s the base of the technology equivalent of Maslow's hierarchy of needs. User experience is king.”
Meeting user experience demands would be complex enough, but networks are now constructed from diversified parts, and have to respond coherently: “Everything is really responsive to the applications,” says Sagi Brody, CTO of managed service provider Opti9. “And the production environment for an enterprise organization today is typically hybrid. It spans across colocation, private clouds, public clouds, and SaaS.”
These complex networks have been put together from parts that were historically siloed - and under the covers some of that hasn’t changed: “You're seeing organizations go from fixed and siloed configurations, into this new digital world.” says Srikrishnan, “and it's never a clean migration. You always have “tech debt” that sits there and it may stick around for 10 to 15 years.”
Alongside that, responsibilities shift: “Many of the things you thought the provider was going to own you still own. You are jumbling together four different types of services, and you have to own the compliance and security of all of them those individually, as well as how they work together,” says Brody.
These networks also distribute more network decisions, says Russ White, infrastructure architect at Juniper: “From a network architecture perspective, how do I build these networks that can handle this Edge traffic and distribute stuff intelligently and still have some sort of a core?”
Fast is not good enough
The services that run on these hybrid distributed networks have to respond instantly - but also very consistently - says White: “When I work on hyper-scaling networks, it's not even really the delay that matters. It's the jitter.”
Delay is when network packets take a long while to arrive. Jitter is when they arrive, but the delay is variable, garbling real-time traffic such as voice calls, he explains: “Consistency is a huge key right now, how can I make the network perform consistently all the time?”
With all these different parts of the network to manage, it’s impossible for network administrators to respond quickly enough to keep up with changing demands.
“The hybrid use cases are forcing us into scenarios where we need to deploy things like VPN, and VxLAN,” says Sagi Brody, CTO of managed service provider Opti9. “These are just literally not configurable by hand anymore.”
The obvious thing to do is to use automated tools to control the network’s response to changing conditions, and to take the burden off the admin: “What I believe in automating the crap out of everything,” says Srikrishnan.
But what exactly is being automated? Srikrishnan says the network is a “nebulous” term. “Are we talking about the virtual networks that the developers see? Or are we talking about the underlying infrastructure that powers all of that? Those are two very different things.”
Another issue is that automation is not simple. The first approach was to make a set of rules which provide a canned version of the response an administrator would make to specific events. That works fine most of the time, but if an event is slightly outside the possibilities considered by the network programmer, the response may actually be wrong - and sometimes disastrously so.
Fast automation is dangerous
Brody says: “Automation is important. But it could also be dangerous, it has to be done right. It has to be use case-specific.”
Brody says: “The intelligence has to come in and add some layers of logic,” to check if any action will cause problems. “An example is IPAM [IP address management]. If the IPAM says that a subnet is free and not in use, before we go and assign an IP address, let's check if it's routable.”
Casemore wants to see verification: “So when you automate a change at scale, it's not going to cause all sorts of problems or potentially break down part of your network.”
“We're in a transition phase,” says White. “We had all these really smart people who could type on the keyboard and get the console working. And we thought we would just automate them out of a job. But we haven't turned that corner.”
Brody has been through that cycle: “Years ago I built a lot of network automation myself. There's always this natural progression. You build the automation, and then at some point, it fails. At some point, it takes down your network, it does the exact opposite of what you want it to do.”
Intelligent automation
Brody says the answer is to make a network that is not just automated, but intelligent: “You add some intelligence, you add some logical checking, and so on."
White agrees: “I want the network to be down as little as possible. And I think we're almost over-relying on automation and under-relying on intelligent automation. We should put the emphasis on intelligence and not on automation.”
IDC’s Casemore adds: “The automation not only becomes more comprehensive, but it becomes smarter and a little more anticipatory. We move to a more proactive form of automation.”
But this has to be done without adding layers of complexity. Brody wants to bring it back to a more simple view:
“We have to turn it around. We need to focus on declarative models, imposing our ideal configuration on the network”. Instead of the automation configuring the network, he wants to see a “single point of truth”, a configuration imposed on the physical network.
“This is a new paradigm,” he says. “We need to move towards machine-to-machine interfaces. And we need to rethink.”
People used to think they could build things as complex as they wanted, he says, “as long as we automate it. And I think we need to get away from that line of thinking and start thinking about how do I make my network more intelligent, so I can actually automate less, but have the automation be more intelligent.”
Brody says: “I think we're moving away from a world where you can half-automate, and half do things manually. We have to focus on simplicity, everything has to be as simple as possible.”
Shrikrishnan thinks the answer may be automating early, from the ground up: “If we talk about automation early on, and you use best practice, you're not using hands to keyboard to deploy anything, unless you're using a product like Terraform or Ansible to push your code up into production into your infrastructure. As you do this, you should be validating it.”
Observability for security
Network behavior also has to be “observable,” a keyword emerging in network discussions. “I think it's a whole new genre of software,” says Brody. “I was at [the AWS tech event] Re:Invent this year, and the big buzzword was observability. Because we've made things so complex, we now have this new challenge of how do we observe what's happening where? And how do we troubleshoot it? That wasn't a problem years ago.”
Shrikirshnan agrees that “observability is huge,” and says a network has to be able to “receive logs and respond to events in real-time.”
For instance, what if a user is normally in Toronto, but suddenly shows up in Manila?
“What's going on there? Is this a legitimate use case? Or is this a bad actor?” says Shrikrishnan. “That user in Manila may have left their iPad at home, and the iPad is now checking in for emails, but the user is physically in Manila.
“Do you now take the traditional SecOps approach and kill their user account because you notice malicious activity? Or do you say, hey, wait a minute, this could be legitimate. Let me prompt them for credentials. And if it is a legitimate use case, do I need to now spin up digital infrastructure in Singapore to support them? because they need reliable secured connectivity back to my core infrastructure.”
In a zero-trust network, users authentication is automatic, and continuous, says White: “When I was at Cisco, and I talked about network security, I had a slide which said we could do a crunchy edge with a really nice DMZ [De-Militarized Zone]. And the inside of the network could be really chewy, like a chocolate chip cookie, Nowadays, I’m sorry, but the entire network has to be crunchy all the way through security has got to be built in from the ground up and all the way through.”
Lifecycle
Network automation also has to be able to handle the lifecycle of a network, during which time it will be maintained by multiple people with different levels of skill.
“If you're a senior network engineer, and you've been in the trenches, you know what to build,” says Brody. “But someone newer and more junior may be tasked with simply deploying hardware and plugging it into some automation software. My fear is, how do I ensure that it's not going to do more damage than good?”
It‘s tempting to design a network for Day Zero and deliver it on Day One, expecting it to carry on working, says Casemore: “It’s not just Day Zero and Day One. When you plan and design something, you have to deal with things like troubleshooting and remediation, and that closed loop.”
Automation has to work on Day N, he says: “So you're able to optimize change management, and ensure that the network is continually refined so that it produces the results that it needs to deliver for those applications that it supports.”
For Brody, the important thing is to have a reference architecture that determines how different clouds and services can be combined as needed.
White says it’s a matter of trying to build networks things in a simple, modular way that can be automated: “Because there's a limit to how much you can hold your head. And if you've made it too complex, you can't be flexible, because nobody can figure out how to make it work.”
How it works in practice
So far, so theoretical. But what happens when you want to actually deliver an automated network? Let’s take as an example, the Apstra network automation system that Juniper acquired and uses.
Apstra coined the term “intent-based” network for the jump from automation to intelligent automation, explains Juniper network engineer Mikko Kiukkanen: “You're describing the intent. What you want to do, not how you get there.”
Some tools automate tasks like IP address generation, but don’t verify them. An intent-based network will be based on a reference design or “graph” which describes what the network is meant to achieve. This is mapped onto a network that can be made from multiple vendors’ hardware.
“We generate the syntax, after validating that the configuration and the design is correct, and push the configuration to the switches,” he says. “This happens on day one, where you implement it, and hand it over to operations. After that, we do the day-to-day operations, which means the monitoring and troubleshooting side of things.
The network behavior is generated from the network design, which is stored in a graph database on Day Zero, say Kiukkanen: “It's a complex data store, which is connected to a router. It gives us now a much, much more granular view into the data center.”
When the network is running, the graph runs in sync with the real network he says. “Rather than querying devices and looking at log files in real time, we can query the graph, because it's a single source of truth.”
The automation runs on the “control plane”, the management interface of the switches, not the the “data plane”, the general bits they transfer: “This gives you flexibility to add equipment, because when you add things to it you're pushing things into the graph using a graphical interface.”
The system can continually probe whether the graph matches the intent, ie whether there is a fault or a failure.
The intents can include service levels, so if a network link needs to operate at no more than 90 percent capacity, the system will flag up when a change breaks that intent, he says.
“If there is an anomaly like a duplicate address, we flag it. And then you just hit one click, and it will say don't let it happen,” he says.
“We can make an alarm for an anomaly,” he says, and issue a trouble ticket automatically for the fix if human intervention is required.
“It’s like autonomous cars,” says Kiukkanen. “We want a self-driving or self-operating network. Are we there yet? Not quite. But we have the pieces.”
No going back
In the pandemic, network automation was put to the test, as thousands of users started to work from home: “An inflexible core data center architecture would not have allowed that.”
Intelligent networks have to operate autonomously, adjusting to deal with faults and surges in demands. “This is the coolest time when you're talking about intelligence and automation,” says Shrikrishnan.
But it’s always going to be a limited kind of autonomy, he says: “I don't want to put the intention forward or the message forward that we're trying to build Skynet here with Intelligent Automation. It's a little different.”