What Service Meshes Are, and Why Istio Leads the Pack
Cloud-native technologies like containers and microservices have made infrastructure infinitely more complex. Service meshes are there to help.
In tech, one thing always leads to another.
Virtual machines, for example, led to containers, and containers led to cloud-native technologies like microservices to better harness the portability and agility containers had brought into play. Cloud-native infrastructure, with applications and services being scaled up and down at the drop of a hat and across platforms, then necessitated even newer technologies -- like network proxies and service meshes, led by Envoy and Istio.
What Is a Service Mesh?
Service meshes were developed to help deal with the great increase in complexity that came along with containers, microservices, and other cloud-native technologies that spread workloads that traditionally have been handled by large monolithic applications across multiple servers and even multiple clouds. While this has been a boon for DevOps teams -- development and deployment don't have to be put on hold while modifications are being made elsewhere in an application -- it's created a new set of issues.
"The challenge that this brings that wasn't present before is that the network is now an integral part of our application," Zachary Butcher, a founding engineer at Tetrate, a San Francisco-based startup that's a major contributer to Istio, explained to Data Center Knowledge. "Before, when we were in one big monolith, we didn't have a lot of network communication, because we're all in one program, so we can just communicate directly. Now we have to communicate over the network, and this introduces all kinds of failure modes that simply didn't exist in our applications before."
One example of these new failure modes is endpoint discovery, where one service can find and connect to another service, Butcher said. Another is "retries," meant to address connection failures without overwhelming the server. Yet another is deploying a new version of a service on a live system. There are also new security issues, since the old method of protecting network perimeter with a firewall no longer applies in today's dynamic orchestrator-enabled environments.
"What a service mesh does is provide a new layer that sits between your application and the network, and it provides all this functionality," he explained. "It provides things like service discovery, fine-grained traffic control, and things like per-request retries."
In addition, a mesh can generate metrics to give DevOps teams a look into the inner workings of their system. Those metrics are especially useful when troubleshooting.
"Because it's pretending to be this network between the real network and the application, we can see all the stuff that's going on, all the network traffic, and I can generate metrics on behalf of your application that describe the traffic that we're seeing," Butcher explained. "In particular, I can generate what we call the RED metrics, the rate of Requests, the rate of Errors, and the Duration or the latency of each request. Those are the critical black-box metrics to have about a service to be able to assess its health, and those can be generated automatically out of the box."
Included in a service mesh toolkit is application identity, which Butcher said was essential for taking full advantage of the portability that containers and microservices offer.
"Before, we had network-level identity. Therefore the policy that we wrote about what things can communicate is in terms of the network: this subnet can talk to that subnet through my firewall, but those other subnets can't," he said. "Now, instead of doing it based on subnets we can start to give applications their own proper identity that's presented at runtime, and that identity can be used to write policy that doesn't depend on the network. That means that I can start to allow portability where I couldn't before, because I can move workloads around. My policies move with me, where network-based policies don't move when I move."
Although service meshes have only been around for about three years, there are already several open source meshes under development, most notably Linkerd, a lightweight mesh that was originally developed by Buoyant but is now a Cloud Native Computing Foundation project; Consul from Hashicorp; and Istio, an open source project controlled and managed by Google.
What's the Big Deal About Istio?
Although usage statistics for the various meshes aren't readily available, Istio appears to be leading the way in terms of adoption -- at least if media ink is any indication. Built on the back of Envoy, CNCF's network proxy that originally came out of Lyft, Istio reached its production-ready 1.0 release in July 2018 and is currently up to version 1.3.3. A beta of 1.4 just came out.
The project is controlled by Google, which recently said it had no immediate plans to contribute it to a foundation. That would bring open governance to the project, and open governance is considered important by many open source advocates. It's currently governed by a steering committee, whose members are drawn solely from Google's and IBM's ranks, and directed by a technical oversight committee that adds VMware to the mix.
Although many open source projects controlled by a single entity have had difficulty building communities of outside developers, Istio hasn't had that problem, having attracted a diverse group of developers. Lin Sun, IBM's technical lead for Istio who holds seats on both committees, told Data Center Knowledge that the project's developer community has over 400 members representing more than 300 companies. All members are active contributors, she said and pointed out that becoming a member requires participating in at least one pull request.
Sun and Butcher agree that much of the project's appeal has to do with its tight integration with Envoy, which supplies the data plane that does most of the heavy lifting by using "sidecars," small utility containers that are automatically deployed alongside containers for service discovery, health checking, routing, load balancing, and the like.
Istio provides the control plane, using automation to configure the data plane, which would otherwise be a tedious and time-consuming task.
"Istio tends to be the most featureful of the existing meshes on the market, and that's largely due to the fact that it's the longest-lived project of the ones using Envoy," Butcher said. "Linkerd has been around for longer than Istio has, but they went and totally rewrote their system into a version-two a while back, and they opted to build their own proxy."
He said Envoy has become something of a standard for supplying the basic framework for service meshes. Most newer meshes are also Envoy-based, because Envoy has more functionality than other proxies.
"Istio effectively provides 100 percent of Envoy's feature set," he added. "You'll find none of the other Envoy-based meshes today provide a feature parity."
Both Sun and Butcher also agree that the full-featuredness of Envoy and Istio makes the software more than a little difficult to use. Butcher said most IT departments shouldn't be in a rush to download and deploy upstream Istio out-of-the-box but consider looking for a downstream vendor that's incorporated it into an implementation that automates or hides some of the complexity from users.
"Istio does not try, as a goal, to hide the complexity of Envoy or to hide the complexity of the underlying system," he said. "This is one of the reasons that Istio is able to ship all of the features that Envoy exposes. The pain point of course is that then you have all the complexity of all the features that Envoy exposes. There are some third parties that are starting to attempt to layer on top to hide some of this complexity."
Butcher, who spent three years as a software engineer at Google before joining Tetrate, compared this with Kubernetes, which many users have found difficult to master out-of-the-box, but which has become invaluable when packaged by downstream vendors with some configurations and features automated and others set to defaults and hidden from the user.
"One thing that's kind of frequently missed in these projects is that in Google we had management planes for these. In open source there are no management planes for these," he said. "Humans don't work with the control plane, and part of the reason that Kubernetes is painful -- part of the reason Istio is painful -- is that humans are coming and trying to interact with these systems directly, but that was never really their intended use case. Their use case was always for other systems to interact with them, but that was never really something that Google spelled out when it was pushing these systems."
Not that Istio developers aren't attempting to remove some of those pain points. Sun, for example, has established a user experience working group within the community. "It's kind of a home for users to share with us some of their experiences and for people to share the projects that they are building to improve Istio's user experience, in addition to improving the user experience in the core," she said about the group.
Input from this work group, launched earlier this year, has already found its way into Istio 1.3 to make using the software easier to understand without having to pore through pages of documentation in order to have a "smooth onboarding experience."
Not for Containers Only
Although Istio from its inception was designed to run on Kubernetes, it's also used in deployments that include virtual machines running alongside containers. Butcher said it can also be useful for companies dependent on monolithic legacy applications that are just starting to move to cloud-native infrastructure.
"I would argue that you should be thinking about service mesh from the very outset of your cloud strategy," he said. "As you're getting your architecture together, as you're nailing down what your cloud plan looks like and how you're going to execute that migration, that is where I think that you should start to pull in the service mesh functionality. This is where it's going to start to provide value in being able to do things like quickly shift traffic from your new container-based deployment back into the old legacy stuff that you know works, by bridging the communication that has to occur between the new stuff and the old stuff."
Read more about:
Data Center KnowledgeAbout the Authors
You May Also Like