Perhaps you have heard the term service mesh in your recent conversations with co-workers or at a tech conference, or seen it in passing on the internet, and wondered what all the fuss is about. It's certainly one of the buzzwords in today's technological landscape.
It's impossible to build modern applications today without microservices. As your use of microservices grows, you begin to notice certain functionalities that need to be developed in each service. These features include security, retries, observability, service discovery, and tracing. It is possible to encapsulate all those functionalities into a common set of libraries that all services can then share—but that would mean a team within your organization would have to own and maintain them, adding to your existing technical debt. In addition, such libraries won't be language-agnostic and would require you to develop and maintain a separate one for each language used in your organization.
Wouldn't it be nice to have such an abstraction provided without incurring additional technical debt? Service mesh to the rescue.
What is a service mesh?
Service mesh simplifies the process of connecting, protecting, and monitoring microservices. A service mesh is an abstraction layer that manages service-to-service communication, security, tracing, observability, and resilience in modern, cloud-native applications. Providing such an abstraction decouples service-to-service communication from applications, enabling dynamic and predictable configurations at runtime.
The mesh is built with resiliency in mind, assuming the underlying infrastructure and network fabric on which the services run will fail. When they do, automatic retry attempts will be made on your service's behalf. In recent years, it has become a critical component of a cloud-native stack as more and more Fortune 500 companies have embraced it in production.
Two core tenets achieve service mesh functionality: the control plane and the data plane. The control plane encapsulates all cross-cutting concerns and functionality. It communicates with the data plane, which consists of a lightweight network proxy (often referred to as a sidecar) and your service. Each copy of the service is co-located with a sidecar proxy that handles all communication with the control plane and all traffic to and from the service.
This brings about a big win: existing applications and services can immediately benefit from service mesh abstraction, since no additional code needs to be written to take advantage of it.
A worked example
Let's take a close look at an example to illustrate a simple flow through a service mesh. We will assume an application that comprises three services: A, B, and C. A relies on communications with B and C to fulfill certain application functionality.
When a request comes for service A through the service mesh:
- The service mesh uses dynamic routing rules to decipher the intended service target. Should the traffic be routed to the service in our local datacenter or the copy located in our cloud deployment? Which version of that service should fulfill the request? All those routing rules are configurable at runtime without needing to restart or redeploy the service.
- Once the target service is determined, an instance is pulled from the instance pool via service discovery, using a variety of factors based on metrics the mesh has already observed from recent requests. For example, the chosen instance is also likely to return the fastest response.
- The request is sent to the instance, along with telemetry such as latency, response code, error code, 99th percentile, and so on, then published to the metrics server.
- The request will be automatically retried if the instance is unresponsive, fails, or is down—until the request deadline has elapsed. Alternatively, if the request consistently errors, the service mesh will remove that instance from the pool and periodically retry it offline.
- If the request requires that A use other services B or C to satisfy the in-flight request, the call is made via the sidecar, which helps ensure network chattiness is reduced and hops are minimized—providing more reliable service-to-service communication.
The CNCF landscape
Several projects on the Cloud Native Computing Foundation (CNCF) aim to address this need. The most popular include Istio, Linkerd, Kuma, Consul, Zuul, and Open Service Mesh. Each comes with different trade-offs in performance, operational complexity, and capability.
In an upcoming post, we will explore some of these implementations, how to apply them to an existing application, and the pros and cons.