The proxy based service mesh is an emerging technology that simplifies building distributed micro-services systems by using specialized proxies to provide cross-cutting infrastructure functionality such as service discovery, load balancing, circuit breaking, metrics monitoring, distributed tracing and more at container orchestration level. By eliminating boilerplate code from your service you are free to use any technology and any programing language.

Traditionally, if you wanted to build a distributed micro-services system, you had to find a set of components and frameworks that provide service discovery, load balancing, circuit breaking, then you had to specifically craft your services to work with these components. Often these frameworks are more or less tied to a specific language or framework.

Netflix OSS for example is really easy to start with when used with Spring Cloud Netflix. Just by adding a few annotations on a Spring Boot app you are able to fire up a Zuul proxy and a Eureka server. By adding a different set of annotations you mark your micro-service as a Eureka Client and there it is, registering to make itself available. If you need to call a downstream service you use Feign. And you can guard your downstream calls using Hystrix.

However as soon as you leave the Java / Spring Boot realm things become more complicated.

If you need to add a service written in C++ or Go, you need to build the integration with Netflix OSS yourself. That’s even more boilerplate and you have to do it each time you add a new language to the stack.

The Service Mesh

The rise of containers and container orchestration technologies has enabled a new kind of infrastructure that allows us to break free of discovery/load balancing/circuit breaking frameworks. This new piece of infrastructure is the “service mesh” — and when I say new, I mean that by the time of this writing it does not even have a Wikipedia page yet.

So what is it? A service mesh is an infrastructure layer —mainly a collection of proxies, one per logical service — that integrates with container orchestration solutions such as Docker Swarm or Kubernetes and provides service discovery, load balancing, circuit breaking, fault injection, security, monitoring, tracing and more out of the box in a non-intrusive way.

It does not care what technology or programming language you used to write your micro-services, as it operates at container level. You can write your micro-services as simple HTTP servers in Java, C++, Rust, Go, NodeJS, it really does not matter.

You can effectively think of a service mesh as infrastructure level Aspect Oriented Programing for your distributed, containerized app. The proxies in the service mesh act like an Aspect in AOP. They wrap a containerized micro-service just like an AspectJ aspect can wrap and instrument a java method, simplifying the system by separating cross-cutting concerns.

Under the hood

Getting all this for free in a non-intrusive way is cool, but how does it work? Let’s take it one by one:

Service Discovery and Load Balancing

A service mesh will hook into the orchestration layer — Docker Swarm or Kubernetes — and get notified each time a container is started and stopped.

When the first container carrying “service1” instance is started, the service mesh will create a proxy and apply iptables configuration that will catch all traffic to “service1” and manage it. As more instances of “service1” come up, the service mesh will be notified about it and the new instance added to the proxy’s configuration.

As traffic flows in, the proxy will provide:

  • service discovery by configuring the software defined network to resolve host name “service1” to itself
  • load balancing by evenly distributing incoming requests to available service instances

These two features effectively replace Eureka and Ribbon in the Netflix OSS stack.

Circuit Breaking

Circuit breaking is a fail fast mechanism. If the underlying service instance becomes slow and does not return a call within a configured time, the circuit breaker will fail the request and return an appropriate error code to the client. The client can then retry and eventually reach a more responsive instance. This provides a much better user experience.

Typically, the slow instance will be marked inactive and given some time to recover before receiving any traffic again.

In a service mesh, the proxy sitting in front of all instances of a service is acting as a circuit breaker as well. This effectively replaces the Hystrix circuit breaker from the Netflix OSS stack.

Fault Injection

Distributed cloud native applications must be designed to be fault tolerant. The hardware on which your application runs in the cloud can fail at any moment. Machines can be taken out for scheduled maintenance. Any instance of any service may become unresponsive at any time. When such an incident takes down an instance of a downstream service, your application must handle the situation gracefully without degrading user experience, or minimally do so.

But since such situations occur randomly in the cloud, it’s hard to reproduce them in a controlled environment and study their effect on the system. That would be a very useful thing to do.

To solve this issue the good people at Netflix came up with Chaos Monkey and the whole suite of related tools, but these are not easy to deploy and operate.

With a service-mesh the same functionality is achieved through proxy configurations. We can introduce two types of random perturbations per service:

  • delay a percentage of requests to observe the effects of increasing that service’s latency on the distributed system, make sure the system as a whole can handle it
  • fail a percentage of the incoming requests with random error codes and make sure the system can survive that.

Measuring system throughput and end-user perceived latency while shifting these perturbations to various degrees during typical workloads allows you to determine which downstream service are critical and which are not.

Rate Limiting, Quotas, Monitoring and Tracing

A service mesh proxy handles all the traffic to a service, it knows what goes right and what goes wrong and it can enforce usage policies such as quotas and rate-limiting.

The proxy will notice every failure and SLA violation. This makes the it the perfect place to monitor service performance.

Because each call from a front-facing micro-service all the way down to the last downstream service goes through the proxies, that’s also the perfect place to implement tracing.

A good service mesh could automatically install monitoring and tracing infrastructure pieces such as Prometheus and Zipkin, along with visualization tools such as Grafana.

The current Service-Mesh Landscape

Today there are two players in the service mesh area: Linkerd and Istio.

Linkerd was first around and is a more mature, production proven solution. It is most notably known for being used in production by Monzo, a new online-only UK bank.

Istio on the other hand, is the emerging challenger. It has not reached production quality yet but is moving very fast and will be supported on top of Google’s Cloud Platform from the beginning. It is built on top of the Envoy proxy developed at Lyft.

Lead Architect at LiquidShare, building a cloud native, blockchain enabled, financial services SaaS platform.