Microservices at scale A complexity management issue
The benefits of microservices have been touted for years, and their popularity is clear when you consider the explosion in use of technologies, such as Kubernetes, over the last few years. It seems that based on the number of successful implementations, that popularity is deserved.
For example, according to a 2020 survey by O’Reilly, 92% of respondents reported some success with microservices, with 54% describing their experience as “mostly successful” and under 10% describing a “complete success.”
But building and managing all of these smaller units containing code adds a lot of complexity to the equation, and it’s important to get it right to achieve those successes. Developers can create as many of these microservices as they need, but it’s important to have good management over those, especially as the number of microservices increases.
According to Mike Tria, head of platform engineering at Atlassian, there are two schools of thought when it comes to managing proliferation of microservices. One idea is just to keep the number of microservices to a minimum so that developers don’t have to think about things like scale and security.
“Every time they’re spinning up a new microservice they try to keep them small,” Tria said. “That works fine for a limited number of use cases and specific domains, because what will happen is those microservices will become large. You’ll end up with, as they say, a distributed monolith.”
The other option is to let developers spin up microservices whenever they want, which requires some additional considerations, according to Tria. Incorporating automation into the process is the key to ensuring this can be done successfully.
“If every time you’re building some new microservice, you have to think about all of those concerns about security, where you’re going to host it, what’s the IAM user and role that you need access to, what other services can it talk to—If developers need to figure all that stuff out every time, then you’re going to have a real scaling challenge. So the key is through automating those capabilities away, make it such that you could spin up microservices without having to do all of those things,” said Tria.
According to Tria, the main benefits of automation are scalability, reliability, and speed. Automation provides the ability to scale because new microservices can be created without burdening developers. Second, reliability is encapsulated in each microservice, which means the whole system becomes more reliable. Finally, nimbleness and speed are gained because each team is able to build microservices at their own pace.
At Atlassian, they built their own tool for managing their microservices, but Tria recommends starting small with some off-the-shelf tool. This will enable you to get to know your microservices and figure out your needs, rather than trying to predict your needs and buying some expensive solution that might have features you don’t need or is missing features you do.
“It’s way too easy with microservices to overdo it right at the start,” Tria said. “Honestly, I think that’s the mistake more companies make getting started. They go too heavy on microservices, and right at the start they throw too much on the compute layer, too much service mesh, Kubernetes, proxy, etc. People go too, too far. And so what happens is they get bogged down in process, in bureaucracy, in too much configuration when people just want to build features really, really fast.”
In addition to incorporating automation, there are a number of other ways to ensure success with scaling microservices.
Because of the nature of microservices, they tend to evoke additional security concerns, according to Tzury Bar Yochay, CTO and co-founder of application security company Reblaze. Traditional software architectures use a castle-and-moat approach with a limited number of ingress points, which makes it possible to just secure the perimeter with a security solution.
Microservices, however, are each independent entities that are Internet-facing. “Every microservice that can accept incoming connections from the outside world is potentially exposed to threats within the incoming traffic stream, and it has other security requirements as well (such as integrating with authentication and authorization services). These requirements are much more challenging than the ones typically faced by traditional applications,” said Bar Yochay.
According to Bar Yochay, new and better approaches are constantly being invented to secure cloud native architectures. For example, service meshes can build traffic filtering right into the mesh itself, and block hostile requests before the microservice receives them. Service meshes are an addition to microservices architectures that enable services to communicate with each other. In addition to added security, they offer benefits like load balancing, discovery, failure recovery, metrics, and more.
These advantages of service meshes will seem greater when they are deployed across a larger number of microservices, but smaller architectures can also benefit from them, according to Bar Yochay.
Of course, the developers in charge of these microservices are also responsible for security, but there are a lot of challenges in their way. For example, there can often be friction between developers and security teams because developers want to add new features, while security wants to slow things down and be more cautious. “As more apps and services are being maintained, there are more opportunities for these cultural issues to arise,” Bar Yochay said.
In order to alleviate the friction between developers and security, Bar Yochay recommends investing in developer-friendly security tools for microservices. According to him, there are many solutions on the market today that allow for security to be built directly into containers or into service meshes. In addition, security vendors are also advancing their use of technology, such as by applying machine learning to behavioral analysis and threat detection.
“We’ve seen microservices turn into monolithic microservices and you get kind of a macroservice pretty quickly if you don’t keep and maintain it and keep on top of those things,” said Bob Quillin, chief ecosystem officer at vFunction, a company that helps migrate applications to microservices architectures.
Dead code is one thing that can quickly lead to microservices that are bigger than they need to be. “There is a lot of software where you’re not quite sure what it does,” said Quillin. “You and your team are maintaining it because it’s safer to keep it than to get rid of it. And that’s what I think that eventually creates these larger and larger microservices that become almost like monoliths themselves.”
Tria recommends that rather than having individuals own a microservice, it’s best to have a team own it.
“Like in the equivalent of it takes a village, it takes a team to keep a microservice healthy, to upgrade it to make sure it’s checking in on its dependencies, on its rituals, around things like reliability and SLO. So I think the good practices have a team on it,” said Tria.
For example, Atlassian has about 3,000 developers and roughly 1,400 microservices. Assuming teams of five to 10 developers, this works out to every team owning two or three microservices, on average, Tria explained.
One of the benefits of microservices—being polyglot—is also one of the downsides. According to Tria, one of Atlassian’s initial attractions to microservices was that they could be written using any language.
“We had services written in Go, Kotlin, Java, Python, Scala, you name it. There’s languages I’ve never even heard of that we had microservices written in, which from an autonomy perspective and letting those teams run was really great. Individual teams could all run off on their own and go and build their services,” said Tria.
However this flexibility led to a language and service transferability problem across teams. In addition, microservices written in a particular language needed developers familiar with that language to maintain them. Eventually Tria’s team realized they needed to standardize down to two or three languages.
Another recommendation Tria has based on his team’s experience is to understand the extent of how much the network can do for you. He recommends investing in things like service discovery early on. “[At the start] all of our services found each other just through DNS. You would reach another service through a domain name. What that did is it put a lot of pressure on our own internal networking systems, specifically DNS,” said Tria.
Figuring out a plan for microservices automation at Atlassian
Atlassian’s Tria is a proponent of incorporating automation into microservices management, but his team had to learn that the hard way.
According to Tria, when Atlassian first started using microservices back in early 2016, it had about 50 to 60 microservices total and all of the microservices were written on a Confluence page. They listed every microservice, who owned it, whether it had passed SOC2 compliance yet, and the on-call contact for that microservice.
“I remember at that time we had this long table, and we kept adding columns to the table and the columns were things like when was the last time a performance test was run against it, or another column was what are all the services that depend on it? What are all the services it depends on? What reliability tier is it for uptime? Is it tier one where it needs very high uptime, tier two where it needs less? And we just kept expanding those columns.”
Once the table hit 100 columns, the team realized that wouldn’t be maintainable for very long. Instead, they created a new project to take the capabilities they had in Confluence and turn them into a tool.
“The idea was we would have a system where when you build a microservice, the system essentially registers it into a central repository that we have,” said Tria. “That repository has a list of all of our services. It has the owners, it has the reliability, tiers, and anyone within the company can just search and look up a surface and we made the tool pretty plugable so that when we have new capabilities that we’re adding to our service.”