Artboard 1 copy 3@2x

Observability in Microservices: A Future to Get Excited About

Posted by Tomasz Kruszewski

Microservices have a problem: observability.

Microservice architecture has revolutionised modern applications. It simplifies builds, allows for dynamic scaling that can take full advantage of the cloud and empowers the incremental (CI/CD) approach of DevOps.

Microservices are so popular they are associated with their own buzzword: cloud-native.

Microservices achieve all of these wonderful outcomes by disaggregating applications into component pieces using containers or VMs — isolating each from its surrounding environment. Containerisation modularises applications, allowing components to be switched on and off, and new pieces bolted on.

The problem is that the disaggregation making microservices useful is also the source of their weakness. It creates complex application webs that are impenetrable to traditional monitoring tools.

However, the future isn't so bleak. Here, we are going to explain what has changed in microservice observability and how it will impact your approach to DevOps.

The centrality of observability and microservices to DevOps

Before jumping into the observability crisis in microservices, it is worth taking a step back to understand why observability is so key in the first place.

Modern application development and operations use DevOps — the collaboration of these two functions in a single, harmonious process. Central to the success of DevOps is the purposeful use of production performance to feedback information into development — reducing errors and driving innovation based on real-time insights from customers and their use habits.   

DevOps also depends on rapid release cycles, prioritising incremental releases of small changes. This incremental approach is part of why containers have grown so rapidly in popularity — they are effectively an architectural choice optimised for modern development strategies. However, with leading companies like Amazon now making code releases every 11.6 seconds on average, pre-production deployment mechanisms and test tools struggle.

To be truly effective and drive innovation without creating risk, DevOps needs observability over production environments to smooth operations and capture that data it relies on to drive innovative continuous code releases.

Limitations of legacy tools and methods for observability in microservices

Keeping tabs on how applications are performing in production is not a new requirement. What has changed is that DevOps places pressure for improved access to higher-quality production data, while microservices make acquiring that data extremely difficult. The two main legacy contenders widely used for collecting this data are APMs (Application Performance Monitoring) and logging tools and techniques.     

Application performance monitoring (APM)

APMs have become a large and important market in recent decades. Although APMs offer some benefits, they can only tell how the application is performing, not what is happening from moment to moment.

Knowing how your applications are performing is critical to identifying the existence of a failure. The value of this should not be understated. If you can’t identify when an application has failed, users will be the first to know, damaging your brand’s credibility.

Performance monitoring can also provide valuable insights around system and application bottlenecks. However, without real information on why your application has failed, root-cause analysis is impossible. Not only can this cause un-acceptable downtime, but you also will not capture the level of detail that is so valuable to a modern, DevOps-enabled, feedback-driven approach to application development.

Log-based observability

Getting down into the details has traditionally been the role of logs. If you have enough time, logs can help you get a limited level of visibility into your application when it behaves incorrectly or, worse case, falls over — kind of. Microservices make this far harder. By disaggregating and decontextualising every segment of your application, you drive up the number of source locations log-based tools need to look.

When approaching microservice production environments logs confront billions of instructions per second, all isolated by virtualised environments. There is too much going on to even log, much less actually assess.  Modern applications can be generating tens of thousands of logs or more each minute.

Traditional logs also fail to provide the real context around events. Teams are stuck making best guesses at what actually happened based on the decontextualised fragments of information they are delivered.

When using logs, developers are stuck with a practice of attempting to reproduce errors, often taking many people multiple days or weeks. It requires making best guesses, waiting, and ultimately expanding technical debt with Band-Aid fixes where real solutions are needed.

The development of true observability tools for microservices

Software flight recorders (record & replay) have gained some popularity in test as a means of providing more context and flexible value than traditional log-based debugging. They allow developers to step back and forth through run-time actions, seeing every detail of what went wrong.

Although valuable, record & replay has had a limited impact to-date because of the slow down they introduce at runtime. This creates challenges in test and entirely rules out deployment of record & replay technology in production. Recent changes to how next-generation software flight recorders are implemented, however, is changing that.

Software flight recorders (record & replay) in production

The breakthrough is the development of software flight recorder technology that is lightweight enough to deploy in production and requires no changes to your normal DevOps workflows — no run time agents and no special hoops to jump through. A change from observability platforms that instrument at run-time to moving all of the heavy lifting to compile time, working with the compiler optimisations to deliver the contextualised and detailed information provided by traditional record & replay tools, but at a fraction of the performance cost at run-time.

Technology firms like RevDeBug have slashed the performance hit of using software flight recorders from 10x, 100x or even more, down to 10% out of the box, with further optimisations taking performance impact down to almost zero.

A self-healing DevOps pipeline

Bringing record & replay capabilities into the production environments has opened the door for even more possibilities. First, enabling monitoring tools that provide heat maps of all global production deployments can be overlayed with real-time data on errors and access to a 100% reproducer of the error.

The granular data of software flight recorders allow for automated rollbacks, rapid root-cause analysis and automation redeployment — this entire process can be completed within minutes. What is guaranteed is the identification of a problem and mitigation of downtime within milliseconds. Teams are then enabled to start work on a solution using contextualized and root-cause informed data immediately.

The detail and contextualisation of flight recorder technology cut through the confusion of microservices. Issues can be rapidly traced back to the source, removing the problems with traditional logs, providing a 100% reproducer of the issues from the start and making the analysis of that data simple.

Combining the thoroughness of reverse debuggers with the in-production capacities of APM tools delivers the much-needed missing piece of the DevOps toolchain — creating a self-healing DevOps pipeline that is freed to take full advantage of the possibilities provided by microservices.  

The future of true observability tools

The observability platform ecosystem is evolving rapidly. There are a number of incumbent players in the record and replay space, including Mozilla’s rr, and brands like Undo, RevPDB and Chronon.

When it comes to deploying this technology in production, and building monitoring and automation features around that core technology, RevDeBug is the innovator. DevOps teams should watch this space with excitement. As the product category becomes an industry standard, the number of features and capabilities will only expand. However, the capabilities are already here. DevOps teams that upgrade their toolset first will grab the largest competitive advantage.  

We are just at the start of bringing record & replay technology into production as a missing piece of the DevOps toolchain, delivering true observability to microservices and cloud-native. True observability is the monitoring solution that DevOps needs. It enables root-cause analysis in real-time, providing actionable insights for rapid solutions.    

If you are seeking a long-term solution to your company's issues with observability in order to maximise your DevOps methodology and accessibility of microservices, exploring the emerging market of true observability platforms is one of the best moves you can make.

Download for free

Submit a Comment