Observability the last choice

5 min readMay 22, 2023

In my latest poll, I was surprised to discover that Observability tooling was ranked last among various software development practices. In this blog post, I aim to shed light on the significance of Observability and how it can kick-start your journey towards quality engineering and Site Reliability Engineering (SRE). Additionally, I will discuss some effective tools that have helped me educate both engineers and non-technical teams about the importance of this practice.

Understanding Observability

Observability is a critical aspect of modern software development and operations. It refers to the ability to gain insights into complex systems by collecting and analysing relevant data. Unlike monitoring, which focuses on specific metrics and alerts, Observability provides a holistic view of the system’s behaviour, uncovering hidden issues and ensuring its overall health.

The Role of Observability in Quality Engineering, plays a pivotal role in enhancing the quality engineering process. By leveraging Observability practices, organisations can proactively identify and rectify potential issues before they impact end-users. Some key benefits of Observability in quality engineering include:

Early Detection: Observability enables the identification of anomalies, errors, and performance bottlenecks at an early stage, allowing for timely intervention.
Rapid Debugging: With comprehensive system visibility, engineers can quickly debug issues and reduce the time it takes to resolve them, leading to faster release cycles and improved customer satisfaction.
Continuous Improvement: By leveraging Observability data, teams can gain valuable insights into system behaviour, enabling them to make informed decisions for continuous improvement and optimisation.

SRE and Observability

Observability is closely aligned with the principles of SRE, a discipline that focuses on ensuring the reliability and resilience of software systems. Here’s how Observability contributes to SRE:

Effective Incident Response: With Observability, SRE teams can quickly identify the root causes of incidents, reducing downtime and minimising the impact on users.
Capacity Planning: Observability data allows SREs to understand resource utilisation patterns, enabling them to plan and allocate resources effectively.
Service-Level Objectives (SLOs) Monitoring: Observability helps SRE teams measure and track SLOs, ensuring that the system meets the defined reliability standards.

Essential Observability Tools help educate both engineers and non-technical teams about operational quality, it is crucial to highlight the tools that can facilitate its implementation. Here are some Observability tools I have used in the past and the use cases.

NOTE: The use cases are not limited to the capabilities of the tools, and I am not getting paid to promote such tools

New Relic and service discovery

After spending several years in a large enterprise, I was surprised when I joined a media company that had recently implemented a micro-service infrastructure but had omitted to document their architecture. Realising the importance of having a clear reference architecture, I approached the head of engineering for guidance. To my surprise, they introduced me to a tool I had never encountered before.

🤯 I was truly astounded by the capabilities of this tool, as it provided me with a comprehensive view of our ecosystem, including real-time interactions and transactions. As a Quality Engineer, this tool proved to be a game changer, enabling me to visualise bottlenecks and identify potential issues even before they could impact our customers.

I consider myself fortunate to have worked in an environment where the adoption of cloud and infrastructure templates was prioritised. This approach allowed for all applications to be meticulously mapped, along with their corresponding interactions.

As I progressed in my career, I found that not every organisation is able to adopt such capability, or are in a long journey to achieve it.

Splunk the importance of logs and traceability

Observability started to have an emphasis on the way I was working, and I started to drill down into the relationship between operation teams and quality engineers.

It was when we had to investigate performance degradation in a new project that I truly understood the benefit of logs and how I could leverage them to justify why something was not working as expected.

The team and I needed to implement some performance metrics around how our new partner APIs could support our business case, and this is when we started to look at the importance of enhancing our logs for our Quality requirements.

Splunk is a powerful platform that plays a significant role in improving the importance of logs and traceability within an organisation. By effectively collecting, analysing, and visualising log data, Splunk enables businesses to gain valuable insights, enhance operational efficiency, and ensure compliance with regulatory requirements.

by centralising the log management we where able to better understand the time a process took to write into the DB or how long an API call took to calculate important information required, with this we were able to go to our partners and discuss strategies on how we could reduce the load to on-prem applications that could not scale up to our requirements.

AWS CloudWatch just the basics

In a recent role, I was enthusiastic about leveraging my observability and telemetry knowledge to enhance our capabilities and implement innovative metrics to support my SRE initiatives. When I realised that not everyone had access to these advanced tools.

I found it perplexing that an organisation would not prioritise investing in observability and telemetry tooling, considering the significant effort required to build and maintain such capabilities. Nevertheless, I recognised that not every organization has the resources, time, or budget to allocate towards these endeavors. Consequently, I quickly adapted and learned how to work with limited resources by relying solely on analyzing the application logs.

To my surprise, I discovered that the team had already done an excellent job of utilising standard log libraries across most of the applications. This allowed us to effectively query and analyse the logs, leveraging the power of query techniques to extract valuable insights and drive our observability efforts forward.

In conclusion

Embracing observability and telemetry practices, along with leveraging appropriate tools, empowers organisations to unlock the full potential of their software systems. By prioritising observability and harnessing the insights provided by telemetry, businesses can drive operational excellence, enhance system reliability, optimise performance, and ultimately deliver exceptional experiences to their customers.