Complex Systems Fail in Complex Ways

With the rise of technology, the complexity of our business applications has dramatically increased. Virtualization, microservices and artificial intelligence are about to dominate our IT landscape soon. In this post, I will write about problem spots and proactive solutions.

Expensive downtimes

In one of my recent projects several years ago I was tasked with a firefighting exercise. Customers were unable to configure their mobile device contracts; sales staff started to open new contracts via the paper-way, and back office teams were overloaded with a high number of paper requests. The reason for this nightmare was a performance bottleneck in the new contract management system. IT teams started log file analysis, added more hardware and restarted their business applications several times a day. After some days it was apparent that whether more Infrastructure nor log file analysis was an appropriate solution to get the issue fixed.

Our biggest concern was that there were not enough insights. The system resource metrics and log files indicated that there was no issue at all. However, the customers were extremely frustrated about the reliability of this new business application. A closer look at the log files, which was another challenge due to the distributed architecture, pointed out that all service response times slowed down after a while. After this initial analysis, we agreed that it would take ages to nail down the problem spot with the limited monitoring information.

Modernization of the monitoring stack

Due to the high pressure and starting of loss in revenue, we decided to close the visibility gap in the application monitoring chain. Priority one was on insights in the user transactions, their response times, and a continuous service response time monitoring. We decided to integrate an application performance monitoring solution for this purpose. Priority two was on 24 x 7 health monitoring and automatic alerting if thresholds exceeded. We automated significant end user interactions and executed those in the availability monitoring solution on a 5-minute schedule.

Quick error detection and remediation

Thanks to the improved monitoring stack we were able to identify the cause of that nasty issue. The integrated APM platform captured significant details such as not enough threads, JDBC pool sizing not appropriate, deadlocks, chatty application and high garbage collect suspension times. After some tuning cycles, we removed those hotspots, and the overall health reached the expected level. Support teams installed watch machines and displayed the health and performance metrics permanent on their online monitoring dashboards.

Make monitoring part of your development and testing activities and share your metrics across the organization. This continuous insight will help to eliminate hotspots proactively and keep you from costly outages.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.