With the rise of technology, the complexity of our business applications has dramatically increased. Virtualization, microservices and artificial intelligence are about to dominate our IT landscape soon. In this post, I will write about problem spots and proactive solutions.

Expensive downtimes

In one of my recent projects several years ago I was tasked with a firefighting exercise. Customers were unable to configure their mobile device contracts; sales staff started to open new contracts via the paper-way, and back office teams were overloaded with a high number of paper requests. The reason for this nightmare was a performance bottleneck in the new contract management system. IT teams started log file analysis, added more hardware and restarted their business applications several times a day. After some days it was apparent that whether more Infrastructure nor log file analysis was an appropriate solution to get the issue fixed.

Our biggest concern was that there were not enough insights. The system resource metrics and log files indicated that there was no issue at all. However, the customers were extremely frustrated about the reliability of this new business application. A closer look at the log files, which was another challenge due to the distributed architecture, pointed out that all service response times slowed down after a while. After this initial analysis, we agreed that it would take ages to nail down the problem spot with the limited monitoring information.

Modernization of the monitoring stack

Due to the high pressure and starting of loss in revenue, we decided to close the visibility gap in the application monitoring chain. Priority one was on insights in the user transactions, their response times, and a continuous service response time monitoring. We decided to integrate an application performance monitoring solution for this purpose. Priority two was on 24 x 7 health monitoring and automatic alerting if thresholds exceeded. We automated significant end user interactions and executed those in the availability monitoring solution on a 5-minute schedule.

Quick error detection and remediation

Thanks to the improved monitoring stack we were able to identify the cause of that nasty issue. The integrated APM platform captured significant details such as not enough threads, JDBC pool sizing not appropriate, deadlocks, chatty application and high garbage collect suspension times. After some tuning cycles, we removed those hotspots, and the overall health reached the expected level. Support teams installed watch machines and displayed the health and performance metrics permanent on their online monitoring dashboards.

Make monitoring part of your development and testing activities and share your metrics across the organization. This continuous insight will help to eliminate hotspots proactively and keep you from costly outages.


Posted by JM

Resourceful, solution-focused and intuitive reliability engineer with over 15 years of demonstrated success in architecting, developing and maintaining effective testing and monitoring solutions. Offers a wealth of knowledge and experience surrounding modern application architecture and development of best practices.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s