Regulatory agencies such as MAS or FINMA closely monitor how financial institutions build, test and operate their business-critical services. Within one hour after a critical service crashed the agencies have to be informed. In this post, I will give you insights on how to build a reliable monitoring stack which helps you to detect and escalate critical outages automatically.
Major elements of a full stack monitoring are:
1. Active monitoring
2. Passive monitoring
# Active monitoring
Active monitoring mimics clicks of a real user. It helps you to see your entire application with the eyes of your end user. You select a set of top business processes and automate them with a synthetic monitoring platform. The created monitoring scripts will be executed on execution machines close to your user’s location. An execution schedule controls the testing frequency. If an error occurs the notification will be sent out to your support teams to give them enough time for corrective measures.
Some familiar active monitoring solutions are:
- Silk Performance Manager from Microfocus
- UserView, BrowserView, ServerView from dotcom-monitor
# Passive monitoring
Simulation of most important use cases on production, so-called active monitoring, is good to check the availability and accuracy of your core services. Real user’s permissions, data constraints and the runtime behavior of a business application can still have a high impact on core applications. Passive monitoring can help to get the required transparency.
The term “passive” monitoring outlines that this is more a listener. There is no robot-based user action simulation involved. Contrary to active monitoring, passive monitoring depends on customers request and collect all user transactions including given input parameters, request details and much more application specific details. This type of monitoring requires a monitoring agent which is installed on your applications servers which collects all relevant details.
Starting from application performance, transaction, user experience, system resource to logfile monitoring, all of them are under the passive monitoring umbrella. Such passive collected metrics are useful for error analysis, continuous optimization or capacity planning because they tell you exactly how many users have been affected and what component was the cause of specific issues.
Some passive monitoring tools are
- APM and UEM: Dynatrace, New Relic, AppDynamics
- Infrastructure: Nagios, Tivoli
- Logfile: Splunk
Automated incident alerting is an excellent and useful feature, but in many cases, it’s still not sufficient. A visualization of the current health status is recommended. Many teams are using monitoring screens and chart the core performance and health figures of their critical services. The benefit of this approach is that your teams can feel safe if great traffic lights are green. Charted key performance indicators can also be shared with business and management team. It’s also easy to re-use those dashboards for internal reporting reasons.
As a good practice, the following KPI visualization could be used
- Create combined or dedicated performance cockpits
- Add traffic lights to visualize actual and historical health metrics
- Add real user stats such as action time, requests executed, failure rate
- Add active monitoring status such as availability and accuracy
- Add host health status such as CPU, Memory, Network and Disk utilization stats
# Alarming and Incident Notification
Software errors, hardware defects or human failures can impact performance or uptime of your business services. You can’t avoid such problem spots completely. A full stack monitoring helps you to detect such issues early and escalate those to the teams in charge. If preconfigured thresholds have been reached an e-mail, SMS or trouble ticket is send out. This alerting practice is extremely useful because your teams can’t look 24 / 7 at their monitoring screens.
From problem detection to resolution
- Active or passive monitoring detects a deviation from expected results
- An automated alert is sent to the support team
- Problem spot is highlighted on the monitoring cockpit
- Support team starts the investigation
- Support team involves full stack engineers
- Support team eliminates the cause of this issue
- Support team closes the trouble ticket
Good luck with your full stack monitoring. We are there to integrate these systems. Just ask!