Monitoring metrics basics
Application monitoring is generally an afterthought when building smaller applications. You should agree on monitoring tools. When making core technical decisions. Like choosing the database or programming language. It swill make later stage integration much easier and less painful.
There are a lot of SaaS solutions out there. These tools are also called APMs - application performance management system. To name the few: New Relic, Datadog, Cisco AppDynamics. The goal for me in here is to talk more Why? and What? to monitor and types of monitoring out there then what tools you need to choose.
Also, go over sample application monitoring requirements.
Why monitor applications?
- Know when things go wrong
- Be able to debug and get insights
- Trending to see changes over time and to drive technical/business decisions
- To feed into other systems/processes (e.g QA, security, automation)
Facebook monitoring challenges are
- Collection of monitoring data. It means to capture the state of the system and track the changes over time.
Usually, it’s best done using time series based data
Main problem here is scale (i.e number of servers, number of services)
- Analyse the monitoring data -
- Detect anomalies
World without monitoring
When running Your applications in production without monitoring. The normal workflow would most likely look something like this.
- You push some code before end of the day to production
- Leave work knowing that everything works
- at home You want to check out the changes but it does not work
- then some of Your friends call You that they can not use Your new fancy feature
- Now Your manager calls and tells you to return to office and fix the issue right away.
- When in office You dig through the logs and git logs to find the issue and then fix it.
World with monitoring
Now let’s talk about when the system has monitoring turned on.
- Day ends with code push to production
- Before leaving the office. You get a notification on the phone that some monitoring metrics are not OK.
- You open the laptop and check out the detailed report of the metrics
- You discover the issue cause is the last code change
- Fix it by issuing package rollback from production and leave office
Types of automated monitoring
Notes taken from How to Build and Deploy Open Source Application Monitoring Solutions
- Load balancers want to now is there free capacity to use
- Auto scaling tools wants to know application utilization should application be scaled in or out
- Health Management checks if the application is working as expected. It will send notifications when application detects problems.
- Accounting what is the operational cost and ROI
Let’s take sample use case to analyze. Below is a simple diagram of central logging application with multiple components.
Let’s go over what each component we plan to monitor will do in our system.
Filebeat is simple to log shipper and runs on each application node as service. It’s the only job is to watch application log files and send each row to Kafka.
Apache Kafka is an open-source stream processing platform. It will collect Filebeat sent messages into queues. Kafka provides a messaging service and runs in high availability mode.
Gobblin is a distributed big data integration framework for batch and streaming systems. In our case, it pulls log data from Kafka and schedules processing jobs using YARN. It runs as a cron job in the predefined period.
We won’t be monitoring Hadoop. Let’s assume it will monitor by other infrastructure teams.
Logging metrics to track
The logging system has multiple components where. Each of these applications has different requirements or logging.
But main metrics we should track are:
- Total of queries executed
- Successful queries completed
- Queries terminated by user - query info to see how we could lower this kind of queries. Usefulness for the end user less time spent sending useless queries. In our case less spent wasting system resources.
- Failed queries - system errors grouped by type i.e invalid period, invalid site name, out of memory
- Average query period length - to help future optimizations for most common period lengths
- Queries info per session - will help us improving tools in future to lower the number of queries.
- Kafka queue size
- YARN job running time
- Gobblin cron job success rate per job type
- Gobblin YARN job running time
- Failed Gobblin YARN job info - to find quickly job failure. Future grouping failures should be possible.
- Number of running Filebeat services running. It’s checked against a number of total nodes using central logging.
- Filebeat Kafka connection status. Not able to connect to Kafka is main reason why filebeat does not work correctly
- Check Filebeat prospectors configuration is present in node
How active monitoring works?
Active monitoring means the monitoring tools actively poll the system status and checks against expected result. In other words, it could be called running integration/system tests periodically.
Components that should have these types of tests run are:
- Central log reader - should run at least once a da. To make sure command line tool works as expected.
- Central log server - checks must run as often as possible at the start an hour is sufficient.
How push and pull model work?
There are two common types of monitoring models pull or push.
- Push when the application sends data to a monitoring system.
Pros of this type of system are:
- Monitored application has full control over data sent and how often it’s sent.
- Security - the application does not expose system information through some endpoint. Cons with this type of system are:
- The application is aware of the monitoring system i.e monitoring system location
- May need formatting data to correct format Main use cases:
- Applications behind heavy firewalls
- Short-running processes i.e CLI applications
- Pull when monitoring system asks directly from the application the current status.
Pull system pros would be:
- Monitoring system schedules the data collection interval
- Monitoring system collect all the data it requires. Sometimes minimal development is required in the application to support metrics collection
- Application does not have to know where or what monitoring system uses. Cons related to pull monitoring:
- Monitoring system must know monitored application location in the network. When You have a large cluster and a lot of applications then adding them to monitoring a bit difficult.
- Very high rate of polling may cause strain on the application resources
In most cases, the pull method should is preferred. In some cases, the push option should be implemented. Below I list the component and monitoring method.
- Filebeat - pull method preferred. Otherwise when application or server is behind the firewall. Then use push method.
- Gobblin - push because it’s cron job. Most likely will use Gobblin provided JMX metrics and add the small daemon to collect and send to a monitoring system.
- Kafka - pull requires third party plugin to collect metric data.
- applogs job scheduler - pull method also live integration tests for heartbeat should be run at least daily but preferably once an hour.
- applogs reader/CLI push because it is not known when the command line tool is run and all collected metrics are related to run the query.
Both applogs job scheduler and applogs CLI may require development to add missing metrics data collection.