Quite often, we are in dilemma that using either network monitoring or log management is enough. If I have a log management solution do, I need to look into the metrics? Even worse, some users feel that if they are getting all the updates from their customers if something goes down, they don’t really need such solutions. Well, the point is to have answers for the following questions.

  • What is the right approach?
  • Is relying on either metric or log enough?
  • If No – then when to use log and when to use metric analytics? Or do I need both?
  • How do I monitor operations & performance in the best possible way?

Quite often, we are in dilemma that using either network monitoring or log management is enough. If I have a log management solution do, I need to look into the metrics? Even worse, some users feel that if they are getting all the updates from their customers if something goes down, they don’t really need such solutions. Well, the point is to have answers for the following questions.

Metrics vs Logs

Many confuse the two or mistake one for another, probably because many network monitoring tool vendors (NMS) claim to offer both the solutions. Well, to be fair there’s a significant amount of overlap between them even though they differ in ample number of ways to be considered as separate entities in their own right.

Before we deep dive, it may be a good idea to understand the basics of metrics and logs.

What are Metrics?

While metrics are about measuring the system KPI like CPU Utilisation, Memory Utilisation etc. at a specific timestamp for a system, logs are about a particular event like new login, spike in bandwidth utilisation etc.. This unit of measurement for metrics may have a timestamp, value, and identifier of whether this applies to a source or a tag. The logs get collected whenever any event occurs, but the metrics are generally collected on specific pre-defined time intervals.

This data collection is referred to as time-series metric. These may be contemplated in various types of widgets or graphs in an NMS dashboard such as heat map, gauges, bar charts, counters, and timers.

Although tracking the health of the systems can be done with the help of performance log files, this approach may be very expensive. Basically, you are constantly creating one fresh log entry with all the system’s metadata for just a number. Here, a metric can normalize this effort so that you don’t need to do logging over the same data over & over again. On the contrary metric will be lighter and will serve the purpose since it will be a tiny fraction of a complete log file.

What are Logs?

For any event in your system, there is a log file generated. It carries a set of data about that event to describe it. This log data has the details about the event for example any resource which was accessed, who did it, and at what time. Each and every event in your system has various types of data in it.

Informational logs are generally well-disposed set of events; debugging logs are generally taken into consideration while troubleshooting codes; warning logs are something might be missing out but it should not directly impact your system; and error logs are the messages that convey an issue has happened.

To summarise, logs will convey you the story for what actually happened in the system which got you into this problem that you’re now fixing.

To learn more about how to manage logs from a central repository, I have a previous blog post around what goes into it: Centralized Logging: A Big Data Challenge

Understanding Log & Metric Data Sources

Typically, vendors build network monitoring solutions (NMS) around either log or metrics data. The below table helps to understand the difference:

[table “7” not found /]
You would find a separate tool for network monitoring each of these sources. A log monitoring tool wouldn’t have information about server log files. Similarly, a log management tool will not track the performance of your application or servers etc. However, many tools available today, offer integration to bridge the gap between the two, but leave some blind spots behind. Let us have a deeper look at their differences & significance of network monitoring and log management together.

Stepping out of Traditional Monitoring Paradigm

Modern day monitoring solutions offer logs and metrics on the same platform, as they tend to offer exceptional functionality that impacts your understanding of your IT infrastructure. Log management solutions can also parse each of the log field and convert them into metrics. Also, some log management tools are able to give out summary metrics. But they miss out on critical information that might come into use later. This method averages out log data information. Similarly, many metrics based solution track error logs in the form of events, which also gives the number of occurrences of a particular event. But there is so much more to it.

When to Use Metrics & When to go for Logs

Metrics are best suited for profiling, monitoring & alerting. The efficiency of summarizing data makes them great for monitoring and performance profiling because you can economically store data for longer duration. On the other hand, logs give you detailed information required for debugging, troubleshooting and auditing. Having a unified solution could be great for alerting because they are faster and efficient. Single platform roughly offloads 90% of the monitoring stress.

Do we really need both metrics and logs?

To operate a highly reliable & prompt service for your end-customers, you can’t afford to leave out any of the use cases. In case if you monitor just the metrics you might miss out of important details desirable for debugging tough repetitive problems. You might end up falling in a situation which is eating up all your precious time for no reason at all. Likewise, if you’ve implemented log management but you’re missing out on alerts and performance overview, then you may end up in a trap. A trap full of confusion & downtime.

Logs are about something that has already occurred in a system while metrics are the measurement of the overall health and performance within the system. So, to make it easier for you to understand, think of it this way. You’re sick and you’re going to see a doctor. Now you’ll tell him everywhere you went recently so that he can understand the culprit behind your upset stomach. You’ll probably mention things like you had street food etc. Once your Doc gets adequate no. of clues, he might be able to find out your ailment exactly and suggest remedies. Now, relate this to fixing your system based on simply looking at logs.

To be confident about his conclusion about what’s wrong & what will be the right treatment, he will require your vital stats like BP, body temperature, and blood samples etc. You’ll always have to give these measurements to your doctor that can be taken irrespective of that fact that you’re sick or healthy. These stats are equivalent to metrics.

Theoretically speaking if you have a consistent flow of these measurements for long duration of time, your doctor can notice the rise in the temperature, unusually out of your usual temperature range. He can also draw some conclusions from the history of where you ate a day before. He can finally conclude that the outside food you recently ate has got you sick, later the next day your body temperature starts to rise slowly and steadily w/o you even noticing, and then the next day you’ve started showing symptoms. He’ll then know with full confidence the final ailment based both the information. This example should help you understand the difference between logs and metrics and how they can work together.

Now, let us relate this situation to IT infrastructure monitoring and management. If you start noticing errors on your server, you should refer to server logs. The issue might be due to dependencies on that server and you do not know exactly when the issue started. It can take significant amount of time to really dig in & figure out the culprit of the actual issue.

Consistent metrics and log history might be confusing but looking at both the stats together can ideally give you some insight and you can work towards the problem resolution. Your metrics say that memory utilization is fairly consistent. Suddenly you notice a spike in the memory utilization of your server. But without server logs, you can’t state the cause of this issue. Gartner’s document published last year on “Monitoring Modern Services and Infrastructure” also identifies logs & metrics as two important domains together required for IT infrastructure management.

Combining both metrics and logs together might give a lot more clarity. During this unusual rise in memory utilization, there might be some instances of unusual log entries which might indicate the cause.

Conclusion

Using single platform like Motadata’s NMS for metrics and logs will give you maximum cost & operational advantage. Unified platform is essential in giving you a holistic & broader view, desirable speed, and scalability required in today’s digital economy. If you want to leverage your logs and would like to get the most sensible infra view, that only metrics can offer, use Motadata to have metrics & logs under one umbrella. To know more visit www.motadata.com