Choosing the right Health and Performance Monitoring tool for your Sitecore solution

In this blog post, I'll be talking about some of the learnings, I've been gathering over the past few months on how to choose and incorporate health and performance monitoring (APM) tools into your Sitecore solutions.

The Big Why

Although a slightly different topic then what I normally blog about, I've always considered monitoring of software solutions to easily be one of the most important things to consider when working with medium to large sized solutions. This is usually a topic most technical minded people can relate to, and agree on should be highly prioritized.

From my daily work, I see that this something that tends to be neglected, one way or another, and there are many reasons for this. One reason is that the team of developers doesn't have the knowledge to determine which tools to use, or even how to use them. Another reason might be that client concludes that it's too costly to invest time and money in such tools in terms of the return on investment, despite the fact that the development team has most likely recommended having such tools available to help track down errors when they occur (and believe me, they will occur eventually, even standard shelf products have errors in them), or in predicting upcoming failures based on trends in previous failure history.

From the perspective of Sitecore, a whole other argument that developers seems to think is feasible is that, if an error occur, why not simply review the log files in Sitecore; that's what they are for, and you can just download the log files and run them through the Sitecore Log Analyzer (SCLA), right?

Let me start by saying that I really appreciate all the work that has gone into making the SCLA, and that I use this (awesome) tool in my daily work. However, I would also like to point out that I prefer not having to log into every Sitecore server, pull down every questionable log file, run each of them through the SCLA and from here compare the log entries across each of the logs from the individual servers, hoping to see some sort of pattern in what might actually be correlated to the error(s).

There has to be a different way, right?

As mentioned, the problem is scaling and in the context of a large solution, getting correct informations about the actual state of the solution. In practice, when you have a medium to large sized solution, not to mention adding Sitecore in the mix, it will take time to manually scan through the log files in order to pin down which server(s) might be causing problems. Moreover, it will be very tricky to get an overall picture of how your solution performs as a whole, since you are lacking a centralized place of gathering and aggregating all information about the current health and performance state of the solution.

Instead, what we want to be able to do is to:

Monitor the performance of the application to make sure that it is healthy
Rapidly diagnose applications or systems that are failing
Monitor live applications, individually or across the entire solution as a whole
Log events that do not necessarily relate to errors in an application

Let me put your mind to ease by saying, that all of the above are exactly what health and performance monitoring tools are meant to help achieving.

The tools

When you decide on using a health and performance monitoring tool, you should be aware that there are a plateau of tools available at your disposal, some free, others priced in the range of tens to hundreds of dollars per month. Although this can be quite overwhelming, the trick is to understand which of the available tools you should be giving a closer look, and how they differ from each other.

To get you started, I've listed some of the tools I've been working with lately to give you an idea of what kind of solutions there are available on the market.

Application Insights

Application Insights is Microsoft's answer of a fully-fledged APM, which can be used to monitor your application(s) while they are running live. Application Insights is aimed at the development team, to help you understand how your app is performing and how it's being used. Application Insights will automatically detect performance anomalies and notify the development team of such.

Once configured, Application Insights will monitor the following about your solution:

Request rates, response times, and failure rates
Dependency rates, response times, and failure rates
Page views and load performance
AJAX calls from web pages
User and session counts
Performance counters from your server machines, such as CPU, memory, and network usage
Diagnostic trace logs from your app (so that you can correlate trace events with requests)

As for pricing, this definitely depends on whether you go with Microsofts cloud based solution in Azure, or use it on-premises. The cloud based solution offers two pricing options: Basic, and Enterprise. With Basic, you pay based on the volume of data your solution sends to Application Insights, with a 1 GB free allowance per month. In the Enterprise pricing option, you pay for the number of nodes that host your application, and you get a daily allowance of data per node. Each node will cost you around $15 where you are able to send 200 MB data per node each day. If you need to send more data than either the 1 GB included in the Basic, or the 200 MB per node in the Enterprise, this will cost you $2.30 per additional GB sent.

I've personally been using the cloud solution for the past 4-5 months, and I've been very satisfied with the results. It requires very little effort to get Application Insights up and running, and once you have data being send to Application Insights, the different analytics and diagnostics are very easy to use, and you quickly get very good overview on the health and performance of your solution.

Elmah.io

Although one of the older players on the field, Elmah.io is a really nice product, which I recommend that you should check out.

In a nutshell, Elmah.io helps monitoring your solution for crashes. In doing so, this helps you getting the overview of the quality of your solution. If an error occurs, you'll get an notification over a variety of communication channels (mail, Slack or such) - heck, Elmah.io even assists you in fixing your bugs by combining error diagnostic information with quick fixes and answers from Stack Overflow (I was quite amazed by this feature, when I first used Elmah.io).

Once the Elmah.io bits have been dropped into your solution and configured appropriately, you get the following facilites without changing a single line of your code:

Logging of nearly all unhandled exceptions
A view of the entire log of recoded exceptions
A view of the full details of any one logged exception
An e-mail notification of each error at the time it occurs

The pricing is reasonable, ranging from $17 to $89 a month, and depends on how many logs you send to Elmah.io each month - not the amount of data, as is the case for Application Insights.

I've used Elmah.io on different solutions, including a large sized Sitecore solution, and it really helped getting an overview of what was going on in the different logs - however, I should emphasize that Elmah.io only provides you an overview over the crashes and errors, no more or no less. This is not necessarily a bad thing, as Elmah.io might be enough for you to get started getting a better overview of your solutions overall quality state. Then later on, once you may have other needs as well, you may consider switching over to a fully-fledged APM tool.

Other tools worth checking out

A part from the tools mentioned, I also recommend that you check out the following:

Important: I highly recommend that you check out each the tools mentioned and review them closely before picking out which tool to use. Most of the tools comes with a free trial period for around 14 days, which should give you enough time to get a good feel for the tool, and decide if this is something you want to continue moving forward with.

Solution X, I'm choosing you! Now, how do I get you to work (nicely) together with Sitecore?

At this point, you've now settled with the tool you want to use. Naturally, the next question to ask is how do you get it working with Sitecore?

The first thing to do is to go over the contributions from the community (see further down below), and check if there is a "ready to use" implementation you can use straight off the shelf. If you can't find such an implementation, you need to implement it yourself. In this case, you have to be aware that there a few things you need to do, when you want to implement custom logging in Sitecore.

The troubles that arise due to Sitecore's log4net implementation

I think it goes without saying, that Sitecore's log4net wrapper implementation is known for it's limitations. Bas Lijten gives a pretty good summary in one of his blog posts, where he writes:

... Because the log4net implementation is a) outdated and b) being hosted in a Sitecore assembly, it’s not possible to easily use 3rd party solutions for log4net with Sitecore. The 3rd party solutions generally use the newer implementations of the LogEventInformation class (which has been altered over time) and they can’t find the log4net assembly, because it isn’t there ...

In practice this means, that if you try to install a log4net extension created for the tool you chosen (like Application Insights log4net appender), you'll quickly see that it won't work once you run your Sitecore solution.

To work around this issue, you have to implement your own log4net appender by inheriting from Sitecore's AppenderSkeleton implementation. From here, the easiest way is to grab the log4net appender you want to use, decompile it, and re-implement it using Sitecore's appender implementation - you can see examples of how this can been done in the different community contributions.

It's expensive to send data to cloud hosted tools

By default everything in Sitecore is logged, which can be a bit of an issue with cloud based tools, since they typically restricts the amount of data to be logged (or said in another way, you will have to pay the big bucks to keep a log of everything Sitecore logs, by default).

As such, you should restrict the log levels such that you only log messages with level WARN and above. To give some perspective, we went down from logging 200.000 log entries to around 2.000, by simply filtering out info logs.

Community contributions

In order for you to quickly get started, I've listed the different community contributions I've found available online:

As always, if you got additional details to the content explained in this blog post, or if you know of other contributions that should make it to the list, please drop me a note in the comment section below.