Aggregated monitoring of BizTalk solutions using "BizMon"
January 6th, 2009Update 2009-08-11: This project turned out to be far more complicated and bigger than I first expected (ever heard that before?). Due to that and the fact that we wanted to have a company behind that could offer full-time support and stability “BizMon” has been released as a commercial product that you can find here.
I love to get some help from you to test it and make it as good as possible. Even if it is commercial and cost money we have a free alternative for small environments and we work hard to keep the license price as low as possible.
Update 2009-02-25: In the original post I said I’d post more on the architecture and the code during February 09. I’m however current struggling getting the needed legal rights etc, etc to be able to talk further about the "BizMon"-solution. It was harder than I thought … I’ll get back to posing on the subject as soon as I have that sorted.
Integration of enterprise processes often ends up being very business critical. If a integration fails delivering the messages it was supposed to it usually means the business will be affected in a very negative way (for example losing money or delivering bad service). That of course means that monitoring the status of the integrations soon becomes very important (if you’re not into getting yelled at or potentially loosing your job).
Strangely enough BizTalk Server 2006 R2 in my humble opinion doesn’t come with the right tool to efficiently monitoring big enterprise integration solutions!
What do I mean by monitoring?
Before I get myself deeper into trouble I’d like to define what I mean by monitoring. I think monitoring a BizTalk integration solution could be divided into four categories.
- Infrastructure (traditional)
This is the easy one and one that IT-pros and alike are used to monitor. Hardware status, network traffic, disk space, event logs etc all fall under this category. If the storing area for the databases start running low on memory we can be pretty sure it’ll eventually effect the integration somehow. - BizTalk infrastructure
This is where it starts getting a bit trickier. This category includes the status of receive locations, orchestrations, host instances and send ports. If a receive location is down no messages will be picked up (but we can also be sure of not getting any suspended messages). - Suspended messages
As most reader of this blog probably know suspended message is due to some sort of failure in BizTalk. It can be an actually exception in code or something that went wrong while trying to send messages. It’s however and important category to monitor. - Heartbeat (monitoring actual successful traffic)
While the points 1-3 above focuses on errors and that things being inactive this category actually monitors that the integration runs as expected.To me this final point is almost the most important one. What I mean is that if everything runs as expected and we’re sending the expected amount of messages in the right pace everything else must be ok – right? It’s however the one that in my experience almost always overlooked!
"What do you mean ‘Not the right tools to monitor’? We have loads of tools in BizTalk 2006 R2!"
OK. So let’s see what tools we have available actually monitor the categories above.
- Infrastructure (traditional)
I won’t discuss this kind of monitoring in this post. There are loads of tools (all from the huge expensive enterprise ones to plenty of good open-source alternatives) for this and you’re probably already using one or several of them already. - BizTalk infrastructure
There are a couple of way of achieving this. One of the is to use the Microsoft BizTalk Server Management Pack for Operation Manager. It does however of course require that you have invested in System Center Operation Manager already …Another way is to either use the ExplorerOM classes or connecting directly to the BizTalk configuration database and code your own report of some sort.
The final (and most common way in my experience) is to try and document the correct configuration and settings and then have someone check these manually (if you’re that person I feel for you …).
- Suspended messages
Suspended messages are of course very important to monitor and it’s for some reason also the first thing developers think of monitoring when developing BizTalk integration (maybe because of the fact that they’re similar to traditional exceptions in software). There are also here a couple of different ways to solve the problem.Microsoft BizTalk Server Management Pack for Operation Manager mentioned above has the functionality to monitor and altering on suspended messages.
BizTalk Server fires the MSBTS_ServiceInstanceSuspendedEvent WMI event every time a service instance gets suspended. It’s fully possible to write a service that watches for this event and then for example sends some sort of alert. Darren Jefford has an example on how do something like that in this post.
In BizTalk 2006 Failed Message Routing was introduced. This gives the developer the possibility to subscribe to suspended messages. These can then for example be sent out to file system or written to a database. Microsoft ESB Guidance for BizTalk Server 2006 R2 Exception management component uses this approach. The problem with this approach is however that the message is moved out of BizTalk and one loses all the built in possibilities of resending them etc.
- Heartbeat (monitoring actual successful traffic)
As I said before I think this is a very important metric. If you can see that messages travel through BizTalk in a normal rate things much be pretty ok – right? Without doing to much coding and developing you own pipeline components for tracking etc there are two options.The first one is of course using the Health and Activity Tracking tool (HAT). This shows a simple view of receives, processed and sent messages. I hate to say it but the HAT tool is bad. It’s slow, it’s hard to use, it’s hard to filter information, it times out, it doesn’t aggregate information, it’s basically almost useless … (Just to make one thing clear: I make my living working with BizTalk and I really enjoy the product but tracking and monitoring is really one of it’s ugly sides. I hate to say it.)
The other option is to develop a simple BAM tracking profile to monitoring the send and receive port ports of the different processes.
So to repeat what I said earlier: no I don’t think BizTalk comes with the right tool to monitor integration solutions. I do however think that the platform has the capabilities to create something that could close that gap in the product.
What I need!
Much of what’s discussed in this post can be solved using the BizTalk Administrations Console (to manually monitor BizTalk infrastructure status) or in the Health and Activity Tracking tool (to manually monitor traffic). The aim of this post is however to discuss the possibilities to use this information, aggregate it and give the persons responsible for monitoring integration a dashboard that shows the current status of all integrations within the enterprise.
The dashboard monitor application need the following main features.
- In one single screen give an overview of the overall status of all the integrations. By status I mean if there are ports, orchestration or host instances that aren’t running that should be running or if there is any suspended traffic on that particular integration.
- The possibility to show detailed information for a specific integration on what artifacts (ports, host instances etc) that are/aren’t running. How much traffic that’s been sent/received via the integration. When traffic was sent/received and if there’s any suspended messages on the integration.
- The option to filter exclude specific artifacts from monitoring (for example receive locations that’s usually turned off etc).
- Setting up monitoring by for example email and also define what integrations to be included in one specific monitoring (different persons are usually responsible for monitoring different integrations).
Introducing "BizMon"
Based on the needs and "requirements" above I’ve started developing a application. The idea is to release it as open-source as soon as I get to a first stable version (I’d be very interested in help on practical details on how to do so). For now I’ll demonstrate it by showing some screenshots. The application is a web application based on ASP.NET MVC.
Screenshot: "Applications" page
The above image shows a screenshot from the start page of the BizMon-application that shows the aggregated status of the entire BizTalk group it’s connected to. The applications is build to monitor one BizTalk group and the shown page displays all applications within that BizTalk group.
In the example image the two first rows have an OK status. That means that all of the monitored artifacts (receive locations, send ports, orchestrations and host instances) within that application are in a running and OK status.
The yellow line on the YIT.NO.Project-application indicates a warning. That means that all the artifacts are in a OK status but there’re suspended messages within that application. The red line indicates that one or more of the monitored artifacts are in a inactive status.
Each row and application show when the last message on that application was received and/or sent. It also show how many suspended messages exists and when the last message got suspended.
Screenshot: "Application-detail" page
When clicking on a application on the main page previously shown the application-detail page is displayed for that application. This page shows detailed information on each of the artifacts within that application. I also shows suspended messages and the date and time of the last suspended.
It also displays a graph showing how many messages that has been processed by each of the ports. Currently the graph can view data from the last 7 days. In the screenshot above data from the 6th of January is shown and as it’s set to display data for a specific day the data is grouped in hours of that day. It’s also possible to view the aggregated data from all the traced days as show below. When viewing data from all days the graphs is grouped by days.
(The graph only shows data from the 6th of January as this is from test and there was no traffic of the previous days but I’m sure you get the idea …)
Screenshot: "Application-detail" page with inactive artifacts
This final page show details of an application with some inactive artifacts. The small cross highlighted by the arrow in the image show the possibility to filter out a single artifact from monitoring. If an excluded artifacts is failing the overall status of the application will still be OK and no alerts will be sent.
Help!
I’d love to get some input and feedback on all this. What do you think could be useful, what do you think won’t? Do you know of something similar, how do you solve this kind of monitoring?
I’d also like to know any suitable placed to publish the code as an open-source project or is the best thing to just release it here on the blog? What do you think? Use the comments or send me a mail.
What’s next?
I have a few thing on the alerts part of the application left and then I’ll release a first version. I’m hoping that could happened at the end of February 09 (look at the update at the top of the post) . Make sure to let me know what you think!
I’ll publish a follow-up post discussing the technical details and the architecture more in detail shortly.
January 7th, 2009 at 8:17 am
Great comprehensive post as always. Nice one.
How much out of the box functionality are you using? One could argue that you could for example use the BAM tracking to log things like suspended messages thus enabling you to also use alerts and subscriptions from the out of the box functionality, as well as having API’s at the ready. But then again, since you already have a history, you must be using either BAM, or HAT data, or a custom storage solution of some kind.
The next obvious step when looking at a monitoring solution after finding out that something went wrong is of course to look at what went wrong. Thus looking at details for suspended messages. That would be a logical evolution. We genereally solve this kind of monitoring using the kind of BAM tracking profile you suggested, and based on the scenario, tie the suspended event or the failed message routing to that as well.
Another interesting approach with this would be to merge this with the ESB Guidance portal. As that already contains much value adding functionality and has a set frame to work from. Having value adding functionality is all good, but it’s even better when you don’t have it in many different places – which was also one of your additional pain points.
When you start to add other kind of measures and statistics (like KB, No of Msgs, etc.) as well is when it gets really interesting. We’ve previously added things like SLA monitoring to the ESB Guidance portal using data from BAM tracking profiles for example.
I’d be interested in evolving this further with you (in whatever form) once you publish it (which you should do to Codeplex imo).
January 7th, 2009 at 9:05 am
Great post, I also think that the HAT tool just sucks.
January 7th, 2009 at 9:29 am
@Johan Hedberg: Thanks!
I’ll try and publish a post discussing the architecture shortly but yes you’re right, all the history is based on a generic BAM tracking model. One of my goals is to make this as easy as possible to configure. I’ve worked with similar solution requiring adding pipeline components to all my pipelines etc and that’s not working for me. This should act as an extra layer on top of your existing solution – just as BAM works. But as you know BAM requires a lot of configuration so I somehow have to make that easier but that definitely has lower priority and will be something I’ll try to get in there in later versions.
When it comes to adding extra error information from suspended messages etc I’m not sure. My idea is to create a tool that aggregates information an could tell support persons that “something is wrong and someone need to look at this”. For all the detailed information we have the existing tools … I don’t want to replace something just aggregate and collect the information into a dashboard. But, that’s just my idea and could change (hey, this is open-source
)! Love the input!
I haven’t looked that much into the ESB Guidance portal (just the exception management) but it looks interesting. All this is however build on ASP.NET MVC, jQuery, LINQ and I don’t think it’d be that easy to merge the two. It could be interesting to look into further down the road.
What’s important to me is that we create something that could exist on top on any BizTalk solution, based on ESB or not and that it’s dead simple to configure and setup.
I’ll get back to you as soon as I get a stable first version up on codeplex!
January 7th, 2009 at 4:58 pm
I can definitely see the usefulness of something like this. HAT is flat-out miserable to use. Admin Console can be sluggish at times, and failed message routing in 2006 is better, but not ideal. We have implemented ESB-like cross-system consolidated exception logging, so there is yet another source of diagnostic info we look at.
So jumping from place to place, system to system to diagnose a problem is one of those tasks that one takes for granted because that is what we have been given by MS, and “that’s how it has always been”. I’m always open to 3rd party solutions that can make life easier. (case in point – I am a big fan of the BizTalk Deployment Framework on Codeplex)
I’m anxious to see what comes out of this initiative. Good luck.
January 8th, 2009 at 1:13 pm
Richard,
Just joining the chorus here, and cheering for a release. I’ll add to the code if I have the skills required.
Henrik
January 9th, 2009 at 10:56 am
Richard,
Just came across this post and need to go through in more detail, but this is a great idea.
I think CodePlex is a good place to host this.
Regarding the architecture, if you adopt (or have adopted) a model where all the information aggregation is done by UI independent components (and even available through say, webservices) , then as Johan suggested, it could even be plugged into the ESB Portal or other portals (including say WPF clients).
I’ll send more feedback soon.
Regards,
Benjy
January 9th, 2009 at 11:27 am
@Santosh
Yep, all the aggregation, status checking etc will be done by a separate project and those dll:s will be GAC:ed. These assemblies will be used both by the web GUI layer (as shown in the post) but also by a windows service handling sending all the alerts etc and could of course be used by other clients as well!
January 23rd, 2009 at 10:35 pm
Looks really good, and I’m curious to see your generic activity, although I’ve your related posts. Are you providing any tooling for enabling tracking for ports or are you relying on TPE?
Looking forward to see you post it on CodePlex!
//Mikael
February 12th, 2009 at 10:39 pm
Hi the tools look great and i would love to contribute if required, our project needs such tool cuz its very difficult to go everytime on prod box to see the biztalk admin console for health monitoring and if such tool is available we can access it from our own system, i would suggest to include some alert functionality.
February 13th, 2009 at 9:27 am
@Mikael Håkansson: First I actually thought about releasing it without any support for setting up TPE-profiles and that users would have to use TPE-tool for that. But as I’ve seen my internal users struggle with this I’ve decided to at least look into using the approach you have in your code and see if I can support this within the actual tool somehow.
That would also make it easier to actually create one database model per application and not have on big activity for all. I think one big activity for all application wouldn’t really scale that well performance-wise one we use it some of our mote traffic intensive integrations.
@Arihant Jain: Yes, I’ve just finished the alerts functionality. It’ll be in there. You can set up subscriptions and have a list of email connected to that subscription. A subscription can then monitor x number of applications. Email is sent out as soon as something fails or a message get stuck. Another email is sent out once the subscriptions doesn’t contain any errors anymore. An NT Service is used to the checking and sending of email.
February 27th, 2009 at 4:00 am
That really cool, let me know when you are releasing its beta so that i can try to use it in my project and and evaluate it and suggest more inputs to you.
April 14th, 2009 at 3:48 pm
Great Post, looks like a very very good tool!
Is there any more activity on it?
Hopefully you’ll get over the legal issues….