Corellation Technology Finds The Root Cause To Performance Bottlenecks
Glassbox is a utility that monitors Java enterprise applications (WAR or EAR.) Glassbox identifies thread contention, excess memory usage, and slow database connections. TestMaker integrates with Glassbox to correlate monitored performance problems to TestMaker's load test operations. Used in combination, TestMaker and Glassbox deliver root cause analysis and mitigation of performance bottlenecks in Java enterprise applications.
Correlation Technology Finds the Root Cause to Performance Bottlenecks
Frank Cohen: Hi. This is Frank Cohen, CEO and founder at PushToTest and I'm glad to introduce Ron Bodkin who is the Glassbox project leader. PushToTest TestMaker 5.3 is going to integrate Glassbox to be able to enable root cause analysis and mitigation especially at the Java application server level. So Ron is here to present a presentation that he put together for TheServerSide Java Symposium and then I'm going to ask him some questions as he's going through his presentation that I hope will be interesting to you.
Ron Bodkin: Thanks, Frank. Well, I'm certainly glad to be here today and excited about the better integration with PushToTest and Glassbox. We started Glassbox back around four years ago when we were thinking about how to apply Aspect-Oriented Programming to improve performance management and have been developing the software since then. Overall, it's an open source troubleshooter for Java applications so the concept of Glassbox lets you identify problems and failures in your application and pinpoint the root cause of those errors giving you a summary of what's going on, what errors are happening and are not happening. The idea is to have a troubleshooting capability that you can build in each phase of software development from development through, of course, testing and into production so that you'd take an ability to troubleshoot at the development phase and bake it in so you have good monitoring in your application.
Now, when I say "bake it in," one of the nice things about Glassbox is that you don’t have to make code changes to start using it. It will integrate with existing applications with low overhead so that you have the ability to start taking advantage of it even if you haven’t developed your application for it but that you can do a little bit of customization of Glassbox to make it work better for your application. The sense is though that it's really useful to start development with a tool like this so you get advantage of it, from it at the beginning of the development cycle and that you can continue to use it with confidence as you rule out large-scale testing and deployment of this system that you know it works because it's been built that way. It provides a nice automated troubleshooter that's well worth considering. So we'll look an introduction to Glassbox and some demos, talk about using the tool, how to extend it, and some conclusions.
So Glassbox gives you an automated diagnosis capability. It looks at the structure of your application, identifies issues such as slowness or failure that are happening, suggests causes, and identifies areas that have data to show that are not causing problems as well. And as I noted before, it's useful in development, QA, and production. Some of the key features of Glassbox are: it features a drop-in installation so you can get up in minutes on existing applications without having to make changes to the source code or how the software gets built. You have one click diagnosis of problems so that the 80% of common issues that you tend to run into frequently that cause a lot of problems are easy to detect. So Glassbox is focused on the areas where there's the most productivity overhead and not necessarily in catching all the most obscure problems that might occur in an application.
Glassbox learns the application so the user doesn’t have to do a lot of configuration. The goal is to get in and make something that's useful out of the box and you can do a little bit of configuration once you're already getting value from it. It flags service level agreements. We'll talk more about how those work. Give you clear descriptions with supporting evidence. As I noted before, it can be run as a low overhead tool. The default mode is as little impact of production applications and it's extensible. You can application-specific logic to get more details for your application. Typically, you'll do that after you've got it working well on a baseline for your application.
So one of the key things that we find happens in most organizations is you have this interesting problem that on the one side you have a lot of enterprise monitoring. You have a lot of ways to say that there are significant problems, applications may be low, you have complaints from users that people can't log in, you get a decrease in business impact like conversion rates are down. On the other side, you have a lot of technical data from monitoring like there are more bugs in the code, performance is low, the disks are running busy, CPU is high. But the real problem is there's this actionability gap. There are many things that are potentially worrisome or problematic technically and there are many things that could be a concern from a business standpoint. But how do you thread them together to know which technical things you're seeing are really causing business problems?
A classic example of this, one of the first projects we did when we were testing Glassbox, we consulted with an organization and they just had hired a new system administration who saw in the logs an alarming critical error message and so went to his boss. He talked to the VP of engineering who routed it through a project manager and to the developer for that line of code. This was in a small company, by the way. And the developer looked at the log and said, "Oh, that? Oh, that's nothing." So it's a classic example where you get a lot of alarming looking technical information and it may in fact turn out not to be critical. So how do you decide which of the maelstrom of information we have about our applications matter?
With that, I'll show you how Glassbox tackles this problem. So the first thing I want to show you what it's like to run Glassbox and how easily you can get up and running. So I'm going to switch over to connecting to a Tomcat on my local server. I'm going to start by just pulling up a sample application that's currently running on this Tomcat. We have a pet store application, just as an example of web application that's running. We'll return to the pet store to see how we can monitor it and see if there are some problems in the installation of the pet store that we can improve. Switch over and take the Glassbox WAR file that I've downloaded. Make a copy of that and come over to the web applications directory and install it.
So now I have deployed this application. Tomcat will be deploying this application for me and I can just go ahead and type it, access the application, and you will see that there is some installer screen that comes up as I run the Glassbox application. So basically, I can deploy the WAR file and then it will let me set up an installation to add some environment variables in a wrapping script for how to run Tomcat or my favorite app server. Glassbox supports a number of application servers including, of course, Tomcat and JBoss and WebLogic and a number of other popular application servers, Jetty and WebSphere. We have people using it with WebSphere as well, of course.
So here it's giving me a few prompts about how I want to set things up. I'm going to pick the defaults to wrap the Catalina scripts, Catalina and Startup and I'm not going to overwrite. There's support for -- if you're in an organization that's running your application server through a custom script that can be accommodated too. So I'm going to click the install button and it's going to go ahead and finish installing this service and it has now actually gone and run that. So what I will do is I will take my Tomcat server that has finished installing and shut it down and I will do a directory and I've now got some new files that got laid down. Specifically, I've got a Startup with Glassbox file that was just installed on the server and I will run that.
Frank Cohen: So aside from installing a WAR file into my application server, do I have to do anything else to get Glassbox up and running?
Ron Bodkin: So what it did in addition to the WAR file is that it created some local configuration files and it created some additional parameters for starting the Java VM for running the server. So we wrote down a wrapper script that basically sets these environment variables that you can see on the screen and calls out to the original Startup script. And now I should have Glassbox monitoring installed in my application server.
Frank Cohen: So I haven’t had to do anything to the pet store application itself to have Glassbox monitoring?
Ron Bodkin: That's right. There's no change in the code at all. I'm just installing some environment variables and jars in the path so that I can get instrumentation through Glassbox. So I'll go ahead and we'll click this verify link just to make sure that we have installed Glassbox correctly and you can see that in fact that's working. And now I can click on the web client link to have a view of Glassbox showing the information about what it's monitoring on the system. So let that run. You can see the Glassbox is monitoring itself, watching the various operations that in itself is performing and we'll go ahead and we'll again use the pet store. So we'll click on the pet store link and we will see everybody's favorite sample application. This is a Struts version of the pet store that's using still fairly commonly used patterns for application development using Struts to connect back to my SQL database. And we'll go ahead and we'll try out using the pet store. So we're clicking through and we can see that it's a little bit slow as I click on some of the links. So we'll test this out here.
Frank Cohen: But its slowness isn’t because of Glassbox. Its slowness is because pet store itself is running slow.
Ron Bodkin: That's right. We have a version of the pet store that's running a little bit slowly on this machine. And in fact, you can see here that I've switched over to the Glassbox tab and it has observed the various operations the pet store is performing and when I click on the ViewCategoryAction, it's telling me that in fact it slowed because of a slow database.
Frank Cohen: So ViewCategoryAction is something that the pet store implemented and then Glassbox is observing ViewCategoryAction connecting to the database.
Ron Bodkin: That's right.
Frank Cohen: Cool.
Ron Bodkin: So here I'm going to hit control left-click to open in a new window just so we can see a full page of the analysis of that operation. You can see that the way Glassbox -- it's just coming up on the screen now. You can see the way Glassbox organizes a report on an operation is it gives you a summary of an executive summary of what's going on. You got a slow operation. It's slow because it's of the database response and CPU use and it's exceeded a 1 second goal all the time one out of one time(s) (100%) with an average time of 3.15 seconds. The technical summary gives you a little more details. So it's now telling you what database it was connecting to. I'm going to highlight that on the screen here so you can see the JDBC connection. And as I get into more details technically, it's going to give me more information like the database that I was trying to connect to and the number of times it tries to connect per operation and the mean time per operation spent connecting and the CPU use. And indeed it shows me a stack trace of Java code where there's time being spent in the Java code. In this case, it's some startup cost of instrumentation and it gives me information also about what URLs are slow. So when do I see this? You can see the one URL that had a slow request is recorded here. It will also record parameters. In this case, they're just embedded in the HTTP request, but for close parameters those can also be reported.
Frank Cohen: I'm curious. How does Glassbox know when something is slow?
Ron Bodkin: So Glassbox measures the performance of operations and tracks the resource used and we'll talk a little bit more about how it uses Aspect-Oriented Programming, the instrument, the applications to record that information. So it gives you some comment solutions like how you might mitigate slow database calls or CPU use and identifies other things that have been checked for and are working correctly in this request. So what we'll do now is we'll come back to the pet store demo and we'll go ahead and proceed to checkout and we will go ahead and authenticate. And when we check out we have an order all of which is working fine. And what we will do is when we have that, there's a GUI's feature that was added to this release of viewing the order with the web service and we went in and gave an example of sometimes people don’t implement web services as efficiently as they might. So we'll click a slightly modified version of the View Web Service link and as we can see it's quite slow. And if we look back at the Glassbox monitoring, what we'll see is that in fact we have a slow remote call in ViewOrderAction is a problem.
So you can see at the top of the screen you got a summary of all the operations in a sort of by application, by status, the yellow ones are ones that are slow. In this case, it was slow because it spent a lot of time making a remote service call to this order detail URL and we can scroll down a little bit and get a little bit information again about that. This gets to warrant any parameters that made two calls to that operation. So we can see that there are again some performance problems in the remote calls. And one more thing I wanted to show is if we go back here, if we click on this link for a lizard in the application, we'll get an error where it was trying to connect to a database and it had -- we put in invalid database credentials. And there we now see that there is an error in Glassbox. It's signified by red to see if there's -- that the ViewCategoryAction now is suffering from an error condition and Glassbox will also tell us about errors in the application. So here we've got a database connection to failure for the same database and if we scroll down we can see it failed for one distinct request. It gives me information like SQL error codes and stack traces that I can extend to see where the error occurred, et cetera.
Glassbox can also identify prepared statements and identify parameters to those as well as URLs for giving you information. So that's a little bit of a summary of how Glassbox can be easily installed and used to monitor an existing application. It's also worth looking at another thing Glassbox provides you JMX statistics. And so what we'll do here is we'll run the JConsole which is the standard viewer for JMX data that comes with a Java Development Kit, JDK, and we've added Glassbox statistics as a management topic that you can view through this tool. And we'll see here that we've got now a more detailed structure. Let me expand this to be full screen.
You can see the various operations inside of the JPetStore application and if we scroll down we can see inside of the Struts action servlet, we've got a number of actions for different things we're doing in this application. ViewCategoryAction is one of them and we can see that we have a count of failures for the various connections and success, how much time was spent, and indeed inside of them we've got details about queries that were run, time spent preparing them, running them, connecting and fetching results. So we've got some detailed breakdown of how things work. Likewise, for the web services call we have some information on that.
Frank Cohen So you're seeing like a live profile of the application but I haven't had to instrument of the app itself and I'm able to now kind of debug what the root causes of these performance slowdowns?
Ron Bodkin: That's right. So you got a nice summary of what were problems and what happened through web interface and if you wanted to dive deeper you could use the JMX Console to get in to see more precise information in details of exactly what had happened as we're seeing here. So this is not the tool I would want to use to do my primary monitoring of application performance. But it's a great complement to the nice summary tool that the web application provides.
Frank Cohen: So the web interface to Glassbox provides a nice formatted report and this would be a way to drill down into that report.
Ron Bodkin: That's right.
Frank Cohen: That's great.
Ron Bodkin: So now with that, let's talk a little bit more about using Glassbox. As we saw in the demonstration, the overhead was low as we were running. We didn’t really notice the significant increase in time. Typically, we see what's up and running and you get about a 1% increase in end to end response times.
Frank Cohen: So less than 1% would be suitable for production environments. So you could run Glassbox in production?
Ron Bodkin: That's right. You could run in production.
Frank Cohen: That's great.
Ron Bodkin: Its focused data capture is on slow operations. So the typical approach is that applications that do things like database queries, there are some amount of time spent on the network performing those operations so having a little bit of bookkeeping, a little bit of extra work to record what's going on is negligible overhead. It's not heavily instrumented. It's not instrumenting all the Java code in the application. Instead, to complement monitoring known places where you're using extra resources, Glassbox is doing some stack sampling on Java5 and later VMs so that every hundred milliseconds it captures stack frames to see where a code is running so it can identify places -- unknown places that were slow but were not expected in advance or an instrument that --
Frank Cohen: So it's not just going to focus on like database connectivity or memory or threads. I mean it can adapt to the application itself.
Ron Bodkin: Definitely. It can learn --
Frank Cohen: That's great.
Ron Bodkin: Thank you -- learns looking at the code what's running to see where time is being spent. Glassbox uses load-time weaving which has little effect on end to end speed. It does slow down in startup time so it's typically about 40% slower for class loading basically when you initialize the application, and depending on how much memory the application uses is about 20% memory overhead. The load-time weaving, AspectJ load-time weaving, I know that the AspectJ team are again looking at optimizing this. Glassbox has a somewhat tailored optimized release of AspectJ load-time weaving to minimize memory overhead and that's why we think we typically hit about a 20% memory overhead. For cases though where people want to optimize further, it's definitely possible to run Glassbox in a build-time weaving mode as well which removes all of that overhead of initialization and instrumentation and removes most of the memory overhead as well.
Frank Cohen: Let me make sure I understand this. So if I was running Glassbox in a production environment then I could expect a 20% memory overhead but then the performance itself would be negligible as the app is running. It's just when the app instantiates its objects for the first time like when the app server is coming up that I'd see a 40% slow down in just that app coming up.
Ron Bodkin: Yeah, that's right.
Frank Cohen: That's great.
Ron Bodkin: Thanks.
Frank Cohen: And you've mentioned the term "aspects" quite a bit. Are you going to talk at all about what those mean?
Ron Bodkin: I will, yeah. So Glassbox is open source. It uses AOP which we’ll talk more about as well as JMX data to capture what's going on in the application. It captures that data then it feeds it to an analysis layer that identifies common problems, summarizes resource use, et cetera, to be able to report sensibly about what's going on through the AJAX web client that's being updated as the application runs. It's open source as an LGPL license and it works with Java 1.4 and later.
You saw some of the problems that Glassbox can solve. What are the kind of 80% problems that it solves? It can identify database problems like slow queries and failures, connection problems, remote calls, web services, EJB, even AJAX requests, thread contention, bottlenecks, people over synchronizing methods or other requests are being locked up. That's a good example of a kind of problem that load testing tools like PushToTest are great at helping you identify early in the cycle. Glassbox can help you find that to see, well, why is this thing slow? Oh, because all the threads are waiting to take their turn, right? Clicking through an application won't tell you that but a load test will. A variety of other resource problems like Java mail and FTP and more generally some problems with Java code like determine some stack sampling.
So Glassbox won't do everything. It won't solve problems that haven’t happened yet. It's not trying to predict problems. You want to have something actionable. Usually, we have enough problems solving the things that are already going wrong without forecasting things that might go wrong. It won't diagnose every problem that is to clear out the common things that are easy to identify and minimize the time wasted on those so you can focus specialists who are talented on the things that actually are hard to solve. It doesn't provide data crunching tools and it's not a workflow tool. Our view is the Glassbox by providing a shared web server that anyone in the organization can look at to have a common view of the problem with a clear description of what's going on is doing a lot to enable people to work better together without replacing tools that allow you to sign tickets and so forth. There are some great tools out there for managing those and instead what's needed is common understanding of problems.



