Eclipse or SUNrise...

Eclipse or SUNrise...
...JAVA for sure

Wednesday, January 25, 2012

IBMs WAIT analysis tool

Today I would like to write a few words about a tool, that can be very helpful in performing JVM analysis. It is called WAIT which means Whole-system Analysis of Idle Time. The tool is exposed by IBM in the Internet and the good news is that it is free! There is also another good news, it can analyze not only IBM's JVM, but also other vendors so Open Source is welcome (yay!!).

There are however couple of problems with it, but I'll leave them for later. Now, what can it do and how it works...

First of all, this tool analyze the standard memory and thread dumps, that are generated by JVM so it can see what is utilized (for example I/O operations or HTTP traffic), so it won't tell us more than a usual, also free tool from IBM (Support Assistant) can tell us. But the great thing about WAIT is, that it will process multiple heap files at once and create a report from all of them lined up with a time-line, so we can see how the JVM load looked like over a time. To be honest, I never liked the ISA tool. In my opinion it is buggy, slow and unstable. It can drill in the analyzed files really deep, but it is not easy to use. The WAIT won't show you as much as ISA, but in most of the cases you are interested in overall performance and the bottlenecks, so lets see how WAIT works.

In order to allow this tool to analyze multiple files, there is a simple script for each platform (Linux/AIX, Windows and z/OS) that you have to download and copy to your environment. Then, you just run it against a PID of your JVM process, it will start grabbing the JVM dumps (it will do it with kill -3 command). When you think, the script gather enough files you simply hit ctrl+c to terminate it. The script is prepared for this interruption so when it will receive such signal, it will compress all the gathered files into a single archive. Bellow there is a sample script run, the 290990 327682 and 376870 are the pids of my JVM processes:

./waitDataCollector.sh 290990 327682 376870
Switching to bash

WAIT data collector!
-------------------
Collector version 7.0
Collecting data for PIDs: 290990 327682 376870
Sleep time between java cores: 30
Number of iterations to collect: 300
Sleep time between ps invocations: 8
Raw data being stored in is in /tmp/waitCollectionData.CollectorPid_405602

Found websphere log directory for PID 290990: [/ibm/WebSphere/AppServer/profiles/xxx/logs/xxx_Srv01]

Found websphere log directory for PID 327682: [/ibm/WebSphere/AppServer/profiles/xxx/logs/nodeagent]

Press CTRL-C to stop collection and gather data


Triggering snapshot 1: ( 20120118 13:21:35:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870

Triggering snapshot 2: ( 20120118 13:22:09:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 3: ( 20120118 13:22:42:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 4: ( 20120118 13:23:15:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 5: ( 20120118 13:23:47:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 6: ( 20120118 13:24:20:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 7: ( 20120118 13:24:53:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 8: ( 20120118 13:25:25:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 9: ( 20120118 13:25:58:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
Collected 3 IBM javacores

Triggering snapshot 10: ( 20120118 13:26:31:%N GMT)
Triggering kill -3 for 290990
Triggering kill -3 for 327682
Triggering kill -3 for 376870
^CCollected 3 IBM javacores

Zipping up wait data found in /tmp/waitCollectionData.CollectorPid_405602
Trying gzip
Collected data stored in waitData.tar.gz
Collected 30 javacores total
Cleaning up raw data dir from [/tmp/waitCollectionData.CollectorPid_405602]

Please submit waitData.tar.gz to the wait server to see a WAIT report

Simple? Sure it is, and it should be! It is great that, you don't have to install any additional software on your machines. I tried it on AIX system and I got no problems. Now all you have to do is to upload the file to the WAIT site.

Ok, so how the analysis look like? Here is an example of my work. I configured my script to grab a snapshot every 10 seconds and I left it to run for 5 minutes to have a better overall overview. It was a WebSphere Process Server JVM version 5. After uploading you will see a dynamic view of this kind:

As you can see, there are 3 graphs with a common time-line, first show the CPU utilization, second shows the number of running threads in the JVM and the last one is showing how the threads are utilized (for example if the are working or waiting). Bellow the graphs you have additional panels that drill deeper in other aspect - if you click on the chart they will display what is going on at that moment. The second image shows you this panel:

I don't want to drill in the analysis stuff in this post, I just marked the interesting stuff that you can see on the charts. I'll deal with it in another post

IBM WAIT is a cool tool, but it suffered with some problems lately. For example there was a problem with their certificate that was revoked... This doesn't sound professional, but hopefully they will keep a valid certificate and the page will be kept fully accessible. Right now it is and they used a real CA cert from GeoTrust so I hope it will remain OK. Ahhh, right - the ssl. There is an important reason why they use SSL, because remember that if you upload files from the production environments, you transfer your clients data so this is critical to be safe. Before each analysis you will also have to agree on their terms which sounds reasonable.

2 comments:

DesuRamesh said...

We are not able to Register to WAIT. do we need to be IBM'er to register ?
Any common userID and Password can we use ?

Thanks for this Blog

Sebastian Kapciak said...

You do not need to be IBM'er to use the WAIT tool, just register here https://wait.ibm.com/newuser.html - and use this account to work with the tool.

I'm glad you like the blog, thanks!