First of all make sure you are instrumenting is appropriate – for example the basic recommendation is not to instrument get/set calls. These are simply returning or setting a single value, very fast. For transactions with a very small transaction time you wouldn’t need to instrument for performance.
Then note that Diagnostics is designed to use the level of instrumentation that will provide adequate information to troubleshoot a temporary or hard to reproduce performance issue while imposing a low overhead that can be tolerated in most production environments.
To achieve this goal, Diagnostics provides two mechanisms which automatically adjust data collection in response to the performance characteristics of the currently executing server request.
The first such mechanism is latency-based trimming. If a particular invocation of an instrumented method is fast, the invocation is not reported (there will be no corresponding node in the Call Profile). This cuts the overhead substantially, as the Diagnostics Agent does not have to create the necessary object and place it in the call tree. At the same time, it is assumed that such fast calls are of no interest to the user who is interested in pinpointing performance issues. You can adjust the reporting threshold (51 ms by default) to eliminate some of these types of fast calls (presented by very thin bars in the call profile). These calls have relatively high overhead, and probably do not provide any useful information which can help diagnose performance issues.
Another automatic data collection mechanism is stack trace sampling (for Java 1.5 or later). This feature reports long running methods even if they are not instrumented. Thus by enabling this feature, and tuning it to provide adequate level of information, the user can turn off some of the instrumentation and trust that any potential performance issues in this module will be reported by stack trace sampling.
As far as light-weight code injection, we do exactly that. Our instrumentation is as light-weight as possible. One should realize though that a major portion of the overhead is caused just by taking a time stamp (which is necessary to calculate the latency).
Each of the numeric metric data for an entity (CPU of a host, heap used in a VM…) can have a threshold value set. Threshold is evaluated against the metric data points received, usually every 5 seconds. The metric with a threshold set will have one of the following status levels: Green, Yellow and Red. The entity’s status is derived from all its metric statuses according to worst-child rules (if any metric for the entity is red, the entity is red).
As long as the metric value does not exceed the threshold the status remains Green. If 3 or more metric data points are beyond the threshold the status turns to Yellow. If the average metric value within the last 5 minutes is beyond the threshold the status becomes Red. Once the 5 minute average goes below the threshold the status becomes Green again. Note that Diagnostics status does not revert to Yellow it goes directly back to Green.
The threshold values for metrics are configurable in the UI (details pane) and some metrics also have default thresholds set. The default threshold configuration is set in the server’s etc directory in thresholds.configuration.
If you need to set thresholds on specific methods you would want to add a separate entry in the points file for each method and this will allow you to set up thresholds and alerts for the specific method.
You can get a list of active users seen by the Diagnostics server in the last 60 seconds. And you can see the Queries/sec indicating how much load the user generates with summary or trend queries.
From the main Diagnostics UI select Configure Diagnostics and the Components page is displayed. (You can also access this Components page by selecting the Maintenance link in any Diagnostics view). Select the query link and then select the Active Users link at the bottom of that page to display a list of active users. Also this data is under Mercury System groupby.
Assuming you’ve put all probes for the JVMs in the cluster into the same probe group then in the Aggregate Server Request view, you can add the count metric to the entity table which tells you the total number of requests across the cluster. You can drill into the aggregate server request to see the server request performance in each JVM. Again use the count metric to see the number of server request instances for each JVM.
To see performance for the cluster, first put all probes for the JVMs in the cluster into the same probe group. Then you can use the Aggregate Server Request view to see performance of the whole cluster and drill down to individual probes.
The thread metrics shown in the Profiler’s Threads tab are collected independently of the server requests.
The values shown in the table are cumulative metrics for:
• CPU time spent in OS kernel
• CPU time spent in user mode
• Time spent in waiting state (in Object.wait(…))
• Time spent in blocked mode (lock contention for “synchronized methods or blocks)
The values can only increase and they correspond to the usage since the thread creation.
The graph shows the difference between the values of these metrics between the last two thread snapshots. Graph makes little sense unless you enable automatic update with a constant frequency.
There is no way to associate this blocked time with the server requests.
When a thread is returned to a pool its ID doesn’t change.
Telnet from Diagnostics server to localhost:2612
–If fails, Check /etc/server.properties timemanager.time_source=
–If timemanager.time_source=ntp, check Diagnostics server can access the internet.
Telnet to port 2612 for the Mediator specified in the property “mediator.host.name” in probe’s /etc/dynamic.properties file from probe machine.
Try to open probe URL /profiler/metrics?xml=true from the Diagnostics server.
• Check for Firewalls.
• Check if there is a proxy specified or needed.
• Try changing /etc/dispatcher.properties force.1.2.event.channel=true
• Try starting probe with -Dhttp.nonProxyHosts=<fully qualified name of the Diagnostics server>
To disable capture class map open probe.properties file in <probe_install>/etc. Look for use.class.map=. When capture class map is enabled it is set to true and will look like use.class.map=true. If you want to disable it and put things back to their default state, change the entry to use.class.map=auto.
Enable Corba cross-VM on both probes. In diagnostics 8.x we added cross-VM support for pure IIOP which covers RMI over IIOP as well.
Enable Corba Cross-VM by following the steps below on both the probes:
a) Disable RMI in the points file in auto_detect.points file ([RMI] active = false)
b) Enable the Corba points (there is a Corba section towards the end of the auto_detect.points file)
c) Read the documentation under [Corba cross-VM Documentation] section of the points file and follow ALL the steps listed there.
d) After doing all of the above, the "jvmEntries" should look something like below.
LoadRunner: 9.52 running on Windows 2003, HP Diagnostics: Version 8.0, WebSphere Commerce: 6.x on AIX.
Retrieving data for the "Stand Views" graphs in diagnostics is working fine, but nothing is displayed for the "IBM WebSphere" views.
Java 2 Security is NOT enabled. The HP Diagnostics Performance Monitoring Infrastructure (PMI) statistic sets are selected as Extended.
Checking the probe.log on the AIX machined shows the following WARN log:
2010-05-06 10:20:09,836 WARN com.mercury.diagnostics.capture.metrics [Metrics Collection] Error initializing com.mercury.diagnostics.capture.metrics.jmx.JMXCollector@ca1f548
java.lang.IllegalAccessError: com.mercury.diagnostics.capture.metrics.jmx.JMXCollector tried to access method com/mercury/diagnostics/capture/metrics/jmx/JMXCollector$AttributesAndDescriptors.add(Ljava/lang/String;Lcom/mercury/diagnostics/common/metrics/MetricDescriptor;)V
Invalid classpath configuration on the IBM Websphere – HP Diagnositcs boot loader on the AIX machine.
Check the Boot Classpath configuration in the WebSphere Admin Console and ensure it has all the correct entries.
In this case it should have read:
but it did not have the first entry (1.4.2__1)