Reduce instrumentation overhead

First of all make sure you are instrumenting is appropriate – for example the basic recommendation is not to instrument get/set calls. These are simply returning or setting a single value, very fast. For transactions with a very small transaction time you wouldn’t need to instrument for performance.

Then note that Diagnostics is designed to use the level of instrumentation that will provide adequate information to troubleshoot a temporary or hard to reproduce performance issue while imposing a low overhead that can be tolerated in most production environments.

To achieve this goal, Diagnostics provides two mechanisms which automatically adjust data collection in response to the performance characteristics of the currently executing server request.

The first such mechanism is latency-based trimming. If a particular invocation of an instrumented method is fast, the invocation is not reported (there will be no corresponding node in the Call Profile). This cuts the overhead substantially, as the Diagnostics Agent does not have to create the necessary object and place it in the call tree. At the same time, it is assumed that such fast calls are of no interest to the user who is interested in pinpointing performance issues. You can adjust the reporting threshold (51 ms by default) to eliminate some of these types of fast calls (presented by very thin bars in the call profile). These calls have relatively high overhead, and probably do not provide any useful information which can help diagnose performance issues.

Another automatic data collection mechanism is stack trace sampling (for Java 1.5 or later). This feature reports long running methods even if they are not instrumented. Thus by enabling this feature, and tuning it to provide adequate level of information, the user can turn off some of the instrumentation and trust that any potential performance issues in this module will be reported by stack trace sampling.

As far as light-weight code injection, we do exactly that. Our instrumentation is as light-weight as possible. One should realize though that a major portion of the overhead is caused just by taking a time stamp (which is necessary to calculate the latency).

Diagnostics thresholds

Each of the numeric metric data for an entity (CPU of a host, heap used in a VM…) can have a threshold value set. Threshold is evaluated against the metric data points received, usually every 5 seconds. The metric with a threshold set will have one of the following status levels: Green, Yellow and Red. The entity’s status is derived from all its metric statuses according to worst-child rules (if any metric for the entity is red, the entity is red).

As long as the metric value does not exceed the threshold the status remains Green. If 3 or more metric data points are beyond the threshold the status turns to Yellow. If the average metric value within the last 5 minutes is beyond the threshold the status becomes Red. Once the 5 minute average goes below the threshold the status becomes Green again. Note that Diagnostics status does not revert to Yellow it goes directly back to Green.

The threshold values for metrics are configurable in the UI (details pane) and some metrics also have default thresholds set. The default threshold configuration is set in the server’s etc directory in thresholds.configuration.

If you need to set thresholds on specific methods you would want to add a separate entry in the points file for each method and this will allow you to set up thresholds and alerts for the specific method.

Get user access information

You can get a list of active users seen by the Diagnostics server in the last 60 seconds. And you can see the Queries/sec indicating how much load the user generates with summary or trend queries.

From the main Diagnostics UI select Configure Diagnostics and the Components page is displayed. (You can also access this Components page by selecting the Maintenance link in any Diagnostics view). Select the query link and then select the Active Users link at the bottom of that page to display a list of active users. Also this data is under Mercury System groupby.

Identify load balancing issues in a cluster

Assuming you’ve put all probes for the JVMs in the cluster into the same probe group then in the Aggregate Server Request view, you can add the count metric to the entity table which tells you the total number of requests across the cluster. You can drill into the aggregate server request to see the server request performance in each JVM. Again use the count metric to see the number of server request instances for each JVM. 

Diagnostics and SiteScope integration

What port does the Diagnostics/SiteScope integration use? You point SiteScope to a Diagnostics MEDIATOR on the standard 2006 port. For example you’d set Receiver URL to:

And once I tag an existing SiteScope monitor with this Diagnostics integration will I need to do any restarts or touch my files? No there is no need to bounce any servers or touch any files.

When you first try to view SiteScope data in the Diagnostics External Monitors view, by default the monitor’s status is gray and no data is graphed. To see a status (red, yellow, green), you must first set a threshold on a metric (in the details pane). To see data in the graph, you must first select a metric to be charted (in the details pane).

Information on the threads metrics

The thread metrics shown in the Profiler’s Threads tab are collected independently of the server requests.

The values shown in the table are cumulative metrics for:

• CPU time spent in OS kernel

• CPU time spent in user mode

• Time spent in waiting state (in Object.wait(…))

• Time spent in blocked mode (lock contention for “synchronized methods or blocks)

The values can only increase and they correspond to the usage since the thread creation.

The graph shows the difference between the values of these metrics between the last two thread snapshots. Graph makes little sense unless you enable automatic update with a constant frequency.

There is no way to associate this blocked time with the server requests.

When a thread is returned to a pool its ID doesn’t change.