Monitoring your JAVA process using collectd

Need for Monitoring

The USE Methodology and drill-down analysis are two of the most effective methods to quickly identify, triage and analyze performance and capacity issues. I use them all the time at work.

USE methodology involves looking at the following:

  • Utilization
  • Saturation
  • Errors

By looking at the above factors for all resources, one can employ a tried, tested and trusted methodology to identify performance issues.

Drill-Down Analysis involves the following three steps done in that order:

  1. Monitor the systems of interest
  2. Identify specific areas of interest for the problem and look for bottlenecks
  3. Analyze the above areas further using techniques such as profiling

The two methodologies we discussed above are from Brenden Gregg’s book Systems Performance. Brenden also maintains an amazing blog filled with interesting stories revolving around performance and capacity problems in the enterprise and cloud. I highly recommend checking it out.

Going back to our discussion, there is one fundamental thing that we need to follow these two methodologies: we need a tool, cli or otherwise to monitor the utilization and saturation of various resources of interest. Traditionally it had been done via tools such as top, ps, iostat, mpstat etc but the downside is that these tools output results to the terminal and one is left wishing if there was a an easy way to graphically represent these values. Also, you might want to store these results to be able to look at the utilization at a particular point in time. Collectd solves this exact problem.

 

Collectd Basics

Collectd is a light weight daemon that you can install on the host/vm you want to monitor and it gathers metrics from the system/application and can send them over  the network to a time-series database (TSDB) such as graphite in real time. This time-series data in Graphite can be visualized using a front-end tool such as Grafana, thereby enabling the systems administrator/performance engineer to not only actively monitor the environment but also look at system resource utilization at any point in time. I skimmed over a lot of minute details here but they will become more familiar once you start using the Collectd/Graphite/Grafana stack more often. It is worth noting that although I am going to talk about sending the metrics collected by Collectd over the network to Graphite, it is very much possible to use any other backend such as InfluxDB or infact store the metrics as RRD files to be graphed by the RRDtool.

Collectd is written in C and has a plugin based architecture. These two reasons are why collectd is one of the most popular system metrics gathering daemons used in the enterprise. The plugin based architecture means that there is a collectd plugin for everything: to send data to Graphite, to collect per process metrics, to collect cpu metrics, to collect memory utilization and you can even write your own plugin in python/java and then collectd does the heavy lifting of collecting your custom metrics every collection interval (default is 10s) and sending them over to the TSDB you configured. Everything is a plugin in collectd, in fact you have to use the “logfile” plugin to even get collected to log to a file. All these plugins are configured in the collectd configuration file at /etc/collectd.conf (you could even use other custom configuration files and have collectd consume them).

In this blog post, we are going to see how to get collectd to monitor a java process, send those metrics over to Graphite and display them using Grafana. Setting up and configuring a Graphite/Grafana server is beyond the scope of this post.

To begin with, you would need the following packages:

  • collectd
  • collectd-java
  • collectd-generic-jmx

If you are using a Red Hat based Linux distro, you can simply yum/dnf install them after enabling the epel repo.

The GenericJMX plugin provided by the collectd-generic-jmx package is the one that helps us monitor the JAVA process. It gives us valuable metrics about heap memory, garbage collection etc. that are lacking in the metrics gathered by the “processes” plugin ( which gathers CPU utilization and RSS/Virtual Memory information primarily). The GenericJMX plugin queries memory, threads and other  metrics from the Java Monitoring Extensions (JMX) of the JVM . Since the GenericJMX plugin is a JAVA based collectd plugin, we also need the “java” plugin provided by the collectd-java package.

We’ve been talking quite a bit about the tooling, but have you wondered what JAVA process we are going to monitor for the purposes of this discussion? We are going to monitor OpenDaylight, an SDN controller for network virtualization. OpenDaylight runs inside of Apache Karaf, so it turns out that we would be monitoring the “karaf” process.

The collected configuration used is below. It is also available here as a gist for quick download.

 

</pre>
# Interval default is 10s
Interval 10

# Hostname for this machine, if not defined, use gethostname(2) system call
Hostname "overcloud-controller-0"

# Loaded Plugins:
LoadPlugin "logfile"
<Plugin "logfile">
File "/var/log/collectd.log"
LogLevel "info"
PrintSeverity true
Timestamp true
</Plugin>

LoadPlugin write_graphite
LoadPlugin unixsock
LoadPlugin java
# Open unix domain socket for collectdctl
<Plugin unixsock>
SocketFile "/var/run/collectd-unixsock"
SocketGroup "collectd"
SocketPerms "0770"
DeleteSocket true
</Plugin>
# Graphite Host Configuration
<Plugin write_graphite>
<Carbon>
Host "10.12.21.1"
Port "2003"
Prefix "odl-osp12-non-containerized."
Protocol "tcp"
LogSendErrors true
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Carbon>
</Plugin>
<Plugin "java">
JVMArg "-Djava.class.path=/usr/share/collectd/java/collectd-api.jar:/usr/share/collectd/java/generic-jmx.jar"
LoadPlugin "org.collectd.java.GenericJMX"
<Plugin "GenericJMX">
<MBean "gc-count">
ObjectName "java.lang:type=GarbageCollector,*"
InstancePrefix "gc-"
InstanceFrom "name"
<Value>
Type "derive"
Table false
Attribute "CollectionCount"
InstancePrefix "count"
</Value>
</MBean>
<MBean "gc-time">
ObjectName "java.lang:type=GarbageCollector,*"
InstancePrefix "gc-"
InstanceFrom "name"
<Value>
Type "derive"
Table false
Attribute "CollectionTime"
InstancePrefix "time"
</Value>
</MBean>
<MBean "memory_pool">
ObjectName "java.lang:type=MemoryPool,*"
InstancePrefix "memory_pool-"
InstanceFrom "name"
<Value>
Type "memory"
Table true
Attribute "Usage"
</Value>
</MBean>
<MBean "memory-heap">
ObjectName "java.lang:type=Memory"
InstancePrefix "memory-heap"
<Value>
Type "memory"
Table true
Attribute "HeapMemoryUsage"
</Value>
</MBean>
<MBean "memory-nonheap">
ObjectName "java.lang:type=Memory"
InstancePrefix "memory-nonheap"
<Value>
Type "memory"
Table true
Attribute "NonHeapMemoryUsage"
</Value>
</MBean>
<MBean "thread">
ObjectName "java.lang:type=Threading"
InstancePrefix "threading"
<Value>
Type "gauge"
Table false
Attribute "ThreadCount"
InstancePrefix "count"
</Value>
</MBean>
<MBean "thread-daemon">
ObjectName "java.lang:type=Threading"
InstancePrefix "threading"
<Value>
Type "gauge"
Table false
Attribute "DaemonThreadCount"
InstancePrefix "count-daemon"
</Value>
</MBean>
<Connection>
ServiceURL "service:jmx:rmi:///jndi/rmi://localhost:1099/karaf-root"
Collect "memory_pool"
Collect "memory-heap"
Collect "memory-nonheap"
Collect "gc-count"
Collect "gc-time"
Collect "thread"
Collect "thread-daemon"
User "karaf"
Password "karaf"
</Connection>
</Plugin>
</Plugin>
# Include other collectd configuration files
Include "/etc/collectd.d" 

 

You will see that we have not only loaded the java and genericjmx plugins but also logfile, write_graphite and unixsock. The logfile plugin is used to log collectd stdout and stderr to a log file which can help us debug in case our collectd plugins don’t work, the write_graphite plugin is what transports the metrics collected by collectd to Graphite over the network and the unixsock plugin helps us see the values collectd is collecting on the host via a collectdctl command (helpful to debug if there is a problem with collectd collecting the metrics or collectd sending the metrics over the network).

Everything under  <Plugin “java”> is the actual configuration for our plugin that queries JAVA specific metrics using JMX. Our configuration enables the collection of several heap/non-heap memory, thread and garbage collection metrics.  In the  <Connection> block, the  ServiceURL, User and Password are important for collectd to be able to talk to JMX. For karaf, the default User and Password are “karaf” and “karaf” respectively. The service URL for Karaf JMX can be obtained from the karaf documentation. You can pretty much use the exact collectd configuration in this post to monitor your JAVA process by changing these 3 parameters.

After editing the collected configuration and restarting the collectd service (via systemd) we would expect the GenericJMX metrics to be collected however, we can see errors in the collected logs

[2017-09-02 22:34:41] [error] lt_dlopen ("/usr/lib64/<span class="il">collectd</span>/java.so"<wbr />) failed: file not found. The most common cause for this problem is missing dependencies. Use ldd(1) to check the dependencies of the plugin / shared object.
[2017-09-02 22:34:41] [error] plugin_load: Load plugin "java" failed with status 1.
[2017-09-02 22:34:41] [warning] Found a configuration for the `java' plugin, but the plugin isn't loaded or didn't register a configuration callback.
[2017-09-02 22:34:41] [warning] Found a configuration for the `java' plugin, but the plugin isn't loaded or didn't register a configuration callback.
[2017-09-02 22:34:41] [warning] There is a `Plugin' block within the configuration for the java plugin. The plugin either only expects "simple" configuration statements or wasn't loaded using `LoadPlugin'. Please check your configuration. 

To fix this we do

[root@overcloud-controller-0 heat-admin]# ldd /usr/lib64/<span class="il">collectd</span>/java.so
linux-vdso.so.1 =>  (0x00007fff605cb000)
libjvm.so => not found
libdl.so.2 => /lib64/libdl.so.2 (0x00007f94c9e2f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f94c9a6c000)
/lib64/ld-linux-x86-64.so.2 (0x0000559875586000)

to create a symlink from the libjvm.so file installed by OpenJDK to /usr/lib64/libjvm.so which is basically the location where collectd expects to find the file.

We can verify that collectd is collecting our metrics using the below command

[root@overcloud-controller-0 heat-admin]# collectdctl listval
overcloud-controller-0/<wbr />GenericJMX-gc-PS MarkSweep/derive-count
overcloud-controller-0/<wbr />GenericJMX-gc-PS MarkSweep/derive-time
overcloud-controller-0/<wbr />GenericJMX-gc-PS Scavenge/derive-count
overcloud-controller-0/<wbr />GenericJMX-gc-PS Scavenge/derive-time
overcloud-controller-0/<wbr />GenericJMX-memory-heap/memory-<wbr />committed
overcloud-controller-0/<wbr />GenericJMX-memory-heap/memory-<wbr />init
overcloud-controller-0/<wbr />GenericJMX-memory-heap/memory-<wbr />max
overcloud-controller-0/<wbr />GenericJMX-memory-heap/memory-<wbr />used
overcloud-controller-0/<wbr />GenericJMX-memory-nonheap/<wbr />memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory-nonheap/<wbr />memory-init
overcloud-controller-0/<wbr />GenericJMX-memory-nonheap/<wbr />memory-max
overcloud-controller-0/<wbr />GenericJMX-memory-nonheap/<wbr />memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-Code Cache/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-Code Cache/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-Code Cache/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-Code Cache/memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Compressed Class Space/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Compressed Class Space/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Compressed Class Space/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Compressed Class Space/memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Metaspace/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Metaspace/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Metaspace/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-<wbr />Metaspace/memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Eden Space/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Eden Space/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Eden Space/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Eden Space/memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Old Gen/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Old Gen/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Old Gen/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Old Gen/memory-used
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Survivor Space/memory-committed
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Survivor Space/memory-init
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Survivor Space/memory-max
overcloud-controller-0/<wbr />GenericJMX-memory_pool-PS Survivor Space/memory-used
overcloud-controller-0/<wbr />GenericJMX-threading/gauge-<wbr />count
overcloud-controller-0/<wbr />GenericJMX-threading/gauge-<wbr />count-daemon

Graphing with Grafana

We will now see how to configure an example graph in Grafana. We are going to configure a graph that plots the JAVA heap for this process. Several other graphs can also be plotted from the metrics collectd is collecting such as threads, garbage collection etc.

Login to your Grafana instance (the Graphite server housing the collected data must have already been setup as the Data Source for this Grafana instance as documented here) and hit the “Grafana” icon on the top left, select “Dashboards” from the drop down and then select “New”. You will then be presented with a screen as below

Selection_221

Then click on the “Graph” icon and then you will be presented a screen as below

Selection_222

Hit on the “Panel Title” and then select edit.

You can then configure the query under the “Metrics” tab in the following format: <prefix> <hostname> <metric>.  The prefix could be any arbitrary name you choose in Graphite, and it can be seen in the collectd configuration of the write_graphite plugin.

 

odl-collectd-1

You can see the values being graphed in various colors as above.

You will now need to add the Axes for this graph. The X-Axis is time obviously and the Y-Axis is bits.

selection_223.png

You can now save the dashboard and that’s all.

In this blog post, we have seen how to use the Collectd/Graphite/Grafana stack to monitor JAVA specific metrics. This can turn out to be extremely useful when anaylzing JAVA performance issues due to heap memory and garbage collection.

One thought on “Monitoring your JAVA process using collectd

Leave a comment