Gábor Samu
Gábor Samu
Creator of this blog.
Jan 24, 2023 41 min read

Monitoring .-.. ... ..-. (IBM Spectrum LSF) with the TIG stack

thumbnail for this post

Much like dashboards in automobiles, dashboards in the context of HPC infrastructure are crucial to get an understanding of what’s happening under the hood of your HPC cluster - at a glance. During my IT career, I’ve used a myriad of monitoring solutions ranging from SNMP and Ganglia, to the ELK (Elasticsearch, Logstash, Kibana) stack. For example, I’ve recently written an overview on how it is possible to visualize IBM Spectrum LSF (LSF) data in Grafana. LSF is an HPC job scheduler which brings to the table three decades of experience in workload and resource management.

For this blog, I decided to take this to the next level by monitoring IBM Spectrum LSF with the well known TIG (Telegraf, InfluxDB, Grafana) stack. This article is not meant to be a debate on the advantages of one monitoring stack over another. Rather, the focus is to demonstrate what is feasible in terms of monitoring Spectrum LSF clusters with the TIG stack, given the many available ways to query LSF for key information using CLI commands.


The Journey

There already exists many write-ups on how to deploy the TIG stack to monitor systems. This isn’t meant to be a guide on setting up the TIG stack. Rather, it’s assumed that the reader already has some familiarity with the TIG stack. If not, then [insert your favourite search engine] is your friend.

On my home network, I decided to setup a VM running on my trusty Traverse Ten64 running Fedora where InfluxDB was installed. The idea was to run InfluxDB on a system that is guaranteed to be always on in my home environment and that is energy efficient. Installing telegraf on all of the LSF cluster servers (x3) proved to be straight forward. Note that in all cases, I used the OS supplied versions of InfluxDB, Telegraf. Finally, I already had a Grafana server running on a server in my network.

Out of the box, Telegraf has the ability to monitor numerous system metrics. Furthermore, there exists literally hundreds of plugins for Telegraf to monitor a wide variety of devices, services and software. A search however, didn’t reveal the existence of any plugin to monitor LSF. So it was time to get creative.


What to monitor?

A bit of research revealed that InfluxDB supports what is known as “line protocol”. This is a well defined text-based format for writing data to InfluxDB. I used the following reference on “line protocol” to guide me. Using line protocol it would be ultimately possible to write a plugin for Telegraf to effecively scrape information from Spectrum LSF and output in line protocol format for writing to InfluxDB.

Before I could begin writing the plugin, the key was to determine what information from Spectrum LSF would be useful to display in the dashboard, and how that information could be extracted. For this I followed the KISS principle to keep things as simple as possible. The key metrics I decided to report on were servers, queues and jobs (oh my!), as well as process information for the LSF scheduler daemons. Refer to the following table for details:


Metric(s) Command
LSF scheduler performance metrics badmin perfmon view -json
LSF available servers, CPUs, cores, slots badmin showstatus
LSF server by status (total number Ok, closed, unreachable, unavailable) badmin showstatus
LSF job statistics (total number running, suspended, pending) badmin showstatus
LSF queue statistics (per queue, total number of jobs running, suspended, pending) bqueues -json -o queue_name:12 njobs pend run susp rsv ususp ssusp
LSF mbatchd process metrics (Telegraf - inputs.procstat)
LSF mbschd process metrics (Telegraf - inputs.procstat)
LSF management lim process metrics (Telegraf - inputs.procstat)

Scrapin' fun

These above metrics would give a good idea of the state of the Spectrum LSF cluster at a glance. With the list of metrics prepared, the next step was to create a plugin script which would scrape data from the noted commands. Both bqueues and badmin perfmon view support output in JSON format with the appropriate flags specified. However, badmin showstatus does not support output in JSON format. This meant that for badmin showstatus it was necessary to scrape data assuming hard coded field positions in the output.

A copy of the Telegraf plugin for Spectrum LSF is provided below. This is just an example and is provided “as is” for testing purposes. Your mileage may vary.


Example lsf_telegraf_agent.py script. Click to expand!
#!/usr/bin/python3.8

# Copyright International Business Machines Corp, 2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#
# script: lsf_telegraf_agent.py
# version: 0.9
# Sample inputs.exec script for Telegraf which outputs metrics from an
# IBM Spectrum LSF management node in InfluxDB Line Protocol input format.
#
# NOTE: It is required to set the lsf_envfile variable to point to the
# LSF profile.lsf file for the given LSF installation.
#
 
import os
import json
import time
import subprocess
import sys
from pathlib import Path

#
# Variable declarations
# **NOTE: lsf_envfile needs to be set to point to the profile.lsf file for the LSF installation. 
#
lsf_envfile = "/opt/ibm/lsfsuite/lsf/conf/profile.lsf"

#
# Source the Spectrum LSF profile.  
# Check for existing of lsf_envfile (profile.lsf) and source the environment. 
# If the specified file does not exist, then exit.  
#
path = Path(lsf_envfile)
if path.is_file(): 
    lsf_env = (f'env -i sh -c "source {lsf_envfile} && env"')
    for line in subprocess.getoutput(lsf_env).split("\n"):
        key, value = line.split("=")
        os.environ[key]= value
else:
    sys.exit(f'The file {lsf_envfile} does not exist.')
    
# 
# Get the time in nanoseconds since the epoch. 
# This is required as part of the InfluxDB line protocol reference. 
# Only supported on Python 3.7+
#
time_nanosec = time.time_ns()

#
# Here we set the LSF environment variable LSB_NTRIES. This will be used to determine the 
# number of retries before failure of a LSF batch command. This is used to cover the case 
# when the LSF mbatchd is not running. 
#
os.environ["LSB_NTRIES"] = "2"

#
# Check if LSF performance metric monitoring is enabled. This is done by running
# 'badmin perfmon view'. If badmin is not found, then exit. 
#
# Check the return status from 'badmin perfmon view' and take the appropriate action:
#  - If return status is 7, it means that performance monitoring is not enabled. The script
#    will enable LSF performance metric monitoring by running 'badmin perfmon start'.
#    Note that a 70 second sleep is required before LSF metrics will be available.  
#  - If return status is 65, it means that the badmin command reported that the
#    LSF batch system is down. This is a fatal error which will cause the script
#    to exit. 
#
lsf_path = os.environ['LSF_BINDIR']
badmin_path = lsf_path + "/badmin"
bqueues_path = lsf_path + "/bqueues"

path = Path(badmin_path)
if path.is_file():
    cmd = [badmin_path, 'perfmon', 'view']
    p = subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    while p.poll() is None:
        time.sleep(0.1)
    return_code = p.returncode
    if return_code == 7:
        cmd = [badmin_path, 'perfmon', 'start']
        p = subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        while p.poll() is None:
            time.sleep(0.1)
        return_code = p.returncode
        time.sleep(70)
    elif return_code == 65:
        sys.exit(f'The LSF batch system is down.')
else:
    sys.exit(f'{badmin_path} does not exist.')

#
# Run badmin with the "perfmon view" keywords and the -json option to product JSON output
# We assume here that the LSF batch system is responsive (a check was done above); if
# the mbatchd is very busy there is a possiblity that it may not be responsive here. This
# case is not considered; LSB_NTRIES setting will determine how many tries are made before
# badmin gives up the ghost.  
# 
# Note: We previously checked for the existence of the 'badmin' binary. 
#
cmd = [badmin_path, 'perfmon', 'view', '-json'] 
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True) 
stdout, stderr = p.communicate()
#
# Guard for the case that the performance monitor has just been enabled, but is not
# producing any data as the first sample period has not elapsed. 
#
if stdout == "":
    sys.exit(f'Output from badmin perfmon view -json is empty.')
else: 
    data = json.loads(stdout)

# 
# Run badmin showstatus
# Next, run the command 'badmin showstatus' and capture the output. Note that badmin showstatus
# does not produce JSON output. So here we must do some scraping of the output. 
# The output from 'badmin showstatus' it placed into the array 'showstatus'. The hard coded
# positions in the output of 'badmin showstatus' are assumed when building the output 
# strings below. Should the format of the output of 'badmin showstatus' change, this will
# need to be updated. 
cmd = [badmin_path, 'showstatus']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True)
stdout, stderr = p.communicate()
# Convert badmin showstatus output into an array
showstatus = stdout.split()

#
# Run bqueues
#
cmd = [bqueues_path, '-json', '-o', 'queue_name:12 njobs pend run susp rsv ususp ssusp']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True)
stdout, stderr = p.communicate()
data_queues = json.loads(stdout)

#
# At this stage, we've captured the output from 'badmin perfmon view -json' and 
# 'badmin showstatus'. We're now ready to print to standard output the metric
# strings in InfluxDB line procotol format. 
#
# Details about the line protocol format can be found here:
# https://docs.influxdata.com/influxdb/v2.6/reference/syntax/line-protocol/
# 
# 

#
# LSF server status
#
print("lsf_servers,","status=total"," value=",showstatus[21],"i ",time_nanosec,sep='')
print("lsf_servers,","status=ok"," value=",showstatus[23],"i ",time_nanosec,sep='')
print("lsf_servers,","status=closed"," value=",showstatus[25],"i ",time_nanosec,sep='')
print("lsf_servers,","status=unreachable"," value=",showstatus[27],"i ",time_nanosec,sep='')
print("lsf_servers,","status=unavailable"," value=",showstatus[29],"i ",time_nanosec,sep='')

#
# LSF job status
#
print("lsf_jobs,","state=total"," value=",showstatus[33],"i ",time_nanosec,sep='')
print("lsf_jobs,","state=running"," value=",showstatus[35],"i ",time_nanosec,sep='')
print("lsf_jobs,","state=suspended"," value=",showstatus[37],"i ",time_nanosec,sep='')
print("lsf_jobs,","state=pending"," value=",showstatus[39],"i ",time_nanosec,sep='')
print("lsf_jobs,","state=finished"," value=",showstatus[41],"i ",time_nanosec,sep='')

#
# LSF user stats
#
print("lsf_users,","state=numusers"," value=",showstatus[45],"i ",time_nanosec,sep='')
print("lsf_users,","state=numgroups"," value=",showstatus[50],"i ",time_nanosec,sep='')
print("lsf_users,","state=numactive"," value=",showstatus[55],"i ",time_nanosec,sep='')

#
# LSF hosts stats
# First we split out the current and peak values for clients, servers, cpus, cores, and slots.
# The current and peak values are separated by the "/" delimiter.
# 
clientssplit = showstatus[9].split("/")
serverssplit = showstatus[11].split("/")
cpussplit = showstatus[13].split("/")
coressplit = showstatus[15].split("/")
slotssplit = showstatus[17].split("/")

print("lsf_hosts,","state=clients"," current=",clientssplit[0],"i,","peak=",clientssplit[1],"i ",time_n
anosec,sep='')
print("lsf_hosts,","state=servers"," current=",serverssplit[0],"i,","peak=",serverssplit[1],"i ",time_n
anosec,sep='')
print("lsf_hosts,","state=cpus"," current=",cpussplit[0],"i,","peak=",cpussplit[1],"i ",time_nanosec,se
p='')
print("lsf_hosts,","state=cores"," current=",coressplit[0],"i,","peak=",coressplit[1],"i ",time_nanosec
,sep='')
print("lsf_hosts,","state=slots"," current=",slotssplit[0],"i,","peak=",slotssplit[1],"i ",time_nanosec
,sep='')

#
# Print mbatchd query metrics
#
print("lsf_mbatchd,","query=job"," value=",data['record'][1]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","query=host"," value=",data['record'][2]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","query=queue"," value=",data['record'][3]['current'],"i ",time_nanosec,sep='')

#
# Print mbatchd job metrics
#
print("lsf_mbatchd,","jobs=submitreqs"," value=",data['record'][4]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","jobs=submitted"," value=",data['record'][5]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","jobs=dispatched"," value=",data['record'][6]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","jobs=completed"," value=",data['record'][7]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","jobs=sentremote"," value=",data['record'][8]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","jobs=acceptremote"," value=",data['record'][9]['current'],"i ",time_nanosec,sep='
')
print("lsf_mbatchd,","sched=interval"," value=",data['record'][10]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","sched=matchhost"," value=",data['record'][11]['current'],"i ",time_nanosec,sep=''
)
print("lsf_mbatchd,","sched=buckets"," value=",data['record'][12]['current'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","sched=reordered"," value=",data['record'][13]['current'],"i ",time_nanosec,sep=''
)

#
# Print mbatchd efficiency metrics. Here check if the efficiency metric indicated is "-". If so, 
# then assume a zero value. The trailing "%" sign on the metrics (percentages) is also stripped here. 
#
slots = (data['record'][14]['current'])
slots_percent = slots
if slots_percent == "-":
    slots_percent = "0"
elif slots_percent != "0":
    # Strip % sign and decimal. This is to work around issue inserting float to InfluxDB
    # "type float, already exists as type integer dropped ..."
    slots_percent = slots[:-4]

memory = (data['record'][15]['current'])
memory_percent = memory
if memory_percent == "-":
    memory_percent = "0"
elif memory_percent != "0":
    # Strip % sign and decimal. This is to work around issue inserting float to InfluxDB
    # "type float, already exists as type integer dropped ..."
    memory_percent = memory[:-4]

print("lsf_mbatchd,","utilization=slots"," value=",slots_percent,"i ",time_nanosec,sep='')
print("lsf_mbatchd,","utilization=memory"," value=",memory_percent,"i ",time_nanosec,sep='')

#
# Print mbatchd file descriptor usage
#
print("lsf_mbatchd,","fd=free"," value=",data['fd']['free'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","fd=used"," value=",data['fd']['used'],"i ",time_nanosec,sep='')
print("lsf_mbatchd,","fd=total"," value=",data['fd']['total'],"i ",time_nanosec,sep='')

#
# Print LSF queue status (njobs)
#
iterations = data_queues["QUEUES"]

for n in range(iterations):
    print("lsf_queues,","name=", data_queues['RECORDS'][n]['QUEUE_NAME'], " njobs=", data_queues['RECOR
DS'][n]['NJOBS'],"i,",
          "pend=", data_queues['RECORDS'][n]['PEND'],"i,",
          "run=", data_queues['RECORDS'][n]['RUN'],"i,",
          "susp=", data_queues['RECORDS'][n]['SUSP'],"i,",
          "rsv=", data_queues['RECORDS'][n]['RSV'],"i,",
          "ususp=", data_queues['RECORDS'][n]['USUSP'],"i,",
          "ssusp=", data_queues['RECORDS'][n]['SSUSP'],"i ",
          time_nanosec, sep='')

exit()    

Bringing it all together

For completeness, below is the detail regarding the configuration of the environment. It should be noted that the simple test environment consists of a single server running IBM Spectrum LSF Suite for HPC and a separate server which runs the InfluxDB instance.


Hostname Component Version
kilenc OS (LSF mgmt server) CentOS Stream release 8 (ppc64le)
kilenc Spectrum LSF Suite for HPC v10.2.0.13
adatbazis OS (InfluxDB server) Fedora release 36 (aarch64)
adatbazis InfluxDB v1.8.10
kilenc Telegraf v1.24.3
kilenc Grafana v9.1.6

The following steps assume that IBM Spectrum LSF Suite for HPC, InfluxDB and Telegraf have been installed.

  1. Start InfluxDB on the host adatbazis

  2. On the LSF management server kilenc, configure telegraf to connect to the influxDB instance on host adatbazis. Edit the configuration /etc/telegraf/telegraf.conf and specify the correct URL in the outputs.influxdb section as follows:

# # Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
#   ## The full HTTP or UDP URL for your InfluxDB instance.
#   ##
#   ## Multiple URLs can be specified for a single cluster, only ONE of the
#   ## urls will be written to each interval.
#   # urls = ["unix:///var/run/influxdb.sock"]
#   # urls = ["udp://127.0.0.1:8089"]
#   # urls = ["http://127.0.0.1:8086"]
# Added gsamu Jan 04 2023
urls = ["http://adatbazis:8086"]
  1. On the LSF management server kilenc, configure telegraf with the custom plugin script lsf_telegraf_agent.py to collect and log metrics from IBM Spectrum LSF Suite for HPC. Edit the configuration /etc/telegraf/telegraf.conf and specify the correct command path in the section inputs.exec. Additionally, set data_format equal to influx.Note that the script lsf_telegraf_agent.py was copied to the directory /etc/telegraf/telegraf.d/scripts with permissions octal 755 and owner set to user telegraf. Note: User telegraf was automatically created during the installation of telegraf.
 
# ## Gather LSF metrics
[[inputs.exec]]
  ## Commands array
   commands = [  "/etc/telegraf/telegraf.d/scripts/lsf_telegraf_agent.py" ]
   timeout = "30s"
   interval = "30s"
   data_format = "influx"
 # ## End LSF metrics
  1. Telegraf provides the ability to collect metrics on processes. Here we’ll use the telegraf procstat facility to monitor the LSF mbatchd and mbschd processes. These are the key daemons involved in handling query requests and making scheduling decisions for jobs in the environment. Edit the configuration /etc/telegraf/telegraf.conf and configure the two following inputs.procstat sections.
# ## Monitor CPU and memory utilization for LSF processes
# ## mbatchd, mbschd, lim (manager)
[[inputs.procstat]]
exe = "lim"
pattern = "lim"
pid_finder = "pgrep"

[[inputs.procstat]]
exe = "mbschd"
pattern = "mbschd"
pid_finder = "pgrep"

[[inputs.procstat]]
exe = "mbatchd"
pattern = "mbatchd"
pid_finder = "pgrep"
  1. With the configuration to telegraf complete, it’s now time to test if the configuration and custom LSF agent is functioning as expected. Note that the following operation is performed on the LSF management candidate host kilenc and assumes that the LSF daemons are up and running. This is achieve by running the command: telegraf –config /etc/telegraf/telegraf.conf –test. Note: Any errors in the configuration file /etc/telegraf/telegraf.conf will result in errors in the output.

Output of telegraf –config /etc/telegraf/telegraf.conf –test. Click to expand!
[root@kilenc telegraf]# pwd
/etc/telegraf
[root@kilenc telegraf]# telegraf --config /etc/telegraf/telegraf.conf --test
> mem,host=kilenc active=1938817024i,available=6820003840i,available_percent=20.653390597462806,buffered=4849664i,cached=6317735936i,commit_limit=33560395776i,committed_as=18635292672i,dirty=4128768i,free=2623799296i,high_free=0i,high_total=0i,huge_page_size=2097152i,huge_pages_free=0i,huge_pages_total=0i,inactive=13852016640i,low_free=0i,low_total=0i,mapped=1007353856i,page_tables=22478848i,shared=259063808i,slab=4946919424i,sreclaimable=902234112i,sunreclaim=4044685312i,swap_cached=3866624i,swap_free=16994729984i,swap_total=17049780224i,total=33021231104i,used=24074846208i,used_percent=72.90717336424115,vmalloc_chunk=0i,vmalloc_total=562949953421312i,vmalloc_used=0i,write_back=0i,write_back_tmp=0i 1674246976000000000
> kernel,host=kilenc boot_time=1673790850i,context_switches=1943864437i,entropy_avail=4037i,interrupts=1294179599i,processes_forked=4255316i 1674246976000000000
> swap,host=kilenc free=16994729984i,total=17049780224i,used=55050240i,used_percent=0.3228794698626609 1674246976000000000
> swap,host=kilenc in=172032i,out=851968i 1674246976000000000
> net,host=kilenc,interface=lo bytes_recv=90039931116i,bytes_sent=90039931116i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=17245997i,packets_sent=17245997i 1674246976000000000
> net,host=kilenc,interface=enP4p1s0f0 bytes_recv=0i,bytes_sent=0i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=0i,packets_sent=0i 1674246976000000000
> net,host=kilenc,interface=enP4p1s0f1 bytes_recv=11791041280i,bytes_sent=1701152001i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=10322276i,packets_sent=4594948i 1674246976000000000
> net,host=kilenc,interface=all icmp_inaddrmaskreps=0i,icmp_inaddrmasks=0i,icmp_incsumerrors=0i,icmp_indestunreachs=8609i,icmp_inechoreps=20i,icmp_inechos=11i,icmp_inerrors=1084i,icmp_inmsgs=8640i,icmp_inparmprobs=0i,icmp_inredirects=0i,icmp_insrcquenchs=0i,icmp_intimeexcds=0i,icmp_intimestampreps=0i,icmp_intimestamps=0i,icmp_outaddrmaskreps=0i,icmp_outaddrmasks=0i,icmp_outdestunreachs=4805i,icmp_outechoreps=11i,icmp_outechos=94i,icmp_outerrors=0i,icmp_outmsgs=4910i,icmp_outparmprobs=0i,icmp_outredirects=0i,icmp_outsrcquenchs=0i,icmp_outtimeexcds=0i,icmp_outtimestampreps=0i,icmp_outtimestamps=0i,icmpmsg_intype0=20i,icmpmsg_intype3=8609i,icmpmsg_intype8=11i,icmpmsg_outtype0=11i,icmpmsg_outtype3=4805i,icmpmsg_outtype8=94i,ip_defaultttl=64i,ip_forwarding=1i,ip_forwdatagrams=0i,ip_fragcreates=62958i,ip_fragfails=0i,ip_fragoks=12611i,ip_inaddrerrors=1i,ip_indelivers=21324370i,ip_indiscards=0i,ip_inhdrerrors=0i,ip_inreceives=21324371i,ip_inunknownprotos=0i,ip_outdiscards=0i,ip_outnoroutes=30i,ip_outrequests=21248264i,ip_reasmfails=0i,ip_reasmoks=0i,ip_reasmreqds=0i,ip_reasmtimeout=0i,tcp_activeopens=763497i,tcp_attemptfails=96617i,tcp_currestab=118i,tcp_estabresets=1917i,tcp_incsumerrors=0i,tcp_inerrs=0i,tcp_insegs=19488475i,tcp_maxconn=-1i,tcp_outrsts=137188i,tcp_outsegs=20220038i,tcp_passiveopens=675805i,tcp_retranssegs=9827i,tcp_rtoalgorithm=1i,tcp_rtomax=120000i,tcp_rtomin=200i,udp_ignoredmulti=10509i,udp_incsumerrors=0i,udp_indatagrams=1816997i,udp_inerrors=0i,udp_memerrors=0i,udp_noports=264i,udp_outdatagrams=1506724i,udp_rcvbuferrors=0i,udp_sndbuferrors=0i,udplite_ignoredmulti=0i,udplite_incsumerrors=0i,udplite_indatagrams=0i,udplite_inerrors=0i,udplite_memerrors=0i,udplite_noports=0i,udplite_outdatagrams=0i,udplite_rcvbuferrors=0i,udplite_sndbuferrors=0i 1674246976000000000
> diskio,host=kilenc,name=dm-2 io_time=9739370i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=4015612416i,read_time=604060i,reads=40592i,weighted_io_time=60563370i,write_bytes=47025459712i,write_time=59959310i,writes=1079691i 1674246976000000000
> diskio,host=kilenc,name=sda1 io_time=1460i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=4849664i,read_time=1304i,reads=67i,weighted_io_time=1304i,write_bytes=0i,write_time=0i,writes=0i 1674246976000000000
> diskio,host=kilenc,name=sda3 io_time=45872430i,iops_in_progress=0i,merged_reads=623i,merged_writes=1061314i,read_bytes=16398521856i,read_time=3371612i,reads=139298i,weighted_io_time=311521720i,write_bytes=133715422208i,write_time=308150107i,writes=7031512i 1674246976000000000
> diskio,host=kilenc,name=dm-1 io_time=5780i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=5636096i,read_time=3030i,reads=81i,weighted_io_time=26500i,write_bytes=13631488i,write_time=23470i,writes=208i 1674246976000000000
> disk,device=dm-0,fstype=xfs,host=kilenc,mode=rw,path=/ free=9315028992i,inodes_free=18214222i,inodes_total=19822888i,inodes_used=1608666i,total=53660876800i,used=44345847808i,used_percent=82.64093032486566 1674246976000000000
> disk,device=sda2,fstype=ext4,host=kilenc,mode=rw,path=/boot free=309653504i,inodes_free=65264i,inodes_total=65536i,inodes_used=272i,total=1020702720i,used=640585728i,used_percent=67.41310045173972 1674246976000000000
> disk,device=dm-2,fstype=xfs,host=kilenc,mode=rw,path=/home free=856442515456i,inodes_free=452529686i,inodes_total=453312512i,inodes_used=782826i,total=927930712064i,used=71488196608i,used_percent=7.704044674735306 1674246976000000000
> disk,device=dm-2,fstype=xfs,host=kilenc,mode=rw,path=/home/opt/at13.0/lib free=856442515456i,inodes_free=452529686i,inodes_total=453312512i,inodes_used=782826i,total=927930712064i,used=71488196608i,used_percent=7.704044674735306 1674246976000000000
> disk,device=dm-2,fstype=xfs,host=kilenc,mode=rw,path=/home/opt/at13.0/lib64 free=856442515456i,inodes_free=452529686i,inodes_total=453312512i,inodes_used=782826i,total=927930712064i,used=71488196608i,used_percent=7.704044674735306 1674246976000000000
> disk,device=ST31000524AS/raktar,fstype=zfs,host=kilenc,mode=rw,path=/mnt/ST31000524AS free=210837438464i,inodes_free=411792117i,inodes_total=412304487i,inodes_used=512370i,total=965496143872i,used=754658705408i,used_percent=78.16278813725106 1674246976000000000
> diskio,host=kilenc,name=sda io_time=45899860i,iops_in_progress=0i,merged_reads=650i,merged_writes=1061332i,read_bytes=16495536128i,read_time=3440899i,reads=141325i,weighted_io_time=311596362i,write_bytes=133715696640i,write_time=308155462i,writes=7031531i 1674246976000000000
> disk,device=ST31000524AS,fstype=zfs,host=kilenc,mode=rw,path=/ST31000524AS free=210837438464i,inodes_free=411792117i,inodes_total=411792123i,inodes_used=6i,total=210837569536i,used=131072i,used_percent=0.00006216728844316324 1674246976000000000
> diskio,host=kilenc,name=sda2 io_time=18060i,iops_in_progress=0i,merged_reads=27i,merged_writes=18i,read_bytes=88372224i,read_time=31224i,reads=436i,weighted_io_time=36579i,write_bytes=274432i,write_time=5355i,writes=19i 1674246976000000000
> diskio,host=kilenc,name=dm-0 io_time=38788720i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=12341294080i,read_time=1143210i,reads=51814i,weighted_io_time=303329620i,write_bytes=86676331008i,write_time=302186410i,writes=6798400i 1674246976000000000
> diskio,host=kilenc,name=sdb io_time=668810i,iops_in_progress=0i,merged_reads=9i,merged_writes=58i,read_bytes=104550912i,read_time=746540i,reads=31054i,weighted_io_time=1445858i,write_bytes=10845920256i,write_time=699318i,writes=124780i 1674246976000000000
> diskio,host=kilenc,name=sdb1 io_time=341330i,iops_in_progress=0i,merged_reads=0i,merged_writes=58i,read_bytes=95562240i,read_time=383066i,reads=25026i,weighted_io_time=1082385i,write_bytes=10845920256i,write_time=699318i,writes=124780i 1674246976000000000
> diskio,host=kilenc,name=sdb9 io_time=190i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=4980736i,read_time=37i,reads=69i,weighted_io_time=37i,write_bytes=0i,write_time=0i,writes=0i 1674246976000000000
> system,host=kilenc load1=2.06,load15=2.12,load5=2.12,n_cpus=32i,n_users=0i 1674246976000000000
> system,host=kilenc uptime=456127i 1674246976000000000
> system,host=kilenc uptime_format="5 days,  6:42" 1674246976000000000
> processes,host=kilenc blocked=1i,dead=0i,idle=569i,paging=0i,parked=1i,running=0i,sleeping=412i,stopped=0i,total=1366i,total_threads=2683i,unknown=0i,zombies=0i 1674246976000000000
> lsf_servers,host=kilenc,status=total value=1i 1674246976000000000
> lsf_servers,host=kilenc,status=ok value=1i 1674246976000000000
> lsf_servers,host=kilenc,status=closed value=0i 1674246976000000000
> lsf_servers,host=kilenc,status=unreachable value=0i 1674246976000000000
> lsf_servers,host=kilenc,status=unavailable value=0i 1674246976000000000
> lsf_jobs,host=kilenc,state=total value=121776i 1674246976000000000
> lsf_jobs,host=kilenc,state=running value=32i 1674246976000000000
> lsf_jobs,host=kilenc,state=suspended value=0i 1674246976000000000
> lsf_jobs,host=kilenc,state=pending value=120771i 1674246976000000000
> lsf_jobs,host=kilenc,state=finished value=973i 1674246976000000000
> lsf_users,host=kilenc,state=numusers value=4i 1674246976000000000
> lsf_users,host=kilenc,state=numgroups value=1i 1674246976000000000
> lsf_users,host=kilenc,state=numactive value=1i 1674246976000000000
> lsf_hosts,host=kilenc,state=clients current=0i,peak=0i 1674246976000000000
> lsf_hosts,host=kilenc,state=servers current=1i,peak=1i 1674246976000000000
> lsf_hosts,host=kilenc,state=cpus current=2i,peak=2i 1674246976000000000
> lsf_hosts,host=kilenc,state=cores current=32i,peak=32i 1674246976000000000
> lsf_hosts,host=kilenc,state=slots current=32i,peak=32i 1674246976000000000
> lsf_mbatchd,host=kilenc,query=job value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,query=host value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,query=queue value=2i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=submitreqs value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=submitted value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=dispatched value=19i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=completed value=12i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=sentremote value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,jobs=acceptremote value=0i 1674246976000000000
> lsf_mbatchd,host=kilenc,sched=interval value=1i 1674246976000000000
> lsf_mbatchd,host=kilenc,sched=matchhost value=5i 1674246976000000000
> lsf_mbatchd,host=kilenc,sched=buckets value=5i 1674246976000000000
> lsf_mbatchd,host=kilenc,sched=reordered value=7i 1674246976000000000
> lsf_mbatchd,host=kilenc,utilization=slots value=100i 1674246976000000000
> lsf_mbatchd,host=kilenc,utilization=memory value=0i 1674246976000000000
> lsf_mbatchd,fd=free,host=kilenc value=65509i 1674246976000000000
> lsf_mbatchd,fd=used,host=kilenc value=26i 1674246976000000000
> lsf_mbatchd,fd=total,host=kilenc value=65535i 1674246976000000000
> lsf_queues,host=kilenc,name=admin njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=owners njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=priority njobs=93951i,pend=93923i,rsv=0i,run=28i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=night njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=short njobs=2504i,pend=2504i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=dataq njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=normal njobs=1750i,pend=1750i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=interactive njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=sendq njobs=22598i,pend=22594i,rsv=0i,run=4i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> lsf_queues,host=kilenc,name=idle njobs=0i,pend=0i,rsv=0i,run=0i,ssusp=0i,susp=0i,ususp=0i 1674246976000000000
> cpu,cpu=cpu0,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu4,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu8,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu12,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu16,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=98.03921568448419,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=1.9607843137324836 1674246977000000000
> cpu,cpu=cpu20,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu24,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu28,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu32,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu36,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu40,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=98.03921568448419,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843136879006,usage_user=0 1674246977000000000
> cpu,cpu=cpu44,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu48,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu52,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=0,usage_iowait=100,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu56,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu60,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu64,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=87.99999999906868,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=10.000000001155058,usage_user=2.0000000002764864 1674246977000000000
> cpu,cpu=cpu68,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu72,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=86.27450980280263,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=11.764705882127403,usage_user=1.9607843137324836 1674246977000000000
> cpu,cpu=cpu76,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu80,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=92.30769231113655,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=3.8461538464431086,usage_user=3.84615384653056 1674246977000000000
> cpu,cpu=cpu84,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=94.11764706486585,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=5.882352941197451 1674246977000000000
> cpu,cpu=cpu88,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu92,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=70.58823529344627,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=29.411764701983955,usage_user=0 1674246977000000000
> cpu,cpu=cpu96,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=96.15384615040192,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=3.8461538460125784,usage_user=0 1674246977000000000
> cpu,cpu=cpu100,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=97.99999999813735,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.999999999998181,usage_user=0 1674246977000000000
> cpu,cpu=cpu104,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=96.07843137993407,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=3.92156862782338,usage_user=0 1674246977000000000
> cpu,cpu=cpu108,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=96.07843136896838,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843136879006,usage_user=1.9607843137324836 1674246977000000000
> cpu,cpu=cpu112,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu116,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=95.91836734305988,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=4.08163265313509,usage_user=0 1674246977000000000
> cpu,cpu=cpu120,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=84.61538461280144,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=3.8461538460344413,usage_user=11.53846153830009 1674246977000000000
> cpu,cpu=cpu124,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1674246977000000000
> cpu,cpu=cpu-total,host=kilenc usage_guest=0,usage_guest_nice=0,usage_idle=93.47826086554115,usage_iowait=3.1055900618243673,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=2.484472049468532,usage_user=0.9316770186919254 1674246977000000000
> procstat,exe=mbatchd,host=kilenc,process_name=mbatchd,user=root child_major_faults=0i,child_minor_faults=0i,cpu_time=0i,cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=0.03,cpu_time_user=0.05,cpu_usage=0,created_at=1674246974000000000i,involuntary_context_switches=1i,major_faults=0i,memory_data=834994176i,memory_locked=0i,memory_rss=815595520i,memory_stack=327680i,memory_swap=0i,memory_usage=2.469912528991699,memory_vms=1091108864i,minor_faults=726i,nice_priority=20i,num_fds=10i,num_threads=2i,pid=62056i,ppid=4103699i,read_bytes=0i,read_count=27i,realtime_priority=0i,rlimit_cpu_time_hard=9223372036854775807i,rlimit_cpu_time_soft=9223372036854775807i,rlimit_file_locks_hard=9223372036854775807i,rlimit_file_locks_soft=9223372036854775807i,rlimit_memory_data_hard=9223372036854775807i,rlimit_memory_data_soft=9223372036854775807i,rlimit_memory_locked_hard=67108864i,rlimit_memory_locked_soft=67108864i,rlimit_memory_rss_hard=9223372036854775807i,rlimit_memory_rss_soft=9223372036854775807i,rlimit_memory_stack_hard=9223372036854775807i,rlimit_memory_stack_soft=8388608i,rlimit_memory_vms_hard=9223372036854775807i,rlimit_memory_vms_soft=9223372036854775807i,rlimit_nice_priority_hard=0i,rlimit_nice_priority_soft=0i,rlimit_num_fds_hard=262144i,rlimit_num_fds_soft=65535i,rlimit_realtime_priority_hard=0i,rlimit_realtime_priority_soft=0i,rlimit_signals_pending_hard=118856i,rlimit_signals_pending_soft=118856i,signals_pending=0i,voluntary_context_switches=5i,write_bytes=0i,write_count=16i 1674246977000000000
> procstat,exe=mbschd,host=kilenc,process_name=mbschd,user=lsfadmin child_major_faults=0i,child_minor_faults=2457641i,cpu_time=320i,cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0.02,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=8.4,cpu_time_user=312.14,cpu_usage=1.836645120693344,created_at=1674227581000000000i,involuntary_context_switches=3553i,major_faults=1i,memory_data=228851712i,memory_locked=0i,memory_rss=236847104i,memory_stack=196608i,memory_swap=0i,memory_usage=0.717257022857666,memory_vms=246808576i,minor_faults=2137969i,nice_priority=20i,num_fds=3i,num_threads=1i,pid=4103740i,ppid=4103699i,read_bytes=1552384i,read_count=936861i,realtime_priority=0i,rlimit_cpu_time_hard=9223372036854775807i,rlimit_cpu_time_soft=9223372036854775807i,rlimit_file_locks_hard=9223372036854775807i,rlimit_file_locks_soft=9223372036854775807i,rlimit_memory_data_hard=9223372036854775807i,rlimit_memory_data_soft=9223372036854775807i,rlimit_memory_locked_hard=67108864i,rlimit_memory_locked_soft=67108864i,rlimit_memory_rss_hard=9223372036854775807i,rlimit_memory_rss_soft=9223372036854775807i,rlimit_memory_stack_hard=9223372036854775807i,rlimit_memory_stack_soft=8388608i,rlimit_memory_vms_hard=9223372036854775807i,rlimit_memory_vms_soft=9223372036854775807i,rlimit_nice_priority_hard=0i,rlimit_nice_priority_soft=0i,rlimit_num_fds_hard=262144i,rlimit_num_fds_soft=65535i,rlimit_realtime_priority_hard=0i,rlimit_realtime_priority_soft=0i,rlimit_signals_pending_hard=118856i,rlimit_signals_pending_soft=118856i,signals_pending=0i,voluntary_context_switches=43952i,write_bytes=0i,write_count=42311i 1674246977000000000
> procstat_lookup,exe=mbschd,host=kilenc,pid_finder=pgrep,result=success pid_count=1i,result_code=0i,running=1i 1674246977000000000
> procstat,exe=mbatchd,host=kilenc,process_name=mbatchd,user=root child_major_faults=2i,child_minor_faults=4476280i,cpu_time=177i,cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=6.68,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=51.01,cpu_time_user=126.42,cpu_usage=0,created_at=1674227573000000000i,involuntary_context_switches=4993i,major_faults=3i,memory_data=834994176i,memory_locked=0i,memory_rss=827785216i,memory_stack=327680i,memory_swap=0i,memory_usage=2.5068273544311523,memory_vms=1091108864i,minor_faults=2406945i,nice_priority=20i,num_fds=26i,num_threads=3i,pid=4103699i,ppid=4103684i,read_bytes=21008384i,read_count=364726i,realtime_priority=0i,rlimit_cpu_time_hard=9223372036854775807i,rlimit_cpu_time_soft=9223372036854775807i,rlimit_file_locks_hard=9223372036854775807i,rlimit_file_locks_soft=9223372036854775807i,rlimit_memory_data_hard=9223372036854775807i,rlimit_memory_data_soft=9223372036854775807i,rlimit_memory_locked_hard=67108864i,rlimit_memory_locked_soft=67108864i,rlimit_memory_rss_hard=9223372036854775807i,rlimit_memory_rss_soft=9223372036854775807i,rlimit_memory_stack_hard=9223372036854775807i,rlimit_memory_stack_soft=8388608i,rlimit_memory_vms_hard=9223372036854775807i,rlimit_memory_vms_soft=9223372036854775807i,rlimit_nice_priority_hard=0i,rlimit_nice_priority_soft=0i,rlimit_num_fds_hard=262144i,rlimit_num_fds_soft=65535i,rlimit_realtime_priority_hard=0i,rlimit_realtime_priority_soft=0i,rlimit_signals_pending_hard=118856i,rlimit_signals_pending_soft=118856i,signals_pending=0i,voluntary_context_switches=172583i,write_bytes=1562181632i,write_count=12164760i 1674246977000000000
> procstat_lookup,exe=mbatchd,host=kilenc,pid_finder=pgrep,result=success pid_count=2i,result_code=0i,running=2i 1674246977000000000

  1. Assuming there were no errors in the previous step with telegraf, proceed to start the telegraf process via systemd.
[root@kilenc telegraf]# systemctl start telegraf
[root@kilenc telegraf]# systemctl status telegraf
● telegraf.service - Telegraf
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2023-01-19 14:13:51 EST; 1 day 1h ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 3225959 (telegraf)
    Tasks: 35 (limit: 190169)
   Memory: 192.6M
   CGroup: /system.slice/telegraf.service
           └─3225959 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/tele>

Jan 19 14:13:51 kilenc systemd[1]: Starting Telegraf...
Jan 19 14:13:51 kilenc systemd[1]: Started Telegraf.
  1. On the host running the database instance, adatbazis, perform queries to check whether the database telegraf exists, as well as checking if LSF related data is being logged. This is confirmed in the output below.

Output from InfluxDB queries. Click to expand!
[root@adatbazis fedora]# influx
Connected to https://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10
> auth
username: influx
password: 
> show databases
name: databases
name
----
_internal
telegraf
> use telegraf
Using database telegraf
> show field keys
name: cpu
fieldKey         fieldType
--------         ---------
usage_guest      float
usage_guest_nice float
usage_idle       float
usage_iowait     float
usage_irq        float
usage_nice       float
usage_softirq    float
usage_steal      float
usage_system     float
usage_user       float

name: disk
fieldKey     fieldType
--------     ---------
free         integer
inodes_free  integer
inodes_total integer
inodes_used  integer
total        integer
used         integer
used_percent float

name: diskio
fieldKey         fieldType
--------         ---------
io_time          integer
iops_in_progress integer
merged_reads     integer
merged_writes    integer
read_bytes       integer
read_time        integer
reads            integer
weighted_io_time integer
write_bytes      integer
write_time       integer
writes           integer

name: kernel
fieldKey         fieldType
--------         ---------
boot_time        integer
context_switches integer
entropy_avail    integer
interrupts       integer
processes_forked integer

name: lsf_hosts
fieldKey fieldType
-------- ---------
current  integer
peak     integer

name: lsf_jobs
fieldKey fieldType
-------- ---------
value    integer

name: lsf_mbatchd
fieldKey fieldType
-------- ---------
value    integer

name: lsf_queues
fieldKey fieldType
-------- ---------
njobs    integer
pend     integer
rsv      integer
run      integer
ssusp    integer
susp     integer
ususp    integer

name: lsf_servers
fieldKey fieldType
-------- ---------
value    integer

name: lsf_users
fieldKey fieldType
-------- ---------
value    integer

name: mem
fieldKey          fieldType
--------          ---------
active            integer
available         integer
available_percent float
buffered          integer
cached            integer
commit_limit      integer
committed_as      integer
dirty             integer
free              integer
high_free         integer
high_total        integer
huge_page_size    integer
huge_pages_free   integer
huge_pages_total  integer
inactive          integer
low_free          integer
low_total         integer
mapped            integer
page_tables       integer
shared            integer
slab              integer
sreclaimable      integer
sunreclaim        integer
swap_cached       integer
swap_free         integer
swap_total        integer
total             integer
used              integer
used_percent      float
vmalloc_chunk     integer
vmalloc_total     integer
vmalloc_used      integer
write_back        integer
write_back_tmp    integer

name: net
fieldKey              fieldType
--------              ---------
bytes_recv            integer
bytes_sent            integer
drop_in               integer
drop_out              integer
err_in                integer
err_out               integer
icmp_inaddrmaskreps   integer
icmp_inaddrmasks      integer
icmp_incsumerrors     integer
icmp_indestunreachs   integer
icmp_inechoreps       integer
icmp_inechos          integer
icmp_inerrors         integer
icmp_inmsgs           integer
icmp_inparmprobs      integer
icmp_inredirects      integer
icmp_insrcquenchs     integer
icmp_intimeexcds      integer
icmp_intimestampreps  integer
icmp_intimestamps     integer
icmp_outaddrmaskreps  integer
icmp_outaddrmasks     integer
icmp_outdestunreachs  integer
icmp_outechoreps      integer
icmp_outechos         integer
icmp_outerrors        integer
icmp_outmsgs          integer
icmp_outparmprobs     integer
icmp_outredirects     integer
icmp_outsrcquenchs    integer
icmp_outtimeexcds     integer
icmp_outtimestampreps integer
icmp_outtimestamps    integer
icmpmsg_intype0       integer
icmpmsg_intype3       integer
icmpmsg_intype8       integer
icmpmsg_outtype0      integer
icmpmsg_outtype3      integer
icmpmsg_outtype8      integer
ip_defaultttl         integer
ip_forwarding         integer
ip_forwdatagrams      integer
ip_fragcreates        integer
ip_fragfails          integer
ip_fragoks            integer
ip_inaddrerrors       integer
ip_indelivers         integer
ip_indiscards         integer
ip_inhdrerrors        integer
ip_inreceives         integer
ip_inunknownprotos    integer
ip_outdiscards        integer
ip_outnoroutes        integer
ip_outrequests        integer
ip_reasmfails         integer
ip_reasmoks           integer
ip_reasmreqds         integer
ip_reasmtimeout       integer
packets_recv          integer
packets_sent          integer
tcp_activeopens       integer
tcp_attemptfails      integer
tcp_currestab         integer
tcp_estabresets       integer
tcp_incsumerrors      integer
tcp_inerrs            integer
tcp_insegs            integer
tcp_maxconn           integer
tcp_outrsts           integer
tcp_outsegs           integer
tcp_passiveopens      integer
tcp_retranssegs       integer
tcp_rtoalgorithm      integer
tcp_rtomax            integer
tcp_rtomin            integer
udp_ignoredmulti      integer
udp_incsumerrors      integer
udp_indatagrams       integer
udp_inerrors          integer
udp_memerrors         integer
udp_noports           integer
udp_outdatagrams      integer
udp_rcvbuferrors      integer
udp_sndbuferrors      integer
udplite_ignoredmulti  integer
udplite_incsumerrors  integer
udplite_indatagrams   integer
udplite_inerrors      integer
udplite_memerrors     integer
udplite_noports       integer
udplite_outdatagrams  integer
udplite_rcvbuferrors  integer
udplite_sndbuferrors  integer

name: processes
fieldKey      fieldType
--------      ---------
blocked       integer
dead          integer
idle          integer
paging        integer
parked        integer
running       integer
sleeping      integer
stopped       integer
total         integer
total_threads integer
unknown       integer
zombies       integer

name: procstat
fieldKey                     fieldType
--------                     ---------
child_major_faults           integer
child_minor_faults           integer
cpu_time_guest               float
cpu_time_guest_nice          float
cpu_time_idle                float
cpu_time_iowait              float
cpu_time_irq                 float
cpu_time_nice                float
cpu_time_soft_irq            float
cpu_time_steal               float
cpu_time_system              float
cpu_time_user                float
cpu_usage                    float
created_at                   integer
involuntary_context_switches integer
major_faults                 integer
memory_data                  integer
memory_locked                integer
memory_rss                   integer
memory_stack                 integer
memory_swap                  integer
memory_usage                 float
memory_vms                   integer
minor_faults                 integer
num_threads                  integer
pid                          integer
ppid                         integer
voluntary_context_switches   integer

name: procstat_lookup
fieldKey    fieldType
--------    ---------
pid_count   integer
result_code integer
running     integer

name: swap
fieldKey     fieldType
--------     ---------
free         integer
in           integer
out          integer
total        integer
used         integer
used_percent float

name: system
fieldKey       fieldType
--------       ---------
load1          float
load15         float
load5          float
n_cpus         integer
n_unique_users integer
n_users        integer
uptime         integer
uptime_format  string
> select * from metrics
> SELECT * FROM "lsf_hosts";
name: lsf_hosts
time                current host   peak state
----                ------- ----   ---- -----
1674493170000000000 0       kilenc 0    clients
1674493170000000000 32      kilenc 32   slots
1674493170000000000 32      kilenc 32   cores
1674493170000000000 1       kilenc 1    servers
1674493170000000000 2       kilenc 2    cpus
1674493200000000000 1       kilenc 1    servers
1674493200000000000 2       kilenc 2    cpus
1674493200000000000 32      kilenc 32   slots
1674493200000000000 0       kilenc 0    clients
1674493200000000000 32      kilenc 32   cores
1674493230000000000 0       kilenc 0    clients
1674493230000000000 32      kilenc 32   cores
1674493230000000000 2       kilenc 2    cpus
1674493230000000000 1       kilenc 1    servers
1674493230000000000 32      kilenc 32   slots
1674493260000000000 1       kilenc 1    servers
1674493260000000000 32      kilenc 32   slots
1674493260000000000 0       kilenc 0    clients
1674493260000000000 2       kilenc 2    cpus
1674493260000000000 32      kilenc 32   cores
> quit

  1. With telegraf successfully logging data to the InfluxDB instance, it will now be possible to create a data source in Grafana in order to create a dashboard containing LSF metrics. As noted at the outset, this article is not meant to be an extensive guide to the creation of dashoards in Grafana. In the Grafana navigation select Configuration > Data sources.
  1. Select the Add data source button, followed by InfluxDB, which is listed under Time series databases. On the settings page specify following values:

Variable Value
URL http://adatbazis:8086
Database telegraf
Basic auth (enable)
User <influxdb_username>
Password <influxdb_password

Next, click on Save & test. If all variables and settings were properly specified, the message datasource is working. 17 measurements found.

  1. With the datasource configured in Grafana, the final step is to create a dashboard. Creating a dashboard requires creating panels which display data pulled from the configured data source using targeted queries. With a bit of effort, I was able to piece together the following dashboard which includes both metrics from LSF, as well as metrics from Telegraf input.procstat for the LSF processes mbatchd, mbschd and the management lim.

Example dashboard definition (JSON). Click to expand!
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "datasource",
          "uid": "grafana"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 21,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 35,
      "panels": [],
      "title": "Cluster aggregate current statistics",
      "type": "row"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "A view of the current status of the LSF servers in the cluster. Servers can be in one of four states: Ok, Unavailable, Closed and Unreachable. ",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            }
          },
          "decimals": 2,
          "mappings": []
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 9,
        "x": 0,
        "y": 1
      },
      "id": 32,
      "options": {
        "displayLabels": [
          "name",
          "value"
        ],
        "legend": {
          "displayMode": "table",
          "placement": "right",
          "showLegend": true,
          "sortBy": "Value",
          "sortDesc": true,
          "values": [
            "value",
            "percent"
          ]
        },
        "pieType": "donut",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "tooltip": {
          "mode": "multi",
          "sort": "none"
        }
      },
      "targets": [
        {
          "alias": "Ok",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_servers",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "status",
              "operator": "=",
              "value": "ok"
            }
          ]
        },
        {
          "alias": "Closed",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_servers",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "status",
              "operator": "=",
              "value": "closed"
            }
          ]
        },
        {
          "alias": "Unreachable",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_servers",
          "orderByTime": "ASC",
          "policy": "default",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "status",
              "operator": "=",
              "value": "unreachable"
            }
          ]
        },
        {
          "alias": "Unavailable",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_servers",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "D",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "status",
              "operator": "=",
              "value": "unavailable"
            }
          ]
        }
      ],
      "title": "Current aggregate LSF server statistics",
      "type": "piechart"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 9,
        "y": 1
      },
      "id": 43,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "text": {},
        "textMode": "auto"
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "distinct"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "running"
            }
          ]
        }
      ],
      "title": "Currently running",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "light-red",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 12,
        "y": 1
      },
      "id": 45,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "text": {},
        "textMode": "auto"
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "default",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "suspended"
            }
          ]
        }
      ],
      "title": "Currently suspended",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            }
          },
          "decimals": 2,
          "mappings": []
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 9,
        "x": 15,
        "y": 1
      },
      "id": 33,
      "options": {
        "displayLabels": [
          "name",
          "value"
        ],
        "legend": {
          "displayMode": "table",
          "placement": "right",
          "showLegend": true,
          "sortBy": "Value",
          "sortDesc": true,
          "values": [
            "value",
            "percent"
          ]
        },
        "pieType": "donut",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "tooltip": {
          "mode": "multi",
          "sort": "none"
        }
      },
      "targets": [
        {
          "alias": "Running",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "running"
            }
          ]
        },
        {
          "alias": "Pending",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "pending"
            }
          ]
        },
        {
          "alias": "Suspended",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "suspended"
            }
          ]
        }
      ],
      "title": "Current aggregate LSF job statistics",
      "type": "piechart"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "yellow",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 9,
        "y": 5
      },
      "id": 44,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "text": {},
        "textMode": "auto"
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "default",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "pending"
            }
          ]
        }
      ],
      "title": "Currently pending ",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "blue",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 12,
        "y": 5
      },
      "id": 46,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "text": {},
        "textMode": "auto"
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "default",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "finished"
            }
          ]
        }
      ],
      "title": "Finished (past hour)",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "Spectrum LSF queue statistics. Here we show jobs in running, pending and suspended jobs. ",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 9,
        "x": 0,
        "y": 9
      },
      "id": 41,
      "options": {
        "displayMode": "lcd",
        "minVizHeight": 10,
        "minVizWidth": 0,
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "showUnfilled": true
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "alias": "Running",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "lsf_queues",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "run"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "name",
              "operator": "=~",
              "value": "/^$Queue$/"
            }
          ]
        },
        {
          "alias": "Pending",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_queues",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "pend"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "name",
              "operator": "=~",
              "value": "/^$Queue$/"
            }
          ]
        },
        {
          "alias": "Suspended",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_queues",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "susp"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "name",
              "operator": "=~",
              "value": "/^$Queue$/"
            }
          ]
        }
      ],
      "title": "Current queue statistics ($Queue)",
      "type": "bargauge"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 9,
        "y": 9
      },
      "id": 53,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "/^lsf_hosts\\.last$/",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_hosts",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "current"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ],
            [
              {
                "params": [
                  "peak"
                ],
                "type": "field"
              }
            ]
          ],
          "tags": [
            {
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            },
            {
              "condition": "AND",
              "key": "state",
              "operator": "=",
              "value": "servers"
            }
          ]
        }
      ],
      "title": "Servers",
      "type": "gauge"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "yellow",
                "value": null
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 12,
        "y": 9
      },
      "id": 54,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "/^lsf_hosts\\.last$/",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_hosts",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "current"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ],
            [
              {
                "params": [
                  "peak"
                ],
                "type": "field"
              }
            ]
          ],
          "tags": [
            {
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            },
            {
              "condition": "AND",
              "key": "state",
              "operator": "=",
              "value": "cpus"
            }
          ]
        }
      ],
      "title": "CPUs",
      "type": "gauge"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "stepBefore",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "log": 2,
              "type": "log"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 9,
        "x": 15,
        "y": 9
      },
      "id": 42,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "alias": "Running",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "running"
            }
          ]
        },
        {
          "alias": "Pending",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "pending"
            }
          ]
        },
        {
          "alias": "Suspended",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_jobs",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ]
          ],
          "tags": [
            {
              "key": "state",
              "operator": "=",
              "value": "suspended"
            }
          ]
        }
      ],
      "title": "Aggregate LSF job statistics",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "light-red",
                "value": null
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 9,
        "y": 13
      },
      "id": 55,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "/^lsf_hosts\\.last$/",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_hosts",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "current"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ],
            [
              {
                "params": [
                  "peak"
                ],
                "type": "field"
              }
            ]
          ],
          "tags": [
            {
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            },
            {
              "condition": "AND",
              "key": "state",
              "operator": "=",
              "value": "cores"
            }
          ]
        }
      ],
      "title": "Cores",
      "type": "gauge"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "blue",
                "value": null
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 3,
        "x": 12,
        "y": 13
      },
      "id": 56,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "/^lsf_hosts\\.last$/",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "pluginVersion": "9.1.6",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_hosts",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "current"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "last"
              }
            ],
            [
              {
                "params": [
                  "peak"
                ],
                "type": "field"
              }
            ]
          ],
          "tags": [
            {
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            },
            {
              "condition": "AND",
              "key": "state",
              "operator": "=",
              "value": "slots"
            }
          ]
        }
      ],
      "title": "Slots",
      "type": "gauge"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 17
      },
      "id": 37,
      "panels": [],
      "title": "LSF scheduler statistics",
      "type": "row"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "graph": false,
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 18
      },
      "id": 20,
      "options": {
        "graph": {},
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "7.5.15",
      "targets": [
        {
          "alias": "CPU utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "cpu_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbatchd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Memory utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "memory_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbatchd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Number of threads",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "num_threads"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbatchd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "File descriptors",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_mbatchd",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "D",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "fd",
              "operator": "=",
              "value": "used"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        }
      ],
      "title": "LSF mbatchd process metrics",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "graph": false,
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 18
      },
      "id": 57,
      "options": {
        "graph": {},
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "7.5.15",
      "targets": [
        {
          "alias": "CPU utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "cpu_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "lim"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Memory utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "memory_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "lim"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Number of threads",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "num_threads"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "lim"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        }
      ],
      "title": "LSF management lim process metrics",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "graph": false,
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 26
      },
      "id": 27,
      "options": {
        "graph": {},
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "7.5.15",
      "targets": [
        {
          "alias": "Job buckets",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "lsf_mbatchd",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "sched",
              "operator": "=",
              "value": "buckets"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Matching host criteria",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_mbatchd",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "sched",
              "operator": "=",
              "value": "matchhost"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Scheduling interval (seconds)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "lsf_mbatchd",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "value"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "sched",
              "operator": "=",
              "value": "interval"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        }
      ],
      "title": "LSF scheduler metrics",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "eNfWCy5Vk"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "graph": false,
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 26
      },
      "id": 58,
      "options": {
        "graph": {},
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "7.5.15",
      "targets": [
        {
          "alias": "CPU utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "A",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "cpu_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbschd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Memory utilization (%)",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "B",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "memory_usage"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbatchd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        },
        {
          "alias": "Number of threads",
          "datasource": {
            "type": "influxdb",
            "uid": "eNfWCy5Vk"
          },
          "groupBy": [
            {
              "params": [
                "$__interval"
              ],
              "type": "time"
            },
            {
              "params": [
                "null"
              ],
              "type": "fill"
            }
          ],
          "hide": false,
          "measurement": "procstat",
          "orderByTime": "ASC",
          "policy": "autogen",
          "refId": "C",
          "resultFormat": "time_series",
          "select": [
            [
              {
                "params": [
                  "num_threads"
                ],
                "type": "field"
              },
              {
                "params": [],
                "type": "mean"
              }
            ]
          ],
          "tags": [
            {
              "key": "exe",
              "operator": "=",
              "value": "mbatchd"
            },
            {
              "condition": "AND",
              "key": "host",
              "operator": "=",
              "value": "kilenc"
            }
          ]
        }
      ],
      "title": "LSF mbschd process metrics",
      "type": "timeseries"
    }
  ],
  "refresh": "30s",
  "schemaVersion": 37,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": true,
          "text": [
            "priority"
          ],
          "value": [
            "priority"
          ]
        },
        "datasource": {
          "type": "influxdb",
          "uid": "oSnSlVc4k"
        },
        "definition": "show tag values from \"lsf_queues\" with key=\"name\"",
        "hide": 0,
        "includeAll": false,
        "multi": false,
        "name": "Queue",
        "options": [],
        "query": "show tag values from \"lsf_queues\" with key=\"name\"",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "LSF dashboard",
  "uid": "-tdhK5x4k",
  "version": 2,
  "weekStart": ""
}

As you can see, with a short plugin script to collect information from LSF, it’s possible to monitor your LSF cluster using the TIG stack. It’s important to note that there are powerful monitoring and reporting tools available from IBM as add-ons to LSF; IBM Spectrum LSF RTM and IBM Spectrum LSF Explorer. You can find more details about the add-on capabilities for LSF here.