Higher system load caused by hbthread feature "llt_hb" after applying infoscale-rhel7_x86_64-Patch-7.3.1.3600

book

Article ID: 100050817

calendar_today

Updated On:

Description

Error Message

No errors reported but multiple hb processes are observed:
 

# ps auxH | awk '$8 ~ /^D/{print}'
root 2488 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/0]
root 2490 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/1]
root 2491 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/2]
root 2492 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/3]
root 2493 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/4]
root 2494 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/5]
root 2495 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/6]
root 2496 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/7]
root 2497 0.0 0.0 0 0 ? D 04:00 0:00 [llt_hb/8]

 

Cause

This is a result of a change required for VMWare vmotion / snapshots. The LLT heartbeat thread was not getting sufficient cycles to run resulting in LLT packet send/receive failures. This would eventually result in nodes being ejected from the cluster due to missing heartbeats.

One heartbeat thread is created per CPU so increase the efficiency of the heartbeat.
 

Resolution

Since this is only required for VMware environments, this feature can be disabled.

a. Verify the feature is enabled:

# lltconfig -H query
Current LLT miscellaneous values:
sleepalloc = 0
hbthread = 1

b. Disable the feaure:

# lltconfig -H hbthread:0

 

To enable or disable permanently append the following line entry to the /etc/llttab file.


To enable:

# vi /etc/llttab

set-misc hbthread:1
 

To disable: 

# vi /etc/llttab

set-misc hbthread:0


NOTE: The change will be effective from the next cluster start.

Issue/Introduction

After applying the infoscale-rhel7_x86_64-Patch-7.3.1.3600 patch it is observed that the system load dramatically increases.
Example: Before patch: 12:56:08 up 16 min, 2 users, load average: 0.21, 0.21, 0.27 After patch: # uptime
14:50:38 up 12 min, 2 users, load average: 3.34, 4.25, 2.69 The load average increases to approximately the number of CPU's installed on the system. Here is an example on an 8 CPU server.
# uptime
05:29:04 up 18 min, 1 user, load average: 7.17, 3.19, 1.29 # grep processor /proc/cpuinfo
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
# uptime
05:29:18 up 18 min, 1 user, load average: 7.43, 3.44, 1.40 # uptime 05:29:25 up 18 min, 1 user, load average: 7.52, 3.59, 1.47

# uptime
05:29:27 up 18 min, 1 user, load average: 7.52, 3.59, 1.47

# uptime
05:34:57 up 23 min, 1 user, load average: 8.00, 6.54, 3.43 # uptime
05:34:59 up 23 min, 1 user, load average: 8.00, 6.54, 3.43