InfoScale 7.4: Clustered File System (CFS) hangs & other service outages caused when upgrading to 7.4 related to GLM defect (vxg_svar_sleep_unlock)

book

Article ID: 100044752

calendar_today

Updated On:

Description

Error Message


Sample stack #1

[37831.842357] INFO: task chmod:19494 blocked for more than 120 seconds.
[37831.843475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[37831.844329] chmod           D ffff8e71f87ebf40     0 19494  19391 0x00000080
[37831.844334] Call Trace:
[37831.844344]  [] ? kmem_cache_alloc+0x35/0x1f0
[37831.844352]  [] schedule+0x29/0x70
[37831.844363]  [] vxg_svar_sleep_unlock+0x78/0xf0 [vxglm]
[37831.844368]  [] ? wake_up_state+0x20/0x20
[37831.844375]  [] vxg_grant_sleep+0x157/0x1b0 [vxglm]
[37831.844381]  [] vxg_cmn_lock+0x54d/0x870 [vxglm]
[37831.844388]  [] ? vxg_lock_ilock_omnibus+0x36d/0x3c0 [vxglm]
[37831.844392]  [] vxg_api_lock+0x8b/0xc0 [vxglm]
[37831.844398]  [] ? vxg_def_nohash2+0x10/0x10 [vxglm]
[37831.844452]  [] vx_glm_lock+0x2e/0x60 [vxfs]
[37831.844481]  [] vxg_svar_sleep_unlock+0x78/0xf0 [vxglm]
[] vxg_grant_sleep+0x157/0x1b0 [vxglm]
[] vxg_cmn_lock+0x54d/0x870 [vxglm]
[] vxg_api_lock+0x8b/0xc0 [vxglm]
[] vx_glm_lock+0x2e/0x60 [vxfs]
[] vx_ihlock+0x2b/0xa0 [vxfs]
[] vx_cfs_iread+0x105/0x220 [vxfs]
[] vx_iget+0xae1/0x1520 [vxfs]
[] vx_dirlook+0x1b9/0x780 [vxfs]
[] vx_int_lookup+0x43b/0x5e0 [vxfs]
[] vx_do_lookup2+0x231/0x280 [vxfs]
[] vx_do_lookup+0x57/0x110 [vxfs]
[] vx_lookup+0x116/0x490 [vxfs]
[] lookup_real+0x23/0x60
[] __lookup_hash+0x42/0x60
[] lookup_slow+0x42/0xa7
[] path_lookupat+0x838/0x8b0
[] filename_lookup+0x2b/0xc0
[] user

 

Sample stack #2
 

[151881.197047] INFO: task vx_worklist_thr:1333 blocked for more than 120 seconds.
[151881.197714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[151881.198402] vx_worklist_thr D ffff913c79804f10     0  1333      2 0x00000080
[151881.198405] Call Trace:
[151881.198408]  [] schedule+0x29/0x70
[151881.198410]  [] schedule_timeout+0x239/0x2c0
[151881.198413]  [] wait_for_completion+0xfd/0x140
[151881.198415]  [] ? wake_up_state+0x20/0x20
[151881.198452]  [] vx_bc_biowait+0x19/0x40 [vxfs]
[151881.198470]  [] vx_bc_bwrite+0xe8/0x1e0 [vxfs]
[151881.198488]  [] vx_bwrite+0xd3/0x270 [vxfs]
[151881.198508]  [] vx_tflush_map+0x51a/0x690 [vxfs]
[151881.198543]  [] vx_put_dele+0x51a/0xb50 [vxfs]
[151881.198551]  [] ? vxg_lock_ilock_omnibus+0x36d/0x3c0 [vxglm]
[151881.198555]  [] ? check_preempt_wakeup+0x11d/0x250
[151881.198559]  [] ? vxg_def_nohash2+0x10/0x10 [vxglm]
[151881.198588]  [] vx_idele_release_fs+0x35e/0x3c0 [vxfs]
[151881.198623]  [] vx_do_fsext+0x24/0x40 [vxfs]
[151881.198656]  [] ? vx_umount_thaw+0x30/0x30 [vxfs]
[151881.198690]  [] vx_workitem_process+0x1c/0x40 [vxfs]
[151881.198723]  [] vx_worklist_process+0x108/0x230 [vxfs]
[151881.198757]  [] vx_walk_fslist+0x2ef/0x300 [vxfs]
[151881.198787]  [] ? vx_recv_revokedele+0x2e0/0x2e0 [vxfs]
[151881.198814]  [] ? vx_mdele_rele_ip+0x40/0x40 [vxfs]
[151881.198842]  [] vx_idele_release+0x2f/0x40 [vxfs]
[151881.198875]  [] vx_workitem_process+0x1c/0x40 [vxfs]
[151881.198908]  [] vx_worklist_process+0x215/0x230 [vxfs]
[151881.198942]  [] ? vx_osdep_deinit+0x1d0/0x1d0 [vxfs]
[151881.198980]  [] vx_worklist_thread+0x98/0x100 [vxfs]
[151881.199023]  [] ? vx_worklist_process+0x230/0x230 [vxfs]
[151881.199056]  [] vx_kthread_init+0x46/0x50 [vxfs]

 

Sample stack #3

Oct 6 02:42:08 server01 kernel: INFO: task sas:28980 blocked for more than 120 seconds.
Oct 6 02:42:08 server01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 6 02:42:08 server01 kernel: sas D ffff8dfa6d7d8fd0 0 28980 28805 0x00000080
Oct 6 02:42:08 server01 kernel: Call Trace:
Oct 6 02:42:08 server01 kernel: [] ? kmem_cache_alloc+0x35/0x1f0
Oct 6 02:42:08 server01 kernel: [] ? vxg_alloc_direct+0x38/0x110 [vxglm]
Oct 6 02:42:08 server01 kernel: [] schedule+0x29/0x70
Oct 6 02:42:08 server01 kernel: [] vxg_svar_sleep_unlock+0x78/0xf0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] ? wake_up_state+0x20/0x20
Oct 6 02:42:08 server01 kernel: [] vxg_grant_sleep+0x157/0x1b0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] vxg_cmn_lock+0x54d/0x870 [vxglm]
Oct 6 02:42:08 server01 kernel: [] ? vxg_lock_ilock_omnibus+0x36d/0x3c0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] ? vxg_lock_ilock_omnibus+0x36d/0x3c0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] ? vxg_lock_ilock_omnibus+0x36d/0x3c0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] vxg_api_lock+0x8b/0xc0 [vxglm]
Oct 6 02:42:08 server01 kernel: [] ? vxg_def_nohash2+0x10/0x10 [vxglm]
Oct 6 02:42:08 server01 kernel: [] vx_do_cfs_frlock+0x184/0x440 [vxfs]
Oct 6 02:42:08 server01 kernel: [] ? mntput+0x24/0x40
Oct 6 02:42:08 server01 kernel: [] ? terminate_walk+0x49/0x50
Oct 6 02:42:08 server01 kernel: [] ? do_last+0x66d/0x12c0
Oct 6 02:42:08 server01 kernel: [] vx_cfs_frlock+0x99/0xc0 [vxfs]
Oct 6 02:42:08 server01 kernel: [] vx_frlock+0x109/0x350 [vxfs]
Oct 6 02:42:08 server01 kernel: [] vfs_lock_file+0x35/0x60
Oct 6 02:42:08 server01 kernel: [] locks_remove_posix.part.27+0x89/0xd0
Oct 6 02:42:08 server01 kernel: [] locks_remove_posix+0x20/0x30
Oct 6 02:42:08 server01 kernel: [] filp_close+0x56/0x90
Oct 6 02:42:08 server01 kernel: [] __close_fd+0x8c/0xb0
Oct 6 02:42:08 server01 kernel: [] SyS_close+0x23/0x50
Oct 6 02:42:08 server01 kernel: [] system_call_fastpath+0x1c/0x21

Cause


GLM defect details.
 

A series of Private hot-fixes have been created to address the following GLM related defect.

---------------------------------------
This patch fixes the following incidents:

Patch ID: 7.4.0.1101

* 3961283 (Tracking ID: 3960823)

SYMPTOM:
Permanent hang is observed with similar stacktrace:

schedule
vxg_svar_sleep_unlock
vxg_grant_sleep
vxg_cmn_lock
vxg_api_lock
vx_glm_lock
vx_cfs_ifcntllock
vx_do_cfs_frlk_setlk
vx_do_cfs_frlock
vx_cfs_frlock
vx_frlock
vfs_lock_file
fcntl_setlk
 

DESCRIPTION:

With the introduction of PIN GRANT support in GLM new flags/variables are introduced which are manipulated using bitwise operations. These new variables/flags introduced were not protected by the same lock which resulted in variable values corruption and lead to a deadlock.

 

RESOLUTION:
Changes are done in GLM code to separate out the flags/variables such that bitwise operations do not corrupt the values.

 

 

Resolution

Please contact Veritas Technical Support to obtain Private hot-fixes for InfoScale 7.4

glm-rhel6_x86_64-HotFix-7.4.0.1201

glm-rhel7_x86_64-HotFix-7.4.0.1201

glm-sles11_x86_64-HotFix-7.4.0.1201

glm-sles12_x86_64-HotFix-7.4.0.1201

 

Reference Etrack 3960823

 

Issue/Introduction


When upgrading to InfoScale 7.4, you may encounter issues with Cluster File System environments. The GLM (Global Lock Manager) defect outlined in this article may result in CFS hangs and other service outages, such as system panics by vxfen as a result of LLT symptoms caused by hangs in GLM. When reviewing the system log files you may notice several vxfs/glm threads.