Primary to Secondary migration on large CFS filesystem takes long time

Description

Error Message

Below messages are logged when CFS filesystem is either unmounted on Primary node or Primary role is migrated to other node.

# umount /large_cfs_filesystem

or

# fsclustadm setprimary /large_cfs_filesystem (on any secondary node)

Jan 27 18:29:38 vcs15 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [vx_ctl_thread:12253]Jan 27 18:29:38 vcs15 kernel: Modules linked in: nfs xfs ext3 jbd ext2 vxodm(P)(U) vxgms(P)(U) amf(P)(U) vxglm(P)(U) vxfen(P)(U) gab(P)(U) llt(P)(U) nfsd lockd nfs_acl auth_rpcgss sunrpc autofs4 coretemp vxspec(P)(U) vxio(P)(U) vxdmp(P)(U) cachefiles fscache(T) dummy bonding 8021q garp stp llc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad iscsi_tcp vxcafs(P)(U) vxportal(P)(U) fdd(P)(U) vxfs(P)(U) exportfs vhost_net macvtap macvlan tun uinput ppdev microcode vmware_balloon parport_pc parport sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif e1000 mptspi mptscsih mptbase scsi_transport_spi sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod crc32c_intel be2iscsi bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 ib_core ib_addr ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: ipmi_msghandler]Jan 27 18:29:38 vcs15 kernel: CPU 3 Jan 27 18:29:38 vcs15 kernel: Modules linked in: nfs xfs ext3 jbd ext2 vxodm(P)(U) vxgms(P)(U) amf(P)(U) vxglm(P)(U) vxfen(P)(U) gab(P)(U) llt(P)(U) nfsd lockd nfs_acl auth_rpcgss sunrpc autofs4 coretemp vxspec(P)(U) vxio(P)(U) vxdmp(P)(U) cachefiles fscache(T) dummy bonding 8021q garp stp llc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad iscsi_tcp vxcafs(P)(U) vxportal(P)(U) fdd(P)(U) vxfs(P)(U) exportfs vhost_net macvtap macvlan tun uinput ppdev microcode vmware_balloon parport_pc parport sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif e1000 mptspi mptscsih mptbase scsi_transport_spi sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod crc32c_intel be2iscsi bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 ib_core ib_addr ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: ipmi_msghandler]Jan 27 18:29:38 vcs15 kernel: Jan 27 18:29:38 vcs15 kernel: Pid: 12253, comm: vx_ctl_thread Tainted: P --------------- T 2.6.32-504.3.3.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference PlatformJan 27 18:29:38 vcs15 kernel: RIP: 0010:[] [] vx_dalist_remau+0x6c/0x90 [vxfs]Jan 27 18:29:38 vcs15 kernel: RSP: 0018:ffff880228e39d60 EFLAGS: 00000212Jan 27 18:29:38 vcs15 kernel: RAX: 00000000000561dc RBX: ffff880228e39d60 RCX: 00000000002b0ee8Jan 27 18:29:38 vcs15 kernel: RDX: ffff880222400000 RSI: 00000000002b0ee0 RDI: ffff88023636e080Jan 27 18:29:38 vcs15 kernel: RBP: ffffffff8100bb8e R08: 000000000006aefe R09: 0000000000000000Jan 27 18:29:38 vcs15 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff880228e39d90Jan 27 18:29:38 vcs15 kernel: R13: ffff88022ade2c00 R14: ffff8802181a4d40 R15: 0000000000005062Jan 27 18:29:38 vcs15 kernel: FS: 00007f3dd73dd700(0000) GS:ffff880028380000(0000) knlGS:0000000000000000Jan 27 18:29:38 vcs15 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003bJan 27 18:29:38 vcs15 kernel: CR2: 000000320a8acd50 CR3: 00000001d3027000 CR4: 00000000000007e0Jan 27 18:29:38 vcs15 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000Jan 27 18:29:38 vcs15 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400Jan 27 18:29:38 vcs15 kernel: Process vx_ctl_thread (pid: 12253, threadinfo ffff880228e38000, task ffff880228981500)Jan 27 18:29:38 vcs15 kernel: Stack:Jan 27 18:29:38 vcs15 kernel: ffff880228e39d90 ffffffffa0497db6 ffff88022ade2c00 0000000000015062Jan 27 18:29:38 vcs15 kernel:  ffff88023636e080 ffff8802181a4d40 ffff880228e39df0 ffffffffa0497ea7Jan 27 18:29:38 vcs15 kernel:  ffff880200000001 0000000000000000 ffff880228e39de0 ffff88022ade2d70Jan 27 18:29:38 vcs15 kernel: Call Trace:Jan 27 18:29:38 vcs15 kernel: [] ? vx_mark_dele+0xe6/0x190 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_abd_dele_dv+0x47/0x190 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_abd_dele+0x50/0x150 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_abdicate+0x1d/0x30 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_recv_vrtmigr+0x1f0/0x2f0 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_recvvrt+0x6c/0xe0 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_ctl_process_thread+0x2ac/0x360 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_recvvrt+0x0/0xe0 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_ctl_process_thread+0x0/0x360 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_kthread_init+0x7b/0x90 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_ctl_process_thread+0x0/0x360 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? child_rip+0xa/0x20Jan 27 18:29:38 vcs15 kernel: [] ? vx_ctl_process_thread+0x0/0x360 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? vx_kthread_init+0x0/0x90 [vxfs]Jan 27 18:29:38 vcs15 kernel: [] ? child_rip+0x0/0x20

Cause

During filesystem unmount on CFS Primary or role swap, the current Primary node has to process the delegated allocation units.

With the large filesystem size, the allocation unit list also increases, which causes the increase in migration time.

Resolution

A supported hotfix has been made available for this issue. Please contact Veritas Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.

Veritas Technologies LLC currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that Veritas Technologies LLC reserves the right to remove any fix from the targeted release if it does not pass quality assurance tests. Veritas’ plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Applies To

Linux systems running Storage Foundation Cluster File System (SFCFS) 6.1 and above with large shared filesystems size above 20TB.

Issue/Introduction

With a large shared filesystem, migrating CFS Primary to secondary node using fsclustadm setprimary or unmounting the filesystem on Primary takes a long time exhibiting a hang scenario.

Additional Information

ETrack: 3233276

Welcome to "KB Articles"