Solaris 11 Sparc environments running VRTSvxvm 8.0.2.1500 or higher, may encounter unwanted vxconfigd restarts.
This can occur more frequently when the DMPNODE and path count is higher than 8/
To reduce the chances of vxconfigd restarting, limit the DMPNODE path count to a maximum of 8 paths.
The issue is related to incident 4155091 made as part of 8.0.2u2 patch release.
Incident: 4155091 Add a tunable to control log file permissions and honour the tunable
Code changes done via incident 4155091 can cause race conditions between vxconfigd threads.
The incident impacted all platforms.
ONEOFF hotfix available
Veritas has released a VRTSvxvm 8.0.2.1570 ONEOFF hotfix for Solaris 11.4 Sparc environments. The issue is applicable to all platforms.
A supported hotfix has been made available for this issue. Please contact Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.
The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.
Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.
Troubleshooting steps
Determine if the vxconfigd has generated a core file.
Run "vxgetcore" against the vxconfigd core file
Location: /opt/VRTSspt/vxgetcore
Sample syntax:
# /opt/VRTSspt/vxgetcore/vxgetcore -c path-to-core### -b /sbin/vxconfigd
Sample pstack.out file
# cat pstack.outcore '/core' of 20772: /sbin/vxconfigd -k -x syslog
------------ lwp# 1 / thread# 1 ---------------
ff0cc000 pthread_mutex_unlock (0xff12bb40?, 0x0?, 0xff000000?, 0xfed02a40?, 0xffffff?, 0x0?) + 1d0
001983b8 vol_cbr_oplistfree (0x1528a28?, 0x12?, 0x3fe4c0?, 0x420d98?, 0x110e318?, 0x2a229f0?) + 34
00197fd0 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + e0
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 62 / thread# 62 ---------------
ff0d8760 __pollsys (0xfd44bee4?, 0x1?, 0xfd44be68?, 0x0?, 0x0?, 0x0?) + 8
ff0132f0 poll (0xfd44bee4?, 0x1?, 0x3e8?, 0x40?, 0x0?, 0x0?) + 84
001a4510 cmdship_reader (0x455b68?, 0x5400?, 0x0?, 0x1?, 0xfd44bee4?, 0x5698?) + 64
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 308 / thread# 308 ---------------
ff0c17c8 write (0x24?, 0x3620e8?, 0x4241c0?, 0x0?, 0x24?, 0x320acc?) + 74
0025e278 get_logfile (0x24?, 0x0?, 0x2?, 0x20000?, 0x0?, 0xfed02a40?) + e8
001978bc vol_translog (0x4f6d18?, 0x3cbb5?, 0x12?, 0x420c00?, 0x420c00?, 0x24?) + 110
00198374 vol_cbr_dolog (0xffffff?, 0x0?, 0xfdc1ba78?, 0x1983b8?, 0x320d90?, 0x3cbb4?) + 314
00197fc8 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + d8
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 309 / thread# 309 ---------------
ff0d8730 __pause (0x0?, 0x0?, 0x0?, 0xfffffff7?, 0x0?, 0x0?) + 8
00173e1c vold_timeout_handler (0x0?, 0x0?, 0x3b3aec?, 0x3b3800?, 0xfd9fbf10?, 0x173b50?) + 78
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Sample VCS messages
/var/VRTSvcs/log/engine_A.log
2024/08/02 06:42:46 VCS ERROR V-16-2-13027 (server1) Resource(cvm_vxconfigd) - monitor procedure did not complete within the expected time.
2024/08/02 06:42:51 VCS ERROR V-16-2-13027 (server1) Resource(CRON) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:30 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:52 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:44:45 VCS ERROR V-16-2-13210 (server1) Agent is calling clean for resource(cvm_vxconfigd) because 2 successive invocations of the monitor procedure did not complete within the expected time.
2024/08/02 06:44:47 VCS INFO V-16-2-13068 (server1) Resource(cvm_vxconfigd) - clean completed successfully.
2024/08/02 06:44:47 VCS ERROR V-16-2-13074 (server1) The monitoring program for resource(cvm_vxconfigd) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 5) the resource.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(DISCREDO_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(CLAIMDATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:50 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(cvm_vxconfigd) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(DISC8DATA_DG) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS NOTICE V-16-2-13076 (server1) Agent has successfully restarted resource(cvm_vxconfigd).
2024/08/02 06:47:20 VCS INFO V-16-1-55031 Resource cvm_vxconfigd in online state received recurring online message on system dhpdiscd
2024/08/02 06:47:53 VCS INFO V-16-2-13026 (server1) Resource(CRON) - monitor procedure finished successfully after failing to complete within the expected time for (3) consecutive times.
Workaround:
1.] Reduce the number of DMP paths to a maximum of 8 for each DMPNODE.
To display the number of paths per DMPNODE, type:
# vxdisk -px LIST_DMP -u g list
2.] Increase the FaultOnMonitorTimeouts: The number of timeouts before a fault is declared. Zero disables the feature.
Make the VCS configuration read-writable.
# haconf -makerw
# hatype -modify FaultOnMonitorTimeouts 2
3.] Increase Default MonitorTimeout values from 60 to 120 to give vxconfigd more tolerance if restarted:
# hatype -display CVMVxconfigd -attribute MonitorTimeout#Type Attribute Value
CVMVxconfigd MonitorTimeout 60
# hatype -display DiskGroup -attribute MonitorTimeout#Type Attribute Value
DiskGroup MonitorTimeout 60
Sample syntax
# hatype -modify MonitorTimeout 120
# hatype -modify CVMVxconfigd MonitorTimeout 120
# hatype -modify DiskGroup MonitorTimeout 120
4.] Verify the MonitorTimeout values are shown as 120 for CVMVxconfigd and DiskGroup resource types:
# hatype -display CVMVxconfigd -attribute MonitorTimeout#Type Attribute Value
CVMVxconfigd MonitorTimeout 120
# hatype -display DiskGroup -attribute MonitorTimeout#Type Attribute Value
DiskGroup MonitorTimeout 120
5.] Save and make the VCS configuration read-only.
# haconf -dump -makero
Request vm_sol11_sparc_8.0.2.1570 patch from support
Solaris 11 Sparc environments running VRTSvxvm 8.0.2.1500 or higher, may encounter unwanted vxconfigd restarts.
This can occur more frequently when the DMPNODE and path count is higher than 8/
To reduce the chances of vxconfigd restarting, limit the DMPNODE path count to a maximum of 8 paths.
The issue is related to incident 4155091 made as part of 8.0.2u2 patch release.
Incident: 4155091 Add a tunable to control log file permissions and honour the tunable Code changes done via incident 4155091 can cause race conditions between vxconfigd threads.
The incident impacted all platforms. ONEOFF hotfix available
Veritas has released a VRTSvxvm 8.0.2.1570 ONEOFF hotfix for Solaris 11.4 Sparc environments. The issue is applicable to all platforms.
A supported hotfix has been made available for this issue. Please contact Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.
The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.
Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.
Troubleshooting steps
Determine if the vxconfigd has generated a core file.
Run "vxgetcore" against the vxconfigd core file
Location: /opt/VRTSspt/vxgetcore
Sample syntax:
# /opt/VRTSspt/vxgetcore/vxgetcore -c path-to-core### -b /sbin/vxconfigd
Sample pstack.out file # cat pstack.outcore '/core' of 20772: /sbin/vxconfigd -k -x syslog
------------ lwp# 1 / thread# 1 ---------------
ff0cc000 pthread_mutex_unlock (0xff12bb40?, 0x0?, 0xff000000?, 0xfed02a40?, 0xffffff?, 0x0?) + 1d0
001983b8 vol_cbr_oplistfree (0x1528a28?, 0x12?, 0x3fe4c0?, 0x420d98?, 0x110e318?, 0x2a229f0?) + 34
00197fd0 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + e0
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 62 / thread# 62 ---------------
ff0d8760 __pollsys (0xfd44bee4?, 0x1?, 0xfd44be68?, 0x0?, 0x0?, 0x0?) + 8
ff0132f0 poll (0xfd44bee4?, 0x1?, 0x3e8?, 0x40?, 0x0?, 0x0?) + 84
001a4510 cmdship_reader (0x455b68?, 0x5400?, 0x0?, 0x1?, 0xfd44bee4?, 0x5698?) + 64
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 308 / thread# 308 ---------------
ff0c17c8 write (0x24?, 0x3620e8?, 0x4241c0?, 0x0?, 0x24?, 0x320acc?) + 74
0025e278 get_logfile (0x24?, 0x0?, 0x2?, 0x20000?, 0x0?, 0xfed02a40?) + e8
001978bc vol_translog (0x4f6d18?, 0x3cbb5?, 0x12?, 0x420c00?, 0x420c00?, 0x24?) + 110
00198374 vol_cbr_dolog (0xffffff?, 0x0?, 0xfdc1ba78?, 0x1983b8?, 0x320d90?, 0x3cbb4?) + 314
00197fc8 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + d8
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 309 / thread# 309 ---------------
ff0d8730 __pause (0x0?, 0x0?, 0x0?, 0xfffffff7?, 0x0?, 0x0?) + 8
00173e1c vold_timeout_handler (0x0?, 0x0?, 0x3b3aec?, 0x3b3800?, 0xfd9fbf10?, 0x173b50?) + 78
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Sample VCS messages
/var/VRTSvcs/log/engine_A.log2024/08/02 06:42:46 VCS ERROR V-16-2-13027 (server1) Resource(cvm_vxconfigd) - monitor procedure did not complete within the expected time.
2024/08/02 06:42:51 VCS ERROR V-16-2-13027 (server1) Resource(CRON) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:30 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:52 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:44:45 VCS ERROR V-16-2-13210 (server1) Agent is calling clean for resource(cvm_vxconfigd) because 2 successive invocations of the monitor procedure did not complete within the expected time.
2024/08/02 06:44:47 VCS INFO V-16-2-13068 (server1) Resource(cvm_vxconfigd) - clean completed successfully.
2024/08/02 06:44:47 VCS ERROR V-16-2-13074 (server1) The monitoring program for resource(cvm_vxconfigd) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 5) the resource.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(DISCREDO_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(CLAIMDATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:50 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(cvm_vxconfigd) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(DISC8DATA_DG) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS NOTICE V-16-2-13076 (server1) Agent has successfully restarted resource(cvm_vxconfigd).
2024/08/02 06:47:20 VCS INFO V-16-1-55031 Resource cvm_vxconfigd in online state received recurring online message on system dhpdiscd
2024/08/02 06:47:53 VCS INFO V-16-2-13026 (server1) Resource(CRON) - monitor procedure finished successfully after failing to complete within the expected time for (3) consecutive times.
Workaround:
1.] Reduce the number of DMP paths to a maximum of 8 for each DMPNODE.
To display the number of paths per DMPNODE, type:
# vxdisk -px LIST_DMP -u g list
2.] Increase the FaultOnMonitorTimeouts: The number of timeouts before a fault is declared. Zero disables the feature.
Make the VCS configuration read-writable. # haconf -makerw # hatype -modify <resource-name> FaultOnMonitorTimeouts 2 3.] Increase Default MonitorTimeout values from 60 to 120 to give vxconfigd more tolerance if restarted:
# hatype -display CVMVxconfigd -attribute MonitorTimeout#Type Attribute Value
CVMVxconfigd MonitorTimeout 60
# hatype -display DiskGroup -attribute MonitorTimeout#Type Attribute Value
DiskGroup MonitorTimeout 60
Sample syntax # hatype -modify <resource-type> MonitorTimeout 120
# hatype -modify CVMVxconfigd MonitorTimeout 120
# hatype -modify DiskGroup MonitorTimeout 120
4.] Verify the MonitorTimeout values are shown as 120 for CVMVxconfigd and DiskGroup resource types: # hatype -display CVMVxconfigd -attribute MonitorTimeout#Type Attribute Value
CVMVxconfigd MonitorTimeout 120
# hatype -display DiskGroup -attribute MonitorTimeout#Type Attribute Value
DiskGroup MonitorTimeout 120
5.] Save and make the VCS configuration read-only. # haconf -dump -makero
JIRA: STESC-8999 ETrack: 4155091