VxVM 8.0.2.1500 (Solaris) may result in unwanted vxconfigd restarts due to vol_cbr_oplistfree EO log permissions feature

book

Article ID: 100070446

calendar_today

Updated On:

Description

Description

 

Solaris 11 Sparc environments running VRTSvxvm 8.0.2.1500 or higher, may encounter unwanted vxconfigd restarts.
This can occur more frequently when the DMPNODE and path count is higher than 8/

To reduce the chances of vxconfigd restarting, limit the DMPNODE path count to a maximum of 8 paths.


The issue is related to incident 4155091 made as part of 8.0.2u2 patch release.

Incident: 4155091  Add a tunable to control log file permissions and honour the tunable

Code changes done via incident 4155091 can cause race conditions between vxconfigd threads.
The incident impacted all platforms.

 

ONEOFF hotfix available


Veritas has released a VRTSvxvm 8.0.2.1570 ONEOFF hotfix for Solaris 11.4 Sparc environments. The issue is applicable to all platforms.

A supported hotfix has been made available for this issue. Please contact  Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.

The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.


Troubleshooting steps
 

Determine if the vxconfigd has generated a core file.

Run "vxgetcore" against the vxconfigd core file

Location:  /opt/VRTSspt/vxgetcore

Sample syntax:

# /opt/VRTSspt/vxgetcore/vxgetcore -c path-to-core###  -b /sbin/vxconfigd


Sample pstack.out file

# cat pstack.out
core '/core' of 20772:  /sbin/vxconfigd -k -x syslog
------------  lwp# 1 / thread# 1  ---------------
 ff0cc000 pthread_mutex_unlock (0xff12bb40?, 0x0?, 0xff000000?, 0xfed02a40?, 0xffffff?, 0x0?) + 1d0
 001983b8 vol_cbr_oplistfree (0x1528a28?, 0x12?, 0x3fe4c0?, 0x420d98?, 0x110e318?, 0x2a229f0?) + 34
 00197fd0 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + e0
 001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
 001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
 00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
 ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------  lwp# 62 / thread# 62  ---------------
 ff0d8760 __pollsys (0xfd44bee4?, 0x1?, 0xfd44be68?, 0x0?, 0x0?, 0x0?) + 8
 ff0132f0 poll     (0xfd44bee4?, 0x1?, 0x3e8?, 0x40?, 0x0?, 0x0?) + 84
 001a4510 cmdship_reader (0x455b68?, 0x5400?, 0x0?, 0x1?, 0xfd44bee4?, 0x5698?) + 64
 ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------  lwp# 308 / thread# 308  ---------------
 ff0c17c8 write    (0x24?, 0x3620e8?, 0x4241c0?, 0x0?, 0x24?, 0x320acc?) + 74
 0025e278 get_logfile (0x24?, 0x0?, 0x2?, 0x20000?, 0x0?, 0xfed02a40?) + e8
 001978bc vol_translog (0x4f6d18?, 0x3cbb5?, 0x12?, 0x420c00?, 0x420c00?, 0x24?) + 110
 00198374 vol_cbr_dolog (0xffffff?, 0x0?, 0xfdc1ba78?, 0x1983b8?, 0x320d90?, 0x3cbb4?) + 314
 00197fc8 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + d8
 001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
 001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
 00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
 ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------  lwp# 309 / thread# 309  ---------------
 ff0d8730 __pause  (0x0?, 0x0?, 0x0?, 0xfffffff7?, 0x0?, 0x0?) + 8
 00173e1c vold_timeout_handler (0x0?, 0x0?, 0x3b3aec?, 0x3b3800?, 0xfd9fbf10?, 0x173b50?) + 78
 ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)




Sample VCS messages

/var/VRTSvcs/log/engine_A.log
 

2024/08/02 06:42:46 VCS ERROR V-16-2-13027 (server1) Resource(cvm_vxconfigd) - monitor procedure did not complete within the expected time.
2024/08/02 06:42:51 VCS ERROR V-16-2-13027 (server1) Resource(CRON) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:30 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:52 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:44:45 VCS ERROR V-16-2-13210 (server1) Agent is calling clean for resource(cvm_vxconfigd) because 2 successive invocations of the monitor procedure did not complete within the expected time.
2024/08/02 06:44:47 VCS INFO V-16-2-13068 (server1) Resource(cvm_vxconfigd) - clean completed successfully.
2024/08/02 06:44:47 VCS ERROR V-16-2-13074 (server1) The monitoring program for resource(cvm_vxconfigd) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 5) the resource.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(DISCREDO_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(CLAIMDATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:50 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(cvm_vxconfigd) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(DISC8DATA_DG) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS NOTICE V-16-2-13076 (server1) Agent has successfully restarted resource(cvm_vxconfigd).
2024/08/02 06:47:20 VCS INFO V-16-1-55031 Resource cvm_vxconfigd in online state received recurring online message on system dhpdiscd
2024/08/02 06:47:53 VCS INFO V-16-2-13026 (server1) Resource(CRON) - monitor procedure finished successfully after failing to complete within the expected time for (3) consecutive times.



Workaround:


1.] Reduce the number of DMP paths to a maximum of 8 for each DMPNODE.

To display the number of paths per DMPNODE, type:

# vxdisk -px LIST_DMP -u g list
 

2.] Increase the FaultOnMonitorTimeouts: The number of timeouts before a fault is declared. Zero disables the feature.

Make the VCS configuration read-writable.

# haconf -makerw

# hatype -modify  FaultOnMonitorTimeouts 2

 

3.] Increase Default MonitorTimeout values from 60 to 120 to give vxconfigd more tolerance if restarted:
 

# hatype -display CVMVxconfigd  -attribute MonitorTimeout
#Type        Attribute              Value
CVMVxconfigd MonitorTimeout         60

 

# hatype -display DiskGroup  -attribute MonitorTimeout
#Type        Attribute              Value
DiskGroup    MonitorTimeout         60


Sample syntax

# hatype -modify MonitorTimeout 120


# hatype -modify CVMVxconfigd MonitorTimeout 120
# hatype -modify DiskGroup MonitorTimeout 120


4.] Verify the MonitorTimeout values are shown as 120 for CVMVxconfigd and DiskGroup  resource types:

# hatype -display CVMVxconfigd  -attribute MonitorTimeout
#Type        Attribute              Value
CVMVxconfigd MonitorTimeout         120

 

# hatype -display DiskGroup  -attribute MonitorTimeout
#Type        Attribute              Value
DiskGroup    MonitorTimeout         120



5.] Save and make the VCS configuration read-only.

# haconf -dump -makero
 

 

Request vm_sol11_sparc_8.0.2.1570 patch from support

 

Issue/Introduction

Solaris 11 Sparc environments running VRTSvxvm 8.0.2.1500 or higher, may encounter unwanted vxconfigd restarts.
This can occur more frequently when the DMPNODE and path count is higher than 8/

To reduce the chances of vxconfigd restarting, limit the DMPNODE path count to a maximum of 8 paths.


The issue is related to incident 4155091 made as part of 8.0.2u2 patch release.

Incident: 4155091 Add a tunable to control log file permissions and honour the tunable Code changes done via incident 4155091 can cause race conditions between vxconfigd threads.
The incident impacted all platforms. ONEOFF hotfix available
Veritas has released a VRTSvxvm 8.0.2.1570 ONEOFF hotfix for Solaris 11.4 Sparc environments. The issue is applicable to all platforms.

A supported hotfix has been made available for this issue. Please contact Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.

The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.


Troubleshooting steps
Determine if the vxconfigd has generated a core file.

Run "vxgetcore" against the vxconfigd core file

Location: /opt/VRTSspt/vxgetcore

Sample syntax:

# /opt/VRTSspt/vxgetcore/vxgetcore -c path-to-core### -b /sbin/vxconfigd


Sample pstack.out file # cat pstack.out
core '/core' of 20772: /sbin/vxconfigd -k -x syslog
------------ lwp# 1 / thread# 1 ---------------
ff0cc000 pthread_mutex_unlock (0xff12bb40?, 0x0?, 0xff000000?, 0xfed02a40?, 0xffffff?, 0x0?) + 1d0
001983b8 vol_cbr_oplistfree (0x1528a28?, 0x12?, 0x3fe4c0?, 0x420d98?, 0x110e318?, 0x2a229f0?) + 34
00197fd0 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + e0
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 62 / thread# 62 ---------------
ff0d8760 __pollsys (0xfd44bee4?, 0x1?, 0xfd44be68?, 0x0?, 0x0?, 0x0?) + 8
ff0132f0 poll (0xfd44bee4?, 0x1?, 0x3e8?, 0x40?, 0x0?, 0x0?) + 84
001a4510 cmdship_reader (0x455b68?, 0x5400?, 0x0?, 0x1?, 0xfd44bee4?, 0x5698?) + 64
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 308 / thread# 308 ---------------
ff0c17c8 write (0x24?, 0x3620e8?, 0x4241c0?, 0x0?, 0x24?, 0x320acc?) + 74
0025e278 get_logfile (0x24?, 0x0?, 0x2?, 0x20000?, 0x0?, 0xfed02a40?) + e8
001978bc vol_translog (0x4f6d18?, 0x3cbb5?, 0x12?, 0x420c00?, 0x420c00?, 0x24?) + 110
00198374 vol_cbr_dolog (0xffffff?, 0x0?, 0xfdc1ba78?, 0x1983b8?, 0x320d90?, 0x3cbb4?) + 314
00197fc8 vol_clntaddop (0xe?, 0x111?, 0x12c6738?, 0x12?, 0x1528a28?, 0x6be?) + d8
001a1e24 vol_cbr_translog (0x1528a28?, 0x400af78?, 0x12?, 0xfdc1bdf0?, 0xb5eb1c?, 0x20?) + 90f0
001737d8 vold_preprocess_request (0x400af78?, 0x1528a28?, 0x1110?, 0x1110?, 0x3a8800?, 0x3b1828?) + 170
00173a5c vold_dispatch_requests (0x0?, 0x1528a28?, 0x420c18?, 0x3b1828?, 0x3b3b38?, 0x3b3ad0?) + 204
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
------------ lwp# 309 / thread# 309 ---------------
ff0d8730 __pause (0x0?, 0x0?, 0x0?, 0xfffffff7?, 0x0?, 0x0?) + 8
00173e1c vold_timeout_handler (0x0?, 0x0?, 0x3b3aec?, 0x3b3800?, 0xfd9fbf10?, 0x173b50?) + 78
ff0d3b24 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)




Sample VCS messages

/var/VRTSvcs/log/engine_A.log
2024/08/02 06:42:46 VCS ERROR V-16-2-13027 (server1) Resource(cvm_vxconfigd) - monitor procedure did not complete within the expected time.
2024/08/02 06:42:51 VCS ERROR V-16-2-13027 (server1) Resource(CRON) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:30 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:43:52 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:44:45 VCS ERROR V-16-2-13210 (server1) Agent is calling clean for resource(cvm_vxconfigd) because 2 successive invocations of the monitor procedure did not complete within the expected time.
2024/08/02 06:44:47 VCS INFO V-16-2-13068 (server1) Resource(cvm_vxconfigd) - clean completed successfully.
2024/08/02 06:44:47 VCS ERROR V-16-2-13074 (server1) The monitoring program for resource(cvm_vxconfigd) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 5) the resource.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(DISCREDO_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:30 VCS ERROR V-16-2-13027 (server1) Resource(CLAIMDATA_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:46:50 VCS ERROR V-16-2-13027 (server1) Resource(DISC8DATA2_DG) - monitor procedure did not complete within the expected time.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(cvm_vxconfigd) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS INFO V-16-2-13026 (server1) Resource(DISC8DATA_DG) - monitor procedure finished successfully after failing to complete within the expected time for (2) consecutive times.
2024/08/02 06:47:20 VCS NOTICE V-16-2-13076 (server1) Agent has successfully restarted resource(cvm_vxconfigd).
2024/08/02 06:47:20 VCS INFO V-16-1-55031 Resource cvm_vxconfigd in online state received recurring online message on system dhpdiscd
2024/08/02 06:47:53 VCS INFO V-16-2-13026 (server1) Resource(CRON) - monitor procedure finished successfully after failing to complete within the expected time for (3) consecutive times.


Workaround:


1.] Reduce the number of DMP paths to a maximum of 8 for each DMPNODE.

To display the number of paths per DMPNODE, type:

# vxdisk -px LIST_DMP -u g list
2.] Increase the FaultOnMonitorTimeouts: The number of timeouts before a fault is declared. Zero disables the feature.

Make the VCS configuration read-writable. # haconf -makerw # hatype -modify <resource-name> FaultOnMonitorTimeouts 2 3.] Increase Default MonitorTimeout values from 60 to 120 to give vxconfigd more tolerance if restarted:
# hatype -display CVMVxconfigd -attribute MonitorTimeout
#Type Attribute Value
CVMVxconfigd MonitorTimeout 60

# hatype -display DiskGroup -attribute MonitorTimeout
#Type Attribute Value
DiskGroup MonitorTimeout 60

Sample syntax # hatype -modify <resource-type> MonitorTimeout 120


# hatype -modify CVMVxconfigd MonitorTimeout 120
# hatype -modify DiskGroup MonitorTimeout 120


4.] Verify the MonitorTimeout values are shown as 120 for CVMVxconfigd and DiskGroup resource types: # hatype -display CVMVxconfigd -attribute MonitorTimeout
#Type Attribute Value
CVMVxconfigd MonitorTimeout 120

# hatype -display DiskGroup -attribute MonitorTimeout
#Type Attribute Value
DiskGroup MonitorTimeout 120



5.] Save and make the VCS configuration read-only. # haconf -dump -makero

Additional Information

JIRA: STESC-8999 ETrack: 4155091