Veritas NetBackup Emergency Engineering Binary (EEB) NBAPP_EEB_ET4057043-4.0.0.1-1.x86_64 prevents false DCPA events & dynamically handles SAN zoning events with VxVM (vm-rhel7_x86_64-HotFix-7.4.2.2205)

book

Article ID: 100052420

calendar_today

Updated On:

Description

Error Message


Sample error

vxvm:vxconfigd: V-5-1-12223 Error in claiming /dev/sdclh: Too many open files
 

Data Corruption Protection Activated (DCPA) messages
 

V-5-1-14523 LUN serial number of the OS device path with device number 8/32 has changed from 600605B00B7B222026ADF17F59D0ECF8 (sdc) to 00f8ecd0597ff1ad2620227b0bb00506 (sdc)
V-5-1-14522 Attempt to isolate DMP node 8/32 failed, error retuned is Device or resource busy: Device or resource busy
V-5-1-8769 ddl_find_devices_in_system: ddl_reconfigure_all failed: Device or resource busy: Device or resource busy
V-5-1-16011 Data Corruption Protection Activated - User Corrective Action Needed:
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery using 'vxdisk scandisks'
V-5-1-0 fred_disk_6
V-5-1-13790 No device configuration changes have been applied to DMP kernel database.
V-5-1-13791 Please consult the documentation for correct procedure to replace disk/path.
 VxVM vxdmp V-5-0-0 [Error] i/o error occurred (errno=0x5) on dmpnode 201/0x20

 

See related article:
False DCPA (Data Corruption Protection Activated) events for JBOD or Internal disks can occur when the original LUN Serial Number cannot be retrieved from the original Page Code offset
https://www.veritas.com/support/en_US/article.100049898
 

The above DCPA events are related to file open failures as too many files are currently open by vxconfigd and not released correctly, resulting in FD leaks in vxconfigd. As the devices are internal to the Netbackup Appliance, it is unexpected to see such DCPA messages which subsequently result in unintended DMPNODE failures.

Upon further investigation we can see the Lun Serial Number (LSN) changing, thus triggering the DCPA events. The impacted DMPNODEs are then disabled to protect data integrity.

The LSN changes were caused by the SCSI inquiries failing against page code 0x80 and successful inquiries from page code 0x83.  As the LSN is different from each page code location, a different LSN is reported for the same disk.

We believe the SCSI inquiries initially failed against 0x80 page code due to the FD leak with vxconfigd, which ultimately triggered the false DCPA events.

 

Cause


When disks attached to a system are in an "online invalid" state, the VxVM vxconfigd daemon will go and read these devices. The read on the devices is necessary to confirm whether the underlying device(s) have  a VxVM Private region present on them and this cannot be avoided from a VxVM perspective.

Therefore to prevent such unwanted reads against the “online invalid” disks, the disks can be offlined to avoid such unwanted read requests.

Veritas Volume Manager (VxVM) disks listed in an online invalid can cause intermittent hangs of vxconfigd and general slowness
https://www.veritas.com/support/en_US/article.100050354

 

Resolution

To better manage SAN events, Veritas engineering has released a VxVM patch via a Veritas NetBackup Emergency Engineering Binary (EEB) to dynamically handle the removal and addition of Backup images (LUNs) following SAN zoning events.

The Veritas InfoScale engineering team have created vm-rhel7_x86_64-HotFix-7.4.2.2205 Private hot-fix & Array Support Library (ASL) update VRTSaslapm_Linux_7.4.2.2601 rolled into the EEB NBAPP_EEB_ET4057043-4.0.0.1-1.x86_64  hot-fix to automate the handling of SAN zoned devices.

In addition, SAN attached disks that appear in an "online invalid" state will be offlined automatically by the vxattachd daemon (/opt/VRTS/bin/vxattachd).

Once the EEB NBAPP_EEB_ET4057043-4.0.0.1-1.x86_64 patch has been installed, the EEB installer will also create the default template file containing the vxattachd tunables:

​# more /etc/default/vxattachd
handle_invalid_disk=on
skip_offline=on
remove_disable_dmpnode=on
     
To reduce the number of "vxdisk scandisks" command executed by the as-collector (autosupport), Veritas also recommends the installation of EEB patch NBAPP_EEB_ET4046344-4.0.0.0-1.x86_64.

Over the years we have found the DMP tunable "dmp_monitor_fabric" can cause performance issues across various platforms.

The design of the "dmp_monitor_fabric" tunable is to proactively detect errors by catching SAN events. With it off, DMP will detect errors reactively. DMP can handle the errors coming from the lower layers, it just depends if the approach is proactive or reactive.

The tunable is turned off by default on AIX.
 

We need to ensure the EEB patch also disables the  "dmp_monitor_fabric" tunable 
https://www.veritas.com/support/en_US/article.100051027
 

fred:/home/maintenance # vxdmpadm gettune dmp_monitor_fabric
            Tunable               Current Value  Default Value
------------------------------    -------------  -------------
dmp_monitor_fabric                      off               on


The following article outlines how to disable (unbind) and enable (bind) HBA access on Red Hat:

How to temporarily (unbind) disable HBA (Fibre Channel) ports (controllers) on RedHat for testing DMP & vxattachd interoperability

https://www.veritas.com/support/en_US/article.100050716

 

EEB Overview
 

NBAPP_4.0        - 4046344

Problem Description

EEB will update interval time for the following collector plugins MegaRAIDCollectorPlugin.yaml msdp.py partition.yaml partition.py Sas3ircu.yaml sas3ircu.py.

NOTES:
Checked interval time for each collector plugin in /log/autosupport/collector.log. after installation of EEB.

NB_Appliance Installed Files
/opt/NBUAppliance/scripts/post_uninstall-4046344.pl
/opt/NBUAppliance/scripts/install-4046344.pl

Installation Requires: All NetBackup services can remain running.
 

NBAPP_4.0.0.1    - 4057043

Problem Description

EEB will bring down all NetBackup Appliance Services EEB will update VRTSvxvm and VRTSaslapm rpm. EEB will set tunable dmp_fabric_monitor to off. 
EEB will reboot NetBackup Appliance after installation or RollBack Completes.

EEB will do the following:
- Bring down all NetBackup and Appliance Services
- Update InfoScale RPM, VRTSvxvm and VRTSaslapm.
- Set VxVM tunable dmp_fabric_monitor to off.
- Update /etc/default/vxattachd
- Reboot NetBackup Appliance Server after installation or RollBack Completes.

NB_Appliance Installed Files
/opt/NBUAppliance/scripts/install-4057043.pl
/opt/NBUAppliance/scripts/post_uninstall-4057043.pl
/opt/NBUAppliance/scripts/pre_proc_uninstall_4057043.pl
/opt/NBUAppliance/scripts/preprocess_install_4057043.pl
/opt/NBUAppliance/scripts/rpms_4057043.tar

Installation Requires: Shutdown and restart all NetBackup services.


 

EEB Install Steps 

 

Login to the NBA appliance:

 

fred.Main_Menu> Manage

Entering appliance management view...

fred.Manage> Software

•              Check EEBs have been downloaded

fred.Software> List Downloaded

 

EEB 1  INSTALL

fred.Software> Install NBAPP_EEB_ET4057043-4.0.0.1-1.x86_64.rpm

- [Info] The file: /opt/NBUAppliance/scripts/rpms_4057043.tar has been extracted successfully.
- [Info] Installing RPM VRTSvxvm-7.4.2.2205-RHEL7.x86_64.rpm.
- [Info] Successfully installed VRTSvxvm-7.4.2.2205-RHEL7.x86_64.rpm
- [Info] Installing RPM VRTSaslapm-7.4.2.2601-RHEL7.x86_64.rpm.
- [Info] Successfully installed VRTSaslapm-7.4.2.2601-RHEL7.x86_64.rpm
- [Info] Successfully imported nbuapp
- [Info] Successfully set dmp_monitor_fabric to off
- [Info] Installation done, rebooting the setup.

Shutdown scheduled for Sat 2022-02-05 21:28:34 PST, use 'shutdown -c' to cancel.

 

EEB2 INSTALL

fred.Software> Install NBAPP_EEB_ET4046344-4.0.0.0-1.x86_64.rpm

Extracting post-process script -- /opt/NBUAppliance/scripts/install-4046344.pl

Changing interval time for MegaRAIDCollectorPlugin.yaml to 1443...
- [Info] Successfully changed interval value for MegaRAIDCollectorPlugin.yaml.
Changing interval time for msdp.py to "60*17"...
- [Info] Successfully changed interval value for msdp.py.
Changing interval time for partition.py to "13 * 60"...
- [Info] Successfully changed interval value for partition.py.
Changing telemtry interval time for partition.py to "21 * 60"...
- [Info] Successfully changed telemetry interval value for partition.py.
Changing interval time for partition.yaml to 913...
- [Info] Successfully changed interval value for partition.yaml.
Changing interval time for sas3ircu.py to "60*19"...
- [Info] Successfully changed interval value for sas3ircu.py.
Changing interval time for Sas3ircu.yaml to 1023...
- [Info] Successfully changed interval value for Sas3ircu.yaml.
- [Info] Successfully restarted the Collector service.

post-process complete.

- [Info] Installer finished execution.
- [Info] Install script exited successfully.
- [Warning] No recipients are configured to receive software notifications. Use 'Settings->Alerts->Email Software Add' command to configure the appropriate Email address.
- [Info] Successfully installed the EEB NBAPP_EEB_ET4046344-4.0.0.0-1.x86_64.rpm.
 

fred.Software> List EEBs
List of installed EEBs:
NBAPP_EEB_ET4046344-4.0.0.0-1.x86_64
NBAPP_EEB_ET4057043-4.0.0.1-1.x86_64

 

Everything mounts

/dev/vx/dsk/nbuapp/advol      5368709120   493432  5326276512   1% /advanceddisk/dp1/advol
/dev/vx/dsk/nbuapp/pdvol     47365624680  2613992 47094576256   1% /msdp/data/dp1/pdvol
/dev/vx/dsk/nbuapp/catvol     2147483648  2529528  2128196944   1% /cat
/dev/vx/dsk/nbuapp/cfgvol      104857600   418072   103624088   1% /config
/dev/vx/dsk/nbuapp/pdcatvol  10737418240   662032 10652875752   1% /msdp/cat
/dev/vx/dsk/nbuapp/1pdvol    27796303000  1580264 27577577456   1% /msdp/data/dp1/1pdvol
/dev/vx/dsk/nbuapp/nbrepovol   104857600    74120   103964872   1% /nbrepo
tmpfs                           26352908        0    26352908   0% /run/user/0
tmpfs                           26352908        0    26352908   0% /run/user/1007
tmpfs                           26352908        0    26352908   0% /run/user/888

 

RPMs are upgraded.

fred:/home/maintenance # rpm -qa | egrep "VRTSvxvm|VRTSaslapm"
VRTSaslapm-7.4.2.2601-RHEL7.x86_64
VRTSvxvm-7.4.2.2205-RHEL7.x86_64

Issue/Introduction


Veritas Volume Manager (VxVM) needs to be informed when LUNs are being removed and added to a system.
When a path or DMPNODE fails, VxVM continues to lookout for the impacted paths/DMPNODEs, not knowing if the LUN was intentionally removed.

Storage vendors do NOT send a clear signal upstream to the operating system (OS) layers to reflect a change has been conducted at the storage layer. Therefore, in most cases the OS is unable to dynamically determine if disks have been added or removed from the host.

When SAN zoning changes are performed, the OS layers receives events from the lower layers informing the system that access to a OS device handle has been lost or added. The Red Hat UDEV framework co-exists with Veritas Dynamic Multi-pathing (DMP) in many positive ways, enabling Veritas to react with built-in intelligence to reduce many labour intensive tasks.

VxVM with Veritas Dynamic Multi-pathing (vxdmp) can react to these events in a positive way and take dynamic action to automate the removal or addition of controllers, paths and enclosure related events without the need for human intervention. If the correct steps are not performed, this can result in a series of issues, impacting the running & operational state of the VxVM vxconfigd daemon.