The issue only applies to VxVM with layered volumes and DCO's.
If DCO's are not present with layered volumes, the corruption (missed writes) will not occur.
When DCO's are not added to mirrored volumes, a full sync of the data is required between plexes when the detached plex is reattached.
In this case, we found that sub-volume’s start offset in actual volume virtual address space is not aligned to the FMR region size.
In an effort to isolate the inconsistent plex content, the plex read policy can be changed to read from a specific plex, i.e. all the plexes associated with a given enclosure (site)
1. Stop the application
2. Set the read policy for each sub-layer volumes to reference the plexes for a single enclosure
# vxvol -g
3. Start the application
If the application reports errors, switch the read plex preference to the other plexes for the other enclosure
4. Stop the application
5. Set the read policy for each sub-layer volumes to reference the plexes for a single enclosure
# vxvol -g
6. Start the application
Veritas engineering have released the below private hot-fix, contact support to obtain the fix.
The vm-rhel7_x86_64-HotFix-7.3.1.2703 hot-fix includes multiple incidents
Patch ID: 7.3.1.2703
3991737 (3976392) Memory corruption might happen in VxVM (Veritas Volume Manager) while processing Plex detach request.
3991996 (3950335) Support for throttling of Administrative IO for layered volumes
3992054 (3992053) Data corruption may happen with layered volumes due to some data not re-synced while attaching a plex.
3992302 (3991580) Deadlock may happen if IO performed on both source and snapshot volumes.
NOTE: The layered volumes issue impacts all VxVM versions and platforms.
Reproduction Steps
1. Create layered volume, i.e. layout=concat-mirror
# vxassist -bg testdg make vol01 1t layout=concat-mirror
NOTE: It can take sometime to create the volume, depending on the volume size specified.
2. Add DCO log to the volume
# vxsnap -g testdg prepare vol01
The vxprint output will look similar to the below:
# vxprint -qhtg testdg
dg testdg default default 23000 1579091402.34.gpk630r4c-08
dm 3pardata0_129 3pardata0_129 auto 65536 1048469696 -
dm 3pardata0_130 3pardata0_130 auto 65536 1048469696 -
dm 3pardata0_131 3pardata0_131 auto 65536 1048469696 -
dm 3pardata0_132 3pardata0_132 auto 65536 1048469696 -
dm 3pardata0_133 3pardata0_133 auto 65536 1048469696 -
dm 3pardata0_134 3pardata0_134 auto 65536 1048469696 -
dm 3pardata0_135 3pardata0_135 auto 65536 1048469696 -
dm 3pardata0_136 3pardata0_136 auto 65536 1048469696 -
dm 3pardata0_137 3pardata0_137 auto 65536 1048469696 -
dm 3pardata0_138 3pardata0_138 auto 65536 1048469696 -
v vol01 - ENABLED ACTIVE 2147483648 SELECT - fsgen
pl vol01-03 vol01 ENABLED ACTIVE 2147483648 CONCAT - RW
sv vol01-S01 vol01-03 vol01-L01 1 1048469696 0 2/2 ENA
sv vol01-S02 vol01-03 vol01-L02 1 1048469696 1048469696 2/2 ENA
sv vol01-S03 vol01-03 vol01-L03 1 50544256 2096939392 2/2 ENA
dc vol01_dco vol01 vol01_dcl
v vol01_dcl - ENABLED ACTIVE 143488 SELECT - gen
pl vol01_dcl-01 vol01_dcl ENABLED ACTIVE 143488 CONCAT - RW
sd 3pardata0_133-01 vol01_dcl-01 3pardata0_133 50544256 143488 0 3pardata0_133 ENA
pl vol01_dcl-02 vol01_dcl ENABLED ACTIVE 143488 CONCAT - RW
sd 3pardata0_134-01 vol01_dcl-02 3pardata0_134 50544256 143488 0 3pardata0_134 ENA
v vol01-L01 - ENABLED ACTIVE 1048469696 SELECT - fsgen
pl vol01-P01 vol01-L01 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_129-02 vol01-P01 3pardata0_129 0 1048469696 0 3pardata0_129 ENA
pl vol01-P02 vol01-L01 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_130-02 vol01-P02 3pardata0_130 0 1048469696 0 3pardata0_130 ENA
v vol01-L02 - ENABLED ACTIVE 1048469696 SELECT - fsgen
pl vol01-P03 vol01-L02 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_131-02 vol01-P03 3pardata0_131 0 1048469696 0 3pardata0_131 ENA
pl vol01-P04 vol01-L02 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_132-02 vol01-P04 3pardata0_132 0 1048469696 0 3pardata0_132 ENA
v vol01-L03 - ENABLED ACTIVE 50544256 SELECT - fsgen
pl vol01-P05 vol01-L03 ENABLED ACTIVE 50544256 CONCAT - RW
sd 3pardata0_133-02 vol01-P05 3pardata0_133 0 50544256 0 3pardata0_133 ENA
pl vol01-P06 vol01-L03 ENABLED ACTIVE 50544256 CONCAT - RW
sd 3pardata0_134-02 vol01-P06 3pardata0_134 0 50544256 0 3pardata0_134 ENA
To prevent hot-relocation (vxrelocd) trying to relocate subdisks to other available space, stop the vxrelocd processes.
Example:
# ps -ef | grep -i vxrelocd
root 6317 1 0 Jan15 ? 00:00:00 /bin/sh - /usr/lib/vxvm/bin/vxrelocd root
root 6386 6317 0 Jan15 ? 00:00:00 /bin/sh - /usr/lib/vxvm/bin/vxrelocd root
root 32078 13648 0 09:52 pts/0 00:00:00 grep --color=auto -i vxrelocd
# kill -9 6317 6386
# ps -ef | grep -i vxrelocd
root 32080 13648 0 09:52 pts/0 00:00:00 grep --color=auto -i vxrelocd
3. Ideally you would have two enclosures for redundancy, however, in this instance the 2nd plex for each sub-layer volume will be detached by disabling the corresponding dmpnodes
# vxdmpadm -f disable dmpnodename=
Examples:
# vxdmpadm -f disable dmpnodename=3pardata0_130
# vxdmpadm -f disable dmpnodename=3pardata0_132
# vxdmpadm -f disable dmpnodename=3pardata0_134
4. I/O will be left running for 30 mins to an hour or more to ensure the surviving attached sub-layer plexes are updated, whilst the other plexes remain in a detached state (DISABLED NODEVICE)
# vxprint -qhtg testdg
dg testdg default default 23000 1579091402.34.gpk630r4c-08
dm 3pardata0_129 3pardata0_129 auto 65536 1048469696 -
dm 3pardata0_130 - - - - NODEVICE
dm 3pardata0_131 3pardata0_131 auto 65536 1048469696 -
dm 3pardata0_132 - - - - NODEVICE
dm 3pardata0_133 3pardata0_133 auto 65536 1048469696 -
dm 3pardata0_134 - - - - NODEVICE
dm 3pardata0_135 3pardata0_135 auto 65536 1048469696 -
dm 3pardata0_136 3pardata0_136 auto 65536 1048469696 -
dm 3pardata0_137 3pardata0_137 auto 65536 1048469696 -
dm 3pardata0_138 3pardata0_138 auto 65536 1048469696 -
v vol01 - ENABLED ACTIVE 2147483648 SELECT - fsgen
pl vol01-03 vol01 ENABLED ACTIVE 2147483648 CONCAT - RW
sv vol01-S01 vol01-03 vol01-L01 1 1048469696 0 1/2 ENA
sv vol01-S02 vol01-03 vol01-L02 1 1048469696 1048469696 1/2 ENA
sv vol01-S03 vol01-03 vol01-L03 1 50544256 2096939392 1/2 ENA
dc vol01_dco vol01 vol01_dcl
v vol01_dcl - ENABLED ACTIVE 143488 SELECT - gen
pl vol01_dcl-01 vol01_dcl ENABLED ACTIVE 143488 CONCAT - RW
sd 3pardata0_133-01 vol01_dcl-01 3pardata0_133 50544256 143488 0 3pardata0_133 ENA
pl vol01_dcl-02 vol01_dcl DISABLED NODEVICE 143488 CONCAT - RW
sd 3pardata0_134-01 vol01_dcl-02 3pardata0_134 50544256 143488 0 - RLOC
v vol01-L01 - ENABLED ACTIVE 1048469696 SELECT - fsgen
pl vol01-P01 vol01-L01 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_129-02 vol01-P01 3pardata0_129 0 1048469696 0 3pardata0_129 ENA
pl vol01-P02 vol01-L01 DISABLED NODEVICE 1048469696 CONCAT - RW
sd 3pardata0_130-02 vol01-P02 3pardata0_130 0 1048469696 0 - RLOC
v vol01-L02 - ENABLED ACTIVE 1048469696 SELECT - fsgen
pl vol01-P03 vol01-L02 ENABLED ACTIVE 1048469696 CONCAT - RW
sd 3pardata0_131-02 vol01-P03 3pardata0_131 0 1048469696 0 3pardata0_131 ENA
pl vol01-P04 vol01-L02 DISABLED NODEVICE 1048469696 CONCAT - RW
sd 3pardata0_132-02 vol01-P04 3pardata0_132 0 1048469696 0 - NDEV
v vol01-L03 - ENABLED ACTIVE 50544256 SELECT - fsgen
pl vol01-P05 vol01-L03 ENABLED ACTIVE 50544256 CONCAT - RW
sd 3pardata0_133-02 vol01-P05 3pardata0_133 0 50544256 0 3pardata0_133 ENA
pl vol01-P06 vol01-L03 DISABLED NODEVICE 50544256 CONCAT - RW
sd 3pardata0_134-02 vol01-P06 3pardata0_134 0 50544256 0 - RLOC
5. Enable the disabled dmpnodes for the detached plexes and wait for the vxattachd daemon (180 seconds+) to detect the returning disks and perform the plex recovery
# vxdmpadm enable dmpnodename=
Examples
# vxdmpadm enable dmpnodename=3pardata0_130
# vxdmpadm enable dmpnodename=3pardata0_132
# vxdmpadm enable dmpnodename=3pardata0_134
6. Stop application
7. Once the plexes have been resynced, set the plex read policy to read from the resynced plexes
# vxvol -g
Examples:
# vxvol -g testdg rdpol prefer vol01-L01 vol01-P02
# vxvol -g testdg rdpol prefer vol01-L02 vol01-P04
# vxvol -g testdg rdpol prefer vol01-L03 vol01-P06
8. Start the application, does the application report any errors
9. If errors are reported, stop the application and switch the preferred read preference back to read from the 1st plex for each sub-layered volume
# vxvol -g
Examples:
# vxvol -g testdg rdpol prefer vol01-L01 vol01-P01
# vxvol -g testdg rdpol prefer vol01-L02 vol01-P03
# vxvol -g testdg rdpol prefer vol01-L03 vol01-P05
10. Start the application, does the application report any errors