Consider the following primary RVG containing two rlinks (i.e. there are two secondary nodes):
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg CONNECT ACTIVE pnd2 vvrdg2 toprimary
...
Note that both rlinks are in a state of CONNECT ACTIVE meaning that they are connected to secondary nodes and replicating. Furthermore we can see that both rlinks are up to date (i.e. there is no outstanding data bufferred in the primary SRL or DCM logs waiting to be replicated):
VxVM VVR vxrlink INFO V-5-1-4467 Rlink rlink1 is up to date
VxVM VVR vxrlink INFO V-5-1-4467 Rlink rlink2 is up to date
For the issue to occur, the following steps must take place:
1. Due to some kind of issue with the network connection or the secondary node, the SRL begins to fill for rlink1. This continues until the SRL overflows for rlink1 and as such it switches to DCM logging mode. rlink2 remains replicating from the SRL as normal:
VxVM VVR vxrlink INFO V-5-1-12887 DCM is in use on rlink rlink1. DCM contains 174080 Kbytes (85%) of the Data Volume(s).
VxVM VVR vxrlink INFO V-5-1-4467 Rlink rlink2 is up to date
2. A resynchronisation of the DCM logs is started for rlink1 in an attempt to switch back to replication from the SRL. As a result of this the amount of outstanding data held by DCM logs begins to fall:
# vxrvg resync testrvg
...
VxVM VVR vxrlink INFO V-5-1-12887 DCM is in use on rlink rlink1. DCM contains 139264 Kbytes (68%) of the Data Volume(s).
VxVM VVR vxrlink INFO V-5-1-12887 DCM is in use on rlink rlink1. DCM contains 137216 Kbytes (67%) of the Data Volume(s).
...
3. Whilst the DCM resynchronisation for rlink1 continues, there is some kind of issue with the network connection or secondary node causing the SRL to start to fill for rlink2:
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 308 outstanding writes, occupying 79618 Kbytes (39%) on the SRL
...
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 559 outstanding writes, occupying 144501 Kbytes (70%) on the SRL
Ultimately, the SRL becomes 100% full and overflows for rlink2 which, due to the settings for SRL protection, causes rlink2 to attempt to switch to DCM logging mode.
4. At this stage, the DCM logs are already busy being used for the resynchronisation of rlink1 which is still in progress. Due to this the DCM logs are not available for DCL logging by rlink2 causing rlink2 to fail to switch to DCM logging mode. As rlink2 has no way in which to track further incoming data volume writes for replication (the SRL has overflowed and DCM logs are busy) it is forced to detach due to a 'log overflow error':
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 788 outstanding writes, occupying 203698 Kbytes (99%) on the SRL
VxVM VVR vxrlink INFO V-5-1-3539 Rlink rlink2 is not currently attached (STALE).
Note that the following messages are shown in the syslog (or errpt on AIX) indicating that the rlink has been detached:
WARNING VxVM VVR vxio V-5-0-113 Disconnecting rep rlink2 to shift to DCM protection
...
WARNING VxVM VVR vxio V-5-0-280 Rlink rlink2 STALE due to log overflow
A vxprint of the disk gorup confirms that whilst rlink1 remains CONNECT ACTIVE, rlink2 is now DETACHED STALE:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg DETACHED STALE pnd2 vvrdg2 toprimary
...
Due to rlink2 being marked DETACHED STALE it will require some kind of synchronisation to restore consistency of the secondary node. Common methods of performing synchronisation are to perform an autosync or differences based synchronisation. Regardless of method used, however, the rlink refuses to connect and as such replication using rlink2 is unable to restart. For example:
# vxprint -qtr | grep rlink2
rl rlink2 testrvg DETACHED STALE pnd2 vvrdg2 toprimary
# vradmin -a startrep testrvg pnd2
VxVM VVR vxrlink WARNING V-5-1-3359 Attaching rlink to non-empty rvg. Autosync will be performed.
...
# vxprint -qtr | grep rlink2
rl rlink2 testrvg ENABLED ACTIVE pnd2 vvrdg2 toprimary
Note that the rlink is marked ENABLED ACTIVE rather than CONNECT ACTIVE meaning that replication is NOT taking place. In addition, if an autosync is performed, the 'resync_paused' flag will be set against the rlink:
flags: write enabled attached consistent disconnected asynchronous autosync resync_paused
Note that rlink1 continues to operate as normal.
The failure of rlink1 to reconnect after being detached is due to a defect in the VVR product causing unexpected flags to get set within the rlink configuration during the failed transition to DCM logging mode. These flags prevent the rlink from operating as normal.
To clear the flags one of the following work arounds can be used:
Force attach the DETACHED STALE rlink:
Note that before attempting this procedure any DCM resynchronisation taking place for other rlinks in the RVG should be completed such that the DCMs are no longer busy.
1. The DETACHED STALE rlink should be force attached on the primary node. As explained above this will cause it to transition to a state of ENABLED ACTIVE:
# vxrlink -f att rlink2
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg ENABLED ACTIVE pnd2 vvrdg2 toprimary
...
2. Due to the rlink being in a state of ENABLED ACTIVE it cannot replicate data to the secondary node. Despite this data volume writes will be logged in the SRL. As such the SRL starts to fill:
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 3876 outstanding writes, occupying 69278 Kbytes (33%) on the SRL
...
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 4144 outstanding writes, occupying 138556 Kbytes (67%) on the SRL
3. Ultimately the SRL will overflow for the problematic rlink causing it to attempt to transition to DCM logging. As the DCM logs are no longer busy (the resynchronisation of rlink1 has completed) this transition suceeds:
VxVM VVR vxrlink INFO V-5-1-4640 Rlink rlink2 has 4392 outstanding writes, occupying 202664 Kbytes (99%) on the SRL
VxVM VVR vxrlink INFO V-5-1-12887 DCM is in use on rlink rlink2. DCM contains 8704 Kbytes (4%) of the Data Volume(s).
Note that this causes the rlink to transition from ENABLED ACTIVE to CONNECT ACTIVE as the rlink configuration is corrected by the sucessful switch to DCM logging:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg CONNECT ACTIVE pnd2 vvrdg2 toprimary
...
4. The problematic rlink can now be detached and further synchronisation attempted as normal:
# vxrlink -g vvrdg2 -f det rlink2
# vxrlink -g vvrdg2 -a att rlink2
Note that this time the rlink transitions to CONNECT ACTIVE as expected:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg CONNECT ACTIVE pnd2 vvrdg2 toprimary
...
In addition we can see that the autosync is progressing as normal:
...
VxVM VVR vxrlink INFO V-5-1-4464 Rlink rlink2 is in AUTOSYNC. 215040 Kbytes remaining.
VxVM VVR vxrlink INFO V-5-1-4464 Rlink rlink2 is in AUTOSYNC. 207616 Kbytes remaining.
...
Destroy and Recreate the DETACHED STALE rlink:
Alternatively, the DETACHED STALE rlink can simply be destroyed and recreated to clean up its configuration. Once recreated the rlink can be synchronised as normal. For example:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg DETACHED STALE pnd2 vvrdg2 toprimary
...
1. Destroy the DETACHED STALE rlink:
# vradmin delsec testrvg pnd2
Note that this removes the DETACHED STALE rlink from the RVG configuration
2. Recreate the DETACHED STALE rlink:
# vradmin addsec testrvg sptaixvcs1 pnd2 prlink=rlink2 srlink=topri
3. The rlink is recreated and can now be synchronised as normal:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg DETACHED STALE pnd2 vvrdg2 topri
...
# vradmin -a startrep testrvg pnd2
...
As expected, the rlink transitions correctly to CONNECT ACTIVE:
rv testrvg 2 ENABLED ACTIVE primary 2 srlvol
rl rlink1 testrvg CONNECT ACTIVE pnd1 vvrdg2 toprimary
rl rlink2 testrvg CONNECT ACTIVE pnd2 vvrdg2 topri
...
Note that this issue is being tracked internally at Veritas via e2226531. It is expected that a permanent fix for this issue will be included in a forthcoming release of Veritas Volume Replicator. Please contact Veritas technical support for further information.
Applies To
Veritas Volume Replicator RVGs with multiple secondary nodes (i.e. a one to many configuration with multiple attached rlinks in the primary RVG) - note that one to one configurations (i.e. single primary and single secondary) are not affected by this issue
Volumes in the RVG must have DCM logs attached and SRL protection should be configured to use DCM logging in the event of SRL overflow
To trigger the issue the SRL for a specific rlink must overflow and cause an attempted transition to DCM logging whilst a DCM replay is already being performed for an alternate rlink in the same RVG.
Note that there are no issues with multiple rlinks logging to DCM logs in parallel