When replicating data using the VVR option in SFW 5.0 RP1a, a server crash (BSOD) or hang can occur when replication attempts to reinitialize following an outage.

book

Article ID: 100021830

calendar_today

Updated On:

Resolution

When replicating data using the VVR option in SFW 5.0 RP1a, a server crash (BSOD) or hang can occur on the secondary server when replication attempts to reinitialize. This can occur during the VVR autosync or fast failback operations due to the DCM log volume being unavailable because of a failure activating the DCM.

If VVR is being used in a non-clustered configuration, the issue experienced will that the VVR Secondary server crashes or hangs immediately following a Diskgroup import

If VVR is being used in a clustering solution with SFW-HA or MSCS, the issue experienced will be that the VVR Secondary server will crash or hang when the cluster service ('HAD' for SFW-HA and 'Cluster Service' for MSCS) is started as this will result in the VVR resources being brought online automatically in most cases. If the resources are manually brought online, the crash or hang would occur once the VMDg and VVR IP resources are both online.

An example of a potential bugcheck that can be seen with this issue is as follows:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high.  This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 00000001, memory referenced
Arg2: d0000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation

READ_ADDRESS:  00000001
CURRENT_IRQL:  2
FAULTING_IP:
vxio!vol_dcm_set_region+79

DEFAULT_BUCKET_ID:  DRIVER_FAULT

BUGCHECK_STR:  0xD1

Cause
The cause of the crash is due to the following sequence:

1. The Primary RVG goes down and the Secondary server performed a takeover with failback logging and becomes a Primary RVG.

2. When the original Primary RVG comes back online and the Diskgroup containing this RVG is imported, the connection between the original Primary RVG and the current Primary RVG occurs (to re-establish replication).

3. The original Primary RVG instructs the current Primary RVG to become a Secondary, and also attempts to mark the changes made to the current Primary RVG to its DCM. However, it crashes while accessing the DCM bitmap because the DCM has not been properly activated.

Solution
The source of this issue has been identified and a Hotfix is available from Veritas Enterprise Technical Support. To obtain the Hotfix, contact Veritas Enterprise Technical Support and reference this article during the call. A support representative will be available to assist in troubleshooting this issue. If it is determined that the private fix addresses the problem the support representative will further assist in obtaining the Hotfix.

Note: This fix specifically addresses the problem identified above. It has not been fully tested and should be applied in a test environment before placing into production. If the systems are not critically impaired, it is recommended to delay the installation of this private fix until the next scheduled maintenance release. The support representative will help in determining the best course of action.

Important Information
Once the private fix has been put in place, future occurrences of this issue will no longer occur. If the VVR configuration is currently impacted by this issue, the following steps must be taken in order to resolve the issue:

1. From the primary server (i.e. the server that does NOT crash when the Diskgroup is imported), run the command: vxprint -VPl  (capital V, capital P, lowercase L)

2. Locate the proper RVG in the output of the command by locating the name of the Diskgroup for which the RVG is configured for. Below is an example of the complete output for a single Diskgroup taking part in VVR replication:
 
Diskgroup = SQL2K5
 
Rvg        : SQLRVG
state      : state=ACTIVE kernel=ENABLED
assoc     : datavols=\Device\HarddiskDmVolumes\SQL2K5\SQLData
             srl=\Device\HarddiskDmVolumes\SQL2K5\SRL
             rlinks=rlk_SERVER1_25042
att        : rlinks=rlk_SERVER1_25042
checkpoint :
flags      : primary enabled attached clustered

3. Find the 'rlinks=' line and note the name of the rlink. In the example above, the rlink name is rlk_SERVER1_25042

4. Run the following command: vxrlink det   *Where is the name noted in Step 2

  Ex: vxrlink det rlk_SERVER1_25042

5. Once detached, move to the secondary VVR server (the server that hangs or crashes when VVR attempts to initialize), open the Veritas Enterprise Administrator (VEA), and select the 'Replication Network' option from the 'Select Host' drop-down box (See Figure 1)

6.  Expand the RDS and locate the Secondary RVG, right-click, and choose the 'Change Replication Settings' option. Make note of all configured options as these will be needed again when readding the Secondary RDS (IP Addresses, Replication Type, DCM Mode). Make sure to also select the 'Advanced' button and make note of the Advanced settings as well. This information is also provided in the vxprint -VPl output that was run in Step 1.

7. Once complete, right-click on the Secondary RVG again and select the 'Delete Secondary' option as shown in Figure 1.

Figure 1


Please Note: The Secondary is the RVG which has the inbound green arrow as part of its icon. The Primary has the outbound blue arrow.

8. Once complete, highlight the RDS, right-click and choose the 'Add Secondary' option as shown in Figure 2

Figure 2


9. Complete the wizard to readd the Secondary with original options documented in Step 6 and a full resynch will initiate afterwards.

Issue/Introduction

When replicating data using the VVR option in SFW 5.0 RP1a, a server crash (BSOD) or hang can occur when replication attempts to reinitialize following an outage. This is normally seen immediately following an import of the Diskgroup or after the cluster software (SFW-HA / MSCS) starts and brings the VVR resources online.