Storage Foundation Cluster File System for Oracle RAC does not support Veritas's implementation of SCSI-3 PGR based I/O fencing and Oracle Clusterware (CRS) is expected to handle any split-brain situations

book

Article ID: 100018826

calendar_today

Updated On:

Resolution

SFCFS RAC does not support Veritas's implementation of SCSI-3 PGR based I/O fencing and Oracle Clusterware (CRS) is expected to handle any split-brain situations. To minimize chances of data corruption in this implementation, it is mandatory that the Veritas Cluster File System (VxCFS) recovery starts only after Oracle Clusterware has completed its cluster reconfiguration and has rebooted one side of the network partition. In such cases, VxCFS would perform its recovery without any loss of files.

To ensure that VxCFS recovery takes place after Oracle Clusterware reconfiguration has completed, LLT peerinact timeout should be set to a value so that GAB delivers cluster reconfiguration to VxCFS after Oracle Clusterware has completed its cluster reconfiguration followed by VxCFS starting its recovery.

Please refer to the following procedure to set the LLT peerinact timeout value based on the Oracle version while configuring SF CFS RAC cluster:

General

First check CSSD misscount value to verify if its default value has been used. To verify the same, run the following command from any cluster node:

# ${CRS_HOME.EN_US}/bin/crsctl get css misscount

If the above command does not show any value and shows message something like "Configuration parameter misscount is not defined", then it means default CSSD misscount has been used for CRS, in which case please refer to sections for Oracle 10g or 11g below as per the Oracle version installed. Otherwise If CSSD default value has not been used, then refer to "For non-default CSSD misscount value" section.

The default CSSD value for Oracle 10g is 60 seconds and for Oracle 11g it is 30 seconds. So, if the above command output shows the same default value, even then refer to sections for Oracle 10g or 11g below as per the Oracle version installed.

Oracle 10g

If default value for CSSD misscount (60 seconds for Oracle 10g) has not been used, then move to "For non-default CSSD misscount value", otherwise perform the following steps on each cluster node:

Note: Use 150 seconds in all the steps in this procedure to set the value of LLT peerinact timeout if patch for Oracle bug # 5736843 has not been applied, otherwise use 90 seconds for all the steps in the procedure. Refer to "Important Note" in "For non-default CSSD misscount value" section below for more details about the Oracle bug. 150 seconds has been used here in all the steps for LLT peerinact timeout value as an example, but it may be 90 seconds based on the above criteria.

1. Set LLT peerinact timeout to 150 seconds using the following command:
# lltconfig -T peerinact:15000
[Where value of peerinact is in .01 seconds]

2. Verify that peerinact has been set to 150 seconds:
# lltconfig -T query

It should show output like this (with peerinact set to 150 seconds):

Current LLT timer values (.01 sec units):
heartbeat = 50
heartbeatlo = 100
peertrouble = 200
peerinact = 15000
oos = 10
retrans = 10
service = 100
arp = 30000
arpreq = 3000
Current LLT flow control values (in packets):
lowwater = 40

3. Append the following line at the end of /etc/llttab file to make LLT peerinact value persistent across reboots:

set-timer peerinact:15000

After appending the above line, /etc/llttab file should look like this:
# cat /etc/llttab
set-node host1
set-cluster 1234
link eth2 eth-00:15:17:48:b5:80 - ether - -
link eth3 eth-00:15:17:48:b5:81 - ether - -
set-timer peerinact:15000

4. Repeat steps 1-3 on other cluster nodes.

Oracle 11g

If default value for CSSD misscount (30 seconds for Oracle 11g) has not been used, then move to "For non-default CSSD misscount value", otherwise perform the following steps on each cluster node:

Note: Use 90 seconds in all the steps in this procedure to set the value of LLT peerinact timeout if patch for Oracle bug # 5736843 has not been applied, otherwise use 60 seconds for all the steps in the procedure. Refer to "Important Note" in "For non-default CSSD misscount value" section below for more details about the Oracle bug. 90 seconds has been used here in all the steps for LLT peerinact timeout value as an example, but it may be 60 seconds based on the above criteria:

1. Set LLT peerinact timeout to 90 seconds using the following command:
# lltconfig -T peerinact:9000
[Where value of peerinact is in .01 seconds]

2. Verify that peerinact has been set to 90 seconds:
# lltconfig -T query

It should show output like this (with peerinact set to 90 seconds):

Current LLT timer values (.01 sec units):
heartbeat = 50
heartbeatlo = 100
peertrouble = 200
peerinact = 9000
oos = 10
retrans = 10
service = 100
arp = 30000
arpreq = 3000
Current LLT flow control values (in packets):
lowwater = 40

3. Append the following line at the end of /etc/llttab file to make LLT peerinact value persistent across reboots:

set-timer peerinact:9000

After appending the above line, /etc/llttab file should look like this:
# cat /etc/llttab
set-node host1
set-cluster 1234
link eth2 eth-00:15:17:48:b5:80 - ether - -
link eth3 eth-00:15:17:48:b5:81 - ether - -
set-timer peerinact:9000

4. Repeat steps 1-3 on other cluster nodes.

Non-default CSSD misscount value

If default value for CSSD misscount (60 seconds for Oracle 10g and 30 seconds for Oracle 11g) has not been used, then LLT peerinact value should be calculated as per the following steps:

First get CSSD misscount, use the following command on any cluster node:
# ${CRS_HOME.EN_US}/bin/crsctl get css misscount

Now, calculate LLT peerinact timeout in seconds:
(LLT peerinact timeout) = 2 * (CSSD misscount) + 30

Important Note: The above rule is to be used if the patch for Oracle bug # 5736843 has not been applied. As per Oracle metalink information, this bug seems to have been fixed in 10.2.0.4 and 11.1.0.6, but the issue may still appear because of some other reasons.

For the exact symptoms of this bug, refer to the bug description on Oracle Metalink along with some other related issues. In case, if this bug is not there, then use the following rule to calculate LLT peerinact timeout in seconds:

(LLT peerinact timeout) = (CSSD misscount) + 30

After calculating LLT peerinact timeout value based on the above rule, perform the following steps on each cluster node:

1. Use the calculated LLT peerinact timeout value (say N) multiplied by 100 in the following command (say M = N*100):

# lltconfig -T peerinact:M
[Where value of peerinact is in .01 seconds]

2. Verify that peerinact has been set to the new value (say M) seconds:
# lltconfig -T query

It should show output like this (with peerinact set to M):

Current LLT timer values (.01 sec units):
heartbeat = 50
heartbeatlo = 100
peertrouble = 200
peerinact = M
oos = 10
retrans = 10
service = 100
arp = 30000
arpreq = 3000
Current LLT flow control values (in packets):
lowwater = 40

3. Append the following line at the end of /etc/llttab file to make LLT peerinact value persistent across reboots:

set-timer peerinact:M

After appending the above line, /etc/llttab should look like this:
# cat /etc/llttab
set-node host1
set-cluster 1234
link eth2 eth-00:15:17:48:b5:80 - ether - -
link eth3 eth-00:15:17:48:b5:81 - ether - -
set-timer peerinact:M

4. Repeat steps 1-3 on other cluster nodes.

Applies To

Applies to SFCFS RAC environments.

Note: This article does not apply to SF Oracle RAC.

Issue/Introduction

Storage Foundation Cluster File System for Oracle RAC (SFCFS RAC) does not support Veritas's implementation of SCSI-3 PGR based I/O fencing and Oracle Clusterware (CRS) is expected to handle any split-brain situations

Was this article helpful?

thumb_up Yes

thumb_down No

Welcome to "KB Articles"