InfoScale 8.x Solaris System panics when LLT lost heartbeat from peer node

book

Article ID: 100068482

calendar_today

Updated On:

Description

Error Message:


Sample: /var/adm/messages

Jun 27 00:47:53 solaris01 ^Mpanic[cpu3]/thread=2a10196db00:
Jun 27 00:47:53 solaris01 unix: [ID 103648 kern.notice] recursive mutex_enter, lp=1840032d02230 owner=2a10196db00 thread=2a10196db00
Jun 27 00:47:53 solaris01 unix: [ID 100000 kern.notice]
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196b990 unix:mutex_panic+90 (1013e0c2, 1840032d02230, 207ec000, 1013dc00, 0, 20674000)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000001 000002a10196db01 00000000207ec000 0000000000000058
Jun 27 00:47:53 solaris01   %l4-7: 0000000000007fff 0000000000007c00 000000001013dc00 000002a10196db01
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196ba40 unix:mutex_vector_enter+314 (2, 2a10196db00, 2a10196db01, 2012d000, 1840032d02230, 1)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 000000002012d218 0000000000000000 0000000000000000 000000002012d208
Jun 27 00:47:53 solaris01   %l4-7: 000002a10196db00 0000000000000001 0000000000000000 0000000000000000
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196baf0 llt:llt_msg_recv1+e08 (1840040b73480, 3, 0, 1840032d02230, 1840032d02c40, 1840032d02228)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000001 0000000000000000 0001840042e2e030 0000000070ec7df4
Jun 27 00:47:53 solaris01   %l4-7: 0001840030b94000 0000000000000060 000000000000000c 0001840030b943c0
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bbe0 llt:llt_msg_recv+524 (1840040b73480, 70ed0, 0, 0, 0, 0)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 0001840042e2e030 0000000000070ec9 0000000000008088 0000000000000020
Jun 27 00:47:53 solaris01   %l4-7: 0000000000000001 000000000000ffff 0000000070ed0000 000000000000ffff
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bcb0 llt:llt_tpi_lrput+708 (0, 1840040a736c0, 70c00, 0, 0, 184002afdeb00)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 0000000070ec7d64 0000000000000000 0000000070ecf000 0000000000070ecf
Jun 27 00:47:53 solaris01   %l4-7: 0000000000070c00 0000000000000000 0000000000000000 0000000000070c00
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bd60 llt:llt_lrput+21c (184004d3fa650, 1840040a736c0, 70ec6000, 1, 1, 70ec6ec4)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 00000000000000c8 0000000000070c00 0000000000000000 00000000003e7923
Jun 27 00:47:53 solaris01   %l4-7: 0000000000000000 0000000000000000 0001840030b94000 000000007b54a740
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bee0 unix:putnext+220 (184004d3fa840, 184004d3fa650, 1840040a736c0, 20881cd4, 0, 184003eb89300)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
Jun 27 00:47:53 solaris01   %l4-7: 0000000000000000 0000000000000000 0000000000005f90 000000007b558c4c
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bf90 ip:udp_ulp_recv+3dc (1840041a88ec0, 1840040a736c0, 30, 2a10196c4e8, 1840040a736c0, 80000008)
 

Cause

When the network drops packets, consequently, LLT is prone to losing heartbeats. In that context, the LLT_NULL packet must be sent out to request a heartbeat from the peer node.

Once the peer node receives an LLT_NULL packet, the issue will occur.

NOTE: A stable network has fewer chances of seeing LLT related problems.

 

Resolution

A supported hotfix has been made available for this issue. Please contact  Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.

The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.

Veritas Private hotfix Solaris LLT 8.0.2.1


SUMMARY OF INCIDENTS FIXED BY THE PATCH
---------------------------------------
Patch ID: 8.0.2.1
* 4166061 (4167791) Recursive mutex_enter from LLT - void llt:llt_msg_recv1 on Solaris.


Once the Veritas LLT Private hotfix has been applied to all nodes in the cluster, please reboot all nodes.


# pkg info VRTSllt
             Name: VRTSllt
          Summary: Veritas Low Latency Transport
      Description: The package contains Veritas Low Latency Transport
         Category: Drivers/Networking
            State: Installed
        Publisher: Veritas
          Version: 8.0.2.1
           Branch: None
   Packaging Date: July  4, 2024 at 10:49:29 AM
Last Install Time: June 25, 2024 at 11:27:23 AM
 Last Update Time: July  4, 2024 at  1:06:32 PM
             Size: 1.70 MB
             FMRI: pkg://Veritas/VRTSllt@8.0.2.1:20240704T104929Z


 

Issue/Introduction


InfoScale 8.x Solaris systems may panic when LLT heartbeats are lost from peer node. The LLT issue impacts InfoScale 8.0 and 8.0.2 Solaris environments.


Error Message:


Sample: /var/adm/messages

Jun 27 00:47:53 solaris01 ^Mpanic[cpu3]/thread=2a10196db00:
Jun 27 00:47:53 solaris01 unix: [ID 103648 kern.notice] recursive mutex_enter, lp=1840032d02230 owner=2a10196db00 thread=2a10196db00
Jun 27 00:47:53 solaris01 unix: [ID 100000 kern.notice]
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196b990 unix:mutex_panic+90 (1013e0c2, 1840032d02230, 207ec000, 1013dc00, 0, 20674000)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 000002a10196db01 00000000207ec000 0000000000000058
Jun 27 00:47:53 solaris01 %l4-7: 0000000000007fff 0000000000007c00 000000001013dc00 000002a10196db01
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196ba40 unix:mutex_vector_enter+314 (2, 2a10196db00, 2a10196db01, 2012d000, 1840032d02230, 1)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 000000002012d218 0000000000000000 0000000000000000 000000002012d208
Jun 27 00:47:53 solaris01 %l4-7: 000002a10196db00 0000000000000001 0000000000000000 0000000000000000
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196baf0 llt:llt_msg_recv1+e08 (1840040b73480, 3, 0, 1840032d02230, 1840032d02c40, 1840032d02228)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000000000000000 0001840042e2e030 0000000070ec7df4
Jun 27 00:47:53 solaris01 %l4-7: 0001840030b94000 0000000000000060 000000000000000c 0001840030b943c0
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bbe0 llt:llt_msg_recv+524 (1840040b73480, 70ed0, 0, 0, 0, 0)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 0001840042e2e030 0000000000070ec9 0000000000008088 0000000000000020
Jun 27 00:47:53 solaris01 %l4-7: 0000000000000001 000000000000ffff 0000000070ed0000 000000000000ffff
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bcb0 llt:llt_tpi_lrput+708 (0, 1840040a736c0, 70c00, 0, 0, 184002afdeb00)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 0000000070ec7d64 0000000000000000 0000000070ecf000 0000000000070ecf
Jun 27 00:47:53 solaris01 %l4-7: 0000000000070c00 0000000000000000 0000000000000000 0000000000070c00
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bd60 llt:llt_lrput+21c (184004d3fa650, 1840040a736c0, 70ec6000, 1, 1, 70ec6ec4)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 00000000000000c8 0000000000070c00 0000000000000000 00000000003e7923
Jun 27 00:47:53 solaris01 %l4-7: 0000000000000000 0000000000000000 0001840030b94000 000000007b54a740
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bee0 unix:putnext+220 (184004d3fa840, 184004d3fa650, 1840040a736c0, 20881cd4, 0, 184003eb89300)
Jun 27 00:47:53 solaris01 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
Jun 27 00:47:53 solaris01 %l4-7: 0000000000000000 0000000000000000 0000000000005f90 000000007b558c4c
Jun 27 00:47:53 solaris01 genunix: [ID 723222 kern.notice] 000002a10196bf90 ip:udp_ulp_recv+3dc (1840041a88ec0, 1840040a736c0, 30, 2a10196c4e8, 1840040a736c0, 80000008)

Cause:

When the network drops packets, consequently, LLT is prone to losing heartbeats. In that context, the LLT_NULL packet must be sent out to request a heartbeat from the peer node. Once the peer node receives an LLT_NULL packet, the issue will occur. NOTE: A stable network has fewer chances of seeing LLT related problems.

Solution:

A supported hotfix has been made available for this issue. Please contact Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.

The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue. Veritas Private hotfix Solaris LLT 8.0.2.1


SUMMARY OF INCIDENTS FIXED BY THE PATCH
---------------------------------------
Patch ID: 8.0.2.1
* 4166061 (4167791) Recursive mutex_enter from LLT - void llt:llt_msg_recv1 on Solaris.

Once the Veritas LLT Private hotfix has been applied to all nodes in the cluster, please reboot all nodes.


# pkg info VRTSllt
Name: VRTSllt
Summary: Veritas Low Latency Transport
Description: The package contains Veritas Low Latency Transport
Category: Drivers/Networking
State: Installed
Publisher: Veritas
Version: 8.0.2.1
Branch: None
Packaging Date: July 4, 2024 at 10:49:29 AM
Last Install Time: June 25, 2024 at 11:27:23 AM
Last Update Time: July 4, 2024 at 1:06:32 PM
Size: 1.70 MB
FMRI: pkg://Veritas/VRTSllt@8.0.2.1:20240704T104929Z


Additional Information

JIRA: STESC-8897