Node has completely failed and the Cluster Configuration Wizard (VCW) fails to remove it.

book

Article ID: 100020150

calendar_today

Updated On:

Resolution

The following steps are to be performed when removing a node that has completely failed from a SFW-HA or VCS cluster. These steps can be used for each node that must be removed.

Note:

  • Always attempt to remove the node from using Tools > System Manager from the Cluster Manager - Java Console and then use the Cluster Configuration Wizard (VCW) if possible.
  • If the node cannot be removed using the Cluster Configuration Wizard (VCW), follow the below procedure to manually remove the failed node from the cluster.
  1. Stop the SFW-HA services on all nodes in the cluster and leave the resources running by executing the following 2 commands from a Windows command prompt:

haconf -dump -makero     <- this is to ensure the current configuration is saved/closed and not writeable)
hastop -all -force

  1. Stop LLT and GAB so their configuration files can be edited by executing the following command from a Windows command prompt:

net stop llt /y

  1. Edit the %vcs_home%\conf\config\main.cf (on just 1 node) via a text editor, after making a backup copy, and remove all entries that relate to the failed inactive node. Perform a search for the name of the failed node to ensure that all entries are removed (e.g. systems in the cluster, Service Group System lists, Per System attributes such as the IP and NIC resources will have the node name listed along with the MAC address, etc.). Once all entries for the server have been removed, save and close the main.cf. (You can run "hacf -verify" against the changed main.cf to make sure syntax remains correct)

 

  1. Edit the %vcs_root%\comms\llt\llthosts.txt file and remove the entry for the node that needs removing.
  1. Copy the edited llthosts.txt file to all other active nodes in the cluster.
  1. Edit the %vcs_root%\comms\gab\gabtab.txt file and change the number of nodes in the cluster to one less (or however many failed nodes less) than the current number (e.g. If there are three nodes in the cluster and one node has completely failed, the existing entry will be gabconfig -c -n 3. Change the '3'  to a  '2'  without the quotes).
  1. Copy the edited gabtab.txt file to all other active nodes in the cluster.
  1. Once the aforementioned files have been edited, and the llthosts.txt and gabtab.txt files have been copied to the active nodes in the cluster, start the cluster services from the node where the %vcs_home%\conf\config\main.cf was edited by executing the following commands from a Windows command prompt:

net start llt
lltconfig -c
net start gab
net start vcscomm
hastart
hasys -state

Note: Continue running the hasys -state command until the local server shows that it is in a Running state. At that point, run the following command to start HAD on all other nodes in the cluster:

hastart -all

  1. Verify that the cluster service is running on all nodes by executing the following command from a Windows command prompt:

hasys -state

Issue/Introduction

How to manually remove a node from a Storage Foundation for Windows High Availability (SFW-HA) or Veritas Cluster Server (VCS) cluster, when a node has completely failed and the Cluster Configuration Wizard (VCW) fails to remove it.