Veritas Volume Replicator (VVR) 6.0.3 and below may encounter a replication stalled drain state when the VVR SRL "vol_max_rdback_sz" (RDBACK) tunable value is not set higher enough in a shared environment

book

Article ID: 100010157

calendar_today

Updated On:

Cause

Asynchronous mode: 
 

Network performance issues: 

Asynchronous mode ensures write requests are not delayed if network capacity is insufficent. Instead, excess requests accumulate on the SRL (as long as the SRL is large enough to hold them). 
If there is a persistent shortfall in network capacity, the SRL eventually overflows. 

The SRL can be used as a buffer to handle temporary shortfalls in network capacity, such as periods of peak usage, provided that these periods are followed by periods during which the Secondary can catch up as the SRL drains. 

NOTE: You can use the "bandwidth_limit" attribute to set the maximum network bandwidth (in bits per second) that can be used during replication.

The asynchronous mode handles burts of I/O or congestion on the network by using the SRL. This minimizes impact on application performance from network bandwidth fluctuations. 
The average network bandwidth must be adequate for the average write rate of the application. 

Asynchronous replication does not compensate for a slow network, as the Secondary needs to reflect the state of the Primary at some point-in-time. However, the Secondary must have committed transactions that have not been written to the Secondary. 

VVR enables you to manage latency protection, by specifying how many outstanding writes are acceptable, and what action to take if that limit is exceeded. 

Asynchronous replication also minimizes impact on application performance because the the I/O completes without waiting for the network acknowledgment from the Secondary.

Resolution


In some circumstances the maximum amount of memory allocated to the VVR readback pool ( "vol_max_rdback_sz") is not sufficient. This is the maximum memory that will be used by VVR, when write requests are being read back from the SRL, normally related to VVR environments where the write operations cannot be handled correctly, due to a potential surge of writes which exceed the available network bandwidth.

When replicating we are mainly concerned with the write operations, as read operations do not affect replication.


The Veritas article "HOWTO85017" describes how to tune the "vol_max_rdback_sz " tunable.
 

In asynchronous mode, where the Secondary or network bandwidth cannot keep with the incoming write rate, the Primary kernel memory buffers fills up.

For VVR to continue to provide memory for incoming writes and to continue its processing, it must free the memory held by writes that have been written to the Primary data volume, but, not yet sent to the Secondary.

When VVR is ready to send the unsent writes that were freed, the writes must first “READ BACK” from the SRL.

In synchronous mode the data is always available in memory, while in asynchronous mode VVR may have to FREQUENTLY “READ BACK” the data from SRL. Synchronous replication can significantly decrease application performance by adding the network round trip to the latency of each write request.

Consequently, replication performance might suffer because of the delay of the additional read operation.

KEY POINT: VVR does not need to “READ BACK” from the SRL if the “NETWORK BANDWIDTH” is sufficient and the Secondary always keeps up with the incoming write rate, or if the Secondary only falls behind for short periods during which the accumulated writes are small enough to fit in the VVR kernel buffer.


As previously stated, in a shared environment, VVR always “READS BACK” from the SRL when replicating in asynchronous mode.


If VVR reads back from the SRL frequently, striping the SRL volume over several self-contained (not used by data volumes) disks could improve performance, unless already done at the array level.

To determine whether VVR is reading back from the SRL, use the “vxstat” command. In the output, note the number of read operations on the SRL.
 

 

 

Issue/Introduction

Veritas Volume Replicator (VVR) may encounter a replication stalled drain state when the VVR SRL "vol_max_rdback_sz" (RDBACK) tunable value is not set higher enough in a shared environment.
  VVR Design Overview:

In an ideal perfect configuration, data is replicated at the speed at which it is generated by the application. As a result, all Secondary hosts remain up to date in this perfect world. A write to a data volume in the Primary flows through various components and across the network until it reaches the Secondary data volume.

For the data on the Secondary to be up to date, each component in the configuration must be able to keep up with the incoming writes. The goal when configuring replication is that VVR is able to handle TEMPORARY bottlenecks, such as occasional surges in writes and network problems.

If one of the components can’t keep up with write rate over the long term, the application could slow down because of the increased write latency, thus resulting in the Secondary falling behind. If a component that completes the write on the Primary cannot keep up, latency might be added to each write, which leads to poor application performance, and as writes on the Primary proceed at the normal pace, updates accumulate in the SRL. Resulting in the Secondary failing behind and the SRL eventually overflow.

Therefore, it is important to examine each component to ensure that it can support the expected application write rate.   IMPORTANT: In a shared environment, VVR always “READS BACK” from the SRL when replicating in asynchronous mode, synchronous mode is not recommended in a shared environment.   How VVR uses buffers between the Primary and Secondary
  Figure 1.0



When using asynchronous mode, VVR copies the data into a kernel buffer on the Primary followed by a header and the data update to the SRL (the header describes the write). From the kernel buffer, VVR then sends the write to all Secondary hosts, and then writes it to the Primary data volume, until the data volume write to the Primary is complete, the kernel buffer cannot be freed.

A Secondary in asynchronous mode might be out of date for various reasons, such as network outages or a surge of writes which exceed available network bandwidth. As the Secondary falls behind, the data to be sent to the Secondary starts accumulating in the write buffer space on the Primary. If the Secondary cannot keep up with the application write rate, VVR may need to free the Primary kernel buffer, so that incoming write requests are not delayed. Secondary hosts that fail behind in this manner are services that are reading back the writes from the Primary SRL.

In this case, the writes are sent from the "vol_max_rdback_sz" (RDBACK) buffer, rather than from the Primary buffer, the ReadBack process process continues until the Secondary catches up with the Primary. Once catch up status is achieved, the process of sending writes to the Secondary reverts back to sending from the kernel bugger, instead of sending by ReadBacks from the SRL.

In a shared environment we want to avoid using “synchronous” mode replication, as the delay for VVR to wait for the Secondary hosts to send a network acknowledgment that the write were received (write received in the VVR kernel memory on the Secondary) will delay things further. In the "synchronous" state, once all Secondary hosts have acknowledged the write, the VVR notifies the application that the write is complete.


Summary of pools (tunables)
========================= vol_rvio_maxpool_sz      RVIO Pool Size (The amount of buffer space that can be allocated within the operating system to handle incoming writes). This tunable has a direct impact on the performance of VVR as it prevents one I/O operation from using all the memory in the system. The value of vol_rvio_maxpool_sz must be at least 10 times greater than the value of the maximum I/O size.

vol_min_lowmem_sz       Low Memory Threshold (The minimum buffer space. VVR frees the write if the amount of buffer space available is below this threshold. This value is auto-tunable. The value that you specify is used as an initial value and could change depending on the application write behaviour.)

vol_max_rdback_sz         Readback Pool Size (The amount of buffer space available for readbacks). Maximum memory that will be used by VVR, when write requests are being read back from the SRL .

vol_max_nmpool_sz        NMCOM Pool Size (The amount of buffer space available for requests coming in to the Secondary over the network) .

vol_max_wrspool_sz       Write Shipping PoolSize (The write ship buffer space, which is the amount of buffer space that can be allocated on the logowner to receive writes sent by the non-logowner) .

Note: To enable auto-tuning for the vol_min_lowmem_sz tunable, set tunable value to -1. Auto-tuning is only supported for this tunable “vol_min_lowmem_sz”.


Reference: https://sort.Veritas.com/public/documents/sfha/6.0.1/aix/productguides/html/sfcfs_admin/apas06s04.htm


Tunable parameter for the readback buffer space

The amount of buffer space available for readbacks is defined by the tunable, "vol_max_rdback_sz" (RDBACK) buffer. To accommodate reading back more data especially in a shared environment, increase the value of "vol_max_rdback_sz" tunable.

The below figure shows how the vol_max_rdback_sz tunable is involved when VVR reads back data. How VVR uses buffers during a readback Figure 2.0

  The "vol_max_rdback_sz" may need increasing where multiple Secondaries are configured in asynchronous mode for one or more RVGs. Changing the value vol_max_rdback_sz will change the readback pool size of all the RVG's configured on the host.

We can use the vxmemstat command to monitor the buffer space. If the output indicates that the amount of space available is completely used, increase the value of the vol_max_rdback_sz tunable to improve readback performance. When decreasing the value of the "vol_max_rdback_sz" tunable, pause replication to all the Secondaries to which VVR is currently replicating.

Tunable parameters for the VVR buffer spaces

The amount of buffer space available to VVR affects the application and replication performance. You can use the following tunables to manage buffer space according to your requirements:

* vol_rvio_maxpool_sz
* vol_min_lowmem_sz
* vol_max_rdback_sz
* vol_max_nmpool_sz

* vol_max_wrspool_sz

Use the vxmemstat command to monitor the buffer space used by VVR.     Focus on two tunables

vol_rvio_maxpool_sz      RVIO Pool Size (The amount of buffer space that can be allocated within the operating system to handle incoming writes). This tunable has a direct impact on the performance of VVR as it prevents one I/O operation from using all the memory in the system. The value of vol_rvio_maxpool_sz must be at least 10 times greater than the value of the maximum I/O size.

vol_max_rdback_sz         Readback Pool Size (The amount of buffer space available for readbacks). Maximum memory that will be used by VVR, when write requests are being read back from the SRL

Increase "vol_max_rdback_s" as described in article.