Instant Snapshot Resync Performance sporadic drops with Resync Writes Average Time over 3 seconds

book

Article ID: 100027615

calendar_today

Updated On:

Description

Error Message

The vxstat command option "-f r" can be used to monitor the snapshot resynchronization performance. When the reported "RESYNC WRITES Avg(ms)" is high, such as over 3000ms, the resynchronization performance will drop significantly.

Cause

When high I/O contention is happening on the system, the high resync latency is an expected behaviour. Following is a general description of the snapshot synchronization process in DCO version 20.

Synchronization of a Data Change Object (DCO) region involves following operations:
a.      Read state of the region in DCO page of snapshot volume (may need IO, depends on the size of the VxVM Kernel Paging Module (volpagemod) Memory Size)
b.      Read state of the region in DCO page of primary volume (may need IO, depends on the size of the VxVM Kernel Paging Module (volpagemod) Memory Size)
c.      Read region data from the primary volume (needs I/O)
d.      Write the region data into the snapshot (needs I/O).  Sync task would read the data from the primary volume and then try to write on the snapshot. At this stage it would verify if pushed write (application write operation) already done (by referring map in the DCO) on snapshot and if not, write the data atomically.
e.      Update the state of the region in DCO of snapshot volume (needs I/O)

Writes on a volume in general will contend for some VxVM kernel shared locks. Write on the same/overlapping region will contend for resources.

In case the application write operations are concentrated on the area where the snapshot operations are currently being performed, the slow snapshot performance (high RESYNC WRITES AVG(ms)) can be observed.

Please note that as generally the writes are executed randomly, normally this contention shouldn’t happen very frequently.
 

Resolution

Tuning Proposals

- Increase the I/O size during snapshot synchronization

Veritas VxVM engineering team also analyzed the effect of synchronization I/O size on the synchronization performance. The test result shows that increasing the I/O size had substantial effect on improving the performance.  By default the synchronization I/O size is 1MB. As a first step, the synchronization I/O size can be increased to 4MB and monitor the synchronization performance. The synchronization I/O size can be increased in step of 4MB and up to 16MB while monitoring the synchronization performance in order to find out optimal I/O size for your environment.

The synchronization I/O size can be specified by using the vxsnap command option "-o iosize=value".   For example,

# vxsnap -g diskgroup_name -o iosize=4m refresh snapshot_volume_name sync=on

or

# vxsnap -g disksgroup_name refresh snapshot_volume_name sync=off
# vxsnap -g disksgroup_name -o iosize=4m syncstart snapshot_volume_name
 

- Adopt appropriate snapshot type according to service scenarios

If snapshot would be used only after full synchronization, better option would be to create the snapshot using break-off snapshot operations because the latter type will add snapshot mirror(s) to the volume and will not have contention issues as instant snapshot has.

For break-off type snapshot, it is suggested to upgrade to Veritas Storage Foundation 6.0 to take advantages of the DCO version 30. A sample output of DCO version 30 will be:

# vxprint -g diskgroup_name -m dco_name
dco dco_name
        tutil0="
        tutil1="
        tutil2="
        parent_vol=volisnap
        log_vol=volisnap_dcl
        comment="DCO for volisnap
        rid=0.1095
        putil0="
        putil1="
        putil2="
        p_flag_move=off
        badlog=off
        parent_vol_rid=0.1055
        log_vol_rid=0.1087
        sp_num=1
        version=30           <<<< DCO version 30
        dcoregionsz=128
        drlregionsz=128
        drlmapsz=2048
        drl=on
        sequentialdrl=off
        drllogging=on
        snap=volfmr3x_snp

With the new DRL design and asynchronous writes introduced in DCO version 30, it is expected greater throughput can be achieved, especially for random writes on large volumes. Per internal performance statistics tests, there will be over 30% improvement for some i/o scenarios compared to previous version.

Please note that in DCO version 30 the size of the VxVM Page Module is not significant due to the new design of the DCO.


Additional Notes

Please note that when the snapshot performance issue occurs of DCO version 20, the first step is to check that VxVM Kernel Paging Module (volpagemod) Memory Size is big enough, and it should be increased to give VxVM kernel module enough memory to work to avoid accessing the on-disk DCO data.  The volpagemod memory size is controlled by the VxVM tunable parameter volpagemod_max_memsz.

Please refer to the SymWISE article 000029655 on how to calculate and tune the required memory size.
 


Applies To

Veritas Volume Manager using DCO version 20 on all platforms

Issue/Introduction

It is observed that sometimes Veritas Volume Manager (VxVM) Snapshot Resynchronization performance drops significantly with DCO version 20.  The RESYNC WRITES AVG(ms) time shown in the "vxstat -f r" output can be more than 3 seconds. # vxprint -g diskgroup_name -m dco_name | grep version
        version=20           <<<< DCO version 20 # vxstat -g datadg -i 1 -f r volisnap                         RESYNC WRITES
TYP NAME                OPS    BLOCKS AVG(ms)  11 September 2012 10:15:38 AM
vol volisnap             1      2048 3030.0             <<< 3000ms (3 seconds) for each Resync Write Operation 11 September 2012 10:15:38 AM
vol volisnap             1      2048 3030.0 11 September 2012 10:15:38 AM
vol volisnap             1      2048 3020.0 ....  
Addition options can be added to "vxstat" command to monitor the underneath I/O performance to confirm that the slow performance is not caused by slow hardware performance.    For example, the command "vxstat -i 1 -g diskgroup_name -f aoprs -vps" can be used to display the major performance stats.
# vxstat -g diskgroup_name -i 10 -f aoprs -vps
                       ATOMIC COPIES              READS FOR SNAPSHOTS        **RESYNC WRITES**          PUSH WRITES               OPERATIONS           BLOCKS          **AVG TIME(ms)**
TYP NAME               OPS    BLOCKS AVG(ms)      OPS    BLOCKS **AVG(ms)**  OPS    BLOCKS AVG(ms)      OPS    BLOCKS AVG(ms)     READ     WRITE       READ     WRITE  **READ  WRITE**
dm  529logbpdn           0         0    0.0         0         0    0.0         0         0    0.0         0         0    0.0         0         4         0       512    **0.0   25.0**
vol PFIbatchprog         0         0    0.0         0         0    0.0         1      2048 **3010.0**     0         0    0.0         0         0         0         0      0.0    0.0
sd  529logbpdn-01        0         0    0.0         0         0    0.0         0         0    0.0         0         0    0.0         0         4         0       512    **0.0   25.0**
In the above vxstat output, the "RESYNC WRITES Avg(ms)" is 3010.0 ms while the "Disk Media (dm) and Subdisk (sd) AVG TIME(ms)" is only 25ms.