Slow Cluster File System (CFS) file information access when running commands like ls and du

book

Article ID: 100031805

calendar_today

Updated On:

Description

1. Whether the inodes are cached in the kernel memory or not

If the inodes are still on disk, and not cached in memory, the inode access will be slow because VxFS/CFS has to spend time reading them from the disk. But once the inodes are cached in the memory, the access will be faster. The performance difference can be significant. We are comparing the disk speed (measured in milliseconds) with memory speed (measured in nanoseconds).

In order to get a general idea if the inodes are cached, you can use the vxfsstat command to check vxi_icache_inuseino parameter.
 

# vxfsstat -v | grep ino
 
For example,

rhel7vm10# vxfsstat -v /volmanyfiles/ | grep ino
....
vxi_icache_curino            245094    vxi_iaccess                 2156902
vxi_icache_inuseino             100    vxi_icache_maxino            398044
vxi_icache_peakino           398044    vxi_bcache_recycleage        114442
 


In the above example output, vxi_icache_inuseino is 100 which means that currently there are only 100 inodes cached in the kernel memory. The vxfsstat -v output gives out the global values and not specific to a particular files system. As a result, vxi_icache_inuseino can only give you a general idea on how many inodes are currently cached in memory, if you find a high number, it doesn't necessarily mean that those cached inodes are from the filesystem that you are interested in.

When we access the inodes, we can use vxfsstat -i to monitor the inode cache hit rate in order to get a more accurate idea on if the inodes we need are cached, or not.

# vxfsstat -i



Example:

The following command will access the information of every inode in the specified directory.  Since vxi_icache_inuseino is 0, there are no inodes cached. The command will run for a longer time because the inodes have to be read directly from the disk.
 

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m40.047s  << It took 40 seconds to read the inodes from disk; this is the disk speed
user    0m1.572s
sys     0m9.643s
 



While VxFS is reading the inodes from disk, the inode cache hit rate is 0% because each inode has to be read from the disk, and can't be found in the inode cache.
 

# vxfsstat -i -t 5 /volmanyfiles/
11:22:31.103 Thu 21 Jan 2016 11:22:31 AM AEDT -- absolute sample    <<< first record is absolute data
....
11:22:56.109 Thu 21 Jan 2016 11:22:56 AM AEDT -- delta (5.001 sec sample)    <<< subsequent data is delta between sampling
Lookup, DNLC & Directory Cache Statistics
        0 maximum entries in dnlc
        0 total lookups             0.00% fast lookup
        0 total dnlc lookup         0.00% dnlc hit rate
        0 total enter               0.00  hit per enter
        0 total dircache setup      0.00  calls per setup
        0 total directory scan      0.00% fast directory scan
inode cache statistics
   393710 inodes current    398044 peak               398044 maximum
        0 lookups             0.00% hit rate        <<< 0% hit rate
        0 inodes alloced          0 freed
    36594 sec recycle age [not limited by maximum]
      600 sec free age
 



After the initial access of the inodes finished, we can check vxi_icache_inuseino for the number of cached inodes.
 

# vxfsstat -v -t 10 /volmanyfiles/  | awk '$2 != 0 || $4 != 0 {print}'

# vxfsstat -v /volmanyfiles/ | egrep 'ino'
....
vxi_icache_curino            393710    vxi_iaccess                 2551685
vxi_icache_inuseino          373167    vxi_icache_maxino            398044   <<< inuseino is now 373167
vxi_icache_peakino           398044    vxi_bcache_recycleage          2171
....


Now, if we access the inodes again, it will take much less time because they are all cached in memory already.
 

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m2.202s         <<< 2 seconds; this is the memory access speed.
user    0m0.771s
sys     0m1.367s

The corresponding vxfsstat -i output.

# vxfsstat -i -t 5 /volmanyfiles/
11:30:41.574 Thu 21 Jan 2016 11:30:41 AM AEDT -- delta (5.002 sec sample)
Lookup, DNLC & Directory Cache Statistics
        0 maximum entries in dnlc
    56622 total lookups             0.00% fast lookup
   394380 total dnlc lookup        85.64% dnlc hit rate
        0 total enter           337758.00  hit per enter
        0 total dircache setup      0.00  calls per setup
    56622 total directory scan      0.00% fast directory scan
inode cache statistics
   393710 inodes current    398044 peak               398044 maximum
    57010 lookups           100.00% hit rate              <<< 100% hit rate
        0 inodes alloced          0 freed
      442 sec recycle age [not limited by maximum]
      600 sec free age
 


 

2. Whether the CFS inodes are having masterless locks or normalized locks

In a CFS enviroment, when an inode is accessed the first time cluster-wide, the inodes are cached in the kernel memory with "masterless" locks. The meaning of masterless is that the inode will be treated, virtually, as not part of the CFS cluster. An inode with a masterless lock behaves similarly to an inode of a locally mounted file system, and the inode access performance will be similar as well.

But once an inode is accessed from more than one CFS cluster node, at the same time, CFS will normalize the corresponding lock. Now that inode will participate in the CFS locking mechanism, and the inode access performance will be constrained by the CFS locking mechanism. Global Lock Manager (GLM) is the CFS component which manages the CFS cluster-wide locking.  

The following command can be used to check the number of times the CFS functions are called to create masterless locks or called to normalize previously masterless locks.

# vxfsstat -s | grep master

Please note that the vxfsstat -s output is file system specific, so you have to specify the corresponding mount point in order to get the parameter for that particular file system.

Continuing with the example in the first section, we accessed the files from only one node in the CFS cluster. Since this is the first time the inodes were accessed, they are cached in the kernel memory with masterless locks.

From node 0 in the CFS cluster:

# vxfsstat -s /volmanyfiles/  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks              31    <<< CFS function was called 393641 times to create masterless locks
 

On node 1 in the CFS cluster, no activity yet.
 

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks               0
 

If we access the inodes from node 0 (one with inodes cached) and provided that the inodes are still in the inode cache, the performance will be very good (reading from memory).
 

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          373167    vxi_icache_maxino            398044   
 

Note: The above vxi_icache_inuseino only shows that there are 373167 inodes cached. They are not necessarily from the file system we are interested.  Check the vxfsstat -i inode cache hit rate during the inode access in order to confirm if our inodes are cached or not.
 
# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.870s     <<< memory access speed
user    0m0.810s
sys     0m1.086s
 

 
When the inodes are accessed from the another node in the cluster the first time, this is one of the "bad" scenarios for CFS inode access. The performance here is hit by two important factors.  

First, the inodes are not cached in the memory yet, and they need to be read from the disk (with much slower performance than memory). Check this with the vxfsstat -v command.
 

# vxfsstat -v /volmanyfiles | grep inuse
vxi_icache_inuseino             103    vxi_icache_maxino            398044      <<< currently only 103 inodes are in memory
 

Secondly, CFS now has to normalize the locks cluster-wide. This will involve a lot of Global Lock Manager (GLM) messaging. The GLM locking activity can be monitor with the vxfsstat -s command.  vxfsstat -s is CFS file system specific, you will need to specify the corresponding mount point to get the data.

# vxfsstat -s | egrep 'hlock|glock' | egrep 'grant|revoke'
Example:
# vxfsstat -s /volmanyfiles | egrep 'hlock|glock' | egrep 'grant|revoke'
vxi_hlockgrant               393643    vxi_hlockrevoke                 694        <<< hlock grant and revoke
....
vxi_rwlock_iupdat                 0    vxi_glockgrant               393222       <<< glock grant
vxi_glockrevoke                   0    vxi_glockgrant_pbhit              0           <<< glock revoke
....
 

 
GLM communicates, cluster-wide, using the GAB (Global Atomic Broadcast) protocol. The GAB messages are transported using the LLT driver. As a result, the performance of the lock normalization will highly depend on the performance of the LLT (Low Latency Transport) links. GLM uses GAB port f for the communication, which will map to the LLT port 5.  We can use the lltshow -p 5 command to monitor the LLT performance.
 
# /opt/VRTSllt/lltshow -p 5
=== LLT port 5:
.....
txrate=3/0/0 pkts, 0/0/0 KB per s/10s/30s (0.00 Gb/sec)           <<< transmit rate
....
txlatency dist (in millisec):                            <<< transmit latency; the lower the better
   7744097 (0 ms)     28686 (1 ms)      7175 (2 ms)      2046 (3 ms)
       916 (4 ms)       196 (5 ms)       357 (6 ms)       285 (7 ms)
         0 (8 ms)        43 (9 ms)         0 (10ms)         0 (11ms)
         0 (12ms)         0 (13ms)         0 (14ms)         0 (15ms)
         3 (>=16ms)
....
rxrate=3/0/0 pkts, 0/0/0 KB per s/10s/30s (0.00 Gb/sec)      <<< receive rate
rxlatency dist (in millisec):                                      <<< receive latency
   7606116 (0 ms)     32625 (1 ms)      8499 (2 ms)      2129 (3 ms)
       604 (4 ms)       107 (5 ms)        18 (6 ms)       101 (7 ms)
         3 (8 ms)        16 (9 ms)        11 (10ms)         0 (11ms)
         0 (12ms)         0 (13ms)         1 (14ms)         1 (15ms)
       936 (>=16ms)
....


 
We can also use glmstat to monitor the GLM lock activity. The glmstat can be enabled by using the glmstat -e command and disabled by using the glmstat -d command. Once the glmstat is enabled, the stats can be printed with the glmstat -m command.

Example:

# glmstat -e

# glmstat -m
         message     all      rw       **g**      pg       **h**     buf     oth    loop
master send:
           GRANT       9       0       0       0       0       0       9       6            <<< GRANT
          REVOKE       9       0       0       0       0       0       9       3          <<< REVOKE
        subtotal      18       0       0       0       0       0      18       9

master recv:
            LOCK       9       0       0       0       0       0       9       6
         RELEASE       9       0       0       0       0       0       9       3
        subtotal      18       0       0       0       0       0      18       9

    master total      36       0       0       0       0       0      36      18

proxy send:
            LOCK       6       0       0       0       0       0       6       6
         RELEASE       3       0       0       0       0       0       3       3
        subtotal       9       0       0       0       0       0       9       9

proxy recv:
           GRANT       6       0       0       0       0       0       6       6
          REVOKE       3       0       0       0       0       0       3       3
        subtotal       9       0       0       0       0       0       9       9

     proxy total      18       0       0       0       0       0      18      18

recovery send:
        subtotal       0                                                       0

recovery recv:
        subtotal       0                                                       0

  recovery total       0                                                       0

      send total      27       0       0       0       0       0      27      18
      recv total      27       0       0       0       0       0      27      18
           total      54       0       0       0       0       0      54      36

# glmstat -d

 
Continuing with our example, we access the inode from another node in the cluster, for the first time.

Node 0:

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          373158    vxi_icache_maxino            398044     <<< 373158 inodes are cached

# vxfsstat -s /volmanyfiles  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks              31    <<< up till now; CFS function was called 393641 times to create masterless locks, and no normalization was done
 

 
Node 1:

# vxfsstat -v /volmanyfiles | grep inuse
vxi_icache_inuseino             103    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks               0


# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    2m43.470s        <<< 2 minutes 43 seconds
user    0m2.207s
sys     0m13.954s
 

 
It took a much longer time to finish the command when CFS has to normalize locks. Part of the reason for the slow performance is that the machines have slow LLT links, as shown below. The LLT links on the test machines only transfer in 3 to 4 MB per second.
 

# while :
> do
> /opt/VRTSllt/lltshow -p 5 | grep xrate
> sleep 5
> done
txrate=13868/12683/12171 pkts, 4363/4059/3905 KB per s/10s/30s (0.03 Gb/sec)
rxrate=13994/13811/13341 pkts, 3060/2997/2895 KB per s/10s/30s (0.02 Gb/sec)
txrate=12956/12401/12140 pkts, 4086/3970/3894 KB per s/10s/30s (0.03 Gb/sec)
 

 
The vxfsstat -s output will show significant lock grant, and revoke, activity.

Node 0:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
13:06:58.001 Thu 21 Jan 2016 01:06:58 PM AEDT -- absolute sample       <<< absolute
                        <<< first record is absolute value; accumulated value since file system was mounted on this node
.....
13:14:45.833 Thu 21 Jan 2016 01:14:45 PM AEDT -- delta (10.000 sec sample)    <<< delta value  per 10 seconds
vxi_hlockgrant                    0    vxi_hlockrevoke               46064           <<< revoked 46064  h locks on node 0 in 10 seconds
vxi_masterless_locks              0    vxi_normalize_locks           23043     <<< normalized 24996 locks
....
vxi_glockrevoke               23022    vxi_glockgrant_pbhit              0     <<< revoked 23022 g locks
 

 
Node 1:
# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
13:07:47.238 Thu 21 Jan 2016 01:07:47 PM AEDT -- absolute sample              <<< absolute accumulated values since mount
                        <<< first record is absolute value; accumulated value since file system was mounted on this node
13:14:57.897 Thu 21 Jan 2016 01:14:57 PM AEDT -- delta (10.000 sec sample)    <<< delta value per 10 seconds
vxi_recv_open_mbr                 0    vxi_hlock_init                23875         <<< initialized 23875 h lock in 10 seconds on node 1
vxi_hlockgrant                47725    vxi_hlockrevoke                   0           <<< granted 47725 h locks
....
vxi_rwlockgrant               23875    vxi_rwlockrevoke                  0          <<< granted 23875 rw locks
....
vxi_rwlock_iupdat                 0    vxi_glockgrant                23850           <<< granted 23850 g locks
....
vxi_staleowner                    0    vxi_strong_pullowner          23850        <<< pulled the ownership of 23850 inodes from node 0
 

 
After the locks are normalized, we have the following vxfsstat data.

Node 0 has the inodes cached in memory, and the locks are normalized. 
 

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          310772    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks          393635   <<< number of times CFS functions called so far
 

 

Node 1 now also has the inode cached in memory but didn't need to create any masterless locks, or normalize any locks. The creation of masterless locks, and the normalization of them, were done on Node 0.
 

# vxfsstat -v /volmanyfiles | egrep 'inuse'
vxi_icache_inuseino          373166    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks         
      0
 

 

3. Whether lazy_isize_enable is enabled or not

Once the CFS inode locks are normalized, each time the information of an inode with normalized lock is accessed, CFS has go through the CFS/GLM protocol to get the accurate data. This will involve sending messages through GAB and LLT.

Before Veritas Storage Foundation 6.1, every time an CFS inode is accessed on a node, CFS has to take two sets of locks of the inodes while taking the ownership.  This involves a lot of GLM lock grant and revoke as shown below.

Continuing with the above example, Node 1 just pulled the ownership from Node 0 according to the previous vxfsstat -s output.


Node 0:


# vxfsstat -s /volmanyfiles  | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner              2 
 

    

Node 1:


# vxfsstat -s /volmanyfiles | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner         393228   <<< number of times CFS function called to pull the ownership
 


 

Since now Node 1 has the ownership, the inode access on Node 1 will be quick.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.663s
user    0m0.714s
sys     0m0.985s
 


 

Over time, the locks will be revoked automatically (but slowly).  And the inode access performance will be degraded on this node over time.

Node 1:

14:48:38.984 Thu 21 Jan 2016 02:48:38 PM AEDT -- delta (10.000 sec sample)
...
vxi_hlockgrant                    0    vxi_hlockrevoke                  50                 <<< about 50 revoke per 30 seconds
...
vxi_rwlockgrant                   0    vxi_rwlockrevoke                 50
 


 

Now go back to Node 0, since the inodes are now owned by Node1, accessing the inodes from Node 0 will be slow because it has to take the ownership back.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m49.765s
user    0m2.298s
sys     0m13.438s
 

 

Node 0:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
14:52:35.351 Thu 21 Jan 2016 02:52:35 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                    0    vxi_hlockrevoke               32296           <<<  h lock revoke
...
vxi_glockrevoke               32296    vxi_glockgrant_pbhit              0        <<< g lock revoke
....
 


 

Node 1:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
14:52:49.006 Thu 21 Jan 2016 02:52:49 PM AEDT -- delta (10.000 sec sample)
vxi_recv_open_mbr                 0    vxi_hlock_init                30652           <<< h lock init
vxi_hlockgrant                61305    vxi_hlockrevoke               30652          <<< h lock grant / revoke
.....
vxi_inode_btranidflush            0    vxi_hlock_deinit              30652         <<< h lock deinit
vxi_rwlockgrant               30653    vxi_rwlockrevoke              30652          <<< rw lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                30653             <<< g lock grant
vxi_glockrevoke                   0    vxi_glockgrant_pbhit          30653
.....
vxi_staleowner                    0    vxi_strong_pullowner          30653           <<< pull ownership
 


 

Once Node 0 has the locks back, inode access will be fast on Node 0. But if we access the node 1 again, the same lock revoke and grant activity will start all over again.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.720s
user    0m0.684s
sys     0m1.067s
 


 

In order to minimize the lock grant and revoke when accessing the inode information, starting from SFCFS (Storage Foundation Cluster File System) 6.1 the lazy_isize_enable tunable is introduced.

The lazy_isize_enable tunable enables or disables a performance optimization in Cluster File System. When one node in a cluster is extending the file, optimization is to not reflect the updated file size immediately on other nodes. Note that if this tunable is enabled, file size reported by stat might be stale, but it will not have any impact on other file operations. You can specify the following values for lazy_isize_enable:

              0  Disables the performance optimization
              1  Enables the performance optimization

The default value of lazy_isize_enable is 0.

Caution :

If the application is running on Cluster File System AND file is getting written from multiple nodes in cluster and write offset is based on the size of the file (determined using stat()), then this tunable should not be turned on as it results in returning STALE file size if the tunable is turned on. An application may end up in writing to wrong offset.

lazy_isize_enable can be turned on by using the vxtune command.


# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 0
 




By turning on lazy_isize_enable the g lock grant and revoke are avoided and this helps the performance.
 

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 1

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize_enable
lazy_isize_enable = 1
 


 

In order to make the parameters permanent across system reboot, you have to specify it in the /etc/vx/tunefstab.  Refer to the vxtunefs manual page for details.

Continue with the above example, the locks of the inodes go back to Node 0 now.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m4.379s
user    0m0.837s
sys     0m2.355s
 

 

Now with lazy_isize_enable turned on, we access the inodes from Node 1.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m40.971s
user    0m1.920s
sys     0m5.303s


 

Node 0:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:08:50.854 Thu 21 Jan 2016 06:08:50 PM AEDT -- delta (10.000 sec sample)
.....
vxi_hlockgrant                    0    vxi_hlockrevoke               97190       <<< there are still h lock revoke, but no more g lock revoke
....
 


 

Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:08:41.829 Thu 21 Jan 2016 06:08:41 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                95810    vxi_hlockrevoke                   0        <<< h lock grant but no g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          95840    <<< pull ownership
......
 


 

4. Accessing the inode information from more than one node in parallel

Try to avoid accessing the inodes from more than one node at a time. This may cause the locks to be passed back and forth.

For example, we access the inodes from two systems, at the same time, with lazy_isize_enable turned on.
# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.465s
user    0m2.192s
sys     0m6.993s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.356s
user    0m2.306s
sys     0m7.277s
 

 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:21:26.254 Thu 21 Jan 2016 06:21:26 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73310    vxi_hlockrevoke               72645       <<< h lock grant and revoke at the same time
.....
vxi_staleowner                    0    vxi_strong_pullowner          73310
 

 
Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:21:28.807 Thu 21 Jan 2016 06:21:28 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                72432    vxi_hlockrevoke               72661           <<<
.....
vxi_staleowner                    0    vxi_strong_pullowner          72443
 


 
With lazy_isize_enable turned off, the performance is worse.
 

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.747s
user    0m2.257s
sys     0m6.752s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m29.402s
user    0m2.369s
sys     0m9.597s
 


 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:25:56.487 Thu 21 Jan 2016 06:25:56 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73358    vxi_hlockrevoke               38365               <<< h lock grant / revoke
.....
vxi_glockrevoke               38364    vxi_glockgrant_pbhit              0           <<< g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          73358

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:25:57.716 Thu 21 Jan 2016 06:25:57 PM AEDT -- delta (10.001 sec sample)
vxi_hlockgrant                38436    vxi_hlockrevoke               73576           <<<  h lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                38436       <<< g lock grant
.....
vxi_staleowner                    0    vxi_strong_pullowner          38436
 


 

General recommendation on the CFS tuning

The following tunings are recommended to improve the CFS inode access.

1. Mount the CFS file system with noatime option.

2. Mount the CFS file system with nomtime option.

3. Turn on lazy_isize_enable using vxtunefs.

Please refer to the mount_vxfs and vxtunefs manual pages for details on the above parameters.

If only the file system space usage is required, and not particular directory space usage, please consider using df instead of  du. The command df obtains the required information without going through every single inode in the file systems.


 

Tuning for LLT Flow Control

If the speed of the LLT links is 1 Gb/s or above, please tune the following LLT Flow Control parameters to take full advantage of the available bandwidth.


# lltconfig -F lowwater:8000
# lltconfig -F highwater:10000
# lltconfig -F rportlowwater:8000
# lltconfig -F rporthighwater:10000
# lltconfig -F window:5000

# lltconfig -F query
Current LLT flow control values (in packets):
  lowwater  = 8000
  highwater = 10000
  rportlowwater  = 8000
  rporthighwater = 10000
  window    = 5000

To make the setting persistent across system reboots, please add the following lines to the /etc/llttab file.

set-flow lowwater:8000
set-flow highwater:10000
set-flow rportlowwater:8000
set-flow rporthighwater:10000
set-flow window:5000
Please note that after tuning these parameters, if high Snd retransmit data is observed in the lltstat output, the value of window parameter should be reduced until the retransmission rate (Snd retransmit data / Snd data packets) is less than 0.1%. The other parameters need not be changed.

# lltstat
LLT statistics:
    10189592   Snd data packets
    4                 Snd retransmit data     
....
 



 

Other Performance Monitoring Tools

Apart from monitoring the LLT performance using lltshow -p 5 as mentioned above, the gabshow.pl output can also be used to check if the performance of the GAB message delivery is affected by LLT network flow controlling.
 
# /opt/VRTSgab/gabshow.pl -port f flowcontrol
============================================================
Send side
                gp_xm_flwfrd(null)
                gp_xm_flwbck(null)
                gp_xm_flwctl         0
                gp_xm_flwsetcnt      0  <<< number of times transmit flow control is called
                gp_xm_flwclrcnt      0  <<< number of times transmit flow cotnrol is cleared
Receive side
                gp_rv_flwctl         0
                gp_rv_flwsetcnt      0  <<< receive flow control set
                gp_rv_flwclrcnt      0  <<< receive flow control clear
 

 
OS native network monitoring tool can also be used to check the healthiness of the network interfaces.   For example, on Linux, the netstat -i command can be used to check the network interface errors.
 

# netstat -i
Kernel Interface table
Iface      MTU      RX-OK    RX-ERR RX-DRP    RX-OVR        TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0     1500   19162882         0  85590         0     17556872      0      0      0 BMmRU
em1       1500    9460921         0      1         0      8789424      0      0      0 BMsRU
em2       1500    9701961         0      2         0      8767448      0      0      0 BMsRU
em3       1500 1986097475 223215861  83332 223215861   2157816519      0      0      0 BMRU
em4       1500 2002717440 199940042  83350 199940042   2157776051      0      0      0 BMRU
lo       65536      21149         0      0         0        21149      0      0      0 LRU

Please refer to the corresponding platform manual pages for details.  



 

2. Whether the CFS inodes are having masterless locks or normalized locks

In a CFS enviroment, when an inode is accessed the first time cluster-wide, the inodes are cached in the kernel memory with "masterless" locks. The meaning of masterless is that the inode will be treated, virtually, as not part of the CFS cluster. An inode with a masterless lock behaves similarly to an inode of a locally mounted file system, and the inode access performance will be similar as well.

But once an inode is accessed from more than one CFS cluster node, at the same time, CFS will normalize the corresponding lock. Now that inode will participate in the CFS locking mechanism, and the inode access performance will be constrained by the CFS locking mechanism. Global Lock Manager (GLM) is the CFS component which manages the CFS cluster-wide locking.  

The following command can be used to check the number of times the CFS functions are called to create masterless locks or called to normalize previously masterless locks.

# vxfsstat -s | grep master

Please note that the vxfsstat -s output is file system specific, so you have to specify the corresponding mount point in order to get the parameter for that particular file system.

Continuing with the example in the first section, we accessed the files from only one node in the CFS cluster. Since this is the first time the inodes were accessed, they are cached in the kernel memory with masterless locks.

From node 0 in the CFS cluster:

# vxfsstat -s /volmanyfiles/  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks              31    <<< CFS function was called 393641 times to create masterless locks
 

On node 1 in the CFS cluster, no activity yet.
 

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks               0
 

If we access the inodes from node 0 (one with inodes cached) and provided that the inodes are still in the inode cache, the performance will be very good (reading from memory).
 

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          373167    vxi_icache_maxino            398044   
 

Note: The above vxi_icache_inuseino only shows that there are 373167 inodes cached. They are not necessarily from the file system we are interested.  Check the vxfsstat -i inode cache hit rate during the inode access in order to confirm if our inodes are cached or not.
 
# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.870s     <<< memory access speed
user    0m0.810s
sys     0m1.086s
 

 
When the inodes are accessed from the another node in the cluster the first time, this is one of the "bad" scenarios for CFS inode access. The performance here is hit by two important factors.  

First, the inodes are not cached in the memory yet, and they need to be read from the disk (with much slower performance than memory). Check this with the vxfsstat -v command.
 

# vxfsstat -v /volmanyfiles | grep inuse
vxi_icache_inuseino             103    vxi_icache_maxino            398044      <<< currently only 103 inodes are in memory
 

Secondly, CFS now has to normalize the locks cluster-wide. This will involve a lot of Global Lock Manager (GLM) messaging. The GLM locking activity can be monitor with the vxfsstat -s command.  vxfsstat -s is CFS file system specific, you will need to specify the corresponding mount point to get the data.

# vxfsstat -s | egrep 'hlock|glock' | egrep 'grant|revoke'
Example:
# vxfsstat -s /volmanyfiles | egrep 'hlock|glock' | egrep 'grant|revoke'
vxi_hlockgrant               393643    vxi_hlockrevoke                 694        <<< hlock grant and revoke
....
vxi_rwlock_iupdat                 0    vxi_glockgrant               393222       <<< glock grant
vxi_glockrevoke                   0    vxi_glockgrant_pbhit              0           <<< glock revoke
....
 

 
GLM communicates, cluster-wide, using the GAB (Global Atomic Broadcast) protocol. The GAB messages are transported using the LLT driver. As a result, the performance of the lock normalization will highly depend on the performance of the LLT (Low Latency Transport) links. GLM uses GAB port f for the communication, which will map to the LLT port 5.  We can use the lltshow -p 5 command to monitor the LLT performance.
 
# /opt/VRTSllt/lltshow -p 5
=== LLT port 5:
.....
txrate=3/0/0 pkts, 0/0/0 KB per s/10s/30s (0.00 Gb/sec)           <<< transmit rate
....
txlatency dist (in millisec):                            <<< transmit latency; the lower the better
   7744097 (0 ms)     28686 (1 ms)      7175 (2 ms)      2046 (3 ms)
       916 (4 ms)       196 (5 ms)       357 (6 ms)       285 (7 ms)
         0 (8 ms)        43 (9 ms)         0 (10ms)         0 (11ms)
         0 (12ms)         0 (13ms)         0 (14ms)         0 (15ms)
         3 (>=16ms)
....
rxrate=3/0/0 pkts, 0/0/0 KB per s/10s/30s (0.00 Gb/sec)      <<< receive rate
rxlatency dist (in millisec):                                      <<< receive latency
   7606116 (0 ms)     32625 (1 ms)      8499 (2 ms)      2129 (3 ms)
       604 (4 ms)       107 (5 ms)        18 (6 ms)       101 (7 ms)
         3 (8 ms)        16 (9 ms)        11 (10ms)         0 (11ms)
         0 (12ms)         0 (13ms)         1 (14ms)         1 (15ms)
       936 (>=16ms)
....


 
We can also use glmstat to monitor the GLM lock activity. The glmstat can be enabled by using the glmstat -e command and disabled by using the glmstat -d command. Once the glmstat is enabled, the stats can be printed with the glmstat -m command.

Example:

# glmstat -e

# glmstat -m
         message     all      rw       **g**      pg       **h**     buf     oth    loop
master send:
           GRANT       9       0       0       0       0       0       9       6            <<< GRANT
          REVOKE       9       0       0       0       0       0       9       3          <<< REVOKE
        subtotal      18       0       0       0       0       0      18       9

master recv:
            LOCK       9       0       0       0       0       0       9       6
         RELEASE       9       0       0       0       0       0       9       3
        subtotal      18       0       0       0       0       0      18       9

    master total      36       0       0       0       0       0      36      18

proxy send:
            LOCK       6       0       0       0       0       0       6       6
         RELEASE       3       0       0       0       0       0       3       3
        subtotal       9       0       0       0       0       0       9       9

proxy recv:
           GRANT       6       0       0       0       0       0       6       6
          REVOKE       3       0       0       0       0       0       3       3
        subtotal       9       0       0       0       0       0       9       9

     proxy total      18       0       0       0       0       0      18      18

recovery send:
        subtotal       0                                                       0

recovery recv:
        subtotal       0                                                       0

  recovery total       0                                                       0

      send total      27       0       0       0       0       0      27      18
      recv total      27       0       0       0       0       0      27      18
           total      54       0       0       0       0       0      54      36

# glmstat -d

 
Continuing with our example, we access the inode from another node in the cluster, for the first time.

Node 0:

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          373158    vxi_icache_maxino            398044     <<< 373158 inodes are cached

# vxfsstat -s /volmanyfiles  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks              31    <<< up till now; CFS function was called 393641 times to create masterless locks, and no normalization was done
 

 
Node 1:

# vxfsstat -v /volmanyfiles | grep inuse
vxi_icache_inuseino             103    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks               0


# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    2m43.470s        <<< 2 minutes 43 seconds
user    0m2.207s
sys     0m13.954s
 

 
It took a much longer time to finish the command when CFS has to normalize locks. Part of the reason for the slow performance is that the machines have slow LLT links, as shown below. The LLT links on the test machines only transfer in 3 to 4 MB per second.
 

# while :
> do
> /opt/VRTSllt/lltshow -p 5 | grep xrate
> sleep 5
> done
txrate=13868/12683/12171 pkts, 4363/4059/3905 KB per s/10s/30s (0.03 Gb/sec)
rxrate=13994/13811/13341 pkts, 3060/2997/2895 KB per s/10s/30s (0.02 Gb/sec)
txrate=12956/12401/12140 pkts, 4086/3970/3894 KB per s/10s/30s (0.03 Gb/sec)
 

 
The vxfsstat -s output will show significant lock grant, and revoke, activity.

Node 0:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
13:06:58.001 Thu 21 Jan 2016 01:06:58 PM AEDT -- absolute sample       <<< absolute
                        <<< first record is absolute value; accumulated value since file system was mounted on this node
.....
13:14:45.833 Thu 21 Jan 2016 01:14:45 PM AEDT -- delta (10.000 sec sample)    <<< delta value  per 10 seconds
vxi_hlockgrant                    0    vxi_hlockrevoke               46064           <<< revoked 46064  h locks on node 0 in 10 seconds
vxi_masterless_locks              0    vxi_normalize_locks           23043     <<< normalized 24996 locks
....
vxi_glockrevoke               23022    vxi_glockgrant_pbhit              0     <<< revoked 23022 g locks
 

 
Node 1:
# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
13:07:47.238 Thu 21 Jan 2016 01:07:47 PM AEDT -- absolute sample              <<< absolute accumulated values since mount
                        <<< first record is absolute value; accumulated value since file system was mounted on this node
13:14:57.897 Thu 21 Jan 2016 01:14:57 PM AEDT -- delta (10.000 sec sample)    <<< delta value per 10 seconds
vxi_recv_open_mbr                 0    vxi_hlock_init                23875         <<< initialized 23875 h lock in 10 seconds on node 1
vxi_hlockgrant                47725    vxi_hlockrevoke                   0           <<< granted 47725 h locks
....
vxi_rwlockgrant               23875    vxi_rwlockrevoke                  0          <<< granted 23875 rw locks
....
vxi_rwlock_iupdat                 0    vxi_glockgrant                23850           <<< granted 23850 g locks
....
vxi_staleowner                    0    vxi_strong_pullowner          23850        <<< pulled the ownership of 23850 inodes from node 0
 

 
After the locks are normalized, we have the following vxfsstat data.

Node 0 has the inodes cached in memory, and the locks are normalized. 
 

# vxfsstat -v /volmanyfiles/ | egrep 'inuse'
vxi_icache_inuseino          310772    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles  | grep master
vxi_masterless_locks         393641    vxi_normalize_locks          393635   <<< number of times CFS functions called so far
 

 

Node 1 now also has the inode cached in memory but didn't need to create any masterless locks, or normalize any locks. The creation of masterless locks, and the normalization of them, were done on Node 0.
 

# vxfsstat -v /volmanyfiles | egrep 'inuse'
vxi_icache_inuseino          373166    vxi_icache_maxino            398044

# vxfsstat -s /volmanyfiles | grep master
vxi_masterless_locks              2    vxi_normalize_locks         
      0
 

 

3. Whether lazy_isize_enable is enabled or not

Once the CFS inode locks are normalized, each time the information of an inode with normalized lock is accessed, CFS has go through the CFS/GLM protocol to get the accurate data. This will involve sending messages through GAB and LLT.

Before Veritas Storage Foundation 6.1, every time an CFS inode is accessed on a node, CFS has to take two sets of locks of the inodes while taking the ownership.  This involves a lot of GLM lock grant and revoke as shown below.

Continuing with the above example, Node 1 just pulled the ownership from Node 0 according to the previous vxfsstat -s output.


Node 0:


# vxfsstat -s /volmanyfiles  | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner              2 
 

    

Node 1:


# vxfsstat -s /volmanyfiles | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner         393228   <<< number of times CFS function called to pull the ownership
 


 

Since now Node 1 has the ownership, the inode access on Node 1 will be quick.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.663s
user    0m0.714s
sys     0m0.985s
 


 

Over time, the locks will be revoked automatically (but slowly).  And the inode access performance will be degraded on this node over time.

Node 1:

14:48:38.984 Thu 21 Jan 2016 02:48:38 PM AEDT -- delta (10.000 sec sample)
...
vxi_hlockgrant                    0    vxi_hlockrevoke                  50                 <<< about 50 revoke per 30 seconds
...
vxi_rwlockgrant                   0    vxi_rwlockrevoke                 50
 


 

Now go back to Node 0, since the inodes are now owned by Node1, accessing the inodes from Node 0 will be slow because it has to take the ownership back.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m49.765s
user    0m2.298s
sys     0m13.438s
 

 

Node 0:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
14:52:35.351 Thu 21 Jan 2016 02:52:35 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                    0    vxi_hlockrevoke               32296           <<<  h lock revoke
...
vxi_glockrevoke               32296    vxi_glockgrant_pbhit              0        <<< g lock revoke
....
 


 

Node 1:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
14:52:49.006 Thu 21 Jan 2016 02:52:49 PM AEDT -- delta (10.000 sec sample)
vxi_recv_open_mbr                 0    vxi_hlock_init                30652           <<< h lock init
vxi_hlockgrant                61305    vxi_hlockrevoke               30652          <<< h lock grant / revoke
.....
vxi_inode_btranidflush            0    vxi_hlock_deinit              30652         <<< h lock deinit
vxi_rwlockgrant               30653    vxi_rwlockrevoke              30652          <<< rw lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                30653             <<< g lock grant
vxi_glockrevoke                   0    vxi_glockgrant_pbhit          30653
.....
vxi_staleowner                    0    vxi_strong_pullowner          30653           <<< pull ownership
 


 

Once Node 0 has the locks back, inode access will be fast on Node 0. But if we access the node 1 again, the same lock revoke and grant activity will start all over again.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.720s
user    0m0.684s
sys     0m1.067s
 


 

In order to minimize the lock grant and revoke when accessing the inode information, starting from SFCFS (Storage Foundation Cluster File System) 6.1 the lazy_isize_enable tunable is introduced.

The lazy_isize_enable tunable enables or disables a performance optimization in Cluster File System. When one node in a cluster is extending the file, optimization is to not reflect the updated file size immediately on other nodes. Note that if this tunable is enabled, file size reported by stat might be stale, but it will not have any impact on other file operations. You can specify the following values for lazy_isize_enable:

              0  Disables the performance optimization
              1  Enables the performance optimization

The default value of lazy_isize_enable is 0.

Caution :

If the application is running on Cluster File System AND file is getting written from multiple nodes in cluster and write offset is based on the size of the file (determined using stat()), then this tunable should not be turned on as it results in returning STALE file size if the tunable is turned on. An application may end up in writing to wrong offset.

lazy_isize_enable can be turned on by using the vxtune command.


# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 0
 




By turning on lazy_isize_enable the g lock grant and revoke are avoided and this helps the performance.
 

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 1

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize_enable
lazy_isize_enable = 1
 


 

In order to make the parameters permanent across system reboot, you have to specify it in the /etc/vx/tunefstab.  Refer to the vxtunefs manual page for details.

Continue with the above example, the locks of the inodes go back to Node 0 now.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m4.379s
user    0m0.837s
sys     0m2.355s
 

 

Now with lazy_isize_enable turned on, we access the inodes from Node 1.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m40.971s
user    0m1.920s
sys     0m5.303s


 

Node 0:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:08:50.854 Thu 21 Jan 2016 06:08:50 PM AEDT -- delta (10.000 sec sample)
.....
vxi_hlockgrant                    0    vxi_hlockrevoke               97190       <<< there are still h lock revoke, but no more g lock revoke
....
 


 

Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:08:41.829 Thu 21 Jan 2016 06:08:41 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                95810    vxi_hlockrevoke                   0        <<< h lock grant but no g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          95840    <<< pull ownership
......
 


 

4. Accessing the inode information from more than one node in parallel

Try to avoid accessing the inodes from more than one node at a time. This may cause the locks to be passed back and forth.

For example, we access the inodes from two systems, at the same time, with lazy_isize_enable turned on.
# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.465s
user    0m2.192s
sys     0m6.993s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.356s
user    0m2.306s
sys     0m7.277s
 

 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:21:26.254 Thu 21 Jan 2016 06:21:26 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73310    vxi_hlockrevoke               72645       <<< h lock grant and revoke at the same time
.....
vxi_staleowner                    0    vxi_strong_pullowner          73310
 

 
Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:21:28.807 Thu 21 Jan 2016 06:21:28 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                72432    vxi_hlockrevoke               72661           <<<
.....
vxi_staleowner                    0    vxi_strong_pullowner          72443
 


 
With lazy_isize_enable turned off, the performance is worse.
 

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.747s
user    0m2.257s
sys     0m6.752s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m29.402s
user    0m2.369s
sys     0m9.597s
 


 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:25:56.487 Thu 21 Jan 2016 06:25:56 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73358    vxi_hlockrevoke               38365               <<< h lock grant / revoke
.....
vxi_glockrevoke               38364    vxi_glockgrant_pbhit              0           <<< g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          73358

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:25:57.716 Thu 21 Jan 2016 06:25:57 PM AEDT -- delta (10.001 sec sample)
vxi_hlockgrant                38436    vxi_hlockrevoke               73576           <<<  h lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                38436       <<< g lock grant
.....
vxi_staleowner                    0    vxi_strong_pullowner          38436
 


 

General recommendation on the CFS tuning

The following tunings are recommended to improve the CFS inode access.

1. Mount the CFS file system with noatime option.

2. Mount the CFS file system with nomtime option.

3. Turn on lazy_isize_enable using vxtunefs.

Please refer to the mount_vxfs and vxtunefs manual pages for details on the above parameters.

If only the file system space usage is required, and not particular directory space usage, please consider using df instead of  du. The command df obtains the required information without going through every single inode in the file systems.


 

Tuning for LLT Flow Control

If the speed of the LLT links is 1 Gb/s or above, please tune the following LLT Flow Control parameters to take full advantage of the available bandwidth.


# lltconfig -F lowwater:8000
# lltconfig -F highwater:10000
# lltconfig -F rportlowwater:8000
# lltconfig -F rporthighwater:10000
# lltconfig -F window:5000

# lltconfig -F query
Current LLT flow control values (in packets):
  lowwater  = 8000
  highwater = 10000
  rportlowwater  = 8000
  rporthighwater = 10000
  window    = 5000

To make the setting persistent across system reboots, please add the following lines to the /etc/llttab file.

set-flow lowwater:8000
set-flow highwater:10000
set-flow rportlowwater:8000
set-flow rporthighwater:10000
set-flow window:5000
Please note that after tuning these parameters, if high Snd retransmit data is observed in the lltstat output, the value of window parameter should be reduced until the retransmission rate (Snd retransmit data / Snd data packets) is less than 0.1%. The other parameters need not be changed.

# lltstat
LLT statistics:
    10189592   Snd data packets
    4                 Snd retransmit data     
....
 



 

Other Performance Monitoring Tools

Apart from monitoring the LLT performance using lltshow -p 5 as mentioned above, the gabshow.pl output can also be used to check if the performance of the GAB message delivery is affected by LLT network flow controlling.
 
# /opt/VRTSgab/gabshow.pl -port f flowcontrol
============================================================
Send side
                gp_xm_flwfrd(null)
                gp_xm_flwbck(null)
                gp_xm_flwctl         0
                gp_xm_flwsetcnt      0  <<< number of times transmit flow control is called
                gp_xm_flwclrcnt      0  <<< number of times transmit flow cotnrol is cleared
Receive side
                gp_rv_flwctl         0
                gp_rv_flwsetcnt      0  <<< receive flow control set
                gp_rv_flwclrcnt      0  <<< receive flow control clear
 

 
OS native network monitoring tool can also be used to check the healthiness of the network interfaces.   For example, on Linux, the netstat -i command can be used to check the network interface errors.
 

# netstat -i
Kernel Interface table
Iface      MTU      RX-OK    RX-ERR RX-DRP    RX-OVR        TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0     1500   19162882         0  85590         0     17556872      0      0      0 BMmRU
em1       1500    9460921         0      1         0      8789424      0      0      0 BMsRU
em2       1500    9701961         0      2         0      8767448      0      0      0 BMsRU
em3       1500 1986097475 223215861  83332 223215861   2157816519      0      0      0 BMRU
em4       1500 2002717440 199940042  83350 199940042   2157776051      0      0      0 BMRU
lo       65536      21149         0      0         0        21149      0      0      0 LRU

Please refer to the corresponding platform manual pages for details.  



 

3. Whether lazy_isize_enable is enabled or not

Once the CFS inode locks are normalized, each time the information of an inode with normalized lock is accessed, CFS has go through the CFS/GLM protocol to get the accurate data. This will involve sending messages through GAB and LLT.

Before Veritas Storage Foundation 6.1, every time an CFS inode is accessed on a node, CFS has to take two sets of locks of the inodes while taking the ownership.  This involves a lot of GLM lock grant and revoke as shown below.

Continuing with the above example, Node 1 just pulled the ownership from Node 0 according to the previous vxfsstat -s output.


Node 0:


# vxfsstat -s /volmanyfiles  | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner              2 
 

    

Node 1:


# vxfsstat -s /volmanyfiles | grep vxi_strong_pullowner
vxi_staleowner                    0    vxi_strong_pullowner         393228   <<< number of times CFS function called to pull the ownership
 


 

Since now Node 1 has the ownership, the inode access on Node 1 will be quick.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.663s
user    0m0.714s
sys     0m0.985s
 


 

Over time, the locks will be revoked automatically (but slowly).  And the inode access performance will be degraded on this node over time.

Node 1:

14:48:38.984 Thu 21 Jan 2016 02:48:38 PM AEDT -- delta (10.000 sec sample)
...
vxi_hlockgrant                    0    vxi_hlockrevoke                  50                 <<< about 50 revoke per 30 seconds
...
vxi_rwlockgrant                   0    vxi_rwlockrevoke                 50
 


 

Now go back to Node 0, since the inodes are now owned by Node1, accessing the inodes from Node 0 will be slow because it has to take the ownership back.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m49.765s
user    0m2.298s
sys     0m13.438s
 

 

Node 0:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
14:52:35.351 Thu 21 Jan 2016 02:52:35 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                    0    vxi_hlockrevoke               32296           <<<  h lock revoke
...
vxi_glockrevoke               32296    vxi_glockgrant_pbhit              0        <<< g lock revoke
....
 


 

Node 1:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
14:52:49.006 Thu 21 Jan 2016 02:52:49 PM AEDT -- delta (10.000 sec sample)
vxi_recv_open_mbr                 0    vxi_hlock_init                30652           <<< h lock init
vxi_hlockgrant                61305    vxi_hlockrevoke               30652          <<< h lock grant / revoke
.....
vxi_inode_btranidflush            0    vxi_hlock_deinit              30652         <<< h lock deinit
vxi_rwlockgrant               30653    vxi_rwlockrevoke              30652          <<< rw lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                30653             <<< g lock grant
vxi_glockrevoke                   0    vxi_glockgrant_pbhit          30653
.....
vxi_staleowner                    0    vxi_strong_pullowner          30653           <<< pull ownership
 


 

Once Node 0 has the locks back, inode access will be fast on Node 0. But if we access the node 1 again, the same lock revoke and grant activity will start all over again.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m1.720s
user    0m0.684s
sys     0m1.067s
 


 

In order to minimize the lock grant and revoke when accessing the inode information, starting from SFCFS (Storage Foundation Cluster File System) 6.1 the lazy_isize_enable tunable is introduced.

The lazy_isize_enable tunable enables or disables a performance optimization in Cluster File System. When one node in a cluster is extending the file, optimization is to not reflect the updated file size immediately on other nodes. Note that if this tunable is enabled, file size reported by stat might be stale, but it will not have any impact on other file operations. You can specify the following values for lazy_isize_enable:

              0  Disables the performance optimization
              1  Enables the performance optimization

The default value of lazy_isize_enable is 0.

Caution :

If the application is running on Cluster File System AND file is getting written from multiple nodes in cluster and write offset is based on the size of the file (determined using stat()), then this tunable should not be turned on as it results in returning STALE file size if the tunable is turned on. An application may end up in writing to wrong offset.

lazy_isize_enable can be turned on by using the vxtune command.


# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 0
 




By turning on lazy_isize_enable the g lock grant and revoke are avoided and this helps the performance.
 

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize
lazy_isize_enable = 1

# vxtunefs -o lazy_isize_enable=1 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs /volmanyfiles | grep lazy_isize_enable
lazy_isize_enable = 1
 


 

In order to make the parameters permanent across system reboot, you have to specify it in the /etc/vx/tunefstab.  Refer to the vxtunefs manual page for details.

Continue with the above example, the locks of the inodes go back to Node 0 now.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m4.379s
user    0m0.837s
sys     0m2.355s
 

 

Now with lazy_isize_enable turned on, we access the inodes from Node 1.

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m40.971s
user    0m1.920s
sys     0m5.303s


 

Node 0:

# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:08:50.854 Thu 21 Jan 2016 06:08:50 PM AEDT -- delta (10.000 sec sample)
.....
vxi_hlockgrant                    0    vxi_hlockrevoke               97190       <<< there are still h lock revoke, but no more g lock revoke
....
 


 

Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:08:41.829 Thu 21 Jan 2016 06:08:41 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                95810    vxi_hlockrevoke                   0        <<< h lock grant but no g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          95840    <<< pull ownership
......
 


 

4. Accessing the inode information from more than one node in parallel

Try to avoid accessing the inodes from more than one node at a time. This may cause the locks to be passed back and forth.

For example, we access the inodes from two systems, at the same time, with lazy_isize_enable turned on.
# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.465s
user    0m2.192s
sys     0m6.993s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.356s
user    0m2.306s
sys     0m7.277s
 

 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:21:26.254 Thu 21 Jan 2016 06:21:26 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73310    vxi_hlockrevoke               72645       <<< h lock grant and revoke at the same time
.....
vxi_staleowner                    0    vxi_strong_pullowner          73310
 

 
Node 1:

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:21:28.807 Thu 21 Jan 2016 06:21:28 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                72432    vxi_hlockrevoke               72661           <<<
.....
vxi_staleowner                    0    vxi_strong_pullowner          72443
 


 
With lazy_isize_enable turned off, the performance is worse.
 

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# vxtunefs -o lazy_isize_enable=0 /volmanyfiles
UX:vxfs vxtunefs: INFO: V-3-22525: Parameters successfully set for /volmanyfiles

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    0m54.747s
user    0m2.257s
sys     0m6.752s

# time find /volmanyfiles/manyfiles -ls | wc -l
393604
real    1m29.402s
user    0m2.369s
sys     0m9.597s
 


 
Node 0:
# vxfsstat -s -t 10 /volmanyfiles/ | awk '$2 != 0 || $4 != 0 {print}'
18:25:56.487 Thu 21 Jan 2016 06:25:56 PM AEDT -- delta (10.000 sec sample)
vxi_hlockgrant                73358    vxi_hlockrevoke               38365               <<< h lock grant / revoke
.....
vxi_glockrevoke               38364    vxi_glockgrant_pbhit              0           <<< g lock grant
....
vxi_staleowner                    0    vxi_strong_pullowner          73358

# vxfsstat -s /volmanyfiles/ -t 10 | awk '$2 != 0 || $4 != 0 {print}'
18:25:57.716 Thu 21 Jan 2016 06:25:57 PM AEDT -- delta (10.001 sec sample)
vxi_hlockgrant                38436    vxi_hlockrevoke               73576           <<<  h lock grant / revoke
.....
vxi_rwlock_iupdat                 0    vxi_glockgrant                38436       <<< g lock grant
.....
vxi_staleowner                    0    vxi_strong_pullowner          38436
 


 

General recommendation on the CFS tuning

The following tunings are recommended to improve the CFS inode access.

1. Mount the CFS file system with noatime option.

2. Mount the CFS file system with nomtime option.

3. Turn on lazy_isize_enable using vxtunefs.

Please refer to the mount_vxfs and vxtunefs manual pages for details on the above parameters.

If only the file system space usage is required, and not particular directory space usage, please consider using df instead of  du. The command df obtains the required information without going through every single inode in the file systems.


 

Tuning for LLT Flow Control

If the speed of the LLT links is 1 Gb/s or above, please tune the following LLT Flow Control parameters to take full advantage of the available bandwidth.


# lltconfig -F lowwater:8000
# lltconfig -F highwater:10000
# lltconfig -F rportlowwater:8000
# lltconfig -F rporthighwater:10000
# lltconfig -F window:5000

# lltconfig -F query
Current LLT flow control values (in packets):
  lowwater  = 8000
  highwater = 10000
  rportlowwater  = 8000
  rporthighwater = 10000
  window    = 5000

To make the setting persistent across system reboots, please add the following lines to the /etc/llttab file.

set-flow lowwater:8000
set-flow highwater:10000
set-flow rportlowwater:8000
set-flow rporthighwater:10000
set-flow window:5000
Please note that after tuning these parameters, if high Snd retransmit data is observed in the lltstat output, the value of window parameter should be reduced until the retransmission rate (Snd retransmit data / Snd data packets) is less than 0.1%. The other parameters need not be changed.

# lltstat
LLT statistics:
    10189592   Snd data packets
    4                 Snd retransmit data     
....
 



 

Other Performance Monitoring Tools

Apart from monitoring the LLT performance using lltshow -p 5 as mentioned above, the gabshow.pl output can also be used to check if the performance of the GAB message delivery is affected by LLT network flow controlling.
 
# /opt/VRTSgab/gabshow.pl -port f flowcontrol
============================================================
Send side
                gp_xm_flwfrd(null)
                gp_xm_flwbck(null)
                gp_xm_flwctl         0
                gp_xm_flwsetcnt      0  <<< number of times transmit flow control is called
                gp_xm_flwclrcnt      0  <<< number of times transmit flow cotnrol is cleared
Receive side
                gp_rv_flwctl         0
                gp_rv_flwsetcnt      0  <<< receive flow control set
                gp_rv_flwclrcnt      0  <<< receive flow control clear
 

 
OS native network monitoring tool can also be used to check the healthiness of the network interfaces.   For example, on Linux, the netstat -i command can be used to check the network interface errors.
 

# netstat -i
Kernel Interface table
Iface      MTU      RX-OK    RX-ERR RX-DRP    RX-OVR        TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0     1500   19162882         0  85590         0     17556872      0      0      0 BMmRU
em1       1500    9460921         0      1         0      8789424      0      0      0 BMsRU
em2       1500    9701961         0      2         0      8767448      0      0      0 BMsRU
em3       1500 1986097475 223215861  83332 223215861   2157816519      0      0      0 BMRU
em4       1500 2002717440 199940042  83350 199940042   2157776051      0      0      0 BMRU
lo       65536      21149         0      0         0        21149      0      0      0 LRU

Please refer to the corresponding platform manual pages for details.  



 

Cause

The information of each file is stored in data structure called an inode. Those inodes are stored permanently on a disk. When VxFS (Veritas File System) or CFS needs to access the information, the inodes are read from disk and stored in the inode cache as part of the kernel memory.

The following factors can affect the performance of inode access.

1. If the inodes are cached in the kernel memory
2. If the CFS inodes are having masterless locks or normalized locks
3. If lazy_isize_enable is enabled
4. Accessing the inode information from more than one node in parallel

Resolution

The following are detailed descriptions for each factor.

Issue/Introduction

Cluster File System (CFS) file information access is slow. The file information can be accessed through commands like ls -l. Space used by files can be obtained by du command. In a CFS environment, a number of factors can affect the performance of obtaining those file information.