Oracle Database performance impact with Extent Fragmentation in file system

book

Article ID: 100025766

calendar_today

Updated On:

Description

Error Message

 

Severe VxFS extent fragmentation was identified to be the root cause of the performance degradation.
 
Oracle file extent fragmentation can be observed using the `/opt/VRTS/bin/fsadm –Ef ’ command.  Oracle file can be considered highly fragmented if the “Average # Extents” field shows several thousand extents. The following output demonstrates a file with 22938 extents:
 
/opt/VRTS/bin/fsadm -Ef TESTFILE.dbf
  Extent Fragmentation Report
        Total    Average      Average     Total
        Files    File Blks    # Extents   Free Blks
            1    33554840       22938   478123039
    blocks used for indirects: 416
    % Free blocks in extents smaller than 64 blks: 0.08
    % Free blocks in extents smaller than  8 blks: 0.00
    % blks allocated to extents 64 blks or larger: 99.82
    Free Extents By Size
           1:          5            2:          5            4:          4   
           8:      26198           16:      10533           32:         14   
          64:          1          128:          0          256:          1   
         512:          0         1024:          1         2048:          1   
        4096:          0         8192:          0        16384:          1   
       32768:          1        65536:          1       131072:          0   
      262144:          0       524288:          1      1048576:          0
 
Additionally, the `/opt/VRTS/bin/fsmap –aH ’ will show the size of each extent:
 
/opt/VRTS/bin/fsmap -aH TESTFILE.dbf
                    Volume Extent Type     File Offset     Extent Size     File
                     vol01         Data         0 Bytes         8.00 KB     TESTFILE.dbf
                     vol01         Data         8.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        16.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        24.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        32.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        40.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        48.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        56.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        64.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        72.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        80.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        88.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data        96.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       104.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       112.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       120.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       128.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       136.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       144.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       152.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       160.00 KB         8.00 KB     TESTFILE.dbf
                     vol01         Data       168.00 KB         8.00 KB     TESTFILE.dbf
 
In this case, the extent size is 8k and would be considered extremely fragmented and corrective action should be taken.
 
Oracle files can have another type of fragmentation that is internal and not caused by VxFS extent fragmentation. This can be seen by tracing the system calls issued by the Oracle process while a sequential workload was in process:
10325/1:         0.0001 lseek(256, 106496, SEEK_SET)                    = 106496
10325/1:         0.0006 readv(256, 0xFFFFFFFF7FFF0C78, 16)              = 131072
10325/1:         0.0013 lseek(256, 253952, SEEK_SET)                    = 253952
10325/1:         0.0005 readv(256, 0xFFFFFFFF7FFF0C78, 16)              = 131072
10325/1:         0.0033 readv(256, 0xFFFFFFFF7FFF0C78, 2)               = 16384
10325/1:         0.0003 lseek(256, 417792, SEEK_SET)                   = 417792
10325/1:         0.0002 readv(256, 0xFFFFFFFF7FFF0C78, 4)               = 32768
10325/1:         0.0004 lseek(256, 12722176, SEEK_SET)                  = 12722176
10325/1:         0.0018 readv(256, 0xFFFFFFFF7FFF0C78, 16)              = 131072
10325/1:         0.0012 lseek(256, 12869632, SEEK_SET)                  = 12869632
10325/1:         0.0005 readv(256, 0xFFFFFFFF7FFF0C78, 16)              = 131072
 
This shows many lseek operations and can have an impact on performance. Compared to the same trace after an Oracle export/import operation:
 
6270/1:          0.0008 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0015 readv(257, 0xFFFFFFFF7FFF4548, 4)               = 32768
6270/1:          0.0001 lseek(257, 0x02044000, SEEK_SET)                = 0x02044000
6270/1:          0.0005 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0006 readv(257, 0xFFFFFFFF7FFF4548, 3)               = 24576
6270/1:          0.0006 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0007 readv(257, 0xFFFFFFFF7FFF4548, 4)               = 32768
6270/1:          0.0005 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0002 readv(257, 0xFFFFFFFF7FFF4548, 4)               = 32768
6270/1:          0.0002 lseek(257, 0x020BC000, SEEK_SET)                = 0x020BC000
6270/1:          0.0006 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0006 readv(257, 0xFFFFFFFF7FFF4548, 3)               = 24576
6270/1:          0.0007 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0004 readv(257, 0xFFFFFFFF7FFF4548, 4)               = 32768
6270/1:          0.0005 readv(257, 0xFFFFFFFF7FFF4548, 16)              = 131072
6270/1:          0.0003 readv(257, 0xFFFFFFFF7FFF4548, 4)               = 32768
 
Oracle file creation and file extending operations can also cause VxFS extent fragmentation due to the I/O pattern Oracle will use in a non-ODM environment.
 
0.0001 stat("/u01/oradata/TEST/T1.dbf", 0xFFFFFFFF7FFF8458) Err#2 ENOENT
11343/1:         0.0002 open("/u01/oradata/TEST/T1.dbf", O_RDWR|O_SYNC|O_CREAT|O_EXCL, 0660) = 12
11343/1:         0.0001 fstat(12, 0xFFFFFFFF7FFF8458)                   = 0
11343/1:         0.0000 fstatvfs(12, 0xFFFFFFFF7FFF8308)                = 0
11343/1:         0.0001 lseek(12, 0, SEEK_SET)                          = 0
11343/1:         0.0007 write(12, "\0A2\0\0FFC0\0\0\0\0\0\0".., 8192)   = 8192
11343/1:         0.0008 fcntl(12, F_FREESP, 0xFFFFFFFF7FFF8350)         = 0
11343/1:         0.0001 close(12)                                       = 0
 
The file is created, file pointer pointed to offset 0 and the first 8k is written. Of interest to VxFS here is the fcntl(12, F_FREESP, …) call, this sets the size of the file by punching a “hole” at the end of the inode. This creates a sparse file.  Oracle then initializes the file by allocating space within this “hole”:
 
11343/1:         0.0000 open("/u01/test.dbf", O_RDWR) = 12
11343/1:         0.0002 getrlimit(RLIMIT_NOFILE, 0xFFFFFFFF7FFF8828)    = 0
11343/1:         0.0001 fcntl(12, F_DUPFD, 0x00000100)                  = 259
11343/1:         0.0000 close(12)                                       = 0
 
Now Oracle does a series of sequential 1MB writes into the hole using 4 LWPs, each writing in parallel to a different offset with the file:
 
11343/5:         0.0076 pwrite(259, "\0A2\0\0\0\0\001\0\0\0\0".., 1040384, 8192) = 1040384
11343/7:         0.0099 pwrite(259, "\0A2\0\0\0\0\080\0\0\0\0".., 1048576, 1048576) = 1048576
11343/9:         0.0106 pwrite(259, "\0A2\0\0\0\001\0\0\0\0\0".., 1048576, 2097152) = 1048576
11343/5:         0.0109 pwrite(259, "\0A2\0\0\0\002\0\0\0\0\0".., 1048576, 4194304) = 1048576
11343/7:         0.0130 pwrite(259, "\0A2\0\0\0\00280\0\0\0\0".., 1048576, 5242880) = 1048576
11343/9:         0.0132 pwrite(259, "\0A2\0\0\0\003\0\0\0\0\0".., 1048576, 6291456) = 1048576
11343/11:        0.0264 pwrite(259, "\0A2\0\0\0\00180\0\0\0\0".., 1048576, 3145728) = 1048576
11343/5:         0.0144 pwrite(259, "\0A2\0\0\0\004\0\0\0\0\0".., 1048576, 8388608) = 1048576
11343/7:         0.0152 pwrite(259, "\0A2\0\0\0\00480\0\0\0\0".., 1048576, 9437184) = 1048576
11343/9:         0.0148 pwrite(259, "\0A2\0\0\0\005\0\0\0\0\0".., 1048576, 10485760) = 1048576
11343/11:        0.0204 pwrite(259, "\0A2\0\0\0\00380\0\0\0\0".., 1048576, 7340032) = 1048576
11343/5:         0.0131 pwrite(259, "\0A2\0\0\0\006\0\0\0\0\0".., 1048576, 12582912) = 1048576
11343/7:         0.0130 pwrite(259, "\0A2\0\0\0\00680\0\0\0\0".., 1048576, 13631488) = 1048576
11343/9:         0.0142 pwrite(259, "\0A2\0\0\0\007\0\0\0\0\0".., 1048576, 14680064) = 1048576
11343/11:        0.0200 pwrite(259, "\0A2\0\0\0\00580\0\0\0\0".., 1048576, 11534336) = 1048576
11343/5:         0.0145 pwrite(259, "\0A2\0\0\0\0\b\0\0\0\0\0".., 1048576, 0x01000000) = 1048576
11343/7:         0.0169 pwrite(259, "\0A2\0\0\0\0\b80\0\0\0\0".., 1048576, 0x01100000) = 1048576
11343/9:         0.0168 pwrite(259, "\0A2\0\0\0\0\t\0\0\0\0\0".., 1048576, 0x01200000) = 1048576
11343/11:        0.0225 pwrite(259, "\0A2\0\0\0\00780\0\0\0\0".., 1048576, 15728640) = 1048576
11343/5:         0.0210 pwrite(259, "\0A2\0\0\0\0\n\0\0\0\0\0".., 1048576, 0x01400000) = 1048576
11343/7:         0.0214 pwrite(259, "\0A2\0\0\0\0\n80\0\0\0\0".., 1048576, 0x01500000) = 1048576
11343/9:         0.0208 pwrite(259, "\0A2\0\0\0\0\v\0\0\0\0\0".., 1048576, 0x01600000) = 1048576
11343/11:        0.0278 pwrite(259, "\0A2\0\0\0\0\t80\0\0\0\0".., 1048576, 0x01300000) = 1048576
11343/5:         0.0151 pwrite(259, "\0A2\0\0\0\0\f\0\0\0\0\0".., 1048576, 0x01800000) = 1048576
11343/7:         0.0145 pwrite(259, "\0A2\0\0\0\0\f80\0\0\0\0".., 1048576, 0x01900000) = 1048576
 
This I/O workload of having 4 parallel writers writing to the same file is not optimal for the VxFS extent allocator and can result in fragmentation. The number of writers is determined automatically by Oracle based on hardware considerations.
 

 

Cause

 

Oracle file creation and file extending operations can also cause VxFS extent fragmentation due to the I/O pattern observed in a non-ODM environment.
 
0.0001 stat("/u01/oradata/TEST/T1.dbf", 0xFFFFFFFF7FFF8458) Err#2 ENOENT
11343/1:         0.0002 open("/u01/oradata/TEST/T1.dbf", O_RDWR|O_SYNC|O_CREAT|O_EXCL, 0660) = 12
11343/1:         0.0001 fstat(12, 0xFFFFFFFF7FFF8458)                   = 0
11343/1:         0.0000 fstatvfs(12, 0xFFFFFFFF7FFF8308)                = 0
11343/1:         0.0001 lseek(12, 0, SEEK_SET)                          = 0
11343/1:         0.0007 write(12, "\0A2\0\0FFC0\0\0\0\0\0\0".., 8192)   = 8192
11343/1:         0.0008 fcntl(12, F_FREESP, 0xFFFFFFFF7FFF8350)         = 0
11343/1:         0.0001 close(12)                                       = 0
 
The file is created, file pointer pointed to offset 0 and the first 8k is written. Of interest to VxFS here is the fcntl(12, F_FREESP, …) call, this sets the size of the file by punching a “hole” at the end of the inode. This creates a sparse file.  Oracle then initializes the file by allocating space within this “hole”:
 
11343/1:         0.0000 open("/u01/test.dbf", O_RDWR) = 12
11343/1:         0.0002 getrlimit(RLIMIT_NOFILE, 0xFFFFFFFF7FFF8828)    = 0
11343/1:         0.0001 fcntl(12, F_DUPFD, 0x00000100)                  = 259
11343/1:         0.0000 close(12)                                       = 0
 
Now Oracle does a series of sequential 1MB writes into the hole using 4 LWPs, each writing in parallel to a different offset with the file:
 
11343/5:         0.0076 pwrite(259, "\0A2\0\0\0\0\001\0\0\0\0".., 1040384, 8192) = 1040384
11343/7:         0.0099 pwrite(259, "\0A2\0\0\0\0\080\0\0\0\0".., 1048576, 1048576) = 1048576
11343/9:         0.0106 pwrite(259, "\0A2\0\0\0\001\0\0\0\0\0".., 1048576, 2097152) = 1048576
11343/5:         0.0109 pwrite(259, "\0A2\0\0\0\002\0\0\0\0\0".., 1048576, 4194304) = 1048576
11343/7:         0.0130 pwrite(259, "\0A2\0\0\0\00280\0\0\0\0".., 1048576, 5242880) = 1048576
11343/9:         0.0132 pwrite(259, "\0A2\0\0\0\003\0\0\0\0\0".., 1048576, 6291456) = 1048576
11343/11:        0.0264 pwrite(259, "\0A2\0\0\0\00180\0\0\0\0".., 1048576, 3145728) = 1048576
11343/5:         0.0144 pwrite(259, "\0A2\0\0\0\004\0\0\0\0\0".., 1048576, 8388608) = 1048576
11343/7:         0.0152 pwrite(259, "\0A2\0\0\0\00480\0\0\0\0".., 1048576, 9437184) = 1048576
11343/9:         0.0148 pwrite(259, "\0A2\0\0\0\005\0\0\0\0\0".., 1048576, 10485760) = 1048576
11343/11:        0.0204 pwrite(259, "\0A2\0\0\0\00380\0\0\0\0".., 1048576, 7340032) = 1048576
11343/5:         0.0131 pwrite(259, "\0A2\0\0\0\006\0\0\0\0\0".., 1048576, 12582912) = 1048576
11343/7:         0.0130 pwrite(259, "\0A2\0\0\0\00680\0\0\0\0".., 1048576, 13631488) = 1048576
11343/9:         0.0142 pwrite(259, "\0A2\0\0\0\007\0\0\0\0\0".., 1048576, 14680064) = 1048576
11343/11:        0.0200 pwrite(259, "\0A2\0\0\0\00580\0\0\0\0".., 1048576, 11534336) = 1048576
11343/5:         0.0145 pwrite(259, "\0A2\0\0\0\0\b\0\0\0\0\0".., 1048576, 0x01000000) = 1048576
11343/7:         0.0169 pwrite(259, "\0A2\0\0\0\0\b80\0\0\0\0".., 1048576, 0x01100000) = 1048576
11343/9:         0.0168 pwrite(259, "\0A2\0\0\0\0\t\0\0\0\0\0".., 1048576, 0x01200000) = 1048576
11343/11:        0.0225 pwrite(259, "\0A2\0\0\0\00780\0\0\0\0".., 1048576, 15728640) = 1048576
11343/5:         0.0210 pwrite(259, "\0A2\0\0\0\0\n\0\0\0\0\0".., 1048576, 0x01400000) = 1048576
11343/7:         0.0214 pwrite(259, "\0A2\0\0\0\0\n80\0\0\0\0".., 1048576, 0x01500000) = 1048576
11343/9:         0.0208 pwrite(259, "\0A2\0\0\0\0\v\0\0\0\0\0".., 1048576, 0x01600000) = 1048576
11343/11:        0.0278 pwrite(259, "\0A2\0\0\0\0\t80\0\0\0\0".., 1048576, 0x01300000) = 1048576
11343/5:         0.0151 pwrite(259, "\0A2\0\0\0\0\f\0\0\0\0\0".., 1048576, 0x01800000) = 1048576
11343/7:         0.0145 pwrite(259, "\0A2\0\0\0\0\f80\0\0\0\0".., 1048576, 0x01900000) = 1048576
 

This I/O workload of having 4 parallel writers writing to the same file is not optimal for the VxFS extent allocator and can result in fragmentation. The number of writers is determined automatically by Oracle based on hardware considerations

 

Resolution

 

There are several methods that can be used to mitigate this effect. VxFS tunables set with the /opt/VRTS/bin/vxtunefs commands include:
 
initial_extent_size:
Changes the default size of the initial extent. VxFS determines, based on the  first  write  to  a  new file,  the  size of the first extent to allocate to the file. Typically the first extent is the smallest  power   of  2  that is larger than the size of the first write. I f that power of 2 is less than 8K,  the  first  extent allocated  is  8K.  After  the initial extent, the file   system increases the size of  subsequent  extents  (see max_seqio_extent_size) with each allocation.
Because most applications write to files using a buffer size  of  8K or less, the increasing extents start dou-bling from a small initial extent.  initial_extent_size   changes the default  initial  extent size to a larger value, so the doubling policy starts from a much larger   initial  size, and the file system won't allocate a set of small extents at the start of file.
Use this parameter only on file  systems  that  have  a   very  large  average  file  size. On such file systems,   there   are   fewer   extents   per   file   and   less fragmentation.
initial_extent_size is measured in file system blocks.
 max_seqio_extent_size:
Increases or decreases the maximum size of an extent. When the file system is following its default alloca-tion policy for sequential writes to a file, it allo-cates an initial extent that is large enough for the first write to the file. When additional extents are allocated, they are progressively larger (the algorithm tries to double the size of the file with each new extent), so each extent can hold several writes worth of data. This reduces the total number of extents in anticipation of continued sequential writes. When there are no more writes to the file, unused space is freed for other files to use.
In general, this allocation stops increasing the size of extents at 2048 blocks, which prevents one file from holding too much unused space.
max_seqio_extent_size is measured in   file system blocks.
 
These settings can help depending on the workload, but not in all cases.   Acceptable solutions were achieved by changing VxFS tuneables: “initial_extent_size” and “max_seqio_extent_size”, setting them to 32768 and then 65536.  In each case,  creating a new 500MB database file the first extent was 32MB or 64MB respectively, but there were hundreds of smaller extents after that.  The “initial_extent_size” tunable works for the first block, but the “max_seqio_extent_size” is not effective for sizing subsequent extents.
 
Another setting that can be used to manipulate the VxFS extent allocator is the ‘setext’ command. ‘setext –r 1049576 ’ will reserve 1GB space in as few VxFS extents as possible. Any allocations done within the first 1GB of such a file will not require interactions with the extent allocator and thus, no fragmentation will occur. Using such a method requires knowledge of how large this file might become so it may not be a good solution for long term dynamic Oracle database files where file size is constantly adjusted with the AUTOEXTEND feature.
 
An additional ‘setext’ command can he used that will also control the VxFS extent allocator: ‘setext –e 32m ’. This command will always attempt to allocate space in 32MB extents for any new allocations. It works best after the file has been first defragmented with commands such as ‘/opt/VRTS/bin/fsadm –ef ’ (per-file defragmentation) or ‘/opt/VRTS/bin/fsadm –e ’ (per-filesystem defragmentation). 
 
Another method of defragmenting existing files used at a recent customer escalation is to simply set the minimum extent size for the file and use the Veritas /opt/VRTS/bin/cp command to copy the fragmented file to a new filename. The following commands require the Oracle instance to be down:
/opt/VRTS/bin/setext -e 32m 
/opt/VRTS/bin/cp –e warn   
 
Then renaming the new file back to the original file name. The ‘cp’ will use 32MB extents and reduce fragmentation.
 
Setting the extent size to 32MB using ‘setext –e 32m ’ forces VxFS to always create 32MB extents (when possible) when doing allocating writes. This works quite well and minimizes extent fragmentation which results in better performance.
 
In conclusion, our test indicates:
  • Oracle data files can be fragmented at create time as well due to the parallel writer threads accessing the same file at different offsets.
  • ODM handles AUTOEXTEND nicely
  • In case ODM is not implemented, avoid using AUTOEXTEND instead add a new datafile to make extra space in TABLESPACE.
  • In the event of using AUTOEXTEND and not using ODM on a VxFS filesystem, setting the minimum extent size using ‘setext –e ’ will prevent future extents from being created smaller than size, thus preventing new fragmentation.  Defragmenting the file system using 'fsadm -e ' will minimize existing fragmentation.
 

 


Applies To

Oracle datafiles created in VxFS mounted file systems in a non-ODM environment

Issue/Introduction

Performance degradation was observed in Oracle under sequential workload   when AUTOEXTEND feature enabled in a non-ODM environment and installed over VxFS file system