Delete RBD : Ambedded Technology

Issue

Do we need to return capacity using fstrim command?
Are there any way I can automatically reclaim disk space for deleted data without performing fstrim?
If we don't perform fstrim, can the OSDs become full even if deleting the files?

Resolution

Do we need to return capacity using fstrim command?

In a traditional file system, a file deletion will mark the respective inode pointers in the parent folder's directory as not used, but will not delete the data in the data blocks.
For a new write, the file system allocator will allocate blocks which are marked as "not in use".
This is the behavior is same for ceph.
The file system has deleted the file and hence it's reference, but has not cleared the underlying blocks. This leads to the file not being present in the RBD-mapped mount point, but the objects would still be present on the RBD device.
And a new write will either over-write these objects or create new ones, as required.
Therefore, It is normal for objects to remain after a file has been deleted and it is not necessary to forcefully retrieve the created object except for the reason for temporary provision of space.
Verified that it is being reused through the tests below.

STEP 1)  The pool used for testing is rbd, the mount point used is /mnt, and there is 128M data for test and the number of existing objects is 54.

[root@node2 ~]# ceph df 
GLOBAL: 
    SIZE      AVAIL     RAW USED     %RAW USED 
    3058M     2814M         244M          7.98 
POOLS: 
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0      138M      4.53             0          54                        <----  there are 54 objects in rbd.

[root@node2 ~]# df -h 
Filesystem             Size  Used Avail Use% Mounted on 
/dev/mapper/rhel-root  5.8G  1.1G  4.8G  18% / 
devtmpfs               487M     0  487M   0% /dev 
tmpfs                  497M     0  497M   0% /dev/shm 
tmpfs                  497M  6.7M  490M   2% /run 
tmpfs                  497M     0  497M   0% /sys/fs/cgroup 
/dev/sda1              497M  125M  373M  25% /boot 
/dev/sdb1             1020M   35M  986M   4% /var/lib/ceph/osd/ceph-1 
tmpfs                  100M     0  100M   0% /run/user/0 
/dev/rbd0               10G  161M  9.9G   2% /mnt 

[root@node2 ~]# ls -alh /mnt/test 
-rw-r--r-- 1 root root 128M Mar  8 23:19 /mnt/test 

STEP 2)  Verify that the number of object is kept continuously after delete the file
[root@node2 ~]# rm /mnt/test 
rm: remove regular file ‘/mnt/test’? y 

[root@node2 ~]# ceph -s 
    cluster 78c40b72-af41-4d40-8d42-12f79c36747f 
    .....
      osdmap e127: 3 osds: 3 up, 3 in 
      pgmap v383: 128 pgs, 1 pools, 138 MB data, 54 objects                        <------ Remaining Objects
            244 MB used, 2814 MB / 3058 MB avail 

STEP 3) Check the number of objects after creating two files to check whether they are reused
[root@node2 ~]# dd if=/dev/zero of=/mnt/test bs=1M count=128 
128+0 records in 
128+0 records out 
134217728 bytes (134 MB) copied, 0.0707993 s, 1.9 GB/s 
[root@node2 ~]# 
[root@node2 ~]# ls -alh /mnt/test 
-rw-r--r-- 1 root root 128M Mar  8 23:48 /mnt/test                                 <----- Confirm generation of test file
[root@node2 ~]# ceph -s 
    cluster 78c40b72-af41-4d40-8d42-12f79c36747f 
     .......
     osdmap e127: 3 osds: 3 up, 3 in 
      pgmap v389: 128 pgs, 1 pools, 138 MB data, 54 objects                        <------  The number of objects is the same
            244 MB used, 2814 MB / 3058 MB avail 

[root@node2 ~]# dd if=/dev/zero of=/mnt/test1 bs=1M count=128                      <------ Confirm generation of test2 file
128+0 records in 
128+0 records out 
134217728 bytes (134 MB) copied, 0.0898858 s, 1.5 GB/s 

[root@node2 ~]# ls -alh /mnt/* 
-rw-r--r-- 1 root root 128M Mar  8 23:48 /mnt/test 
-rw-r--r-- 1 root root 128M Mar  8 23:49 /mnt/test1 
[root@node2 ~]# 
[root@node2 ~]# ceph -s 
    cluster 78c40b72-af41-4d40-8d42-12f79c36747f 
     ....
     osdmap e127: 3 osds: 3 up, 3 in 
      pgmap v393: 128 pgs, 1 pools, 266 MB data, 86 objects                       <---------- increase the object count  
            495 MB used, 2563 MB / 3058 MB avail 

[root@node2 ~]# df -h 
Filesystem             Size  Used Avail Use% Mounted on 
/dev/mapper/rhel-root  5.8G  1.1G  4.8G  18% / 
devtmpfs               487M     0  487M   0% /dev 
tmpfs                  497M     0  497M   0% /dev/shm 
tmpfs                  497M  6.7M  490M   2% /run 
tmpfs                  497M     0  497M   0% /sys/fs/cgroup 
/dev/sda1              497M  125M  373M  25% /boot 
/dev/sdb1             1020M   35M  986M   4% /var/lib/ceph/osd/ceph-1 
tmpfs                  100M     0  100M   0% /run/user/0 
/dev/rbd0               10G  289M  9.8G   3% /mnt

Are there any way I can automatically reclaim disk space for deleted data without performing fstrim?

The most accurate information about the actual capacity currently in use can be confirmed with the "df -h" command or discard mount option.
The discard mount option behaves similarly to fstrim and it's to clean up objects on the backend when the files get deleted to use the TRIM support by the underlying disks.
If you want to use this function, the disk must support 'TRIM' as well as the file system. For reference, all recent discs and file systems support 'TRIM'.
However, using the discard option can cause performance degradation by enabling TRIM for blocks that are actually disabled.

How to use the discard option is as follow

#mount -o discard [device] [mount point]

Example)
#mount -o discard /dev/rbd1 /mnt/rbd

If we don't perform fstrim, can the OSDs become full even if deleting the files?

If you do not reclaim the capacity by performing the fstrim, OSD full will not occur.
As I mention that the object is reused when the file is deleted, so it is not calculated as the usage in ceph df.
The criterion for determining osd full is calculated based on "available" rather than "used" for each osd.
For your understand, performed the test, please refer to the test as well

STEP 1)  Check current disk usage, number of objects, and full ratio value.

[root@node2 ~]# ceph df 
GLOBAL: 
    SIZE      AVAIL     RAW USED     %RAW USED 
    3058M     2199M         859M         28.10 
POOLS: 
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0      414M     13.54             0         115 
[root@node2 ~]# df -h 
Filesystem             Size  Used Avail Use% Mounted on 
/dev/mapper/rhel-root  5.8G  1.1G  4.8G  18% / 
devtmpfs               487M     0  487M   0% /dev 
tmpfs                  497M     0  497M   0% /dev/shm 
tmpfs                  497M  6.7M  490M   2% /run 
tmpfs                  497M     0  497M   0% /sys/fs/cgroup 
/dev/sda1              497M  125M  373M  25% /boot 
/dev/sdb1             1020M   35M  986M   4% /var/lib/ceph/osd/ceph-1 
tmpfs                  100M     0  100M   0% /run/user/0 
/dev/rbd0             1014M  433M  582M  43% /mnt

[root@node2 ~]# ceph pg dump | head 
dumped all in format plain 
version 474 
stamp 2017-03-09 01:16:45.895113 
last_osdmap_epoch 141 
last_pg_scan 21 
full_ratio 0.9                                                               <----  check the full ratio
nearfull_ratio 0.8                                                           <----  check the nearfull ratio
..

STEP 2)  Create files to make osd full. 
[root@node2 ~]# dd if=/dev/zero of=/mnt/test5 bs=1M count=100                <----  add more files to fill osd
100+0 records in 
100+0 records out 
104857600 bytes (105 MB) copied, 0.0605148 s, 1.7 GB/s 
[root@node2 ~]# dd if=/dev/zero of=/mnt/test6 bs=1M count=100 
100+0 records in 
100+0 records out 
104857600 bytes (105 MB) copied, 0.0931641 s, 1.1 GB/s 
[root@node2 ~]# dd if=/dev/zero of=/mnt/test7 bs=1M count=100 
100+0 records in 
100+0 records out 
104857600 bytes (105 MB) copied, 1.4708 s, 71.3 MB/s 

STEP 3) Verify that one osd is full due to file creation, and check the number of objects with "ceph df" command.

[root@node2 ~]# ceph -s 
    cluster 78c40b72-af41-4d40-8d42-12f79c36747f 
     health HEALTH_ERR 
     ......
            1 full osd(s)                                                    <----  Confirm that one of the OSDs is full
     ......
[root@node2 ~]# ceph df 
GLOBAL: 
    SIZE      AVAIL     RAW USED     %RAW USED 
    3058M     2002M        1056M         34.55 
POOLS: 
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0      629M     20.57             0         169 
[root@node2 ~]# df -h 
Filesystem             Size  Used Avail Use% Mounted on 
/dev/mapper/rhel-root  5.8G  1.1G  4.8G  18% / 
devtmpfs               487M     0  487M   0% /dev 
tmpfs                  497M     0  497M   0% /dev/shm 
tmpfs                  497M  6.7M  490M   2% /run 
tmpfs                  497M     0  497M   0% /sys/fs/cgroup 
/dev/sda1              497M  125M  373M  25% /boot 
/dev/sdb1             1020M   35M  986M   4% /var/lib/ceph/osd/ceph-1 
tmpfs                  100M     0  100M   0% /run/user/0 
/dev/rbd0             1014M  733M  282M  73% /mnt                         <---- check usage 

STEP 4) Confirm that osd full has been removed through file deletion without fstrim.
[root@node2 ~]# rm /mnt/test*                                             <---- Remove osd full via file deletion
rm: remove regular file ‘/mnt/test’? y 
rm: remove regular file ‘/mnt/test1’? y 
rm: remove regular file ‘/mnt/test3’? y 
rm: remove regular file ‘/mnt/test4’? y 
rm: remove regular file ‘/mnt/test5’? y 
rm: remove regular file ‘/mnt/test6’? y 
[root@node2 ~]# df -h 
Filesystem             Size  Used Avail Use% Mounted on 
/dev/mapper/rhel-root  5.8G  1.1G  4.8G  18% / 
devtmpfs               487M     0  487M   0% /dev 
tmpfs                  497M     0  497M   0% /dev/shm 
tmpfs                  497M  6.7M  490M   2% /run 
tmpfs                  497M     0  497M   0% /sys/fs/cgroup 
/dev/sda1              497M  125M  373M  25% /boot 
/dev/sdb1             1020M   35M  986M   4% /var/lib/ceph/osd/ceph-1 
tmpfs                  100M     0  100M   0% /run/user/0 
/dev/rbd0             1014M   33M  982M   4% /mnt 
[root@node2 ~]# ceph -s 
    cluster 78c40b72-af41-4d40-8d42-12f79c36747f 
     health HEALTH_WARN 
            Monitor clock skew detected                               <---- Confirm that the osd warning has been removed
     monmap e1: 3 mons at {node1=192.168.50.1:6789/0,node2=192.168.50.2:6789/0,node3=192.168.50.3:6789/0} 
            election epoch 112, quorum 0,1,2 node1,node2,node3 
     osdmap e143: 3 osds: 3 up, 3 in 
      pgmap v493: 128 pgs, 1 pools, 698 MB data, 187 objects 
            871 MB used, 2186 MB / 3058 MB avail 
            ........

[root@node2 ~]# ceph df 
GLOBAL: 
    SIZE      AVAIL     RAW USED     %RAW USED 
    3058M     2186M         871M         28.51 
POOLS: 
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0      698M     22.83             0         187

From the test, even if file deletion without fstrim is performed, the number of objects does not decrease, but osd's full disappears.
Therefore , you can confirm that osd full does not occur when don't performed fstrim after deleting.
Note that the number of objects when osd full is 169 is the value that was checked during writing, and the number of objects when completed is 187.

Root Cause

Ceph does not delete the object when deleting the file, same as the traditional file system, and the object still remains on the RBD device.
Also a new write will either over-write these objects or create new ones, as required.
Therefore, the objects are still present in the pool, a 'ceph df' will show the pool being occupied with the objects, even though those are not used.

Why removing data from RBD disk does not free-up space on ceph cluster and do we have to always perform fstrim to free up space? Print