Issue
- Do we need to return capacity using fstrim command?
- Are there any way I can automatically reclaim disk space for deleted data without performing fstrim?
- If we don't perform fstrim, can the OSDs become full even if deleting the files?
Resolution
Do we need to return capacity using fstrim command?
In a traditional file system, a file deletion will mark the respective inode pointers in the parent folder's directory as not used, but will not delete the data in the data blocks.
For a new write, the file system allocator will allocate blocks which are marked as "not in use".
This is the behavior is same for ceph.
The file system has deleted the file and hence it's reference, but has not cleared the underlying blocks. This leads to the file not being present in the RBD-mapped mount point, but the objects would still be present on the RBD device.
And a new write will either over-write these objects or create new ones, as required.
Therefore, It is normal for objects to remain after a file has been deleted and it is not necessary to forcefully retrieve the created object except for the reason for temporary provision of space.
Verified that it is being reused through the tests below.
STEP 1) The pool used for testing is rbd, the mount point used is /mnt, and there is 128M data for test and the number of existing objects is 54.
[root@node2 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3058M 2814M 244M 7.98
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 138M 4.53 0 54 <---- there are 54 objects in rbd.
[root@node2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 5.8G 1.1G 4.8G 18% /
devtmpfs 487M 0 487M 0% /dev
tmpfs 497M 0 497M 0% /dev/shm
tmpfs 497M 6.7M 490M 2% /run
tmpfs 497M 0 497M 0% /sys/fs/cgroup
/dev/sda1 497M 125M 373M 25% /boot
/dev/sdb1 1020M 35M 986M 4% /var/lib/ceph/osd/ceph-1
tmpfs 100M 0 100M 0% /run/user/0
/dev/rbd0 10G 161M 9.9G 2% /mnt
[root@node2 ~]# ls -alh /mnt/test
-rw-r--r-- 1 root root 128M Mar 8 23:19 /mnt/test
STEP 2) Verify that the number of object is kept continuously after delete the file
[root@node2 ~]# rm /mnt/test
rm: remove regular file ‘/mnt/test’? y
[root@node2 ~]# ceph -s
cluster 78c40b72-af41-4d40-8d42-12f79c36747f
.....
osdmap e127: 3 osds: 3 up, 3 in
pgmap v383: 128 pgs, 1 pools, 138 MB data, 54 objects <------ Remaining Objects
244 MB used, 2814 MB / 3058 MB avail
STEP 3) Check the number of objects after creating two files to check whether they are reused
[root@node2 ~]# dd if=/dev/zero of=/mnt/test bs=1M count=128
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.0707993 s, 1.9 GB/s
[root@node2 ~]#
[root@node2 ~]# ls -alh /mnt/test
-rw-r--r-- 1 root root 128M Mar 8 23:48 /mnt/test <----- Confirm generation of test file
[root@node2 ~]# ceph -s
cluster 78c40b72-af41-4d40-8d42-12f79c36747f
.......
osdmap e127: 3 osds: 3 up, 3 in
pgmap v389: 128 pgs, 1 pools, 138 MB data, 54 objects <------ The number of objects is the same
244 MB used, 2814 MB / 3058 MB avail
[root@node2 ~]# dd if=/dev/zero of=/mnt/test1 bs=1M count=128 <------ Confirm generation of test2 file
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.0898858 s, 1.5 GB/s
[root@node2 ~]# ls -alh /mnt/*
-rw-r--r-- 1 root root 128M Mar 8 23:48 /mnt/test
-rw-r--r-- 1 root root 128M Mar 8 23:49 /mnt/test1
[root@node2 ~]#
[root@node2 ~]# ceph -s
cluster 78c40b72-af41-4d40-8d42-12f79c36747f
....
osdmap e127: 3 osds: 3 up, 3 in
pgmap v393: 128 pgs, 1 pools, 266 MB data, 86 objects <---------- increase the object count
495 MB used, 2563 MB / 3058 MB avail
[root@node2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 5.8G 1.1G 4.8G 18% /
devtmpfs 487M 0 487M 0% /dev
tmpfs 497M 0 497M 0% /dev/shm
tmpfs 497M 6.7M 490M 2% /run
tmpfs 497M 0 497M 0% /sys/fs/cgroup
/dev/sda1 497M 125M 373M 25% /boot
/dev/sdb1 1020M 35M 986M 4% /var/lib/ceph/osd/ceph-1
tmpfs 100M 0 100M 0% /run/user/0
/dev/rbd0 10G 289M 9.8G 3% /mnt
Are there any way I can automatically reclaim disk space for deleted data without performing fstrim?
The most accurate information about the actual capacity currently in use can be confirmed with the "df -h" command or discard mount option.
The discard mount option behaves similarly to fstrim and it's to clean up objects on the backend when the files get deleted to use the TRIM support by the underlying disks.
If you want to use this function, the disk must support 'TRIM' as well as the file system. For reference, all recent discs and file systems support 'TRIM'.
However, using the discard option can cause performance degradation by enabling TRIM for blocks that are actually disabled.
How to use the discard option is as follow
#mount -o discard [device] [mount point]
Example)
#mount -o discard /dev/rbd1 /mnt/rbd
If we don't perform fstrim, can the OSDs become full even if deleting the files?
If you do not reclaim the capacity by performing the fstrim, OSD full will not occur.
As I mention that the object is reused when the file is deleted, so it is not calculated as the usage in ceph df.
The criterion for determining osd full is calculated based on "available" rather than "used" for each osd.
For your understand, performed the test, please refer to the test as well
STEP 1) Check current disk usage, number of objects, and full ratio value.
[root@node2 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3058M 2199M 859M 28.10
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 414M 13.54 0 115
[root@node2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 5.8G 1.1G 4.8G 18% /
devtmpfs 487M 0 487M 0% /dev
tmpfs 497M 0 497M 0% /dev/shm
tmpfs 497M 6.7M 490M 2% /run
tmpfs 497M 0 497M 0% /sys/fs/cgroup
/dev/sda1 497M 125M 373M 25% /boot
/dev/sdb1 1020M 35M 986M 4% /var/lib/ceph/osd/ceph-1
tmpfs 100M 0 100M 0% /run/user/0
/dev/rbd0 1014M 433M 582M 43% /mnt
[root@node2 ~]# ceph pg dump | head
dumped all in format plain
version 474
stamp 2017-03-09 01:16:45.895113
last_osdmap_epoch 141
last_pg_scan 21
full_ratio 0.9 <---- check the full ratio
nearfull_ratio 0.8 <---- check the nearfull ratio
..
STEP 2) Create files to make osd full.
[root@node2 ~]# dd if=/dev/zero of=/mnt/test5 bs=1M count=100 <---- add more files to fill osd
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0605148 s, 1.7 GB/s
[root@node2 ~]# dd if=/dev/zero of=/mnt/test6 bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0931641 s, 1.1 GB/s
[root@node2 ~]# dd if=/dev/zero of=/mnt/test7 bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.4708 s, 71.3 MB/s
STEP 3) Verify that one osd is full due to file creation, and check the number of objects with "ceph df" command.
[root@node2 ~]# ceph -s
cluster 78c40b72-af41-4d40-8d42-12f79c36747f
health HEALTH_ERR
......
1 full osd(s) <---- Confirm that one of the OSDs is full
......
[root@node2 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3058M 2002M 1056M 34.55
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 629M 20.57 0 169
[root@node2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 5.8G 1.1G 4.8G 18% /
devtmpfs 487M 0 487M 0% /dev
tmpfs 497M 0 497M 0% /dev/shm
tmpfs 497M 6.7M 490M 2% /run
tmpfs 497M 0 497M 0% /sys/fs/cgroup
/dev/sda1 497M 125M 373M 25% /boot
/dev/sdb1 1020M 35M 986M 4% /var/lib/ceph/osd/ceph-1
tmpfs 100M 0 100M 0% /run/user/0
/dev/rbd0 1014M 733M 282M 73% /mnt <---- check usage
STEP 4) Confirm that osd full has been removed through file deletion without fstrim.
[root@node2 ~]# rm /mnt/test* <---- Remove osd full via file deletion
rm: remove regular file ‘/mnt/test’? y
rm: remove regular file ‘/mnt/test1’? y
rm: remove regular file ‘/mnt/test3’? y
rm: remove regular file ‘/mnt/test4’? y
rm: remove regular file ‘/mnt/test5’? y
rm: remove regular file ‘/mnt/test6’? y
[root@node2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 5.8G 1.1G 4.8G 18% /
devtmpfs 487M 0 487M 0% /dev
tmpfs 497M 0 497M 0% /dev/shm
tmpfs 497M 6.7M 490M 2% /run
tmpfs 497M 0 497M 0% /sys/fs/cgroup
/dev/sda1 497M 125M 373M 25% /boot
/dev/sdb1 1020M 35M 986M 4% /var/lib/ceph/osd/ceph-1
tmpfs 100M 0 100M 0% /run/user/0
/dev/rbd0 1014M 33M 982M 4% /mnt
[root@node2 ~]# ceph -s
cluster 78c40b72-af41-4d40-8d42-12f79c36747f
health HEALTH_WARN
Monitor clock skew detected <---- Confirm that the osd warning has been removed
monmap e1: 3 mons at {node1=192.168.50.1:6789/0,node2=192.168.50.2:6789/0,node3=192.168.50.3:6789/0}
election epoch 112, quorum 0,1,2 node1,node2,node3
osdmap e143: 3 osds: 3 up, 3 in
pgmap v493: 128 pgs, 1 pools, 698 MB data, 187 objects
871 MB used, 2186 MB / 3058 MB avail
........
[root@node2 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3058M 2186M 871M 28.51
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 698M 22.83 0 187
From the test, even if file deletion without fstrim is performed, the number of objects does not decrease, but osd's full disappears.
Therefore , you can confirm that osd full does not occur when don't performed fstrim after deleting.
Note that the number of objects when osd full is 169 is the value that was checked during writing, and the number of objects when completed is 187.
Root Cause
Ceph does not delete the object when deleting the file, same as the traditional file system, and the object still remains on the RBD device.
Also a new write will either over-write these objects or create new ones, as required.
Therefore, the objects are still present in the pool, a 'ceph df' will show the pool being occupied with the objects, even though those are not used.