Resolution
Conceptually, in RocksDB every piece of information is stored in files. RocksDB recognizes three types of storage and expects them to be well suited for different performance requirements.
1) db
2) db.slow
3) db.wal
BlueFS wraps RocksDB understanding of filesystem and transforms into structure well suited for block devices.
1) db is allocated from device options.bluestore_block_db_path, (informal block.db)
2) db.slow is allocated from device options.bluestore_block_path, (informal block)
3) db.wal is allocated from device options.bluestore_wal_path, (informal block.wal)
There are two different behaviors that constitute spillover:
1) During processing, RocksDB has a temporary higher demand for space. It asks to create a file on "/db/xxxx" but exhausts space on block.db and starts consuming block space. This redirection of allocation is done internally in BlueFS and RocksDB is unaware that relocation occurred. This means that file with name "/db/xxxxx" is located on slow blockdevice, so RocksDB still thinks it is fast. After the peak is gone, data allocated on block remains there, with a lot of space free on block.db.
2) RocksDB starts creating files in "/db.slow/" directory disregarding fact that all data can be still fit into regular "db" area.
Internals Regarding options slow_total_bytes and slow_used_bytes:
Value slow_total_bytes value is a direct report on how much space on blockdevice is reserved for BlueFS.
Value slow_used_bytes value is direct report how much of slow_total_bytes is used by BlueFS.
slow_total_bytes space is dynamically allocated to stay in range of:
osd_conf.bluestore_bluefs_min_ratio < bluefs_ratio < osd_conf.bluestore_bluefs_max_ratio
bluefs_ratio = block free space for BlueFS / (block space for BlueFS + block.db space for BlueFS)"block space for BlueFS" <- this is space cut from block device and gifted to BlueFS as a free area to allocate from, this is reported as slow_total_bytes
"block free space for BlueFS" <- this is how much of "block space for BlueFS" is still unallocated, this is reported as slow_used_bytes, but reversed(slow_total_bytes - slow_used_bytes)
"block.db space for BlueFS" <- this is the size of partition options.bluestore_block_db_path
The relation between bluefs_ratio and slow_total_bytes is clear but indirect.
Each time condition occurs:
osd_conf.bluestore_bluefs_min_ratio < bluefs_ratio, the BlueFS cuts some space from block device
Each time condition occurs:
bluefs_ratio < osd_conf.bluestore_bluefs_max_ratio, the BlueFS releases some space back to block device
Both these actions cause bluefs_ratiochange until it stabilizes.Enhancement plans for this spillover behavior:
The BlueStore engineering team has been working on uncovering the root cause of spillover from block.db to block. They have two paths of investigation:1) Creation of RocksDB Iterator locks tables. For as long as such Iterator exists, tables cannot be freed.
If compaction is done in such a case, the size of the database can temporarily double. This bears an uncanny similarity to reported behavior: "after I restarted OSD, DB is small again".
Engineering intent to make tracking of RocksDB Iterator instances to verify if we have stale ones. If so, fix.
In the current implementation, this mandates BlueFS to go to block device. Engineering is planning of change that will disregard the request for "/db.slow/" if there is enough space and move to proper storage only if we run out of space.