Ceph - rbd-mirror has slow performance during rbd journal image replay 

 SOLUTION VERIFIED  - Updated  

Environment

  • Ceph Storage

Issue

  • How to increase the performance of rbd-mirriorwhen an asynchronous copy of RBDs to another cluster is slow?
  • rbd-mirror is initially designed for disaster recovery to copy an RBD from one cluster to another.
  • In other instances, a site might want to use it as a means to migrate RBDs from one cluster to another for decommissioning the original cluster.
  • Regardless of the use case, mirroring many large RBD images from one cluster to another cluster can be slow if using the default parameters.

Resolution

  • For better or worse, out of the box, librbd and rbd-mirror are configured to conserve memory at the expense of performance to support the potential case of thousands of images being mirrored and only a single rbd-mirror daemon attempting to handle the load.

  • To increase the performance of the one-way sync, increase rbd_journal_max_payload_bytes = 8388608to the[client] section on the "rbd import" node. Normally, writes larger than 16KiB are broken into multiple journal entries -- and "rbd import" will attempt to write object-size (4MiB) blocks per write.

  • Then increase rbd_mirror_journal_max_fetch_bytes = 33554432to the [client] section on the rbd-mirrordaemon host and restart the daemon for the change to take effect. Normally, the daemon tries to nibble the journal events to prevent excessive memory use in the case where potentially thousands of images are being mirrored.

  • Before making these changes into a production environment we recommends to test changes in a test environment to understand the impact.