Things to consider why and when to expand a CEPH-cluster

CEPH as an ultimate product for storage backend has very good capabilities when it comes to expanding your storage needs for future. While the initial setup already has a crucial meaning for the adequacy of the storage backend because of the pool structure and size of the replication, CEPH is extremely flexible to expansions. Once new storage node hardware
is set up, installed and configured, you basically just tell CEPH to add the new node in to the cluster. It’ll automatically spread the data to the new member and continue functioning fully by the time being since all the recovery data is just traffic in the backend network of the cluster.

For what comes to clusters the performance of the hardware should always be fairly similar. Clusters speed is determined by the slowest member of the cluster, therefore best practice for expansions is to use fairly similar hardware as the original setup. Sometimes few years passes by with the current state and expanding after that might force using newer hardware, but that doesn’t matter as long as you have same network speed and
configuration and also the same amount of RAM in the nodes. Disk size is irrelevant, but you can only make use of space what the smallest disk has in the cluster, buying larger disks to an old CEPH cluster with smaller disks would only benefit the cluster with size of the old disks. The good part is the you don’t need to expand with huge amount of investment, the cluster can be expanded node by node. But from budget point-of-view you should consider expanding as much as possible, workwise it easier to organize physical installation for multiple nodes comparing to install them once in a while.

CEPH also has a default limitation on OSD-disks that they don’t allow you to fill up the OSD-disks over 95% of its capacity. Once reached to the point, it will shut down the OSD-disk to prevent writing more data to it, the system starts alerting issues after 85% of OSD-disk fill rate which gives you time to react. The percentage can be increased from configuration, but it’s not recommended. It depends a lot from the clusters size and failure policy.

Example showing why it’s important to have free space available in case of failure: Your CEPH-cluster has 5 storage nodes, each with 10 x 10TB disks. Therefore, one node can hold up to 100TB on raw storage and in total of 500TB in whole cluster. If you have e.g. pool replica of 2 which would come down to 250TB useable storage in the cluster. Let’s imagine that the cluster is filled up 75% of the useable storage which in this case means 187.5TB,
that 187.5TB would spread to 5 nodes as 37.5TB on each leaving 62.5TB available for usage in the whole cluster. Now let’s imagine that 1 node breaks and is forced to shut down, 37.5TB should be spread to 4 nodes instead of 5. While the usable storage has now come down to 200TB but the consumed storage on each node increases to 46.875TB and in whole
cluster to which is already 93,75% used from the usable storage. The cluster would still remain functional, but back-filling would be prevented after 90% fill rate, also the state of the cluster would be in ERROR. While you can recover from the situation just fine if everything goes as planned, but if you would lose 2 more OSD-disks during the time of system-failure of the other node, your cluster would shut down itself to prevent damaging it
even more. Not only it would have huge impact on your infrastructure and services but recovering the cluster from that state can take serious amount of time and cause data
losses.