Ceph Storage for Private Cloud: An Architecture Guide

How Ceph delivers resilient, scalable software-defined storage for private cloud. Architecture, replication, and sizing fundamentals explained.

Storage is the part of a private cloud that everyone relies on and almost no one wants to think about, right up until it fails. Traditional storage arrays handled this by throwing money at the problem: dual controllers, proprietary disk shelves, and a support contract to match. Ceph takes a fundamentally different path. It turns a pool of ordinary servers and their disks into a single, self-healing, software-defined storage system that scales from a handful of nodes to thousands without a forklift upgrade.

If you are building or evaluating a private cloud, understanding how Ceph works is worth the effort, because it underpins the block, object, and file storage your virtual machines and applications depend on. This guide walks through the architecture from the ground up, explains the crucial choice between replication and erasure coding, and shows how Ceph plugs into OpenStack services like Cinder and Glance.

What Ceph actually is

Ceph is a distributed storage platform that presents three interfaces from one underlying cluster: block storage (RBD, the RADOS Block Device) for virtual machine disks, object storage (a S3-compatible gateway) for unstructured data and backups, and a POSIX file system (CephFS) for shared file access. All three sit on top of the same foundation, a reliable autonomic distributed object store known as RADOS.

The design goal behind RADOS is simple to state and hard to achieve: no single point of failure, and no central bottleneck. There is no master controller that every read and write must pass through. Instead, clients compute where data lives and talk to the responsible nodes directly. That property is what lets Ceph scale almost linearly as you add hardware.

The core components

A Ceph cluster is made of a few distinct daemons, each with a clear job. Understanding them demystifies most of what you will see in monitoring dashboards.

OSDs: where data lives

An Object Storage Daemon (OSD) manages a single storage device, typically one disk or SSD. It stores the actual data objects, handles replication to its peers, and participates in recovery when a device fails. A production cluster runs anywhere from a dozen to thousands of OSDs, and they do the heavy lifting of the entire system. As a rule of thumb you run one OSD per physical drive, so capacity and throughput scale together as drives are added.

Monitors: the source of truth

Monitor daemons (MONs) maintain the cluster map, the authoritative record of which OSDs exist, which are up, and how data should be distributed. Clients consult the map to know where to read and write. Because the map is critical, monitors run as a small odd-numbered quorum (typically three or five) so the cluster always has a consistent, agreed-upon view of its own state.

Managers and metadata servers

Manager daemons (MGRs) handle metrics, dashboards, and orchestration alongside the monitors. If you use CephFS, Metadata Servers (MDS) manage the file system namespace, directory hierarchy, and permissions so that file operations stay fast without burdening the object layer.

How Ceph places data: CRUSH

The single most important idea in Ceph is that it does not keep a central lookup table mapping every object to a location. A table like that would become a bottleneck and a liability at scale. Instead, Ceph uses an algorithm called CRUSH (Controlled Replication Under Scalable Hashing).

CRUSH lets any client or OSD calculate exactly which devices should hold a given piece of data, using only the cluster map and the object name. Data is grouped into placement groups, and CRUSH deterministically maps those groups onto OSDs according to rules you define. Because placement is computed rather than looked up, there is no central directory to overload, and the cluster can rebalance itself automatically when hardware changes.

CRUSH is also topology-aware. You describe your physical layout, which disks sit in which host, which hosts in which rack, which racks in which room, and CRUSH spreads replicas across those failure domains. The practical payoff is that you can tell Ceph to never place two copies of the same data in the same rack, so losing an entire rack does not lose your data.

Replication vs erasure coding

This is the decision that most shapes a cluster's cost and resilience profile, and it is worth understanding properly.

Replication

With replication, Ceph keeps multiple full copies of every object, conventionally three. Lose a disk and the data still exists in two other places, and the cluster immediately begins copying it elsewhere to restore the replica count. Replication is simple, fast, and recovers quickly, which makes it the natural choice for performance-sensitive block storage. The cost is capacity: three-way replication means you get one third of your raw disk as usable space.

Erasure coding

Erasure coding works like a smarter, distributed version of RAID. Instead of full copies, Ceph splits each object into data chunks plus a number of parity chunks and spreads them across many OSDs. A common 4+2 profile, for example, can tolerate the loss of any two chunks while delivering roughly 67 percent usable capacity, far better than the 33 percent of three-way replication.

The trade-off is compute and latency. Reconstructing data from parity costs CPU and involves more nodes per operation, so erasure coding suits large, throughput-oriented data such as backups, archives, and object storage more than it suits hot, latency-sensitive volumes. Many clusters use both: replicated pools for VM disks, erasure-coded pools for bulk and cold data.

Self-healing and resilience in practice

Ceph is built to assume hardware will fail, because at scale it always does. When an OSD goes offline, the monitors update the cluster map, the affected placement groups are flagged as degraded, and the surviving OSDs begin re-replicating the missing data to restore full redundancy, all without operator intervention.

Scrubbing runs continuously in the background, comparing replicas and checksums to catch silent data corruption (bit rot) before it spreads. The result is a system that not only survives failures but actively works to maintain its own integrity. For an operator, day-to-day storage management shifts from reactive firefighting to capacity planning and monitoring.

How Ceph powers an OpenStack private cloud

Ceph and OpenStack are a natural pairing, and Ceph is the de facto storage backend for serious OpenStack deployments. The integration touches several services. Cinder, the block storage service, provisions RBD volumes for virtual machine disks, so a VM can be created, snapshotted, and resized with storage that lives across the whole cluster rather than on one host. Glance, the image service, stores VM images in Ceph, and because both Glance and Cinder share the same backend, creating a volume from an image becomes a near-instant copy-on-write clone instead of a slow data transfer.

Nova, the compute service, can boot instances directly from Ceph-backed volumes, which means a VM is no longer tied to the local disk of one hypervisor. If that host fails, the instance can be restarted elsewhere because its storage was never local in the first place. This is the foundation of live migration and high availability in a private cloud.

Sizing and planning a cluster

A few planning principles save a lot of pain later. Start with at least three nodes so replication and monitor quorum have somewhere to live; smaller clusters cannot tolerate a failure cleanly. Plan for capacity headroom, because a cluster that fills past roughly 80 percent loses the free space it needs to re-replicate after a failure. Match your network to your disks, since recovery and replication traffic move across the network, a fast, dedicated storage network (often 25GbE or higher) keeps rebuilds quick and client I/O smooth. And keep failure domains in mind from day one, distributing nodes across racks and power feeds so CRUSH can place data safely.

Why this matters for sovereign infrastructure

Ceph's appeal is not only technical. Because it runs on commodity hardware and is fully open source, it removes the dependence on a single storage vendor and the licensing that comes with it, which matters a great deal for organizations pursuing genuine control over their own infrastructure.

This is why Ceph sits at the heart of the clouditiv platform. We use it as the unified storage layer beneath our OpenStack private clouds, backing Cinder volumes and Glance images with replicated and erasure-coded pools tuned to each workload, all kept within German data centres and aligned with the compliance requirements European organizations face. For teams moving off legacy virtualization, our guide on migrating from VMware to OpenStack shows how a Ceph-backed cloud replaces the proprietary storage arrays of the past, and you can explore the full picture on our on-premise cloud platform.

The takeaway

Ceph replaces the monolithic storage array with something more flexible and more durable: a pool of standard servers that organizes, protects, and heals your data automatically. Its CRUSH algorithm removes the central bottleneck, its replication and erasure-coding options let you tune the balance of cost and resilience per workload, and its tight integration with OpenStack makes it the backbone of a modern private cloud. Understanding it is the difference between treating storage as a black box and designing it deliberately for the durability your applications deserve.