Setting Up a High Availability (HA) Cluster on Proxmox VE 9.1 with Ceph Storage

High Availability (HA) is one of the strongest features of Proxmox VE. Combined with Ceph, it allows virtual machines and containers to continue running even when a node fails. In this guide, we’ll walk through building a Proxmox VE 9.1 HA cluster backed by Ceph storage, from planning to validation.

This post focuses on clarity and completeness, not shortcuts.


Architecture Overview

Minimum Recommended Setup

  • 3 Proxmox VE 9.1 nodes (minimum for quorum)
  • 3 Ceph MONs
  • 3 Ceph OSDs per node (recommended for performance)
  • Dedicated storage/network interface for Ceph traffic
  • Shared time source (NTP) across all nodes

Example Layout

NodeRole
pve01Proxmox + Ceph MON + OSD
pve02Proxmox + Ceph MON + OSD
pve03Proxmox + Ceph MON + OSD

Step 1: Prepare the Nodes

1. Install Proxmox VE 9.1

  • Use identical hardware if possible
  • During installation:
    • Set static IP addresses
    • Configure correct hostname and FQDN
    • Use ZFS or ext4 for the system disk (Ceph disks must be separate)

2. Configure Networking

You should have:

  • Management Network (Proxmox GUI, corosync)
  • Ceph Network (storage replication traffic)

Example:

vmbr0 → Management (10.0.0.0/24)
vmbr1 → Ceph (172.16.0.0/24)

Step 2: Create the Proxmox Cluster

On the first node:

pvecm create prod-cluster

On the remaining nodes:

pvecm add <IP_of_first_node>

Verify cluster status:

pvecm status

You should see:

  • All nodes listed
  • Quorum established

Step 3: Install Ceph on Proxmox

From the Proxmox GUI:

  1. Go to Datacenter → Ceph
  2. Install Ceph packages on all nodes
  3. Configure:
    • Public Network: Ceph traffic network
    • Cluster Network (optional but recommended)

Step 4: Deploy Ceph MONs and Managers

Create MONs

  • Create one MON per node
  • Ensure all MONs show healthy

Create Ceph Managers

  • At least one active manager
  • One standby manager recommended

Check health:

ceph status

Step 5: Create Ceph OSDs

Disk Requirements

  • Raw disks (no partitions)
  • Same size disks recommended
  • SSDs or NVMe preferred

Create OSDs

From each node:

  • Select unused disk
  • Create OSD via Proxmox GUI

Confirm OSDs:

ceph osd tree

Step 6: Create Ceph Pools

Recommended pools:

  • rbd (VM disks)
  • cephfs_data
  • cephfs_metadata

Set replication size:

Size: 3
Min Size: 2

Enable autoscaling:

ceph osd pool set rbd pg_autoscale_mode on

Step 7: Configure Ceph as Proxmox Storage

  1. Go to Datacenter → Storage
  2. Add RBD
  3. Select the Ceph pool
  4. Enable:
    • Disk images
    • Snapshots

Test by creating a VM disk on Ceph storage.


Step 8: Enable High Availability (HA)

Enable HA Services

HA is enabled by default once the cluster is formed.

Verify services:

systemctl status pve-ha-lrm
systemctl status pve-ha-crm

Step 9: Configure HA Groups (Optional but Recommended)

HA Groups control failover priority.

Example:

  • Group name: primary-group
  • Nodes: pve01 → pve02 → pve03
  • Restricted migration enabled

Step 10: Make VMs Highly Available

Requirements

  • VM disks must be on shared storage (Ceph)
  • VM must be using VirtIO drivers

Enable HA

  1. Select VM
  2. Go to HA
  3. Add to HA group
  4. Set:
    • Max restart attempts
    • Failover policy

Step 11: Testing Failover

Test 1: Graceful Node Shutdown

shutdown now

Expected behavior:

  • VM stops on failed node
  • VM restarts on another node

Test 2: Hard Failure

  • Power off node abruptly
  • Watch HA Manager relocate VM

Check HA status:

ha-manager status

Step 12: Tuning and Best Practices

Ceph

  • Use dedicated Ceph network
  • Monitor latency regularly
  • Avoid mixing slow disks with fast disks

Proxmox HA

  • Don’t overcommit RAM heavily
  • Use VM startup delays
  • Keep fencing enabled

General

  • Monitor quorum status
  • Back up VMs even with HA
  • Keep all nodes on the same Proxmox version

Common Pitfalls

  • Running Ceph and management traffic on the same NIC
  • Using consumer-grade disks without power loss protection
  • Less than 3 nodes (no quorum)
  • HA without shared storage

Conclusion

A Proxmox VE 9.1 HA cluster with Ceph provides:

  • Automatic failover
  • Scalable storage
  • No single point of failure

While setup requires careful planning, the result is a resilient, enterprise-grade virtualization platform built entirely on open-source technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *