IBM Storage Ceph Expert - Production Engineering

Code: ISCEPHE

Description

An intensive 3-day program designed to tackle real-world crises and optimize production clusters at petabyte scale. It takes you from architecture to forensic troubleshooting in production.

The expert course focuses on 100% on critical production operations:

  • Forensic troubleshooting when everything fails
  • Real disaster recovery (not simulations)
  • Advanced performance engineering
  • Complex multi-factor scenarios

The Ceph Advanced and Ceph Expert courses are complementary. The Advanced course covers "how to configure it well" and the Expert course covers "what to do when it fails badly".

The course is distribution-agnostic. The troubleshooting, DR, and optimization techniques we teach are universal, and work the same on any Linux distribution. Labs can be configured with:

  • Linux
    • Rocky Linux
    • Ubuntu
    • RHEL
    • Alma Linux
  • Storage
    • IBM Storage Ceph
    • Ceph upstream (Squid 19.2+)
    • Red Hat Ceph Storage
    • or whichever version you prefer

Audience

Administrators or those with production experience who need to master real-world critical scenarios that vendors don't teach

Prerequisites

You should have completed our IBM Storage Ceph Deployment and Administration and IBM Storage Ceph Adanced courses, or have equivalent knowledge.

This course assumes you master:

  • Ceph architecture (MON/OSD/MGR)
  • pool/PG/CRUSH management
  • basic troubleshooting
  • practical experience managing clusters in production (2+ years or equivalent courses)

Objectives

You will learn to solve:

  • Critical failures in 200TB+ clusters
  • Recovery of 40TB corrupted CephFS
  • Extreme tuning for AI/ML (500TB/day)
  • Troubleshooting under 24/7 pressure

Topics

Advanced Performance Engineering & Forensics

From architecture to forensic troubleshooting in production

  • Architectural Optimization
    • BlueStore internals: RocksDB tuning, compaction, write amplification
    • CPU optimization: C-states impact (labs showing 5x degradation), NUMA
    • Network: 100GbE patterns, TCP tuning, nf_conntrack
    • NVMe-specific: reactor tuning, bdevs_per_cluster optimization
  • Forensic Troubleshooting
    • Diagnostic toolchain: blktrace, perf, objectstore-tool
    • Real case studies: NVMe degradation, post-upgrade OSD flapping
    • Advanced PG lifecycle: stuck states, manual intervention
    • Labs: Cluster with real problems to diagnose

Disaster Recovery, Multi-Site & Petabyte Scaling

Extreme recovery and multi-site architectures

  • Advanced Disaster Recovery
    • Edinburgh 40TB case: complete error chain and recovery procedures
    • CephFS disasters: metadata corruption, MDS failure handling
    • RBD mirroring: pool vs image-based, failover automation
    • Physical DR: disk extraction, journal, whoami preservation
  • Multi-Site & Petabytes
    • RGW multisite: master zone failure, manual promotion, sync fairness
    • WAN planning: formulas for 1 GbE per 8TB daily ingest
    • Petabyte challenges: CERN 30PB (7,200 OSDs), 310M objects
    • Labs: Multi-site failover and recovery simulation

Security, AI/ML Workloads & Cost Engineering

Enterprise security and optimization for modern workloads

  • Security Hardening
    • Encryption: LUKS/dmcrypt OSDs, msgr2 secure, RGW SSE-S3/KMS
    • Key management: rotation (Squid 19.2.3+), Barbican integration
    • Compliance: HIPAA architecture, GDPR, audit logging
    • Threat detection: monitoring patterns, vulnerability management
  • AI/ML & ROI Engineering
    • S3 Select: Trino integration (2.5x-9x performance), analytics pushdown
    • AI/ML patterns: checkpointing, parallel access optimization
    • TCO analysis: EC efficiency, commodity hardware savings
    • Hybrid architectures: OpenStack DCN, edge-to-core, multi-cloud
  • Lab Specifications: Realistic enterprise cloud infrastructure
    • Infrastructure
      • Real 5-6 node cluster
      • 500GB+ pre-populated data per student
      • 24/7 access for 7+ days post-course
    • Real Scenarios
      • Disk failures & network partitions
      • Simulated metadata corruption
      • Injected performance degradation
    • Tools
      • blktrace, perf, objectstore-tool
      • Pre-installed debugging symbols
      • Real datasets with I/O patterns

Price (ex. VAT)

€ 3.250,00 per person

Duration

3 days

Schedule

  •  virtual
  •  06-04-2026 - 08-04-2026
  • register

  •  virtual
  •  01-06-2026 - 03-06-2026
  • register

  •  virtual
  •  27-07-2026 - 29-07-2026
  • register

Delivery methods

  • Classroom
  • On-site (at your location)
  • Virtual (instructor online)

Questions?

Write us and we will contact you to discuss your requirements
contact us