IBM Storage Ceph Expert - Production Engineering

Code: ISCEPHE

Description

An intensive 3-day program designed to tackle real-world crises and optimize production clusters at petabyte scale. It takes you from architecture to forensic troubleshooting in production.

The expert course focuses on 100% on critical production operations:

Forensic troubleshooting when everything fails
Real disaster recovery (not simulations)
Advanced performance engineering
Complex multi-factor scenarios

The Ceph Advanced and Ceph Expert courses are complementary. The Advanced course covers "how to configure it well" and the Expert course covers "what to do when it fails badly".

The course is distribution-agnostic. The troubleshooting, DR, and optimization techniques we teach are universal, and work the same on any Linux distribution. Labs can be configured with:

Linux
- Rocky Linux
- Ubuntu
- RHEL
- Alma Linux
Storage
- IBM Storage Ceph
- Ceph upstream (Squid 19.2+)
- Red Hat Ceph Storage
- or whichever version you prefer

Audience

Administrators or those with production experience who need to master real-world critical scenarios that vendors don't teach

Prerequisites

You should have completed our IBM Storage Ceph Deployment and Administration and IBM Storage Ceph Adanced courses, or have equivalent knowledge.

This course assumes you master:

Ceph architecture (MON/OSD/MGR)
pool/PG/CRUSH management
basic troubleshooting
practical experience managing clusters in production (2+ years or equivalent courses)

Objectives

You will learn to solve:

Critical failures in 200TB+ clusters
Recovery of 40TB corrupted CephFS
Extreme tuning for AI/ML (500TB/day)
Troubleshooting under 24/7 pressure

Topics

Advanced Performance Engineering & Forensics

From architecture to forensic troubleshooting in production

Architectural Optimization
- BlueStore internals: RocksDB tuning, compaction, write amplification
- CPU optimization: C-states impact (labs showing 5x degradation), NUMA
- Network: 100GbE patterns, TCP tuning, nf_conntrack
- NVMe-specific: reactor tuning, bdevs_per_cluster optimization
Forensic Troubleshooting
- Diagnostic toolchain: blktrace, perf, objectstore-tool
- Real case studies: NVMe degradation, post-upgrade OSD flapping
- Advanced PG lifecycle: stuck states, manual intervention
- Labs: Cluster with real problems to diagnose

Disaster Recovery, Multi-Site & Petabyte Scaling

Extreme recovery and multi-site architectures

Advanced Disaster Recovery
- Edinburgh 40TB case: complete error chain and recovery procedures
- CephFS disasters: metadata corruption, MDS failure handling
- RBD mirroring: pool vs image-based, failover automation
- Physical DR: disk extraction, journal, whoami preservation
Multi-Site & Petabytes
- RGW multisite: master zone failure, manual promotion, sync fairness
- WAN planning: formulas for 1 GbE per 8TB daily ingest
- Petabyte challenges: CERN 30PB (7,200 OSDs), 310M objects
- Labs: Multi-site failover and recovery simulation

Security, AI/ML Workloads & Cost Engineering

Enterprise security and optimization for modern workloads

Security Hardening
- Encryption: LUKS/dmcrypt OSDs, msgr2 secure, RGW SSE-S3/KMS
- Key management: rotation (Squid 19.2.3+), Barbican integration
- Compliance: HIPAA architecture, GDPR, audit logging
- Threat detection: monitoring patterns, vulnerability management
AI/ML & ROI Engineering
- S3 Select: Trino integration (2.5x-9x performance), analytics pushdown
- AI/ML patterns: checkpointing, parallel access optimization
- TCO analysis: EC efficiency, commodity hardware savings
- Hybrid architectures: OpenStack DCN, edge-to-core, multi-cloud

Lab Specifications: Realistic enterprise cloud infrastructure
- Infrastructure
  - Real 5-6 node cluster
  - 500GB+ pre-populated data per student
  - 24/7 access for 7+ days post-course
- Real Scenarios
  - Disk failures & network partitions
  - Simulated metadata corruption
  - Injected performance degradation
- Tools
  - blktrace, perf, objectstore-tool
  - Pre-installed debugging symbols
  - Real datasets with I/O patterns

Price (ex. VAT)

€ 3.250,00 per person

Duration

3 days

Schedule

virtual
06-04-2026 - 08-04-2026
register

virtual
01-06-2026 - 03-06-2026
register

virtual
27-07-2026 - 29-07-2026
register

Delivery methods

Classroom
On-site (at your location)
Virtual (instructor online)

Questions?

Write us and we will contact you to discuss your requirements