LIBRISTO
LIBROAMANTO
mandatory
Become part of a community of book lovers from all over the world and get access to a whole bunch of benefits. Create an account for free
0
Austrian Post 5.49 DPD courier 3.99 DPD point 2.99

HPC Observability

Production Monitoring, Profiling, and Site Reliability for Linux Clusters, GPUs, and Parallel Storage at Scale

Language EnglishEnglish
Book Paperback
Book HPC Observability M. Edwards
Libristo code: 52747456
Publishers Independently published, May 2026
HPC Observability is a hands-on guide for the engineers and administrators who keep high-performance... Full description
? points 55 b New New
22.29 VAT included
Expected in stock Expected 02. 06. 2026
Austria Delivery to Austria

30-day return policy

HPC Observability is a hands-on guide for the engineers and administrators who keep high-performance computing systems running reliably at scale. It brings together the operational knowledge scattered across vendor documentation, conference papers, and forum threads into a practical framework for turning HPC telemetry into actionable insight.

Modern HPC environments - Slurm clusters, GPU-dense AI systems, Lustre and GPFS storage, InfiniBand and Slingshot fabrics - generate more data than any team can manually interpret. The result is wasted node-hours, failed simulations, hidden storage bottlenecks, fabric congestion, and GPU failures that surface only after days of runtime.

This book provides a complete operational approach to HPC observability through a five-layer model covering hardware, operating systems, schedulers, applications, storage, and networks. Readers learn how to build metrics pipelines for clusters from hundreds to tens of thousands of nodes; monitor GPUs with DCGM; profile MPI and OpenMP applications with PAPI and Score-P; diagnose storage and network slowdowns; create useful dashboards and alerts; and run effective incident response and post-mortems.

Drawing on peer-reviewed research and real production experience, the book includes original diagrams, practical workflows, reference material, Prometheus alert examples, and a step-by-step lab environment for learning on a laptop.

Written in the voice of a senior HPC engineer rather than an academic text, HPC Observability assumes readers already understand the fundamentals and focuses instead on the operational realities of running large-scale Linux, AI, and research-computing infrastructure.

Actress & Polyglot
EWA KASP for
Play video
Ewa Kasp
Libristo has the largest selection of foreign-language books. That’s why I buy my books there.

About the book

Full name HPC Observability
Author M. Edwards
Language English
Binding Book - Paperback
Date of issue 2026
Number of pages 164
EAN 9798198765443
Libristo code 52747456
Weight 397
Dimensions 216 x 280 x 9
Give this book today
It's easy
1 Add to cart and choose Deliver as present at the checkout 2 We'll send you a voucher 3 The book will arrive at the recipient's address

Login

Log in to your account. Don't have a Libristo account? Create one now!

 
mandatory
mandatory

Don’t have an account? Discover the benefits of having a Libristo account!

With a Libristo account, you'll have everything under control.

Create a Libristo account
Book advisor Libroamiko
Hi, I'm Libroamiko, can I help?