How to Utilize a Cluster: All About SLURM

Hosted by Zach Mueller

129 students

In this video

What you'll learn

Why is managing multiple computer difficult?

We will learn *why* tools like SLURM, Kubernetes, and the like exist for when people mention "scale"

Why SLURM?

Why is SLURM my go-to choice between them all? What makes it special? Isn't it more... intense? (Spoiler: no)

I just use Colab... do I really need to know this?

We'll discuss why knowing SLURM (or in general how these systems work) is imperative in this modern age of Deep Learning

How can I set up SLURM at home?

We're going to see how I've managed a SLURM cluster *in my house*, so that you know how to apply it to your own cluster

Why this topic matters

In the modern age of Deep Learning, gigantic models are trained on hundreds and thousands of GPUs at a given time. Knowing how to utilize what the modern labs are doing for cluster management is a rapidly needed skill, which sadly means learning some Linux. Even if you may not have access to a cluster TODAY to apply this to, if you ever will, this lesson will be invaluable.

You'll learn from

Zach Mueller

Instructor, Technical Lead at Hugging Face

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.

I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.

Previously at