How to Utilize a Cluster: All About SLURM
Hosted by Zach Mueller
What you'll learn
Why is managing multiple computer difficult?
We will learn *why* tools like SLURM, Kubernetes, and the like exist for when people mention "scale"
Why SLURM?
Why is SLURM my go-to choice between them all? What makes it special? Isn't it more... intense? (Spoiler: no)
I just use Colab... do I really need to know this?
We'll discuss why knowing SLURM (or in general how these systems work) is imperative in this modern age of Deep Learning
How can I set up SLURM at home?
We're going to see how I've managed a SLURM cluster *in my house*, so that you know how to apply it to your own cluster
Why this topic matters
In the modern age of Deep Learning, gigantic models are trained on hundreds and thousands of GPUs at a given time. Knowing how to utilize what the modern labs are doing for cluster management is a rapidly needed skill, which sadly means learning some Linux. Even if you may not have access to a cluster TODAY to apply this to, if you ever will, this lesson will be invaluable.
You'll learn from
Zach Mueller
Instructor, Technical Lead at Hugging Face
I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.
I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.
Previously at
Go deeper with a course
Keep exploring