Speeding up runpod

# December 18, 2023

Runpod.io is my favorite GPU provider right now for smaller experiments. They have pretty consistent availability of 4x/8x configurations with A100 80GB GPUs alongside some of the current generation nvidia chips.

One issue I've observed a few times now is varying runtime performance box-to-box. My working mental model of VMs is that you have full control of your allocation; if you've been granted 4 CPUs you get the ability to push 4 CPUs to the brink of capacity. Of course, the reality is a bit more murky depending on your underlying kernel and virtual machine manager, but usually this simple model works out fine.

On Runpod since any configuration less than requesting the full 8GPUs is multi-tenant, you might be competing with other workloads. A few times now I've observed sluggish performance on the box (batch preprocessing slow to complete, bash commands slow to enter, etc.)

An htop readout that you want to see at bootup. I might even have the box to myself.

My default when starting up new boxes has become:

  • Immediately install htop and check for the current server load. It should show you the "average" across both CPU and memory, which ends up being the total for the whole box.
  • When the box is overloaded these numbers can both start creeping up to near 100% utilization of the CPU and memory. My guess is there's some default swapping that's allowed on the boxes, but that behavior results in slower than average performance at the limit.
  • If you need fast network IO to external services or your local machine, make sure it's colocated in a similar region when you create the box.

Secure Cloud creation panel. Defaults to anywhere in the world but allows you to customize the device region.

If this happens to you and you need quicker processing:

  • Consider using a A100 SXM 80GB configuration. I've found the speed of these boxes and their availability to be higher than the stock A100 80GBs.
  • As a last resort, consider upgrading your GPU allocation as well. You'll get more CPUs and memory to go alongside the GPUs, which will also force their task allocator to place you on a box with less overall load.

Switching to a less loaded box has decreased some of my processing tasks from 3h+ to 10 minutes. It can make a world of difference if you're observing performance that's meaningfully slower than when you're doing development locally.

Related tags:
#ml
LLMs as interdisciplinary agents
The real breakthrough with large language models might not be exceeding human levels of performance in a discrete task. Perhaps it's enough that they can approach human level performance in a variety of tasks. There might be more whitespace in intersectional disciplines than aiming for true expert status in any one.
Debugging tips for neural network training
Practical notes for debugging more complicated training pipelines and architectures, informed by pure research and productionalizing models in industry. This guide has a bias towards debugging large language models.
AI needs a better definition
People label AI as anything and everything these days. You have search systems, you have process automation, you have spam filters. If motion activated supermarket doors were invented today, I guarantee they’d be branded AI too.

Hi, I'm Pierce

I write mostly about engineering, machine learning, and company building. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month, maximum.