I've been reading up on last week's ZeRO-Offload release – here are a few introductory notes on what it is, when it's useful, how it can be used and how it works.

What is ZeRO-Offloading?

ZeRO-Offloading is a way of reducing GPU memory usage during neural network training by offloading data and compute from the GPU(s) to CPU. Crucially this is done in a way that provides high training throughput and that avoids major slow-downs from moving the data and doing computations on CPU.

ZeRO-offloading makes it possible to train models that are up to 10x larger than previously possible with the same hardware – even on a single GPU. You could for instance train GPT-2 (~10 billion parameters) on a single V100 GPU with 32GB RAM.  Lastly, it promises almost-linear scaling in multi-GPU settings.

When is this going to be useful for you?

  • If you want to train larger models or if you want to train your current models faster since ZeRO offloading allows you to train with larger batch sizes.
  • If you're working with PyTorch and are willing/able to work with Microsoft's DeepSpeed library (other ZeRO-Offloading implementations are on the horizon but not available for now). Alternatively you could try to adapt the official implementation yourself.
  • If you're willing to take on some modelling constraints. The current version of ZeRO-Offloading is tied to mixed precision training with Adam, for instance.

How can you use it?

ZeRO-Offloading is implemented in Microsoft's DeepSpeed library. A native PyTorch ZeRO implementation is being developed but offloading is not supported at this point. The official implementation available, so you could try to work directly with that.

Once you're set up within DeepSpeed, the additional effort required to use ZeRO-Offloading seems to be quite small, essentially it's just modifying a few flags and a configuration file.

Hugging Face's transformers library has an experimental integration with DeepSpeed. Stas Bekman has a nice blog post describing how to use it and what kind of results you can expect on a few benchmarks.

Facebook Research's fairscale has a partial implementation of ZeRO, the multi-GPU memory optimization method on top of which ZeRO-Offloading is built. CPU-offloading does not to seem to be supported at this point.

How does it work?

ZeRO-Offloading is based on the Zero Redundancy Optimizer (ZeRO), so let's quickly review what ZeRO is and how it works.

The Zero Redundancy Optimizer

ZeRO-Offloading is based on the Zero Redundancy Optimizer (ZeRO).

ZeRO, in a nutshell, is a memory optimization method for data-parallel model-parallel training, in which gradients, parameters and optimizer state are distributed across the memory of multiple GPUs without any redundancy. This is done in a way that keeps the communication overhead between GPUs relatively low.

I recommend reading this introductory blog post – especially watching the animation – and then reading the paper if you want to go deeper. The official implementation of ZeRO can be found here.

Here's a figure that illustrates the distribution of parameters, gradients and optimizer states across GPUs (source):

ZeRO-Offloading

Quoting directly from a blog post on ZeRO-Offloading (I've added the Figure that is being referred to below):

(...) , ZeRO-Offload inherits the optimizer state and gradient partitioning from ZeRO-2. Unlike ZeRO-2, instead of having each GPU keep a partition of the optimizer state and gradients, ZeRO-Offload offloads both to host CPU memory. Optimizer states are kept in CPU memory for the entire training. Gradients, on the other hand, are computed and averaged using reduce-scatter on the GPUs during the backward pass, and each data-parallel process then offloads the averaged gradients belonging to its partition to the CPU memory (g offload in Figure 7) while discarding the rest. Once the gradients are available on the CPU, optimizer state partitions are updated in parallel by each data parallel process directly on the CPU (p update in Figure 7). After the update, parameter partitions are moved back to GPU followed by an all-gather operation on the GPU to gather all the updated parameters (g swap in Figure 7). ZeRO-Offload also exploits overlapping between communication (such as g offload and g swap) and computation (such as the backward pass and p update) using separate CUDA streams to maximize training efficiency.

This process is illustrated in this figure from the blog post:

One thing to note here is that ZeRO-Offloading is designed specifically for mixed precision training with Adam. In particular, the current version of ZeRO-Offloading uses DeepCPUAdam, an optimized version of Adam.  The main reason for using this optimizer is to avoid the CPU computation becoming a bottleneck in the whole process. This version of Adam seems to be about 6x faster than the PyTorch implementation.

Lastly, here are some of the results from the ZeRO-Offload paper I found particularly interesting:

  1. The largest models that can be trained with ZeRO-Offload.

2. Near linear scaling of throughput per GPU as the number of GPUs is increased – when used in combination with ZeRO:

3. Throughput per GPU of PyTorch, L2L, ZeRO-Offload:

Here's the paper, ZeRO-Offload: Democratizing Billion-Scale Model Training.