Say, you're training a deep learning model in PyTorch. What can you do to make your training finish faster?
In this post, I'll provide an overview of some of the lowest-effort, highest-impact ways of accelerating the training of deep learning models in PyTorch. For each of the methods, I'll briefly summarize the idea, try to estimate the expected speed-up and discuss some limitations. I will focus on conveying the most important parts and point to further resources for each of them. Mostly, I'll focus on changes that can be made directly within PyTorch without introducing additional libraries and I'll assume that you are training your model on GPU(s).
The suggestions – roughly sorted from largest to smallest expected speed-up – are:
- Consider using a different learning rate schedule.
- Use multiple workers and pinned memory in
- Max out the batch size.
- Use Automatic Mixed Precision (AMP).
- Consider using a different optimizer.
- Turn on cudNN benchmarking.
- Beware of frequently transferring data between CPUs and GPUs.
- Use gradient/activation checkpointing.
- Use gradient accumulation.
- Use DistributedDataParallel for multi-GPU training.
- Set gradients to None rather than 0.
- Turn off debugging APIs if not needed.
- Use gradient clipping.
- Turn off bias before BatchNorm.
- Turn off gradient computation during validation.
- Use input and batch normalization.
1. Consider using another learning rate schedule
The learning rate (schedule) you choose has a large impact on the speed of convergence as well as the generalization performance of your model.
Cyclical Learning Rates and the 1Cycle learning rate schedule are both methods introduced by Leslie N. Smith (here and here), and then popularised by fast.ai's Jeremy Howard and Sylvain Gugger (here and here). Essentially, the 1Cycle learning rate schedule looks something like this:
[1cycle consists of] two steps of equal lengths, one going from a lower learning rate to a higher one than go back to the minimum. The maximum should be the value picked with the Learning Rate Finder, and the lower one can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs, and, in the last part of training, we should allow the learning rate to decrease more than the minimum, by several orders of magnitude.
In the best case this schedule achieves a massive speed-up – what Smith calls Superconvergence – as compared to conventional learning rate schedules. Using the 1Cycle policy he needs ~10x fewer training iterations of a ResNet-56 on ImageNet to match the performance of the original paper, for instance). The schedule seems to perform robustly well across common architectures and optimizers.
PyTorch implements both of these methods
torch.optim.lr_scheduler.OneCycleLR, see the documentation.
One drawback of these schedulers is that they introduce a number of additional hyperparameters. This post and this repo, offer a nice overview and implementation of how good hyper-parameters can be found including the Learning Rate Finder mentioned above.
Why does this work? It doesn't seem entirely clear but one possible explanation might be that regularly increasing the learning rate helps to traverse saddle points in the loss landscape more quickly.
2. Use multiple workers and pinned memory in
Szymon Micacz achieves a 2x speed-up for a single training epoch by using four workers and pinned memory.
A rule of thumb that people are using to choose the number of workers is to set it to four times the number of available GPUs with both a larger and smaller number of workers leading to a slow down.
Note that increasing
num_workers will increase your CPU memory consumption.
3. Max out the batch size
This is a somewhat contentious point. Generally, however, it seems like using the largest batch size your GPU memory permits will accelerate your training (see NVIDIA's Szymon Migacz, for instance). Note that you will also have to adjust other hyperparameters, such as the learning rate, if you modify the batch size. A rule of thumb here is to double the learning rate as you double the batch size.
OpenAI has a nice empirical paper on the number of convergence steps needed for different batch sizes. Daniel Huynh runs some experiments with different batch sizes (also using the 1Cycle policy discussed above) where he achieves a 4x speed-up by going from batch size 64 to 512.
One of the downsides of using large batch sizes, however, is that they might lead to solutions that generalize worse than those trained with smaller batches.
4. Use Automatic Mixed Precision (AMP)
The release of PyTorch 1.6 included a native implementation of Automatic Mixed Precision training to PyTorch. The main idea here is that certain operations can be run faster and without a loss of accuracy at semi-precision (FP16) rather than in the single-precision (FP32) used elsewhere. AMP, then, automatically decide which operation should be executed in which format. This allows both for faster training and a smaller memory footprint.
In the best case, the usage of AMP would look something like this:
Benchmarking a number of common language and vision models on NVIDIA V100 GPUs, Huang and colleagues find that using AMP over regular FP32 training yields roughly 2x – but upto 5.5x – training speed-ups.
Currently, only CUDA ops can be autocast in this way. See the documentation here for more details on this and other limitations.
5. Consider using another optimizer
AdamW is Adam with weight decay (rather than L2-regularization) which was popularized by fast.ai and is now available natively in PyTorch as
torch.optim.AdamW. AdamW seems to consistently outperform Adam in terms of both the error achieved and the training time. See this excellent blog post on why using weight decay instead of L2-regularization makes a difference for Adam.
Both Adam and AdamW work well with the 1Cycle policy described above.
NVIDA's APEX implements fused versions of a number of common optimizers such as Adam. This implementation avoid a number of passes to and from GPU memory as compared to the PyTorch implementation of Adam, yielding speed-ups in the range of 5%.
6. Turn on cudNN benchmarking
If your model architecture remains fixed and your input size stays constant, setting
torch.backends.cudnn.benchmark = True might be beneficial (docs). This enables the cudNN autotuner which will benchmark a number of different ways of computing convolutions in cudNN and then use the fastest method from then on.
For a rough reference on the type of speed-up you can expect from this, Szymon Migacz achieves a speed-up of 70% on a forward pass for a convolution and a 27% speed-up for a forward + backward pass of the same convolution.
One caveat here is that this autotuning might become very slow if you max out the batch size as mentioned above.
7. Beware of frequently transferring data between CPUs and GPUs
Beware of frequently transferring tensors from a GPU to a CPU using
tensor.cpu() and vice versa using
tensor.cuda() as these are relatively expensive. The same applies for
.numpy() – use
If you are creating a new tensor, you can also directly assign it to your GPU using the keyword argument
If you do need to transfer data, using
.to(non_blocking=True), might be useful as long as you don't have any synchronization points after the transfer.
If you really have to, you might want to give Santosh Gupta's SpeedTorch a try, although it doesn't seem entirely clear when this actually does/doesn't provide speed-ups.
8. Use gradient/activation checkpointing
Quoting directly from the documentation:
Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.
Specifically, in the forward pass,
functionwill run in
torch.no_grad()manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the
functionparameter. In the backwards pass, the saved inputs and
functionis retrieved, and the forward pass is computed on
functionagain, now tracking the intermediate activations, and then the gradients are calculated using these activation values.
So while this will might slightly increase your run time for a given batch size, you'll significantly reduce your memory footprint. This in turn will allow you to further increase the batch size you're using allowing for better GPU utilization.
While checkpointing is implemented natively as
torch.utils.checkpoint (docs), it does seem to take some thought and effort to implement properly. Priya Goyal has a good tutorial demonstrating some of the key aspects of checkpointing.
9. Use gradient accumulation
Another approach to increasing the batch size is to accumulate gradients across multiple
.backward() passes before calling
Following a post by Hugging Face's Thomas Wolf, gradient accumulation can be implemented as follows:
This method was developed mainly to circumvent GPU memory limitations and I'm not entirely clear on the trade-off between having additional
.backward() loops. This discussion on the fastai forum seems to suggest that it can in fact accelerate training, so it's probably worth a try.
10. Use Distributed Data Parallel for multi-GPU training
Methods to accelerate distributed training probably warrant their own post but one simple one is to use
torch.nn.DistributedDataParallel rather than
torch.nn.DataParallel. By doing so, each GPU will be driven by a dedicated CPU core avoiding the GIL issues of
In general, I can strongly recommend the documentation on distributed training:
Subscribe to Efficient DL
Get the latest posts delivered right to your inbox
11. Set gradients to None rather than 0
.zero_grad(set_to_none=True) rather than
Doing so will let the memory allocator handle the gradients rather than actively setting them to 0. This will lead to yield a modest speed-up as they say in the documentation, so don't expect any miracles.
Watch out, doing this is not side-effect free! Check the docs for the details on this.
.as_tensor() rather than
torch.tensor() always copies data. If you have a numpy array that you want to convert, use
torch.from_numpy() to avoid copying the data.
13. Turn on debugging tools only when actually needed
Pytorch offers a number of useful debugging tools like the autograd.profiler, autograd.grad_check, and autograd.anomaly_detection. Make sure to use them to better understand when needed but to also turn them off when you don't need them as they will slow down your training.
14. Use gradient clipping
Originally used to avoid exploding gradients in RNNs, there is both some empirical evidence as well as some theoretical support that clipping gradients (roughly speaking:
gradient = min(gradient, threshold) ) accelerates convergence.
Hugging Face's Transformer implementation is a really clean example of how to use gradient clipping as well as some of the other methods such as AMP mentioned in this post.
In PyTorch this can be done using
It's not entirely clear to me which models benefit how much from gradient clipping but it seems to be robustly useful for RNNs, Transformer-based and ResNets architectures and a range of different optimizers.
15. Turn off bias before BatchNorm
This is a very simple one: turn off the bias of layers before BatchNormalization layers. For a 2-D convolutional layer, this can be done by setting the bias keyword to False:
torch.nn.Conv2d(..., bias=False, ...). (Here's a reminder why this makes sense.)
You will save some parameters, I would however expect the speed-up of this to be relatively small as compared to some of the other methods mentioned here.
16. Turn off gradient computation during validation
This one is straightforward: set
torch.no_grad() during validation.
17. Use input and batch normalization
You're probably already doing this but you might want to double-check:
And here's a reminder of why you probably should.
Bonus tip from the comments: Use JIT to fuse point-wise operations.
If you have adjacent point-wise operations you can use PyTorch JIT to combine them into one FusionGroup which can then be launched on a single kernel rather than multiple kernels as would have been done per default. You'll also save some memory reads and writes.
Szymon Migacz shows how you can use the @torch.jit.script decorator to fuse the operations in a GELU, for instance:
@torch.jit.script def fused_gelu(x): return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
In this case, fusing the operations leads to a 5x speed-up for the execution of
fused_gelu as compared to the unfused version.
See also this post for an example of how Torchscript can be used to accelerate an RNN.
Hat tip to u/Patient_Atmosphere45 on Reddit for the suggestion.
Sources and additional resources
Thomas Wolf at Hugging Face has a number of interesting articles on accelerating deep learning – with a particular focus on language models.
Thanks to Ben Hahn, Kevin Klein and Robin Vaaler for their feedback on a draft of this post!