The main message of Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers (June 2020) by Li et al. is this:
Larger (wider and deeper) transformer models converge much faster than smaller transformers. Often, this difference in training iterations outweighs the additional computational cost of training a larger model.
Larger models can also be compressed more heavily (in terms of the ratio of the compressed model size to the original model size) than smaller models using quantization and pruning. Therefore, the larger models can also be competitive with the smaller models in terms of inference time.
I've really enjoyed this paper – do check it out!