A few notes on Measuring the Algorithmic Efficiency of Neural Networks (May 2020) by Danny Hernandez and Tom Brown at OpenAI.
Main Points
The authors suggest that there are three main factors driving progress in Deep Learning: algorithmic improvements, data and compute. The aim of this paper is to suggest a way of measuring algorithmic progress and then to measure the algorithmic progress in image classification over the last few years using this new metric.
The authors suggest to measure algorithmic efficiency (and progress) in terms of how much compute it takes to train a model to reach a fixed target on some benchmark, e.g. AlexNet performance on ImageNet.
Based on this, they:
show that the number of floating point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44x between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years.

The authors try to trace back what the main factors were that drove this huge increase in efficiency in image classification models:
We attribute the 44x efficiency gains to sparsity, batch normalization, residual connections, architecture search, and appropriate scaling
Comments
I think the paper proposes a sensible way of measuring algorithmic efficiency and is very transparent about its current limitations.
The authors are very careful to point out that they do not want to make any predictions about how these trends will develop in the future and in areas other than vision. One additional data point that comes to mind here is that there's been an enormous amount of recent research on making large attention-based language models more efficient. Within ~2 years the computational complexity of self-attention has been reduced from being quadratic in the length of the input sequence to being linear in the length of the input sequence!
A limitation (that the authors also acknowledge) is that the some of the models evaluated in the paper are not specifically optimized to reach a specific target on a benchmark nor to reach that target in the most efficient way. Looking at DAWNBench results – where efficient training to reach a given target is the official goal – can thus also be a really useful way of measuring progress. The DAWNBench submissions are not constrained in terms of hardware, however, so algorithmic and hardware progress are entangled.
Another thing that stood out to me was that David Page actually made a very similar suggestion about measuring algorithmic progress in his series of blog posts back in 2018!:
State-of-the-art accuracy is an ill-conditioned target in the sense that throwing a larger model, more hyperparameter tuning, more data augmentation or longer training time at the problem will typically lead to accuracy gains, making fair comparison between works a delicate task. There is a danger that innovations in training or architectural design introduce additional hyperparmeter dimensions and that tuning these may lead to better implicit optimisation of aspects of training that are otherwise unrelated to the extension under study. Ablation studies, which are generally considered best practice, cannot resolve this issue if the base model has a lower dimensional space of explicit hyperparmeters to optimise. A result of this situation is that state-of-the-art models can seem like a menagerie of isolated results which are hard to compare, fragile to reproduce and difficult to build upon.
Given these problems, we propose that anything that makes it easier to compare within and between experimental works is a good thing. We believe that producing competitive benchmarks for resource constrained training is a way to alleviate some of these difficulties.
Do check out the paper, the blog post accompanying the paper and a Github repo where the authors plan to track new results.