I've enjoyed this paper a lot! It reviews recent (up to January 2020) advances in pruning, quantization and structural efficiency where the latter includes distillation, weight sharing, special matrix structures optimized for faster multiplication and hand-designed/automatically-designed efficient architectures.

Here are some interesting bits quoted directly from the conclusion:

  • For quantization approaches, a common pattern in the most successful approaches is to combine real-valued representations, that help in maintaining the expressiveness of DNNs, with quantization to enhance the computationally intensive operations.
  • For pruning methods, we observed that the trend is moving towards structured pruning approaches that obtain smaller models whose data structures are compatible with highly optimized dense tensor operations.
  • On the structural level of DNNs, a lot of progress has been made in the development of specific building blocks that maintain a high expressiveness of the DNN while at the same time reducing the computational overhead substantially. The newly emerging neural architecture search (NAS) approaches are promising candidates to automate the design of application-specific architectures with negligible user interaction. However, it appears unlikely that current NAS approaches will discover new fundamental design principles as the resulting architectures highly depend on a-priori knowledge encoded in the architecture search space.

Recommended reading!