A summary of Deep Residual Learning for Image Recognition by Kaiming He et al.

Nicholas M. Synovic

12-01-2022 - 5 minutes read - 882 words

A summary of Deep Residual Learning for Image Recognition

Kaiming He et al. arXiv, 2015 DOI

For the summary of the paper, go to the Summary section of this article.

A summary of Deep Residual Learning for Image Recognition

First Pass

Read the title, abstract, introduction, section and sub-section headings, and conclusion

Problem

What is the problem addressed in the paper?

SOTA deep learning (DL) computer vision (CV) models are becoming increasingly more common. The trend with these models is to make them deeper to get better results. However, the deeper a model is, the harder it is to train due to vanishing gradients, and the longer it takes to train it as there are more parameters. Additionally, as the network gets deeper, the accuracy starts to degrade. This paper aims to address the later two issues with training very large networks for the purposes of image recognition.

Motivation

Why should we care about this paper?

Because it introduces a deep residual learning framework. This framework is simple to implement, but allows for models to become far deeper than before due to its usage of skip connections. These skip connections simply add the identity of the previous residual learning building block to the output of the current building block.

Context

What other types of papers is the work related to?

This paper is related to works that aim to create better CV models.

Contributions

What are the author’s main contributions?

Their main contributions is the deep residual learning framework and the SOTA ResNet suite of models, as well as an understanding of how residual learning works via shortcuts.

Second Pass

A proper read through of the paper is required to answer this

Background Work

What has been done prior to this paper?

Work has been done to explore skip connections and residual networks, as well as creating other DL CV models.

Figures, Diagrams, Illustrations, and Graphs

Are the axes properly labeled? Are results shown with error bars, so that conclusions are statistically significant?

All of the figures are properly labeled and are clear to read.

Clarity

Is the paper well written?

The paper is well written.

Relevant Work

Mark relevant work for review

The following relevant work can be found in the Citations section of this article.

F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.

Methodology

What methodology did the author’s use to validate their contributions?

The authors compared their ResNet family of models against CV models that did not adhere to the residual learning framework, along with other SOTA models, on image classification on the ImageNet and CIFAR-10 datasets, as well as object detection on the PASCAL and MS COCO datasets.

Author Assumptions

What assumptions does the author(s) make? Are they justified assumptions?

The authors assumed that better models can be made by just adding more layers.

Correctness

Do the assumptions seem valid?

Not entirely. Better architected models rely on more layers yes. However, better models need more data to perform better as well. Additionally, the usage of neural compressors or optimizers can adjust the performance of a CV model as well.

Future Directions

My own proposed future directions for the work

I’d like to see how far I can push the ResNet model family with respect to quantization and optimization in order to run these extremely deep models on mobile or low powered hardware.

Open Questions

What open questions do I have about the work?

What was the system specification that they used to train the ResNet model family on?

Author Feedback

What feedback would I give to the authors?

This was a very clear and concise paper on a model architecture.

Summary

A summary of the paper

The paper, Deep Residual learning for Image Recognition by Kaiming He et al. [1] introduced the ResNet family of DL CV models. These models are based on a deep residual learning framework that makes usage of shortcut connections (taking the identity function of the previous deep residual block and summing it to the output of the current residual learning block) and the ReLU activation function to reduce the total number of parameters within the model, while dramatically increasing the depth in comparison to former SOTA methods (i.e., VGG). The authors found that their deep models achieve SOTA performance on both Object Detection (using the Faster-RCNN method) and Image Recognition compared to existing SOTA models. Furthermore, they found that the deeper the network, the more accurate and less error prone it is with respect to image recognition.

Summarization Technique

This paper was summarized using a modified technique proposed by S. Keshav in his work How to Read a Paper [0].