A summary of Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers by Sixiao Zheng et al.

Nicholas M. Synovic

12-04-2022 - 4 minutes read - 802 words

A summary of Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng et al. CVPR, 2021 DOI

For the summary of the paper, go to the Summary section of this article.

A summary of Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

First Pass

Read the title, abstract, introduction, section and sub-section headings, and conclusion

Problem

What is the problem addressed in the paper?

This paper aims to develop a model architecture based off of the usage of the Transformer architecture for the purposes of semantic segmentation.

Motivation

Why should we care about this paper?

Because the reference model implemented in this paper achieved SOTA results on a variety of datasets as well as achieving first place on the ADE20K test server leader board the day that it was released.

Context

What other types of papers is the work related to?

This paper is most closely related to works within the CV domain that use the Transformer architecture as well as papers that develop SOTA architectures for semantic segmentation.

Contributions

What are the author’s main contributions?

The authors main contribution are a CV model architecture based on Transformers that achieves SOTA performance on several datasets.

Second Pass

A proper read through of the paper is required to answer this

Background Work

What has been done prior to this paper?

Work has been done to develop semantic segmentation models that rely on FCN style architectures, as well as on CV models that rely on the Transformer architecture to perform image classification.

Figures, Diagrams, Illustrations, and Graphs

Are the axes properly labeled? Are results shown with error bars, so that conclusions are statistically significant?

The figures and tables are clear and easy to understand.

Clarity

Is the paper well written?

This is a well written paper.

Relevant Work

Mark relevant work for review

The following relevant work can be found in the Citations section of this article.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2019.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2019.

Methodology

What methodology did the author’s use to validate their contributions?

They evaluated their model against other SOTA models on the Cityscapes, ADE20K, and Pascal Context datasets.

Author Assumptions

What assumptions does the author(s) make? Are they justified assumptions?

Their assumption is that the FCN architecture is not the best performing architecture for semantic segmentation tasks.

Correctness

Do the assumptions seem valid?

No. While their model (SETR) does achieve SOTA performance on semantic segmentation tasks using the Transformer architecture, this alone does not mean that the FCN architecture is dead. Rather, this paper just shows that Transformers can be used for this task.

Future Directions

My own proposed future directions for the work

Before working with this paper, I need to better understand what a Transformer is and how to use one.

Open Questions

What open questions do I have about the work?

It was mentioned that Transformers have a quadratic complexity. This, to me, sounds like they are unusable on low powered devices. Can you compress or quantize these types of models?

Author Feedback

What feedback would I give to the authors?

I liked this paper. I thought it was interesting and exciting to hear that there is a potential for a paradigm shift. If you are going to report that you were first on the leader board for some competition, it would also be good to know how long your model held onto that spot.

Summary

A summary of the paper

The paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers by Sixiao Zheng et al. presents the idea of using the Transformer architecture for the purposes of semantic segmentation. In order to do this, the authors describe in detail how to transform an image into a suitable format for a transformer to understand. Additionally, they evaluate their model (SETR) on three different semantic segmentation datasets and achieved SOTA scores on all of them. However, SETR doesn’t achieve SOTA overall. SETR was first on the ADE20K test server leader board the day that it was released though.

This paper challenges the usage of FCN’s as the primary architecture for semantic segmentation.

Summarization Technique

This paper was summarized using a modified technique proposed by S. Keshav in his work How to Read a Paper [0].