[Paper reading] Swin Transformer

2023. 9. 4. 07:31ใ†ArtificialIntelligence/PaperReading

 

 

 

 

Swin Transformer
Hierarchical Vision Transformer using Shifted Windows

 

 

 

 

 

Abstract

  • This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
  • Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
  • To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
  • These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test- dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
  • The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.

 

 

 

Introduction

  • On the other hand, the evolution of network architectures in natural language processing (NLP) has taken a different path, where the prevalent architecture today is instead the Transformer. Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data. Its tremendous success in the language domain has led researchers to investigate its adaptation to computer vision, where it has recently demonstrated promising results on certain tasks, specifically image classification and joint vision-language modeling.
  • In this paper, we seek to expand the applicability of Transformer such that it can serve as a general-purpose backbone for computer vision, as it does for NLP and as CNNs do in vision. We observe that significant challenges in transferring its high performance in the language domain to the visual domain can be explained by differences between the two modalities. 
  • One of these differences involves scale. Unlike the word tokens that serve as the basic elements of processing in language Transformers, visual elements can vary substantially in scale, a problem that receives attention in tasks such as object detection. In existing Transformer-based models , tokens are all of a fixed scale, a property unsuitable for these vision applications. 
  • Another difference is the much higher resolution of pixels in images compared to words in passages of text. There exist many vision tasks such as semantic segmentation that require dense prediction at the pixel level, and this would be intractable for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic to image size. 
  • To overcome these issues, we propose a general-purpose Transformer backbone, called Swin Transformer, which constructs hierarchical feature maps and has linear computational complexity to image size
  • As illustrated in Figure 1, Swin Transformer constructs a hierarchical representation by starting from small-sized patches (outlined in gray) and gradually merging neighboring patches in deeper Transformer layers. With these hierarchical feature maps, the Swin Transformer model can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks (FPN)or U-Net
  • The linear computational complexity is achieved by computing self-attention locally within non-overlapping windows that partition an image (outlined in red). The number of patches in each window is fixed, and thus the complexity becomes linear to image size. 
  • These merits make Swin Transformer suitable as a general-purpose backbone for various vision tasks, in contrast to previous Transformer based architectures which produce feature maps of a single resolution and have quadratic complexity.

 

 

 

Conclusion

  • This paper presents Swin Transformer, a new vision Transformer which produces a hierarchical feature representation and has linear computational complexity with respect to input image size.
  • Swin Transformer achieves the state-of-the-art performance on COCO object detection and ADE20K semantic segmentation, significantly surpassing previous best methods. We hope that Swin Transformer’s strong performance on various vision problems will encourage unified modeling of vision and language signals.
  • As a key element of Swin Transformer, the shifted window based self-attention is shown to be effective and efficient on vision problems, and we look forward to investigating its use in natural language processing as well.

 

 

 

 

Conclusion

์žฅ์ 

  • ๊ธฐ์กด Vision ๋ถ„์•ผ์—์„œ CNN ๊ธฐ๋ฐ˜์˜ backbone์„ ์‚ฌ์šฉํ•˜๋‹ค๊ฐ€, ์ด๋ฅผ NLP ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋˜๋˜ Transformer (self-attention) ๊ธฐ๋ฐ˜์œผ๋กœ ๋„˜์–ด๊ฐ€๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ, ์ฃผ์š”ํ•œ ๋ฌธ์ œ์ ์ด ๋ฌด์—‡์ธ์ง€ ํŒŒ์•…ํ•˜๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ๋‚˜๊ฐ€๋Š” ๊ณผ์ •์ด introduction์—์„œ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์ „๊ฐœ๋˜์—ˆ๋‹ค.
  • On the other hand, the evolution of network architectures in natural language processing (NLP) has taken a different path, where the prevalent architecture today is instead the Transformer. Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data. Its tremendous success in the language domain has led researchers to investigate its adaptation to computer vision, where it has recently demonstrated promising results on certain tasks, specifically image classification and joint vision-language modeling.
  • In this paper, we seek to expand the applicability of Transformer such that it can serve as a general-purpose backbone for computer vision, as it does for NLP and as CNNs do in vision. We observe that significant challenges in transferring its high performance in the language domain to the visual domain can be explained by differences between the two modalities.
  • These merits make Swin Transformer suitable as a general-purpose backbone for various vision tasks, in contrast to previous Transformer based architectures which produce feature maps of a single resolution and have quadratic complexity.
    • general-purpose๋ฅผ ๊ฐ„๋‹จํžˆ ์–ธ๊ธ‰ํ•˜๊ณ  ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค์–‘ํ•œ dataset ์‹คํ—˜์„ ํ†ตํ•ด image classification, object detection, segmentation ๊ฐ๊ฐ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ๋„ํ‘œ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ œ์‹œํ•œ ์ ๋„ ์ธ์ƒ๊นŠ์—ˆ๋‹ค. 

 

๋น… ์˜ค๋ฉ”๊ฐ€

  • method ๋ถ€๋ถ„์—์„œ ๊ณ„์‚ฐ ๋ณต์žก๋„ ๋น…-์˜ค๋ฉ”๊ฐ€๋ฅผ ์ˆ˜์‹์œผ๋กœ ์ •๋ฆฌํ•˜์—ฌ ์ œ์‹œํ•œ ์ ์ด ์ง์ ‘์ ์œผ๋กœ global ๋ฐฉ์‹๊ณผ local (window based)์˜ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์–ด์„œ ์ข‹์•˜๋‹ค. 

 

output feature

  • l ๋ฒˆ์งธ ๋ธ”๋Ÿญ์— ๋Œ€ํ•˜์—ฌ MSA(Multi-head self attention)๊ณผ MLP block์„ ํ†ต๊ณผํ•˜์˜€์„ ๋•Œ์˜ output feature ์ˆ˜์‹์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ œ์‹œํ•œ ์ ์ด ์ธ์ƒ๊นŠ์—ˆ๋‹ค. ์ฒ˜์Œ์—๋Š” self attention๊ณผ shifted window๊ฐ€ ์ ์šฉ๋˜๋Š” ๊ณผ์ •์ด ๋ชจํ˜ธํ•˜๊ฒŒ ๋‹ค๊ฐ€์™€์„œ ์ž˜ ์ดํ•ด๋˜์ง€ ์•Š์•˜๋Š”๋ฐ, ๋‹ค์–‘ํ•œ ์ˆ˜์‹์„ ํ•จ๊ป˜ ์ œ์‹œํ•˜์—ฌ, ๋‹จ๊ณ„๋ณ„ ๊ณผ์ •์— ๋Œ€ํ•ด ์ดํ•ดํ•˜๊ธฐ ์ˆ˜์›”ํ•˜์˜€๋‹ค.

 

 

๋‹จ์ 

  • Relative position bias 
  • In computing self-attention, we follow by including a relative position bias ∈ RM2×Mto each head in computing similarity:

  • d is the query/key dimension, and M2 is the number of patches in a window. Since the relative position along each axis lies in the range [−M + 1, M − 1], we parameter- ize a smaller-sized bias matrix Bˆ ∈ R(2M−1)×(2M−1), and values in B are taken from Bˆ.
  • We observe significant improvements over counterparts without this bias term or that use absolute position embedding, as shown in Table 4. Further adding absolute position embedding to the input as in drops performance slightly, thus it is not adopted in our implementation.
  • The learnt relative position bias in pre-training can be also used to initialize a model for fine-tuning with a different window size through bi-cubic interpolation.
    • relative position bias์ด ๋ฌด์—‡์ด๊ณ  ์™œ ์‚ฌ์šฉ๋œ ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•ด์„œ ์ž˜ ์ดํ•ด๋˜์ง€ ์•Š๋Š”๋‹ค. ๊ธฐ์กด ViT ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰๋œ inductive bias์™€ ์–ด๋–ค ์ธก๋ฉด์—์„œ ๋‹ค๋ฅธ ๊ฒƒ์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค. 

 

  • Compared with the state-of-the-art ConvNets, i.e. RegNet and EfficientNet, the Swin Transformer achieves a slightly better speed-accuracy trade-off. Nothing that while RegNet and EfficientNet are obtained via a thorough architecture search, the proposed Swin Transformer is adapted from the standard Transformer and has strong potential for further improvement.
    • image classification (Image-1K training) ๊ฒฐ๊ณผ ๋ถ„์„ part์—์„œ Swin Transformer๊ฐ€ CNN ๊ธฐ๋ฐ˜์˜ EfficientNet ๋ณด๋‹ค ๋” ๋งŽ์€ ์—ฐ์ƒ๋Ÿ‰๊ณผ parmas ์ˆ˜๋ฅผ ๊ฐ€์ง์—๋„ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์—†๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ํ•ด๋‹น ์‹คํ—˜์—์„œ CNN ๋Œ€๋น„ Swin Transformer์˜ ์žฅ์ ์ด ๋ฌด์—‡์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค. 

 

 

๊ฐœ์„ ํ•  ์  & Question

 

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] Swin Transformer - Hierarchical Vision Transformer using Shifted Windows

ViT ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ํฌ์ŠคํŠธ์— ์ด์–ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ด์šฉํ•ด image recognition task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์— ๋Œ€ํ•ด ๊ณ„์† ๋‹ค๋ค„๋ณด๋ คํ•œ๋‹ค. ์ด๋ฒˆ ์ฃผ์ œ๋Š” Swin Transformer๋กœ, 2021๋…„ 3์›” ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ(Microsoft Research A

heeya-stupidbutstudying.tistory.com

 

 

 

 

 

 

 

'ArtificialIntelligence > PaperReading' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Paper reading] Dataset Condensation reading  (0) 2023.09.11
[Paper reading] Dataset Condensation Summary  (0) 2023.09.10
[Paper reading] Transformers for image recognition, ViT  (0) 2023.08.28
[Paper reading] Attention is all you need, Transformer  (0) 2023.08.25
[Paper reading] DenseNet  (0) 2023.08.22