[Paper reading] Transformers for image recognition, ViT

2023. 8. 28. 18:45ArtificialIntelligence/PaperReading

 

 

 

Transformers for image recognition

Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. 

 

 

 

Abstract

  • While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.
  • In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  • When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

 

 

 

Conclusion

  • We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step.
  • Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
  • While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring self-supervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pre-training. Finally, further scaling of ViT would likely lead to improved performance.

 

 

 

Method

  • The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image ∈ R(H×W×C) into a sequence of flattened 2D patches x∈ R(N×(P2·C)), where (H,Wis the resolution of the original image, is the number of channels, (P,Pis the resolution of each image patch, and N  HW / P^is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.
  • The Transformer uses constant latent vector size through all of its layers, so we flatten the patches and map to dimensions with a trainable linear projection. We refer to the output of this projection as the patch embeddings.

 

  • Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings. The resulting sequence of embedding vectors serves as input to the encoder. (인코더의 input으로 들어가게 되는 position embedding의 결과, 벡터들의 sequence)

 

  • Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs.
  • In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model.
  • In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below).
  • Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

 

 

 

Conclusion 

장점

  • 1D의 NLP 언어 처리에서 2D 이미지를 처리하기 위해 Transformer 구조를 변형하는 과정에서, 왜 이러한 method를 사용해야 하는지 논리적으로 잘 제시된 것 같다. 해당 모델의 구조를 이해하는 데 도움이 되었다. 
  • 이전 transformer 논문에서는 표기되지 않았던 std를 함께 표에 제시한 방식이 더욱 신뢰도 있는 방법인 것 같다.

 

  • Image Classification task를 3가지 그룹으로 나누어 분석한 점이 좋았다.
  • We also evaluate on the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates low-data transfer to diverse tasks, using 1000 training examples per task. The tasks are divided into three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite imagery, and Structured – tasks that require geometric understanding like localization.

 

Figure 7

 

 

  • ViT 모델의 internal representation을 다양한 관점에서 분석하고, 정리한 점이 좋았다. 
  • To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding filters. The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.
  • After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
  • Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specifically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive field size in CNNs.
  • We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we find that the model attends to image regions that are semantically relevant for classification (Figure 6).

Figure 6: 모델이 classification과 관련된 정보에 집중하고 있다.

 

 

단점 

 

LayerNorm — PyTorch 2.0 documentation

Shortcuts

pytorch.org

https://yonghyuc.wordpress.com/2020/03/04/batch-norm-vs-layer-norm/

 

Batch Norm vs Layer Norm

Multi Layer Perceptron (MLP)를 구성하다 보면 Batch normalization이나 Layer Normalization을 자주 접하게 되는데 이 각각에 대한 설명을 따로 보면 이해가 되는 듯 하다가도 둘을 같이 묶어서 생각하면 자주 헷

yonghyuc.wordpress.com

 

  • Hybrid Architecture 이 부분에서 어떤 방식으로 CNN과 transformer의 hybrid가 이루어지게 되는 것인지 보다 구체적인 방법이 제시되었으면 좋을 것 같다. (왜 해당 방식으로 구현한 것인지, 어떤 장점을 기대하는 지 등) 
  • As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classification input embedding and position embeddings are added as described above.

 

 

개선할 점 & Question

  • inductive bias에서 locality와 global이 어떤 개념을 의미하는 지 잘 모르겠다. 
  • Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.
  • In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. 

 

 

 

참고자료 & Code

https://github.com/google-research/vision_transformer

 

GitHub - google-research/vision_transformer

Contribute to google-research/vision_transformer development by creating an account on GitHub.

github.com

 

https://medium.com/data-and-beyond/vision-transformers-vit-a-very-basic-introduction-6cd29a7e56f3

 

Vision Transformers [ViT]: A very basic introduction

A Simple and basic understanding of how transformers can be used in images

medium.com

 

Drawbacks of using Vision Transformers

1. ViTs are computationally expensive to train (any image related task is always expensive due to large pixel sizes of images). This is because they have a large number of parameters.

2. ViTs are not as efficient as CNNs at processing images. This is because they need to attend to every part of the image, even if it is not important for the task at hand.

3. ViTs are not as interpretable as CNNs. This means that it is difficult to understand how they make predictions.

 

 

 

https://hongl.tistory.com/235

 

Vision Transformer (4) - Pytorch 구현

Vision Transformer (1) Vision Transformer (2) Vision Transformer (3) - Attention Map 이번 포스트에서는 Pytorch 프레임워크를 이용하여 Vision Transformer 모델을 구현해보도록 하겠습니다. 복습을 위해 다시 한번 ViT 모

hongl.tistory.com

https://kimbg.tistory.com/31

 

[ML] ViT(20.10); Vision Transformer 코드 구현 및 설명 with pytorch

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE https://arxiv.org/pdf/2010.11929.pdf vit 논문에 관련한 양질의 리뷰는 상당히 많아서, 코드 구현에 관한 설명만 정리하고자 했습니다. 그래도 중

kimbg.tistory.com