[Paper reading] GoogleNet

2023. 8. 18. 10:37ArtificialIntelligence/PaperReading

 

 

 

 

 

 

Inception Module

  • The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions.
  • adding an alternative parallel pooling path

 

  • As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary; as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×and 5×convolutions should increase as we move to higher layers.

 

  • One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity.
  • The ubiquitous use of dimension reduction (1X1 conv 의미) allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size.
  • Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.
  • The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are − 3× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

 

 

 

1 X 1 convolution layer

  •  Network-in-Network is an approach proposed by Lin et al. in order to increase the representational power of neural networks. When applied to convolutional layers, the method could be viewed as additional × convolutional layers followed typically by the rectified linear activation
  • This enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our architecture. However, in our setting, × convolutions have dual purpose
  • they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks.
  • This allows for not just increasing the depth, but also the width of our networks without significant performance penalty.

 

  • 1×convolutions are used to compute reductions before the expensive 3×and 5×convolutions.
  • Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.

 

 

 

Abstract & Conclusion

  • We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
  • The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing.
  • One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

 

  • Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.
  • The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture.
  • Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways.

 

 

 

Conclusion

  • 장점
    • convolution 개수 증가에 대한 근거를 기술한 점
    • 1X1 convolutional layer의 효과에 대해 자세히 기술된 점
      (naive version에서 dimension reduction model로 넘어가는 이유)
      나아가 발생한 문제에 대해, 해결 방법을 제시하는 과정의 논리
      • 왜 dimension reduction을 사용했는지 논리적으로 기술되었다.
    • 병렬적인 구조를 1 X 1 filter를 활용하여, 계산량을 줄이고, 이를 통해 depth를 쌓을 수 있었던 점

 

 

  • 단점
    • Inception module
      • 다양한 size의 convolution filter를 통과한 output들이 어떻게 concat 되는지, 구체적으로 제시되면 좋을 것 같다.
        나아가 이미지 정보가 병렬적으로, inception module을 통해 다양한 size filter를 통해 abstract 되면, 왜 더 좋은 방식인지 제시되면 좋을 것 같다. 

    • auxiliary classifiers
      • Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. 
        왜 middle에서 discriminative한 feature들이 생성되는지 ?
      • By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.
      • These classifiers take the form of smaller convolutional networks put on top of the output of the Inception modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.
      • 보조 분류기가 왜 등장했는지, 사용해야하는 이유에 대해 더 구체적인 근거가 제시되면 좋을 것 같다. sparsity의 중요성으로 모델 성능을 향상시켰으나, gradient vanishing 문제를 해결하지 못하여 loss를 더하는 것인지?

 

 

  • 개선할 점 & Question
    • dropout 방식이 사용되었는데, batch norm이 적용된다면 어떻게 될까?
    • however the use of dropout remained essential even after removing the fully connected layers. 
    • For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion.
      처음부터 Inception을 사용하는 것이, 성능 관점에서는 더 좋은 것?

 

 

 

https://bskyvision.com/entry/CNN-알고리즘들-GoogLeNetinception-v1의-구조