VIT1 Vision Transformer (ViT) 정리 : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Transformer] Attention Is All You Need The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new arxiv.org Transformer가 나온 이후 자연어 처리에서 이미 많이 쓰이고 computer vision 분야에서는 최근에서야 transformer 기반 모델들이 SOTA를 .. 2021. 11. 12. 이전 1 다음