Our website is made possible by displaying online advertisements to our visitors.
Please consider supporting us by disabling your ad blocker.

Responsive image


Vision transformer

The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.

A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency.[2] Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.[3][4]

Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs. ViTs have found application in image recognition, image segmentation, weather prediction, and autonomous driving.[5][6]

  1. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  2. ^ Cite error: The named reference Raghu-2021 was invoked but never defined (see the help page).
  3. ^ Dehghani, Mostafa; Djolonga, Josip; Mustafa, Basil; Padlewski, Piotr; Heek, Jonathan; Gilmer, Justin; Steiner, Andreas; Caron, Mathilde; Geirhos, Robert (2023-02-10), Scaling Vision Transformers to 22 Billion Parameters, arXiv:2302.05442
  4. ^ "Scaling vision transformers to 22 billion parameters". research.google. Retrieved 2024-08-07.
  5. ^ Han, Kai; Wang, Yunhe; Chen, Hanting; Chen, Xinghao; Guo, Jianyuan; Liu, Zhenhua; Tang, Yehui; Xiao, An; Xu, Chunjing; Xu, Yixing; Yang, Zhaohui; Zhang, Yiman; Tao, Dacheng (2023-01-01). "A Survey on Vision Transformer". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (1): 87–110. arXiv:2012.12556. doi:10.1109/TPAMI.2022.3152247. ISSN 0162-8828. PMID 35180075.
  6. ^ Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak (2022-09-13). "Transformers in Vision: A Survey". ACM Comput. Surv. 54 (10s): 200:1–200:41. arXiv:2101.01169. doi:10.1145/3505244. ISSN 0360-0300.

Previous Page Next Page