Approximating vision transformers for edge: variational inference and mixed-precision for multi-modal data

Dewant Katare*, Sam Leroux, Marijn Janssen, Aaron Yi Ding

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

4 Downloads (Pure)

Abstract

Vision transformer (ViTs) models have shown higher accuracy, robustness and large volume data processing ability, creating new baselines and references for perception tasks. However, these advantages require large memory and high-performance processors and computing units, which makes model adaptability and deployment challenging within resource-constrained environments such as memory-restricted and battery-powered edge devices. This paper addresses the model deployment challenges by proposing a model approximation approach VI-ViT, for edge deployment using variational inference with mixed precision for processing multi-modalities, such as point clouds and images. Our experimental evaluation on the nuScenes and Waymo datasets show up to 37% and 31% reduction in model parameters and Flops while maintaining a mean average precision of 70.5 compared to 74.8 of the baseline model. This work presents a practical deployment approach for approximating and optimizing Vision Transformers for edge AI applications by balancing model metrics such as parameters, flops, latency, energy consumption, and accuracy, which can easily be adapted to other transformer models and datasets.

Original languageEnglish
Article number71
JournalComputing
Volume107
Issue number3
DOIs
Publication statusPublished - 2025

Keywords

  • Edge AI
  • Mixed precision
  • Model approximation
  • Multimodality
  • Quantization
  • Variational parameters
  • Vision transformers

Fingerprint

Dive into the research topics of 'Approximating vision transformers for edge: variational inference and mixed-precision for multi-modal data'. Together they form a unique fingerprint.

Cite this