TY - JOUR
T1 - Approximating vision transformers for edge
T2 - variational inference and mixed-precision for multi-modal data
AU - Katare, Dewant
AU - Leroux, Sam
AU - Janssen, Marijn
AU - Ding, Aaron Yi
PY - 2025
Y1 - 2025
N2 - Vision transformer (ViTs) models have shown higher accuracy, robustness and large volume data processing ability, creating new baselines and references for perception tasks. However, these advantages require large memory and high-performance processors and computing units, which makes model adaptability and deployment challenging within resource-constrained environments such as memory-restricted and battery-powered edge devices. This paper addresses the model deployment challenges by proposing a model approximation approach VI-ViT, for edge deployment using variational inference with mixed precision for processing multi-modalities, such as point clouds and images. Our experimental evaluation on the nuScenes and Waymo datasets show up to 37% and 31% reduction in model parameters and Flops while maintaining a mean average precision of 70.5 compared to 74.8 of the baseline model. This work presents a practical deployment approach for approximating and optimizing Vision Transformers for edge AI applications by balancing model metrics such as parameters, flops, latency, energy consumption, and accuracy, which can easily be adapted to other transformer models and datasets.
AB - Vision transformer (ViTs) models have shown higher accuracy, robustness and large volume data processing ability, creating new baselines and references for perception tasks. However, these advantages require large memory and high-performance processors and computing units, which makes model adaptability and deployment challenging within resource-constrained environments such as memory-restricted and battery-powered edge devices. This paper addresses the model deployment challenges by proposing a model approximation approach VI-ViT, for edge deployment using variational inference with mixed precision for processing multi-modalities, such as point clouds and images. Our experimental evaluation on the nuScenes and Waymo datasets show up to 37% and 31% reduction in model parameters and Flops while maintaining a mean average precision of 70.5 compared to 74.8 of the baseline model. This work presents a practical deployment approach for approximating and optimizing Vision Transformers for edge AI applications by balancing model metrics such as parameters, flops, latency, energy consumption, and accuracy, which can easily be adapted to other transformer models and datasets.
KW - Edge AI
KW - Mixed precision
KW - Model approximation
KW - Multimodality
KW - Quantization
KW - Variational parameters
KW - Vision transformers
UR - http://www.scopus.com/inward/record.url?scp=86000252906&partnerID=8YFLogxK
U2 - 10.1007/s00607-025-01427-w
DO - 10.1007/s00607-025-01427-w
M3 - Article
AN - SCOPUS:86000252906
SN - 0010-485X
VL - 107
JO - Computing
JF - Computing
IS - 3
M1 - 71
ER -