Poster
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
Marina Neseem · Conor McCullough · Randy Hsin · Chas Leichner · Shan Li · In Suk Chong · Andrew Howard · Lukasz Lew · Sherief Reda · Ville-Mikko Rautio · Daniele Moro
Arch 4A-E Poster #133
Abstract:
Low-precision quantization is recognized for its efficacy in neural network optimization.Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions, batch normalization, and quantization scaling dominate the inference cost of low-precision models.These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (*ACE*).In this paper, we propose $ACE_{v2}$ - an extended version of *ACE* which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware.Moreover, we introduce *PikeLPN*, a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular, we present a novel quantization technique for batch normalization layers named *QuantNorm* which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally, we propose applying *Double Quantization* where the quantization scaling parameters are quantized. Furthermore, we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing *Distribution-Heterogeneous Quantization* which enables quantizing them to low-precision.*PikeLPN* achieves Pareto-optimality in efficiency-accuracy trade-off with up to *3*$\times$ efficiency improvement compared to SOTA low-precision models.
Chat is not available.