在当今的机器学习领域,神经网络已经成为许多应用领域的首选工具。然而,神经网络的训练过程复杂且需要大量的数据,这使得人们对于神经网络的优化和加速需求日益增长。在这种背景下,量化神经网络(Quantized Neural Network, QNN)应运而生,成为研究的热点。本文将简单介绍量化神经网络。
图片来源于 Qualcomm
- 因为卷积神经网络对噪声不敏感,量化相当于对原输入加入了大量的噪声。
- 模型太大,比如VGG19大于500MB的参数量,存储压力大;
- 每个层的weights范围基本都是确定的,且波动不大,适合量化压缩;
- 此外,量化既减少访存又可减少计算量
- 因为训练需要反向传播和梯度下降,int8为离散数值,举个例子就是我们的学习率一般都是零点几零点几的,int8不匹配无法反向传播更新。
A White Paper on Neural Network Quantization
- https://arxiv.org/abs/2106.08295
- Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
Quantization for Neural Networks
- https://leimao.github.io/article/Neural-Networks-Quantization/
- Abstract: Quantization refers to techniques for performing computations and storing tensors at lower bit-widths than floating point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. This technique is in particular useful at the inference time since it saves a lot of inference computation cost without sacrificing too much inference accuracies. So far, major deep learning frameworks, such as TensorFlow and PyTorch, have supported quantization natively. The users have been using the built-in quantization modules successfully without knowing how it works exactly. In this article, I would like to elucidate the mathematics of quantization for neural networks so that the developers would have some ideas about the quantization mechanisms.