在当今的机器学习领域,神经网络已经成为许多应用领域的首选工具。然而,神经网络的训练过程复杂且需要大量的数据,这使得人们对于神经网络的优化和加速需求日益增长。在这种背景下,量化神经网络(Quantized Neural Network, QNN)应运而生,成为研究的热点。本文将简单介绍量化神经网络。
量化神经网络(QNN)
简单介绍
随着深度学习在为我们的电子设备提供智能方面的作用日益凸显,小型、低延迟且节能的神经网络解决方案的需求也日益迫切。如今,神经网络已经广泛应用到各种设备和服务中,从智能手机、智能眼镜到无人机、机器人和自动驾驶汽车。这些设备的运行往往会受到神经网络计算时间和能耗的严格限制。
减少神经网络计算时间和能耗的最有效方法之一就是量化。在神经网络量化中,权重和激活张量以比通常使用的16或32位更低的位精度进行存储。例如,当从32位降低到8位时,存储张量的内存开销减少了四倍,而矩阵乘法的计算成本则降低了16倍。实验证明,神经网络对量化具有相当的鲁棒性,这意味着它们可以在较低的位宽下运行,而网络精度的影响相对较小。
图片来源于 Qualcomm
此外,神经网络量化通常可以与其他常见的神经网络优化方法一起使用,例如神经架构搜索、压缩和剪枝。这些方法可以进一步提高神经网络的效率,从而使神经网络在计算时间和能耗方面的优势更加明显。
然而,神经网络量化并非没有挑战。低位宽量化会给网络带来噪声,从而导致精度下降。虽然某些网络对这种噪声具有鲁棒性,但其他网络则需要额外的工作才能充分利用量化的优势。这就需要我们寻找和开发新的方法和技术,以实现高效且高精度的神经网络运行。
一些问题
下面的这三个问题来自模型量化综述及应用。
在计算机系统中,量化是指定点与浮点等数据之间建立一种数据映射关系,使得以较小的精度损失代价获得了较好的收益,可简单理解为用“低比特”数字表示FP32等数值。
为什么量化有用?
- 因为卷积神经网络对噪声不敏感,量化相当于对原输入加入了大量的噪声。
为什么用量化?
- 模型太大,比如VGG19大于500MB的参数量,存储压力大;
- 每个层的weights范围基本都是确定的,且波动不大,适合量化压缩;
- 此外,量化既减少访存又可减少计算量
为什么不直接训练低精度的模型?
- 因为训练需要反向传播和梯度下降,int8为离散数值,举个例子就是我们的学习率一般都是零点几零点几的,int8不匹配无法反向传播更新。
关于QNN的一些介绍资料
-
A White Paper on Neural Network Quantization
- https://arxiv.org/abs/2106.08295
- Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
-
Quantization for Neural Networks
- https://leimao.github.io/article/Neural-Networks-Quantization/
- Abstract: Quantization refers to techniques for performing computations and storing tensors at lower bit-widths than floating point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. This technique is in particular useful at the inference time since it saves a lot of inference computation cost without sacrificing too much inference accuracies. So far, major deep learning frameworks, such as TensorFlow and PyTorch, have supported quantization natively. The users have been using the built-in quantization modules successfully without knowing how it works exactly. In this article, I would like to elucidate the mathematics of quantization for neural networks so that the developers would have some ideas about the quantization mechanisms.