同样网络结构,不一样的推理速度?--记一次奇怪的踩坑

2021-11-28

背景

这是以前工程化过程中碰到的一个问题,一直没有总结整理过。现象是这样的,有一个网络结构(基本就是Resnet50), 以前已经工程化到MNN了。当时在PC上运行,单线程大概600ms。后来,模型性能提升了(模型结构没有变化,只是数据增多),于是考虑升级模型,奇怪的是,运行却要2s多,足足是原来的3倍多。在我当时的认知里,结构不变,各种卷积,FC参数量都没变,计算量应该是不变的,为啥会出现这么大差距,百思不得其解。记录下当时的排查和解决过程。

问题定位

首先模型在mxnet下没问题,因此一开始认定是MNN的问题。所以往MNN提了issue, https://github.com/alibaba/MNN/issues/786, 并且后来用MNN/tools/下面的timeProfile去逐层测速(4线程下面),按层类型汇总:
原模型:

Sort by time cost !
Node Type Avg(ms) % Called times Flops Rate
Reshape 0.072720 0.040518 2.000000 0.000386
Pooling 0.365080 0.203415 4.000000 0.017411
Eltwise 0.559910 0.311970 24.000000 0.026874
PReLU 1.148671 0.640014 25.000000 0.056019
Scale 1.955981 1.089830 52.000000 0.077988
Convolution 175.379929 97.717857 54.000000 99.814148
total time : 179.475815 ms, total mflops : 6321.113770
main, 112, cost time: 17968.628906 ms

新模型:

Sort by time cost !
Node Type Avg(ms) % Called times Flops Rate
Reshape 0.072500 0.016967 2.000000 0.000386
Pooling 0.370410 0.086687 4.000000 0.017411
Eltwise 0.571718 0.133798 24.000000 0.026874
PReLU 1.229384 0.287711 25.000000 0.056019
Scale 1.930405 0.451770 52.000000 0.077988
Convolution 423.116730 99.021439 54.000000 99.814148
total time : 427.298096 ms, total mflops : 6321.113770
main, 112, cost time: 42751.425781 ms

可以看出差距主要是在卷积层,新模型的卷积层慢了很多。收到反馈比较慢···,所以我又去测试了其他框架,opencv_dnn, 情况依旧。所以在opencv也提交了issue:https://github.com/opencv/opencv/issues/17259, Opencv的回复很快(点赞),并且复现了(OpenVino后端不能复现)。他们表示也很疑惑。这个时候MNN也回复了,让试试开启/fp:fast编译选项,不过我测试了还是无效,可能是windows上没生效。Opencv团队在开启-DENABLE_FAST_MATH=ON后,速度差不多了,不过这个选项可能对精度有影响,因此不算最终解决方案,不过也能大概指出是跟数值计算有关乐。这时,一个大佬提出了 I zeroed all weights that were smaller than 1e-15 and both give the same efficiency. I suspect that the fusion process is leading to a lot of denormals by multiplying small numbers with small numbers. I have some doubts on my claim though because it's a bit unusual to have models filled with so many tiny weights to cause serious performance degradation.

Denormals have leading zeros in the mantissa which is not-so-normal representation. Normally, you would have leading zeros counted in the exponent to make room for having as many significant digits as possible in the mantissa. When the number becomes so small that you cannot make the exponent any smaller without an overflow, you will use leading zeros in the mantissa to store the value. Most hardware are optimized for handling normal numbers efficiently and often have alternate slow paths for dealing with denormals. When you enable fast math, you are also enabling flush-to-zero (treat denormals as zero). With FTZ, the hardware deals with them efficiently by simply ignoring them.

The CUDA backend didn't face this issue probably because the convolutions are largely implemented using fused multiply-add ops and the FMA pipeline can handle denormals. Only multi-instruction sequences like sqrt require special handling of denormals. 也就是由于出现了太多的denormal,中文是非规格化浮点数,可以简单理解为非常小的浮点数,处理这种数的速度大大慢于规格化的浮点数。具体到我们这个问题,由于网络的权重基本都是小数,可能权重本身就太小了,慢慢出现很多很小的数(denormal number),导致了计算速度慢。我统计了2个模型权重中<1e^-15的个数,确实慢的要多很多。

问题解决

问题的根源是出现了过小的小数,并且通过OpenCV的回复中测试了将权重过小的置为有符号的0,速度就大致一样了,精度不会受影响。

@@ -414,6 +414,10 @@ public:
                 cv::multiply(originWeights.row(i), weightsMultipliers[i], weightsMat.row(i));
                 biasvec[i] *= wi;
             }
+            Mat mask = (abs(weightsMat) <= 1e-15f) & (weightsMat > 0);
+            weightsMat.setTo(0, mask);  // Flush to zero (FTZ) denormal weights
+            mask = (abs(weightsMat) <= 1e-15f) & (weightsMat < 0);
+            weightsMat.setTo(-0, mask);  // Flush to zero (FTZ) denormal weights
         }


仿照这个思路,我们可以将模型中权重e^-15次方的置为0,这个操作可以在原始模型上操作,也可以在模型转换时操作,我选择的是在MNN的转换代码中修改,具体是tools/converter/source/optimizer/PostConverter.cpp,optimizeNet最后加上:

    auto& op_list=newNet->oplists;
    size_t cnt=0;
    for(auto& op :op_list)
    {
        if(op->type==MNN::OpType::OpType_Convolution||op->type==MNN::OpType::OpType_ConvolutionDepthwise)
        {
            auto conv2D = op->main.AsConvolution2D();
            for (auto& w: conv2D->weight)
            {
//                if(std::fpclassify(w)==FP_SUBNORMAL)
	            if(std::abs(w)<1e-15)
                {
                	cnt+=1;
                    if(w>0.0f)
                    {
                        w=0.0f;
                    }
                    else if(w<0.0f)
                    {
                        w=-0.0f;
                    }
                }
            }
        }
    }

    std::cout<<"weights too small cnt "<<cnt<<std::endl;

重新编译转换工具即可。不过其实我这样的修改方式严格来说不完全正确,因为还有可能是在推理过程中产生这样的小数,因此正确的方式是修改推理代码,在卷积算子计算前和计算后把非规格浮点数忽略掉,不过这个操作改起来工作量就会大些了,因为用上面的方式已经解决我的问题了,这种改法我没有去实施了,如果有兴趣的可参考OpenCV的方式:https://github.com/opencv/opencv/pull/17295

总结

同样的模型结构若一个模型权重含有的非常小的权重太多,是会严重影响推理速度的(CUDA, OpenVino不影响),可以在训练时将这种权重置0,或者转换模型时处理,精度不会受到影响。