The GoogLeNet published in 2015 presented an effective and simple idea. Instead of applying a 3×3 or 5×5 convolution layer directly, in later layers, first apply a 1×1 convolution layer and then apply the real convolution layer. The 1×1 convolution average all depth values of each pixel and reduce the depth to one. So the 1×1 convolution is equivalent to a mean pooling along the D direction.
It can dramatically reduce the number of parameters needed for training. In later layers, after applying many convolution filters (assume D of them) in the previous convolution layer, the data in the feedforward become MxNxD. D could be 100 or even more. With a regular convolution layer, the filter size should be 5x5xD. But if we apply a 1×1 convolution layer first, the depth D becomes one and the data has MxN dimensions. So the number of parameters needed are 1 + 5×5. If D is 100, the number of parameters for that layer is reduced from 2500 to 26.
According to the paper, it doesn’t reduce the performance of the network.