Yann LeCun’s classic paper Gradient-Based Learning Applied to Document Recognition is worth every Deep Learning researcher for reading. It contains about 43 pages(not counting the reference pages), which is some kind of challenge for your patience and focus. Based on this price, you can get the detailed explanations and intuitive insights behind the deep learning development.

Gap between testing error and training error

This part references classic formula to explain the trade-off between and , which is the trade-off between bias and variance, or synonym underfitting and overfitting.

where is one constant, is one index to evaluate the complexity of the model, is the number of training samples, is a constant in range .

LeNet-5 Details

When I read the II-B part, I found there’re lots of summary numbers about the famous LeNet-5 network. But in this compact paper, it didn’t give detailed steps for those computations. In my opinion these computing steps are interesting and are able to give us deeper understanding about how this network works. So, I’d like to complement those details.

Layer (6@28x28).

  • feature maps.
  • size filter.
  • size of feature map: one dimension is , thus the size is .
  • number of trainable parameters:
    • each feature map requires same filter, which contains trainable coefficients.
    • and trainable bias parameter.

    So in all, each feature map require trainable parameters

    • as there’re layers’ feature map in ,

    The whole requires trainable parameters.

  • number of connections: when we compute the above parameters, we just treat each point in with same filter, which smooths different cases with the same compute pattern. Thus, the number of connections can be get by recovering this process as: .

Layer (6@14x14).

  • feature maps.
  • size of feature map: as there’s non-overlapping and shrinking each filter dimension by , i.e. , we get the feature size as .
  • number of trainable parameters:
    • considering the pooling process as , only and are trainable, and each feature map use the same trainable parameters, so we only need trainable parameters in each feature map.
    • as there’re layers, in totally we require trainable parameters.
  • number of connections: as there’re inputting sampling connections, plus bias connection within each feature map. Thus we get the total number of connections as .

Layer (16@10x10).

This layer is much more special and differs from common convolutional layer or sampling(pooling) layer. As the paper says, it combines neighborhoods at identical locations in a subset of ’s feature maps, which means the neighborhoods in the 3rd dimension, layer dimension. If we image the feature maps are heaped from bottom to top. Then the neighborhoods means the local layers in the height dimension. So for each layer group in , the filter size may be different.

  • number of training parameters:
    • number of parameters in first layers: for each layer in this group (the indicates the number of neighbors), and the total layers require training parameters.
    • number of parameters in next layers(with the same above thinking): training parameters (the indicates the number of neighbors).
    • number of next 3 layers: training parameters.
    • number of last 1 layer: training parameters. Summarizing all the numbers, the whole requires parameters.
  • number of connections: same as discussion in , each feature map’s element uses the same filter, and there’re elements in each feature map of , we get the whole number of connections .

Layer (16@5x5).

  • number of trainable parameters: .
  • number of connections: .

Layer (120@1x1).

  • number of trainable parameters: .
  • number of connections: .

Layer (84@1x1).

This layer contains the same classic neural network structure. It would get the input from all the layers from , which means layers with size. And to generate the unit of , it’ll weight the elements in and add one bias. It means each unit of requires trainable parameters.

And this weighted sum, called for unit , will be passed to a sigmoid squashing function. This is the final result of unit in : , where .

  • number of trainable parameters: .
  • number of connections: .

Output Layer

Each unit in OUTPUT layer computes from the Euclidean distance between its input vector and its parameter vector: , where is index of output class, i.e. totally classes here, and is the index of input vector, i.e. totally units in here. Thus we can get:

  • number of trainable parameters:

Remarks:

  1. The distance used above is called Euclidean Radical Basis Function (RBF).
  2. As the paper discussed, although there’re lots of trainable parameters, but the value of each parameter is or .
  3. As the component in is the result of sigmoid function, it’s value range is . So, the design of this OUTPUT layer is used as a penalty term measuring the fit between the input pattern and a model of the class associated with the RBF. The further away is the input from the parameter vector, the larger is the RBF output.
  4. The rationale of treating RBF as penalty term can be traced from probability theory. In probabilistic terms, the RBF output can be interpreted as the unnormalized negative log-likelihood of a Gaussian distribution in the space of configurations of layer , which is the common definition of loss function.
  5. As the role of loss function, the configuration of is supposed to be as close as possible to parameters in RBF, which corresponds to the pattern’s designed class.
  6. The choice of RBF parameters is designed to represent a stylized image of the corresponding character class drawn on a bitmap (hence the number ).
  7. As it represents a stylized image, it’s very powerful to recognize groups of objects, which have common confusable parts, especially there’s post-processor to correct the confusion. On the other hand, it’s not useful to recognized isolated object, which is full of special characters far from stylized trend.
  8. From a higher perspective, here we deal with the classier by using distributed codes (i.e. use the codes of pixels) instead of common-used one-hot encoding format. The reason is: non-distributed codes tend to behave badly when the number of classes is larger than a few dozens. That’s why we often see one-hot encoding appearing in distinguish digits, instead of pictures. (Here the number of set ASCII, i.e. the number of classes, is already large enough for one-hot encoding.)