One common vectorization technique is useful in machine learning problem: assign one row of matrix , let’s say to another column matrix , let’s say , where , is some index function.

Example 1: SVM Gradient

For differential computation:

Thus we get , where . To construct this assignment, we need to use the fact that, e.g. assign the first column of to the st column of , we can use the formula

where the position of is equal to the value of .

Then, the whole computation will become

mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)

Example 2: Softmax Loss

For differential , where , we can use the fact that, e.g. assign the first column of to the st column of , we can use the formula

where the position of is equal to the position of ’s maximum element.

Then, the whole computation will become

mask = np.zeros((num_train, num_classes))
maxPos = np.max(f, axis=1)
index = [np.arrange(num_train), maxPos]
mask[index] = 1
np.dot(X.T, mask)

Example 3: Softmax Gradient

We can continue to use the characteristics of above discussion to construct the vectorization of softmax loss. It’s loss formula is

For the first term, it’s the same as the first example that , where . Thus we can use the code

mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)

For the differential of 2nd term, we get

Thus, except the , which can be gotten by using, e.g.

where the position of is equal to the value of , we only need to multiple the previous coefficients , i.e.

This is the general term of the sum in the differential of 2nd term. Let’s expand some beginning terms:

So, the whole sum will become

Dimension Analysis

The above discussion can be treated as math proof of the following technique: dimension analysis, which is easier to remember. Let’s analyze the typical vector multiplication in full-connected neural network computation. Assume the formula that we use is

where, is a matrix, has dimension , and the dimension of is . And we’ve known the derivative of is . Obviously, its dimension is the same as , i.e. . Let’s determine the derivatives of .

According to chain rule, the derivative of must be some combinations of and . As:

  • dimension of is the same as , i.e. ,
  • dimension of is ,
  • and dimension of is .

In order to get the result with dimension , we need to use multiplication of matrices of dimension and dimension . Thus, we can get the answer as .

Also, to get , we know it must be the combination of and . As:

  • the dimension of is ,
  • the dimension of is ,
  • the dimension of is

we get .

Finally, for , we know its dimension is . And it is only related to with dimension . We just need to construct the matrix multiplication with all-one vector with dimension . Thus we get the result .

Superisely, we can find that the results of dimension analysis are the same as the previous discussion. That’s the power of dimension analysis!