One common vectorization technique is useful in machine learning problem: assign one row of matrix A, let’s say A[i] to another column matrix B, let’s say B[k], where k=y(i), y is some index function.

Example 1: SVM Gradient

For differential computation:

dwk=Liwk={xTi(k=yj),i=1,,N0(kyj)

Thus we get dwk=xTi, where k=yj(=y(i)). To construct this assignment, we need to use the fact that, e.g. assign the first column of X[0]T to the y[0]st column of dw, we can use the formula

X[0]T[0,...,0,1,0,...,0]

where the position of 1 is equal to the value of y[0].

Then, the whole computation will become

mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)

Example 2: Softmax Loss

For differential dwk=Liwk=xTi, where k=argmaxjfj, we can use the fact that, e.g. assign the first column of X[0]T to the y[0]st column of dw, we can use the formula

X[0]T[0,...,0,1,0,...,0]

where the position of 1 is equal to the position of f[0]’s maximum element.

Then, the whole computation will become

mask = np.zeros((num_train, num_classes))
maxPos = np.max(f, axis=1)
index = [np.arrange(num_train), maxPos]
mask[index] = 1
np.dot(X.T, mask)

Example 3: Softmax Gradient

We can continue to use the characteristics of above discussion to construct the vectorization of softmax loss. It’s loss formula is

Li=fyi+log(jefj)

For the first term, it’s the same as the first example that dwk=XTi, where k=yi. Thus we can use the code

mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)

For the differential of 2nd term, we get

kefkjefj fk

Thus, except the fk, which can be gotten by using, e.g.

X[0]T[0,...,0,1,0,...,0]

where the position of 1 is equal to the value of y[0], we only need to multiple the previous coefficients efkjefj, i.e.

X[0]T[0,...,0,efkjefj,0,...,0]

This is the general term of the sum in the differential of 2nd term. Let’s expand some beginning terms:

X[0]T[ef0jefj,0,0,...,0]X[0]T[0,ef1jefj,0,...,0]X[0]T[0,0,ef2jefj,0,...,0]

So, the whole sum will become

X[0]T[ef0jefj,ef1jefj,ef2jefj,...,efkjefj,...]

Dimension Analysis

The above discussion can be treated as math proof of the following technique: dimension analysis, which is easier to remember. Let’s analyze the typical vector multiplication in full-connected neural network computation. Assume the formula that we use is

a=XW+b

where, X is a N×D matrix, W has dimension D×M, and the dimension of b is M×1. And we’ve known the derivative of a is da. Obviously, its dimension is the same as a, i.e. N×M. Let’s determine the derivatives of X,W,b.

According to chain rule, the derivative of X must be some combinations of W and da. As:

  • dimension of dX is the same as X, i.e. N×D,
  • dimension of W is D×M,
  • and dimension of da is N×M.

In order to get the result with dimension N×D, we need to use multiplication of matrices of dimension N×M and dimension M×D. Thus, we can get the answer as dX=daWT.

Also, to get dW, we know it must be the combination of X and da. As:

  • the dimension of dW is D×M,
  • the dimension of X is N×D,
  • the dimension of da is N×M

we get dW=XTda.

Finally, for db, we know its dimension is M×1. And it is only related to da with dimension N×M. We just need to construct the matrix multiplication with all-one vector u=[1,1,,1]T with dimension N×1. Thus we get the result db=daTu.

Superisely, we can find that the results of dimension analysis are the same as the previous discussion. That’s the power of dimension analysis!