Vectorization in numpy (draft)
One common vectorization technique is useful in machine learning problem: assign one row of matrix A, let’s say A[i] to another column matrix B, let’s say B[k], where k=y(i), y is some index function.
Example 1: SVM Gradient
For differential computation:
dwk=∂Li∂wk={−xTi(k=yj),i=1,⋯,N0(k≠yj)Thus we get dwk=−xTi, where k=yj(=y(i)). To construct this assignment, we need to use the fact that, e.g. assign the first column of X[0]T to the y[0]st column of dw, we can use the formula
X[0]T∗[0,...,0,−1,0,...,0]where the position of −1 is equal to the value of y[0].
Then, the whole computation will become
mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)
Example 2: Softmax Loss
For differential dwk=∂Li∂wk=xTi, where k=argmaxjfj, we can use the fact that, e.g. assign the first column of X[0]T to the y[0]st column of dw, we can use the formula
X[0]T∗[0,...,0,1,0,...,0]where the position of 1 is equal to the position of f[0]’s maximum element.
Then, the whole computation will become
mask = np.zeros((num_train, num_classes))
maxPos = np.max(f, axis=1)
index = [np.arrange(num_train), maxPos]
mask[index] = 1
np.dot(X.T, mask)
Example 3: Softmax Gradient
We can continue to use the characteristics of above discussion to construct the vectorization of softmax loss. It’s loss formula is
Li=−fyi+log(∑jefj)For the first term, it’s the same as the first example that dwk=−XTi, where k=yi. Thus we can use the code
mask = np.zeros((num_train, num_classes))
index = [np.arrange(num_train), y]
mask[index] = -1
np.dot(X.T, mask)
For the differential of 2nd term, we get
∑kefk∑jefj f′kThus, except the f′k, which can be gotten by using, e.g.
X[0]T∗[0,...,0,1,0,...,0]where the position of 1 is equal to the value of y[0], we only need to multiple the previous coefficients efk∑jefj, i.e.
X[0]T∗[0,...,0,efk∑jefj,0,...,0]This is the general term of the sum in the differential of 2nd term. Let’s expand some beginning terms:
X[0]T∗[ef0∑jefj,0,0,...,0]X[0]T∗[0,ef1∑jefj,0,...,0]X[0]T∗[0,0,ef2∑jefj,0,...,0]So, the whole sum will become
X[0]T∗[ef0∑jefj,ef1∑jefj,ef2∑jefj,...,efk∑jefj,...]Dimension Analysis
The above discussion can be treated as math proof of the following technique: dimension analysis, which is easier to remember. Let’s analyze the typical vector multiplication in full-connected neural network computation. Assume the formula that we use is
a=X⋅W+bwhere, X is a N×D matrix, W has dimension D×M, and the dimension of b is M×1. And we’ve known the derivative of a is da. Obviously, its dimension is the same as a, i.e. N×M. Let’s determine the derivatives of X,W,b.
According to chain rule, the derivative of X must be some combinations of W and da. As:
- dimension of dX is the same as X, i.e. N×D,
- dimension of W is D×M,
- and dimension of da is N×M.
In order to get the result with dimension N×D, we need to use multiplication of matrices of dimension N×M and dimension M×D. Thus, we can get the answer as dX=da⋅WT.
Also, to get dW, we know it must be the combination of X and da. As:
- the dimension of dW is D×M,
- the dimension of X is N×D,
- the dimension of da is N×M
we get dW=XT⋅da.
Finally, for db, we know its dimension is M×1. And it is only related to da with dimension N×M. We just need to construct the matrix multiplication with all-one vector u=[1,1,⋯,1]T with dimension N×1. Thus we get the result db=daT⋅u.
Superisely, we can find that the results of dimension analysis are the same as the previous discussion. That’s the power of dimension analysis!