cs231n assignment 1
Recently I was following an online course on Convolutional Neural Networks (CNN) provided by Stanford. I find it a very nice hands-on material: slides and notes are easy to understand. Purely reading formulations can be confusing sometimes, but practicing experiments helps better understanding what the formulations and the symbols in them are expressing.
The first assignment is about basic assignments. It also includes some practice on vectorization, which may make Python code faster. I don’t do the nested version in some assignments, and some of code are half-vectorized.
Prerequisites
1. KNN
KNN is the easiest one; this part is still worth doing, because it helps understand vectorization and cross validation.
Train
In KNN, the process of training is simply remembering X_train
and y_train
:
X_train
: Shape as (#features, #train). Each column corresponds to a training sample.y_train
: Shape as (#train,). Labels.
Distances
Essence of KNN lies in computing distances. Input is simply X
. Output is dists
.
X
: Shape as (#features, #test). Each column corresponds to a test sample.XT
: Shape as (#features, #train). Each column corresponds to a training sample.dists
: Shape as (#test, #train).dists[i, j]
means the distance of $i^{th}$ test sample to $j^{th}$ training sample.
For brevity, denote
- $D$ as #features
- $X^{(i)}$ as $i^{th}$ column of $X$, i.e. $i^{th}$ sample
Since
, we can vectorize it like this:
def compute_distances_no_loops(self, X):
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
# vectorize
ip = X.dot(self.X_train.T) # inner product
XT2 = np.sum(self.X_train ** 2, axis=1)
X2 = np.sum(X ** 2, axis=1)
dists = np.sqrt(-2*ip + XT2 + X2.reshape(-1, 1))
Predict
from collections import Counter
def predict_labels(self, dists, k=1):
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in xrange(num_test):
# pick nearest neighbors
closest_y = self.y_train[np.argsort(dists[i])[:k]]
# count which class appears most
y_pred[i] = Counter(closest_y).most_common(1)[0][0]
return y_pred
Cross validation
Cross validation is a process to determine hyper-parameters; in this case, k
is a hyper-parameter. We have X_train
and X_test
; now we subdivide X_train
into cv_X_train
and cv_X_test
, where cv_
means cross validation.
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = np.array(np.array_split(X_train, num_folds))
y_train_folds = np.array(np.array_split(y_train, num_folds))
k_to_accuracies = {}
dim = X_train.shape[1]
from cs231n.classifiers import KNearestNeighbor
for k in k_choices:
k_to_accuracies[k] = [0] * num_folds
for test_idx in range(num_folds):
train_ids = [idx for idx in range(num_folds) if idx != test_idx]
cv_X_train = X_train_folds[train_ids].reshape(-1, dim)
cv_X_test = X_train_folds[test_idx]
cv_y_train = y_train_folds[train_ids].reshape(-1)
cv_y_test = y_train_folds[test_idx]
classifier = KNearestNeighbor()
classifier.train(cv_X_train, cv_y_train)
dists = classifier.compute_distances_no_loops(cv_X_test)
cv_y_test_pred = classifier.predict_labels(dists, k)
num_correct = np.sum(cv_y_test_pred == cv_y_test)
num_test = len(cv_y_test)
accuracy = float(num_correct) / num_test
k_to_accuracies[k][test_idx] = accuracy
for k in sorted(k_to_accuracies): # Print out the computed accuracies
for accuracy in k_to_accuracies[k]:
print 'k = %d, accuracy = %f' % (k, accuracy)
2. SVM
Here I omit the pre-processing part – subtract the mean image and add an extra dimension ‘1’ to each sample, which has been explained by lecture notes.
Denote
- $x^{i}$ as $i^{th}$ column of $X$, i.e. $i^{th}$ sample
- $w_{j}$ as $j^{th}$ row of $W$
- $N$ as # of training samples
- $C$ as # of classes
Then
-
Score function
scores
orwx
: -
Loss function:
-
Gradient of loss
ddW
: -
Total loss
loss
: -
Gradient of total loss
dW
:
_variable
_s that represent these terms are used in the following snippet.
Train
Training is to obtain best $W$, minimizing total loss $L$. Here we use gradient descent to find best $W$, and we need to calculate total loss and gradient of it.
Loss by nested loop
Outer loop iterates i
of $L_{i}$ over $N$, corresponding to a specific sample. Inner loop iterates j
of $W_{j}$ over $C$, corresponding to a specific class.
def svm_loss_naive(W, X, y, reg):
dW = np.zeros(W.shape) # initialize the gradient as zero
delta = 1
num_classes = W.shape[0]
num_train = X.shape[1]
loss = 0.0
for i in xrange(num_train):
ddW = np.zeros(W.shape)
ddWyi = np.zeros(W[0].shape)
scores = W.dot(X[:, i])
correct_class_score = scores[y[i]]
for j in xrange(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + delta
if margin > 0:
loss += margin
ddW[j] = X[:, i] ## be careful, it's a reference
ddWyi += ddW[j]
ddW[y[i]] = -ddWyi
dW += ddW
# divided by num_train
loss /= num_train
dW /= num_train
# add regularization term
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
return loss, dW
Loss by half vectorization
Inner loop can be vectorized. Rather than count margin
loop by loop under j
, we can count them in a once. It depends on scores
and correct_class_score
. scores
already satisfies, so we’ll try to obtain correct_class_score
for each j
.
Sorry about naming, but here I name wx
as scores
, judge
as margin
s. Here’s the snippet:
def svm_loss_vectorized(W, X, y, reg):
dW = np.zeros(W.shape) # initialize the gradient as zero
wx = W.dot(X)
delta = 1
### loss
# wxy chooses scores of right labels
# its shape is (#samples,)
wxy = [ wx[y[i], i] for i in xrange(wx.shape[1]) ]
# judge expression
# remember to exclude on y[i]'s
judge = wx - wxy + delta
# make judge 0 on y[i]
for i in xrange(wx.shape[1]):
judge[y[i], i] = 0
# mass is a matrix holding all useful temp results
# shape of judge is (#class, #train)
mass = np.maximum(0, judge)
loss = np.sum(mass) / X.shape[1]
loss += 0.5 * reg * np.sum(W * W)
Now explain on the dW
part. In each iteration under i
, dW
adds an ddW
. Each row of ddW
is a weight timing x[:, i]
. For most rows who’s not y[i]
, weight is 1, i.e. ddW[i] = x[:, i]
. For the specific row who is y[i]
, ddW[i]
’s weight is the negative sum of all other rows. This logic corresponds to the definition of :
def svm_loss_vectorized(W, X, y, reg):
# continue on last snippet
# weight to be producted by X
# its shape is (#classes, #samples)
weight = np.array((judge > 0).astype(int))
# weights on y[i] needs special care
weight_yi = -np.sum(weight, axis=0)
for i in xrange(wx.shape[1]):
weight[y[i], i] = weight_yi[i]
# half vectorized
for i in xrange(X.shape[1]):
ddW = X[:, i] * weight[:, i].reshape(-1, 1)
dW += ddW
dW /= X.shape[1]
dW += reg * W
return loss, dW
Comparison of the two-loop version and the half vectorized version
Following the ipython notebook and testing on training dataset with shape (3073, 49000) and 10 labels, training of the two-loop version takes around 7s
and that of half vectorized version takes around 4s
. Actually the complexity of the algorithm doesn’t change, i.e. $O \left( N \cdot D \cdot C \right)$; That speed goes up is because Python is slow in for loops, and maybe numpy has done some optimization in matrix multiplication.
Gradient descent
This part is comparatively easy. Having gradient $\nabla_{W}L$, we may just subtract $W$ by $\alpha \nabla_{W}L$ in each iteration, where $\alpha$ is learning rate.
Predict
The process of prediction is simply choosing the label with highest score. I omit it here.
3. Softmax
Keep the symbols defined in SVM, here we have:
-
Score function:
-
Output function:
-
Loss function:
-
Gradient function:
, where
Or you may refer to this version of gradient function.
-
Total loss:
-
Gradient of total loss:
According to the lecture note, here we adjust to . Here’s the snippet, very similar to that of SVM:
def softmax_loss_naive(W, X, y, reg):
dW = np.zeros_like(W)
### loss
scores = W.dot(X)
scores_max = np.max(scores, axis=0)
scores -= scores_max
exp_scores = np.exp(scores)
sums = np.sum(exp_scores, axis=0)
log_sums = np.log(sums)
scores_y = np.array([scores[y[i], i] for i in xrange(X.shape[1])])
loss = np.sum(-scores_y + log_sums)
loss /= X.shape[1]
loss += .5 * np.sum(W * W)
### dW
for i in xrange(X.shape[1]):
# dW += 1./sums[i] * log_sums[i] * X[:, i]
dW += 1./sums[i] * exp_scores[:, i].reshape(-1, 1) * X[:, i]
dW[y[i]] -= X[:, i]
dW /= X.shape[1]
dW += reg * W
return loss, dW
Note the sigma term
.
In Python it is interpreted as exp_scores[:, i].reshape(-1, 1) * X[:, i]
, because for each j
, there’s only one none zero row in $\nabla_{W}f^{(i)}_{j}$, whose value is $e^{f^{(i)}}$ dot product $x^{(i)}$. Shape of $e^{f^{(i)}}$ is (C,)
and shape of $x^{(i)}$ is (D,)
. Making use of broadcast, we can reshape $e^{f^{(i)}}$ to (C, 1)
and make a (C, D)
matrix; that is what we want for ddW
.
Say last
Thank everyone that points out my mistakes. If there’s any more mistake, please don’t hesitate to tell me. The original purpose of this blog is just to note down my immediate thoughts when following cs231n in my spare time. It’s totally written in an amateur’s view; I would be happy if it helps you.
The official version of SVM and Softmax is really nice, and comes in later assignments.