A From-Scratch Implementation of Neural Networks in Python
By Bas Machielsen
June 1, 2021
Introduction
In this blog post, I wanted to write down an easy implementation of neural networks in Python. The model architecture of neural networks contains three pillars:
- The actual model (Multi-layer perceptron network), with inputs, layers, and outputs, and weights connecting these.
- Forward propagation - the loop from inputs to subsequent layers and finally, the output, using given weights.
- Backward propagation - updating the weights with gradient descent to minimize the loss.
In this blog post, I want to write about all three of them. I am borrowing heavily from two good resources: first, the video series on Neural Networks by 3blue1brown on Youtube, and secondly, a tutorial by Valerio Velardo. If you take the effort to watch these videos, you can see that Valerio’s series provides the implementation architecture of neural networks in Python, but I think the notation by 3blue1brown is a little bit clearer. I will therefore slightly change the code, so that it matches the notation used in the 3blue1brown videos. Another good resource is this book and accompanying code, which also contains a Python-implementation of neural networks, which is virtually equivalent to mine, although it uses some operators with which I am less familiar.
Multilayer perceptron architecture and forward propagation
Our goal is to write a function that takes an \(n\)
-dimensional vector as an input ($n_1, \dots, n_n$), transforms it to \(Z > 1\)
intermediate layers, and finally transforms it into a \(k\)
-dimensional output. Suppose the second layer \(j\)
has \(a_1, \dots, a_l\)
perceptrons in layer 2. Then, there have to be \(n \cdot l\)
weights connecting \(n\)
input perceptrons to \(l\)
perceptrons. Following the notation of 3blue1brown, we take element \((i,j)\)
of the matrix to be the weight to \(l\)
from \(n\)
: in other words, if \(i\)
changes, the layer to which the weight goes is changed, and when \(j\)
changes, the layer from which the weight goes is changed. In yet other words, different rows represent different destination nodes, and different columns represent different origin nodes:
$$ W_{layer 1 \rightarrow layer 2} = \begin{bmatrix} w_{n_1 \rightarrow a_1} & w_{n_2 \rightarrow a_1} & \dots & w_{n_n \rightarrow a_1} \newline w_{n_1 \rightarrow a_2} & w_{n_2 \rightarrow a_2} & \dots & w_{n_n \rightarrow a_2} \newline \vdots & & \ddots & \vdots \newline w_{n_1 \rightarrow a_l} & w_{n_2 \rightarrow a_l} & \dots & w_{n_n \rightarrow a_l} \end{bmatrix} $$
This \(l\)
x \(n\)
matrix is then multiplied with a \(n\)
x \(1\)
vector to obtain a \(l\)
x \(1\)
vector, which represent all the perceptrons in layer 2. Multiplying out the above matrix with an input vector \(x = (n_1, \dots, n_n)^T\)
gives:
$$ W\mathbb{x} = \begin{bmatrix} w_{n_1 \rightarrow a_1} & w_{n_2 \rightarrow a_1} & \dots & w_{n_n \rightarrow a_1} \newline w_{n_1 \rightarrow a_2} & w_{n_2 \rightarrow a_2} & \dots & w_{n_n \rightarrow a_2} \newline \vdots & & \ddots & \vdots \newline w_{n_1 \rightarrow a_l} & w_{n_2 \rightarrow a_l} & \dots & w_{n_n \rightarrow a_l} \end{bmatrix} \begin{bmatrix} n_1 \newline \vdots \newline \newline n_n \end{bmatrix} = \newline \begin{bmatrix} n_1 \cdot w_{n_1 \rightarrow a_1} + \dots + n_n \cdot w_{n_n \rightarrow a_1} \newline \vdots \newline n_1 \cdot w_{n_1 \rightarrow a_l} + \dots + n_n \cdot w_{n_n \rightarrow a_l} \end{bmatrix} $$
So that each entry in \(a_l\)
is a product of the weights and the initial entries. This represents the first step in ‘forward propagation’ - the multiplication of weights and inputs. Subsequent steps involve only changing the dimensions, that is, changing the matrix \(\mathbb{W}\)
to proceed from \(l\)
perceptrons to the next intermediate layer, or output layer. In general, if layer \(L-1\)
has \(l\)
elements and layer \(L\)
has \(k\)
elements, then the weight matrix should have \(k\)
x \(l\)
elements, each element \((i,j)\)
linking element \(i\)
in layer \(L\)
to element \(j\)
in layer \(L-1\)
. Of course, this new layer should be subject to an activation function (e.g. sigmoid), which I don’t want to pay too much attention to.
Implementation in Python looks as follows, where num_inputs and num_outputs represent the dimensions of input and output layers, and hidden layers is a list of numbers, indicating (i) how many hidden layers there should be (the length of the list), and (ii) how many perceptrons (nodes) each of them should have:
import numpy as np
class MLP(object):
def __init__(self, num_inputs=3, hidden_layers=[3, 3], num_outputs=2):
self.num_inputs = num_inputs
self.hidden_layers = hidden_layers
self.num_outputs = num_outputs
# create a list that saves the dimensions of the layers
layers = [num_inputs] + hidden_layers + [num_outputs]
# fill in random weights with the correct dimensions
# if layer i has k elements, layer i+1 has j elements, the matrix should have j x k elements
weights = []
for i in range(len(layers)-1):
w = np.random.rand(layers[i+1], layers[i])
weights.append(w)
self.weights = weights
def forward_propagate(self, inputs):
# the input layer activation is just the input itself
activations = inputs
# iterate through the network layers
for w in self.weights:
# calculate matrix multiplication between previous activation and weight matrix
net_inputs = w @ activations
# apply sigmoid activation function
activations = self._sigmoid(net_inputs)
# return output layer activation
return activations
def _sigmoid(self, x):
y = 1.0 / (1 + np.exp(-x))
return y
Forward propagation is achieved by a loop that multiplies subsequent layers (that start with the input layer) by the respective weights. This implementation overwrites the values of the subsequent layers every time, to save memory. Of course, it could have been possible to save the product w @ activations
for each iteration in the for loop. Next, we look at backward propagation, and we incorporate that insight into the algorithm.
Backward propagation
Backward propagation calculates the partial derivatives with respect to all weight matrices. We find those in a recursive way, as implemented in the following algorithm. We then use a gradient descent algorithm to change the weight matrices in the direction in which the cost function decreases the most. Mathematically, what we attempt to implement is a gradient function consisting of the derivates of the Cost function with respect to the weights:
$$ \frac{\partial C}{\partial w^{(l)}_{jk}} = a^{l-1}_k \cdot \sigma’(z^{(l)}_j) \cdot \frac{\partial C}{\partial a^{(l)}_j} \text{ where: } \newline \frac{\partial C}{\partial a^{(l)}_j} = \begin{cases} \sum_{j=1}^{n_{l+1}} w^{(l+1)}_{jk} \cdot \sigma’(z_j^{(l+1)}) \cdot \frac{\partial C}{\partial a_j^{(l+1)}} &\text{ if } l \neq L \newline (a^{(L)}_j - y_j) &\text{ if } l = L \end{cases} $$
In Python, the following code would implement the network with backpropagation using the same notational logic as we’ve used before. First, we define the MLP object and forward propagation, now saving all the activations and raw inputs (named z’s):
import numpy as np
from random import random
class MLP(object):
def __init__(self, num_inputs=3, hidden_layers=[3, 3], num_outputs=2):
self.num_inputs = num_inputs
self.hidden_layers = hidden_layers
self.num_outputs = num_outputs
# create a generic representation of the layers
layers = [num_inputs] + hidden_layers + [num_outputs]
# create random connection weights for the layers
weights = []
for i in range(len(layers) - 1):
w = np.random.rand(layers[i+1], layers[i])
weights.append(w)
self.weights = weights
# save derivatives per layer
derivatives = []
for i in range(len(layers) - 1):
d = np.zeros((layers[i+1], layers[i]))
derivatives.append(d)
self.derivatives = derivatives
# save activations per layer
activations = []
for i in range(len(layers)):
a = np.zeros(layers[i])
activations.append(a)
self.activations = activations
# save raw inputs per layer
zs = []
for i in range(len(layers)):
z = np.zeros(layers[i])
zs.append(z)
self.zs = zs
Then, we redefine forward propagation while saving the activations and raw inputs:
def forward_propagate(self, inputs):
# the input layer activation is just the input itself
activations = inputs
# save the activations for backpropogation
self.activations[0] = activations
# iterate through the network layers
for i, w in enumerate(self.weights):
# calculate matrix multiplication between previous activation and weight matrix
z = np.dot(w, activations)
# apply sigmoid activation function
activations = self._sigmoid(z)
# save the activations for backpropogation
self.zs[i + 1] = z
self.activations[i + 1] = activations
# return output layer activation
return activations, zs
Finally, we implement backward propagation as defined previously. We start with an ’error’, meaning \((a^L - y) \cdot \sigma'(z^L)\)
. We then compute the \(\delta^L = (a^L - y) \cdot \sigma'(z^L)\)
. Afterwards, we implement the derivatives as \(\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j \cdot a^{(l-1)}_k\)
in matrix form. In this code, \(i = l-1\)
. Finally, we update the error by redefining \(\delta^l = ((w^{(l+1)})^T \delta^{(l+1)}) \cdot \sigma'(z^l)\)
.
def back_propagate(self, error):
# iterate backwards through the network layers
for i in reversed(range(len(self.derivatives))):
# get activation for previous layer
activations = self.activations[i+1]
zs = self.zs[i+1]
# apply sigmoid derivative function - * denotes Hadamard product
delta = error * self._sigmoid_derivative(zs)
# get activations for current layer
current_activations = self.activations[i]
# save derivative after applying matrix multiplication
self.derivatives[i] = np.dot(delta, current_activations.transpose())
# backpropagate the next error
error = np.dot(self.weights[i].transpose(), delta)
Conclusion
These code chunks represent the core of the neural networks architecture. I have not emphasized gradient descent, which is a way to update the weights given the current weights and the derivatives. All the auxilary functions, including gradient descent, are included in the following chunk:
def gradient_descent(self, learningRate=1):
# update the weights by stepping down the gradient
for i in range(len(self.weights)):
weights = self.weights[i]
derivatives = self.derivatives[i]
weights += derivatives * learningRate
def _sigmoid(self, x):
y = 1.0 / (1 + np.exp(-x))
return y
def _sigmoid_derivative(self, x):
return x * (1.0 - x)
def _mse(self, target, output):
return np.average((target - output) ** 2)
def train(self, inputs, targets, epochs, learning_rate):
# now enter the training loop
for i in range(epochs):
sum_errors = 0
# iterate through all the training data
for j, input in enumerate(inputs):
target = targets[j]
# activate the network!
output = self.forward_propagate(input)
error = target - output
self.back_propagate(error)
# now perform gradient descent on the derivatives
# (this will update the weights
self.gradient_descent(learning_rate)
# keep track of the MSE for reporting later
sum_errors += self._mse(target, output)
# Epoch complete, report the training error
print("Error: {} at epoch {}".format(sum_errors / len(items), i+1))
print("Training complete!")
I hope all of this was useful, thanks for reading!
- Posted on:
- June 1, 2021
- Length:
- 8 minute read, 1642 words
- See Also: