The Mathematics Underlying Deep Learning
By Bas Machielsen
May 29, 2021
Introduction
In this post, I want to provide a quick primer about the mathematics underlying the backpropagation algorithm in neural networks and deep learning algorithms. I use the notation provided in this book, chapter 2, and prove/explain in-depth where some of the equations come from. I hope this provides an intuitive and clear explanation about the logic underlying backpropagation, and it should make it easy to implement this using code.
Delta
\(\delta^l\)
is defined as \(\frac{\partial C}{\partial z^l}\)
, or in component form: \(\delta^l_j = \frac{\partial C}{\partial z^l_j}\)
. The network is define such that the matrix \(W^L\)
represents the weights from layer \(a^{l-1}\)
to layer \(a^l\)
. We start indexation at 1 (so not at 0!). Hence, layer \(a^0\)
and weights \(W^1\)
do not exist, and layer \(a^1\)
is the input layer. There are a total of \(L\)
layers in the network, each of which can have a variable number of nodes. First, we find an expression for \(\delta^L\)
, meaning delta in the last layer.
By the chain rule, and recognizing that the only way \(z^L_j\)
influences the cost function is through \(a^L_j\)
, we know that:
$$ \delta^L_j = \frac{\partial C}{\partial a^L_j} \cdot \frac{\partial a^L_j}{\partial z^L_j} $$
The first part of this derivative is determined by the cost function, and the second part \(\frac{\partial a^L_j}{\partial z^L_j}\)
evaluates to \(\sigma'(z^L_j)\)
. Hence, in matrix form, the expression for \(\delta^L\)
is:
$$ \begin{pmatrix} \vdots \newline \delta^L_j \newline \vdots \end{pmatrix} = \begin{pmatrix} \frac{\partial C}{\partial a^L_1} \newline \vdots \newline \frac{\partial C}{\partial a^L_j} \newline \vdots \end{pmatrix} \circ \begin{pmatrix} \sigma’(z^L_1) \newline \vdots \newline \sigma’(z^L_j) \newline \vdots \end{pmatrix} $$
or alternatively, simply:
$$ \delta^L = \frac{\partial C}{\partial a^L} \cdot \sigma’(z^L) $$
From layer L to layer L-1
As soon as we have a \(\delta\)
for layer \(L\)
, or layer \(l\)
more generally, we want to find a delta for the previous layer. In the next subsection, it will become clear why. To fix ideas, suppose there are \(K\)
elements in layer \(l+1\)
, indexed by \(k\)
and there are \(J\)
elements in layer \(l\)
, indexed by \(j\)
. Then, we start off with the definition of \(\delta^l\)
in component form, which we recognize to be composed of all weights that proceed from that specific node, to one of the nodes \(1\)
to \(K\)
in layer \(l+1\)
:
$$ \delta^l_j = \frac{\partial C}{\partial z^l_j} = \sum_k \frac{\partial C}{\partial z^{l+1)}_k} \cdot \frac{\partial z^{l+1}_k}{z^l_j} $$
The first term is by definition equal to \(\delta^{l+1}_k\)
. The second term can be obtained by recognizing that:
$$ z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j = \sum_j w^{l+1}_{kj} \sigma(z^l_j) $$
Differentiating this expression with respect to \(z^l_j\)
gives (because there is only 1 specific \(z^l_j\)
):
$$ \frac{\partial z^{l+1}_k}{\partial z^l_j}=w^{l+1}_{kj}\sigma’(z^l_j) $$
Substituting this expression back in the expression for \(\delta^l_j\)
gives:
$$ \delta^l_j = \sum_k \delta^{l+1}_k \cdot w_{kj}^{l+1} \cdot \sigma’(z^l_j) $$
In matrix form, this expression becomes:
$$ \delta^l = W_{kj}^T \delta^{l+1} \circ \sigma’(z^l) $$
or, more explicitly,
$$ \begin{bmatrix} \delta^l_1 \newline \vdots \newline \delta^l_J \end{bmatrix} = \begin{bmatrix} w_{11} & w_{21} & \dots & w_{k1} \newline w_{12} & \dots & \dots & w_{k2} \newline \vdots & \ddots & \ddots & \vdots \newline w_{1j} & \dots & \dots & w_{kj} \end{bmatrix} \begin{bmatrix} \delta^{l+1}_1 \newline \vdots \newline \delta^{l+1_K} \end{bmatrix} \circ \begin{pmatrix} \sigma’(z^l_1) \newline \vdots \newline \sigma’(z^l_j) \end{pmatrix} $$
with the dimensions \(j\)
x \(1\)
(layer \(l\)
) = \(j\)
x \(k\)
(transpose of the weight matrix) times \(k\)
x \(1\)
(layer \(l+1\)
) times \(j\)
x 1 (layer \(l\)
.
Relating Delta’s to Derivatives
Now that we have an expression for all \(\delta^l\)
’s, we have to relate the weight derivatives \(\frac{\partial C}{\partial w_{jk}}\)
to the \(\delta\)
’s. To see this, we realize that a particular weight to layer \(l\)
only influences one particular node \(z_j^l\)
in layer \(l\)
. Hence, we can write the partial derivative that we are looking for as:
$$
\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial w^l_{jk}}
$$
Then, by definition, \(\frac{\partial C}{\partial z^l_j} = \delta^l_j\)
. For the second partial derivative, we realize that, for a layer \(l\)
with \(J\)
nodes and a layer \(l-1\)
with \(K\)
nodes:
$$
\begin{pmatrix}
z^l_1 \newline
\vdots \newline
z^l_J \end{pmatrix} =
\begin{pmatrix}
w_{11} & \dots & w_{1K} \newline
\vdots & \ddots & \vdots \newline
w_{J1} & \dots & w_{JK}
\end{pmatrix}
\begin{pmatrix}
a^{l-1}_1 \newline
\vdots \newline
a^{l-1}_K
\end{pmatrix}
$$
And thus, the derivative of \(z^l_j\)
with respect to \(w_{jk}\)
is just \(a^{l-1}_k\)
. Substituting that in the expression for \(\frac{\partial C}{\partial w_{jk}}\)
gives:
$$ \frac{\partial C}{\partial w_{jk}} = \delta_j \cdot a^{l-1}_k $$
Finally, in matrix form, this amounts to:
$$ \begin{pmatrix} \frac{\partial C}{\partial w_{jk}} \end{pmatrix} = \begin{pmatrix} \delta^l_1 \newline \vdots \newline \delta^l_j \end{pmatrix} \cdot \begin{pmatrix} a^{l-1}_1 \newline \vdots \newline a^{l-1}_K \end{pmatrix}^T $$
Which gives back a \(J\)
by \(K\)
matrix, which after multiplying exactly corresponds to the component form sketched out above. This is also what you would use in an implementation of backpropagation in code.
- Posted on:
- May 29, 2021
- Length:
- 4 minute read, 797 words
- See Also: