The Mathematics Underlying Deep Learning

By Bas Machielsen

May 29, 2021

Introduction

In this post, I want to provide a quick primer about the mathematics underlying the backpropagation algorithm in neural networks and deep learning algorithms. I use the notation provided in this book, chapter 2, and prove/explain in-depth where some of the equations come from. I hope this provides an intuitive and clear explanation about the logic underlying backpropagation, and it should make it easy to implement this using code.

Delta

$\delta^l$ is defined as $\frac{\partial C}{\partial z^l}$, or in component form: $\delta^l_j = \frac{\partial C}{\partial z^l_j}$. The network is define such that the matrix $W^L$ represents the weights from layer $a^{l-1}$ to layer $a^l$. We start indexation at 1 (so not at 0!). Hence, layer $a^0$ and weights $W^1$ do not exist, and layer $a^1$ is the input layer. There are a total of $L$ layers in the network, each of which can have a variable number of nodes. First, we find an expression for $\delta^L$, meaning delta in the last layer.

By the chain rule, and recognizing that the only way $z^L_j$ influences the cost function is through $a^L_j$, we know that:

$$ \delta^L_j = \frac{\partial C}{\partial a^L_j} \cdot \frac{\partial a^L_j}{\partial z^L_j} $$

The first part of this derivative is determined by the cost function, and the second part $\frac{\partial a^L_j}{\partial z^L_j}$ evaluates to $\sigma'(z^L_j)$. Hence, in matrix form, the expression for $\delta^L$ is:

$$ \begin{pmatrix} \vdots \newline \delta^L_j \newline \vdots \end{pmatrix} = \begin{pmatrix} \frac{\partial C}{\partial a^L_1} \newline \vdots \newline \frac{\partial C}{\partial a^L_j} \newline \vdots \end{pmatrix} \circ \begin{pmatrix} \sigma’(z^L_1) \newline \vdots \newline \sigma’(z^L_j) \newline \vdots \end{pmatrix} $$

or alternatively, simply:

$$ \delta^L = \frac{\partial C}{\partial a^L} \cdot \sigma’(z^L) $$

From layer L to layer L-1

As soon as we have a $\delta$ for layer $L$, or layer $l$ more generally, we want to find a delta for the previous layer. In the next subsection, it will become clear why. To fix ideas, suppose there are $K$ elements in layer $l+1$, indexed by $k$ and there are $J$ elements in layer $l$, indexed by $j$. Then, we start off with the definition of $\delta^l$ in component form, which we recognize to be composed of all weights that proceed from that specific node, to one of the nodes $1$ to $K$ in layer $l+1$:

$$ \delta^l_j = \frac{\partial C}{\partial z^l_j} = \sum_k \frac{\partial C}{\partial z^{l+1)}_k} \cdot \frac{\partial z^{l+1}_k}{z^l_j} $$

The first term is by definition equal to $\delta^{l+1}_k$. The second term can be obtained by recognizing that:

$$ z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j = \sum_j w^{l+1}_{kj} \sigma(z^l_j) $$

Differentiating this expression with respect to $z^l_j$ gives (because there is only 1 specific $z^l_j$):

$$ \frac{\partial z^{l+1}_k}{\partial z^l_j}=w^{l+1}_{kj}\sigma’(z^l_j) $$

Substituting this expression back in the expression for $\delta^l_j$ gives:

$$ \delta^l_j = \sum_k \delta^{l+1}_k \cdot w_{kj}^{l+1} \cdot \sigma’(z^l_j) $$

In matrix form, this expression becomes:

$$ \delta^l = W_{kj}^T \delta^{l+1} \circ \sigma’(z^l) $$

or, more explicitly,

$$ \begin{bmatrix} \delta^l_1 \newline \vdots \newline \delta^l_J \end{bmatrix} = \begin{bmatrix} w_{11} & w_{21} & \dots & w_{k1} \newline w_{12} & \dots & \dots & w_{k2} \newline \vdots & \ddots & \ddots & \vdots \newline w_{1j} & \dots & \dots & w_{kj} \end{bmatrix} \begin{bmatrix} \delta^{l+1}_1 \newline \vdots \newline \delta^{l+1_K} \end{bmatrix} \circ \begin{pmatrix} \sigma’(z^l_1) \newline \vdots \newline \sigma’(z^l_j) \end{pmatrix} $$

with the dimensions $j$ x $1$ (layer $l$) = $j$ x $k$ (transpose of the weight matrix) times $k$ x $1$ (layer $l+1$) times $j$ x 1 (layer $l$.

Relating Delta’s to Derivatives

Now that we have an expression for all $\delta^l$’s, we have to relate the weight derivatives $\frac{\partial C}{\partial w_{jk}}$ to the $\delta$’s. To see this, we realize that a particular weight to layer $l$ only influences one particular node $z_j^l$ in layer $l$. Hence, we can write the partial derivative that we are looking for as:

$$ \frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial w^l_{jk}} $$ Then, by definition, $\frac{\partial C}{\partial z^l_j} = \delta^l_j$. For the second partial derivative, we realize that, for a layer $l$ with $J$ nodes and a layer $l-1$ with $K$ nodes:

$$ \begin{pmatrix} z^l_1 \newline \vdots \newline z^l_J \end{pmatrix} = \begin{pmatrix} w_{11} & \dots & w_{1K} \newline \vdots & \ddots & \vdots \newline w_{J1} & \dots & w_{JK} \end{pmatrix} \begin{pmatrix} a^{l-1}_1 \newline \vdots \newline a^{l-1}_K \end{pmatrix} $$ And thus, the derivative of $z^l_j$ with respect to $w_{jk}$ is just $a^{l-1}_k$. Substituting that in the expression for $\frac{\partial C}{\partial w_{jk}}$ gives:

$$ \frac{\partial C}{\partial w_{jk}} = \delta_j \cdot a^{l-1}_k $$

Finally, in matrix form, this amounts to:

$$ \begin{pmatrix} \frac{\partial C}{\partial w_{jk}} \end{pmatrix} = \begin{pmatrix} \delta^l_1 \newline \vdots \newline \delta^l_j \end{pmatrix} \cdot \begin{pmatrix} a^{l-1}_1 \newline \vdots \newline a^{l-1}_K \end{pmatrix}^T $$

Which gives back a $J$ by $K$ matrix, which after multiplying exactly corresponds to the component form sketched out above. This is also what you would use in an implementation of backpropagation in code.

Posted on:: May 29, 2021

Length:: 4 minute read, 797 words

See Also: