# the back-propagation equations for a convolutional network

Written on In this post, we'll derive the back-propagation equations for our convolutional net, which has the structure shown below. The square blocks refer to 2D matrices, while the rectangles represent matrices with a single row. They are drawn vertically, however, to make the diagram cleaner. matrix size meaning
$$\textbf{k}^i$$ $$1\times n^k$$ i$$^{th}$$ kernel
$$\textbf{F}$$ $$n^k \times n^0$$ image frame captures
$$\textbf{l}^{ci}$$ $$1\times n^0$$ i$$^{th}$$ pre-pooled layer
$$\textbf{l}^0$$ $$1\times n^0$$ pooled convolution layer
$$\textbf{w}^{01}$$ $$n^0 \times n^1$$ weights from layer 0 to 1
$$\textbf{l}^1$$ $$1\times n^1$$ hidden layer
$$\textbf{w}^{12}$$ $$n^1 \times n^2$$ weights from layer 1 to 2
$$\textbf{l}^2$$ $$1\times n^2$$ output layer

In terms of components, the convolutional layer is given by

$$l_i^{cj}=\sum_m k_m^j F_{mi},$$

and

$$l_k^0 = \tanh \left[ \max \left(l_k^{c0},l_k^{c1},...,l_k^{cN}\right) \right].$$

For layers 1 and 2 we have

$$l_q^1=\tanh \left[\sum_r l_r^0 w_{rq}^{01}\right].$$

and

$$l_n^2 = \sigma \left[\sum_l l_l^1 w_{ln}^{12} \right],$$

where $$\sigma$$ is the softmax function. The loss function is

$$\epsilon = \frac{1}{2}\sum_p \left(l_p^2-y_p \right)^2,$$

and by applying the chain rule to these equations we will derive the back-propagation equations. Similarly to the 3 layer non-convolutional net, we have

$$\frac{\partial \epsilon}{\partial w_{qs}^{12}}=\left[\textbf{l}^1\otimes\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{s'}\right]_{qs}$$

and

$$\frac{\partial \epsilon}{\partial w_{qs}^{01}}=\left[\textbf{l}^0 \otimes \left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\right]_{qs}.$$

but now in addition to these we have

$$\frac{\partial \epsilon}{\partial k_q^i}=\sum_{n,m,p}\frac{\partial \epsilon}{\partial l_m^2}\frac{\partial l_m^2}{\partial l_n^1}\frac{\partial l_n^1}{\partial l_p^0}\frac{\partial l_p^0}{\partial k_q^i}.$$

These derivatives are

$$\frac{\partial \epsilon}{\partial l_m^2}=\sum_p \left(l_p^2-y_p\right)\delta_{pm}=\left(l_m^2-y_m\right),$$
$$\frac{\partial \l_m^2}{\partial l_n^1}=\sigma_m'\sum_l w_{lm}^{12}\delta_{ln}=\sigma_m' w_{nm}^{12},$$
$$\frac{\partial l_n^1}{\partial l_p^0}=t_n'\sum_r \delta_{pr}w_{rn}^{01}=t_n'w_{pn}^{01},$$

and

$$\frac{\partial l_p^0}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\{t\left[\max(l_p^{c0},l_p^{c1},...,l_p^{cN})\right]\}=t_p^{0'}\frac{\partial l_p^{c,w(p)}}{\partial k_q^i},$$

where $$w(p)$$ is the index of the largest $$l_p^{ci}$$ ("w" is for "winner"). This last derivative is given by

$$\frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\left[\sum_m k_m^{w(p)}F_{mp}\right]=\sum_m F_{mp}\delta^{i,w(p)}\delta_{mq}=F_{qp}\delta^{i,w(p)}.$$

Defining

$$\mathcal{F}_{qp}^i\equiv F_{qp}\delta^{i,w(p)},$$

this result can be written more compactly as

$$\frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\mathcal{F}_{qp}^i,$$

so that

$$\frac{\partial l_p^0}{\partial k_q^i}=t_p^{0'}\mathcal{F}_{qp}^i.$$

Hence, for the components of the gradient corresponding to the kernels, we obtain

$$\frac{\partial \epsilon}{\partial k_{q}^i}=\sum_{n,m,p}\left(l_m^2-y_m\right)\sigma_m' w_{mn}^{12,T}t_n^{1'} w_{np}^{01,T}t_p^{0'}\mathcal{F}_{pq}^{i,T},$$

which can be written as

$$\frac{\partial \epsilon}{\partial k_{q}^i}=\left[\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\textbf{w}^{01,T}\textbf{D}_{t'}^0 \mathcal{F}^{i,T} \right]_q.$$