The first part introduces the motivation and explains some background behind the task, and then sums up the backpropagation algorithm implementation. The second part discusses from the coding perspective and focuses mostly on performance, as for coder the time is always limited in supply and thus a topic of interest. The last part finally changes perspective to more on data science field and discuss ways to improve network learning capability.

I would like to thank my colleagues, especially Data Scientist Dr.**Henry Joutsijoki**, for the effort commenting on the context during the creation cycles, and others for all their valuable comments and efforts to improve this article.

CONTENT

#### Part 2:

#### Part 3:

## INTRODUCTION

A few years back my colleague came up with a coding assignment called Neuropia for recruitment assessment. The aim was to implement a simple machine learning solution. Yet the actual evaluated goal was to design an interface for perceptrons and multilayer feed-forward neural network and utilize that with a minimal neural network. The assignee was supposed to use the network to resolve logical binary gate having two inputs and one output with expected results. For example, if the network has been trained to act as NAND gate, then input array 1, 1 would produce 0, and other logical combinations would produce 1.

Neuropia revealed to be notoriously difficult. In general, it has been pretty tedious to adjust values to work just as a NAND gate. Implementing a single gate was quite tricky, and the most straightforward approach was just to set the activation function to implement a logical gate directly. That is a really simple artificial neuron and therefore fulfills the requirement, but is not a perceptron. Therefore such an implementation will not address the benefits and purpose of neural networks, thus a more comprehensive answer would be nice.

Not so long ago I started to follow recommendable Youtube channel 3Blue1Brown and watched episodes of neural networks and backpropagation algorithm. Along with that I got the idea to implement Neuropia in a way it can learn and be applied for more complex tasks than just resolving NAND logic.

## IMPLEMENTING LEARNING NETWORK

The Neuropia C++ interface implements two classes called Neuron and Layer. Neurons represent perceptrons and hold weights, i.e. multiplier for each input value and a bias, value that is summed up to the result, and the activation function that modifies the sum before passing it to the next neurons.

Layers implement a double linked list containing vector of Neurons and a mutable buffer to hold the most recent calculation results of this layer. In this implementation each layer is fully connected, meaning that each neuron in a layer is connected to each neuron in the succeeding layer. See Figure 1 a simple network that has an input layer of two inputs, a single hidden layer and an output layer with a single output.

The set of layers is a network and on creating the network, the first layer is the input layer, the last is the output layer, and layers in between are called hidden layers. The input layer will not modify data anyhow, it just passes the data to the next layer. The neuron properties inside the layer can be set individually or layer-wise.

## LOGISTIC GATE

One may find interesting that resolving an XOR gate problem, at least one hidden layer is required. Here we first create a network of three layers where the input layer has two neurons as well as a hidden layer, and the output layer has a single neuron to emit a boolean output value. The network is illustrated in Figure 1. All neurons are initialized to their default values.

`auto network = Neuropia::Layer(2).join(2).join(1);`

The `Layer::join`

function has few overloads. The variant used here creates a Layer with a given amount of neurons and bind them to the caller layer. The first layer is automatically an input layer, and the last an output layer.

Before training the network all weights and biases must initially set to random values.

`network.randomize();`

The train function is repeatedly called with input data and expected output data, aka labels. The third parameter, learning rate will be discussed later, but for the reference, a value 0.05 is used here. The `TRAININGS`

constant defines the needed training iterations, also called epochs.

`while(++loops < TRAININGS) {`

const auto train = trainData[random()];

network.train(train.inputs.begin(), train.labels.begin(), MY_LEARNINGRATE);

}

Verify the results and see the magic happen.

`for(const auto& input : verifyData) {`

std::cout << input << "->" << network.feed(input) << std::endl;

}

As a result we will get output for the gates in Figure 2:

From the results, you can see how network outputs probabilities are close to 0 and 1 as expected. It is Interesting that with a certain probability the output is completely incorrect as the training algorithm is not immune to getting stuck in the local minima. However, the given learning rate, and the other hyperparameters as discussed later, would help to configure the network so that this unfortunate case will not occur.

## BACKPROPAGATION IMPLEMENTATION

This Neuropia implementation uses backpropagation algorithm for training. I am not going through the math behind the implementation since there are more than plenty of mathematical papers, articles, books, youtube videos, and other sources. I hope the audience here will find it more interesting to dig into the programmer’s implementation perspective.

The backpropagation algorithm optimizes network for given training data, and if the training data represents the learnable subject well, the results can be applied in more general when feeding previously unseen data into the trained network.

The optimization process uses a gradient descent method that gradually finds the minima on the multidimensional plane where network data has at best adopted for the input. Later on, I will use the ‘minimum’ to refer that singular point where the network would theoretically reach its perfectness, i.e. neurons have values for their optimal stage.

Before we will dive deep into the algorithm, a few words about the Matrix class used. The backpropagation algorithm here could be implemented without matrices, and Neuron or Layer do not internally use any matrices. However, implementing the backpropagation algorithm is such a complex beast that the matrices are excellent tools to utilize. Due to the initial design, there is likely some performance hit when data has to be imported in between Neurons and matrices, and therefore for any real-world implementation the internal data layout would likely be a matrix. This Matrix class is implemented earlier for a completely different purpose and that may (or may not) explain a few of its peculiarities. Originally implemented from pure simplicity perspective and performance was not even considered as a requirement. However, I analyzed it with Valgrind and similar tools and fixed a few most obvious issues there. Please note that ‘*’ and ‘*= ‘ operators do elemental multiplication (also known as Hadamart product) per row and column same way as ‘+’ and ‘-’ operations. The matrix multiplication itself is done with `multiplication`

function.

So, backpropagation algorithm goes through network backwards (“hence the name” - as all tutorials keep saying), compares each layer input and output, calculates layer’s proportional error contribution and tunes gradually weights and biases to minimize the error between read output and expected output.

I will go through here the internal implementation of the training function. The Neuropia::Layer API function does forward feed first so this internal function gets the real network output for the training iteration. Please note that the code found in the repository may be different from the presented here as it may have been faced some further development.

`void train(IteratorItInput inputs, IteratorItOutput expectedOutputs, double learningRate, DerivativeFunction df) {`

The `Layer::feed`

function is the one used for passing values through the network. It takes input iterators to data as parameters and returns an output vector. Naturally, the input vector size has to match with the number of input layer neurons and the output vector then has the size of the output vectors.

`const auto out = feed(inputs, inputs + m_neurons.size());`

ValueVector expectedValues(out.size());

std::copy(expectedOutputs, expectedOutputs + out.size(), expectedValues.begin());

backpropagation(out, expectedValues, learningRate, df);

}

The `backpropagation`

function has output and expected output parameters as std::vectors and hence they have to be exported as matrices.

`void Layer::backpropagation`

(const std::vector<double>& outValues, const std::vector<double>& expectedValues, double learningRate, DerivativeFunction df) {

const auto expected = Matrix<double>::fromArray(expectedValues, Matrix<double>::VecDir::row);

auto lastValues = Matrix<double>::fromArray(outValues, Matrix<double>::VecDir::row);

At first, an error, the difference between expected and output values, is calculated. There are other options to get the error than just a linear delta between the result and expected values. The subtraction here is called a cost function and other approaches may improve the network performance, but to keep this simple, this implementation is fine.

`auto errors = expected - lastValues;`

The output layer is set as the initial layer.

`auto lastLayer = outLayer();`

Next, we extract the buffered values from the last hidden layer. In other words, values calculated within the feed function are called and stored into a Layer object. For each layer, the latest neuron output is cached in the mutable container to be used for backpropagation.

With an error and data of each Neuron, we know how it is off and therefore we are able to get its gradient. Then based on gradient we know how to adjust weights and biases to go towards minima on the next round.

`auto layerData = Matrix<double>::fromArray(previousLayer(lastLayer)->m_outBuffer, Matrix<double>::VecDir::row);`

So we loop through each layer.

`for(;;) {`

For the gradient, the derivative of `lastValues`

, i.e. current layer output is needed. Neuropia lets set your own activation function for perceptrons, but for training also its derivative function must be given. There will be more discussion about different activation functions later on in this document - but as an example, we can assume that sigmoid function is used. Sigmoid function, defined as

is a default activation function for a Neuron class and its derivative is as

. It looks a bit complicated, but the beautiful thing here is that within perceptrons each neuron output is an activation function, and therefore, the sigmoid function derivative function can be written as (in C++).

`DerivativeFunction df = [](double value) -> double { return value * (1.0 - value);});`

Hence we have an expression to get directions of changing values from the current Layer data. The Matrix::map function return a new Matrix where a given function is applied to each Matrix value.

`const auto gradientDelta = lastValues.map(df);`

As we know the error and its direction, we can construct the gradient scaled with a learning rate. Since the training algorithm is going towards minima only in average, it is crucial not to take too big steps and thus miss the minima, or otherwise, take too small, and iteration takes too long - or even ends up into arbitrary local minima without the ability to find a more optimal minima. Note that the learning rate can also be dynamically changed when learning progress. It can take good leaps in the beginning and end up to have tiny little steps towards the end to optimize minima approaching.

`const auto gradients = learningRate * errors * gradientDelta;`

We have a gradient that let us adjust the bias and weight values for the layer neurons.

`for(auto i = 0U; i < gradients.rows(); i++) {`

auto& n = lastLayer->m_neurons[i];

n.setBias(n.bias() + gradients(0, i));

}

The obvious complexity of the expression above is due to the Neuropia API. As matrix notation that can just be written as “biases += gradients”. But we have to inject the bias values back to the network and hence the complexity of the expression.

Calculating the weights for Neuron is pretty similar operation even the weights are stored as std::vector for each input value, and therefore the implementation, as well as math expression, is slightly more complex.

For matrix multiplication we need current layer output data transposed.

`const auto layerDataTransposed = layerData.transpose();`

const auto layerDeltas = Matrix<double>::multiply(gradients, layerDataTransposed);

Retrieve weights data as matrix.

`const auto weightsData =`

Matrix<double>::fromArray(lastLayer->m_neurons,

previousLayer(lastLayer)->size(),

[](const Neuron& n, int index) {

return n.weight(index);

});

If you wonder where that size of vector comes from, note the network is fully connected and thus the amount of weights is just the amount previous layer neurons. For readers not familiar with modern C++ syntax, `fromArray`

function accepts three parameters: the vector data is retrieved and its size (there are also iterator overloads in the Matrix class) and a function that does mapping from a Neuron to a weight value.

So, now we have new weights and we can add an adjusting delta to it.

`const auto weights = weightsData + layerDeltas;`

The weights are written back to the network.

`for(auto j = 0U; j < weights.rows(); j++) {`

auto& n = lastLayer->m_neurons[j];

for(auto i = 0U; i < weights.cols() ; i++){

n.setWeight(i, weights(i, j));

}

}

Ok, that is all about it, for this layer! Then we go for the next layer. We need errors of the next layer. As layer’s each neuron’s each weight vector contributes to the current error, we get a new error matrix by having multiply between each weight vector and current error vector. Please see details how that is possible in Backpropagation Step by Step. Personally, the explanation get me somehow convinced.

`const auto weightsDataTransposed = weightsData.transpose();`

errors = Matrix<double>::multiply(weightsDataTransposed, errors);

Now we are ready to go backward to the previous layer - unless it is an input layer and we are done.

`lastLayer = previousLayer(lastLayer);`

if(lastLayer == nullptr || lastLayer->isInput())

break; //we hit the input layer

Then we just get new buffered values for coming loop round.

`lastValues = layerData;`

layerData = Matrix<double>::fromArray(previousLayer(lastLayer)->m_outBuffer, Matrix<double>::VecDir::row);

And we are done! It was not that bad, right?

But hold on, there is more - the rabbit hole continues! I had a nice implementation and wanted to apply it to a bit more complex dataset than just the most simple logistic functions. Next, we look at what happens when we utilize the network to classify a set of 60000 images of handwritten numbers and look closer at the training.