## LINEAR UNIT FUNCTIONS

The sigmoid function, introduced earlier, is the original activation function for the backpropagation algorithm. The Neuropia API let you introduce own activation functions and I implemented two commonly used alternatives that are developed since the early days of neural network research: Leaky ReLu and ELu. I was first testing plain ReLu, that is equal to Leaky parameter set to 0, but that made too often my network to “die”. The network is dead when any further training will not change its output values as a result of gradients have zero values so that they cannot be changed anymore with the used activation function. The issue is also called the vanishing gradient problem. I set the leaky parameter to 0.05 as it seems quite well to keep the network alive when testing with MNIST data. In Table 5 there is a comparison of different activation functions. ELu and ReLu seem to converge faster, but at some point in the network, they may explode by having weight values to beyond 64-bit floating point accuracy. Note that a single hidden layer with Leaky ReLU gave me so far the best result.

Activation Function | 30000 iterations [32,16] Topology | 50000 iterations [32,16] Topology | 100000 iterations [32,16] Topology | 200000 iterations [32,16] Topology | 100000 Iterations, [500] Topology |
---|---|---|---|---|---|

Sigmoid | 87.5% | 89.6% | 92.6% | 92.3% | 94.3% |

ELu | 91.2% | 92.1% | 92.3% | 93.7% | N/A% |

Leaky ReLu | 91.3% | 93.0% | 92.4% | N/A | 95.8% |

#### Table 5: Accuracies obtained with different activation functions

Both ReLu and ELu functions improve network compared to sigmoid function, but they both make training more unstable, meaning that it is easy to get the network to die or explode during training. The exploding means that values will increase beyond the numerical range of used number types. There are certain ways to help the issue dropout and L2 regularization discussed later.

One thing to note regarding activation function is the network initialization. Earlier, in Logistic Gate, I was using Layer::randomize function that set weight and bias values equally distributed between -1 and 1. However, considering how network values are supposed to develop during the training, that is not optimal. Neuropia `Layer::initalize`

function provides better tuned initial weights and biases to help network converge faster and, for non-sigmoid activation functions, not facing vanishing death or explosion to infinity that easily.

## DROPOUT

Dropout is an interesting concept to improve network training. The dropout turns off random Neurons during training, and therefore it is not that easy to have overfitting networks. The dropout can also be considered as an enormous ensemble as each epoch constructs a different topology. Since the trained number of neurons is smaller over a single epoch, more training iterations are needed when dropout is applied. The DropoutRate hyperparameter defines the number of neurons in each layer that have been shut down. Except for the output layer that is left intact.

However, after applying numerous testing rounds, any dropout applied network was not able to provide improved network accuracy. By increasing the number of epochs the dropout network will eventually gain as good accuracy as networks without the dropout. My explanation is that the topologies used for MNIST are relatively simple, there is no overfitting, and therefore no benefit of using dropout. For example, a simple topology of 500-neuron hidden layer and 150000 iterations will provide 96.8% accuracy without dropout and 93.3% with 50% dropout. Dropout is also assumed to speed up the training as there are not that many neurons to calculate, but with Neuropia the difference is only 5%. That will not anyhow compensate the number of required iterations as without dropout the same network goes up to 94.8% just with 50000 iterations.

During these tests, the ReLu activation function seems quite easily climb over the 96% accuracy barrier and that would encourage me to do last attempts to pass the 97%.

## L2 REGULARIZATION

L2 regularization adds a damping hyperparameter for the network. It regulates extreme weight values and therefore helps with overfitting and, what has been more important with improving MNIST recognition accuracy, L2 regularization decreases the chance of network to explode.

There are alternatives for how to apply the hyperparameter and for Neuropia the implementation changes gradient values as follows:

`if(lambdaL2 > 0.0) {`

const auto L2 = gradients.reduce<double>(0, [](auto a, auto r){return a + (r * r);}) /

static_cast<double>(gradients.rows());

const auto l = lambdaL2 * L2;

gradients.mapThis([l](auto v){return v - l;});

}

To test L2 regularization the epoch I decided to use a bit more complex network.

- 150000 epocs
- 500,100 topology
- InitStrategy auto
- LearningRateMin 0.00001
- LearningRateMax 0.02

Activation Function | L2 | Accuracy |
---|---|---|

ReLu | 0 | 97.82% |

ReLu | 0.001 | 97.85% |

ReLu, Sigmoid | 0 | 97.82% |

ReLu, Sigmoid | 0.001 | 97.85% |

#### Table 6: Bigger network

As seen in Table 6, there is not a big difference between results. Actually, all of them are well inside any error margins. The L2 regularization does not seem to have a significant impact and nor the extra case where the second hidden layer is using Sigmoid activation function. The training time was pretty static 36 minutes using 2014 year Macbook.

However, the results were so close to 98% accuracy that I decided to give it one more shot: And yes! When applying 200000 epochs, which would take a hefty 51 minutes of running, I am able to gain 98.1% accuracy with Neuropia!

What next, towards 99%? I will not go any further here. However, I assume the parallel training with the ensemble that may be doable - and of course research more any hyperparameter alternatives.

## SUMMARY

This Neuropia implementation is not supposed to replace your favorite neural network library. The sole purpose of this article is to offer a different view from the programmer’s perspective, as an alternative to most of the tutorials and introductions that are written by Data Scientists and Mathematicians. Here I tried to ventilate the topic and focus on issues that matter from a programmer’s perspective - and naturally, just have some fun.

As a result, I was able to implement a feed-forward neural network that is able to analyze MNIST data with 98% accuracy. Presumably, 99% could be possible by just doing more epochs and figure out a set of hyperparameters that will not make the network to explode, but with a certain likelihood that would require improved learning rate and cost function implementations as suggested earlier.

Maybe the lesson learned here is that the universe of hyperparameters is vast and weird, and it is frustratingly difficult to find optimized values for the network. It looks like the secret of neural networks is not in their implementation, that seems to be relatively simple, but the (black) art of getting all the possible properties and parameters right. The MNIST issue is very simple, and even with that, there seems to be endless possibilities how to adjust the values.

Neuropia is implemented using standard C++ 14. Tested on Clang, GCC, and MSVC17. All the code discussed here: Neuropia, IdxReader, Matrix class and training code is available in my repository in Github under permissive MIT license. There is also a lot of code not explored here as working with parameters and details of applying training. Happy cloning and exploring!