Road to Neuropia, Part 2

Software development

Road to Neuropia article consists of thee parts. Please see parts 1 & 2:

CLOSER LOOK TO IMPLEMENTATION

The MNIST database of handwritten digits is considered to be ‘Hello World’ for classifiers, containing a training set of 60000 images and test set of 10000 images. The data is stored in Idx format. I created a simple IdxReader class that implements an iterator over a file. The image data is stored in 28 x 28 8 bit grayscale arrays and therefore the resolving neural network input layer has 784 input neurons. The output layer has naturally ten neurons, one for each digit.

The first implementation was passing all training data to the backpropagation algorithm. The approach was not very pragmatic as calculation took several hours and so I changed training to enact stochastic gradient descent. It means that an arbitrary amount of training data samples are passed to the training algorithm in random order until the result is assumed to be good enough. IdxRandomReader can be used to hop randomly over data to construct a random image and label sample sets.

HYPERPARAMETERS

Defining the number of hidden layers and other characteristics of the network have a big impact on how it behaves and performs. These features are called hyperparameters, and getting them right can be tricky. In the Neuropia there are several hyperparameters, but at first, we will focus on just three of them: Iterations, learning rate, and network topology.

Iterations is the number of epochs network is executed while training.
Learning rate defines how big delta each Gradient Descent step would take.
Topology is the structure of the hidden layers. A common approach for topology is suppressing series within hidden layers. In the notation used here [32,16] topology mean a network of 784, 32, 16 and 10 layers.

In Table 1 there is a comparison of those hyperparameter combinations. Learning rate is fixed to be 0.01 and training time is capped to 30s. So instead of defining iterations, I implemented an alternative hyperparameter as maximum training time. The Iterations field in the table defines epochs in 30 seconds. The Time field represents the seconds of 10000 test images are fed through the generated network, i.e. speed efficiency of the network. The Accuracy defines how well the generated network then recognizes the verification test data, i.e how well network output will match with a given label.

Topology	Iterations	Time	Accuracy %
[16,16]	146033	0.19	88.9
[32,16]	74845	0.37	91.6
[256,128]	5750	3.14	89.2

Table 1: Results of 30s training

To improve these results I applied linearly decreasing learning rate. The stochastic gradient descent algorithm takes random steps that in general go towards goal until the result is seen good enough (or a certain number of iterations are done, or time’s out). As I mentioned earlier, if steps are too small the found minima is more likely only a local and there would be a deeper pit elsewhere (i.e. more accurate network state). Yet the step is too big, there is a higher probability to jump over the minima and never get close enough for an optimal solution. In Table 2 the learning rate decreases linearly from 0.3 to 0.01, and we can see immediately some improvements. Neuropia only supports a linear change of learning rate even though some logarithmic curve may provide better results as learning itself is very non-linear.

At this phase, it is good to point out that there is a certain amount of variation within numbers per test execution and therefore given here far too accurately. Thus consider all numbers only vaguely demonstrative. The variation within accuracies is about 1% units in between measurements.

Topology	Iterations / 30s	Time	Accuracy % / 15s	Accuracy % / 30s	Accuracy % / 60s	Accuracy % / 120s
[16]	155451	0.18	87.3	91.1	91.7	91.6
[16]	155451	0.18	87.3	91.1	91.7	91.6
[32]	82189	0.33	88.7	91.6	92.0	93.7
[16,16]	151388	0.19	90.4	91.8	92.8	92.9
[16,16,16]	146969	0.20	87.6	89.3	89.6	89.3
[32,16]	78819	0.37	90.6	92.2	93.4	94.5
[128]	19001	1.39	86.5	89.3	91.8	93.2
[256,128]	6346	3.36	88.7	90.8	93.1	94.1
[128,32,16]	18283	1.32	77.0	91.3	92.2	93.3

Table 2: Linear learning rate

As we can see the training time vs. accuracy is surprisingly stable. It is fairly easy to reach about 93% accuracy using any kind of network topology, but as we can see in the last third of this writing going beyond that would require significantly more effort. There is no easy way to speed up the training just varying the network topology. Nonetheless, with this data, two layers seem to provide marginally the best result, yet the difference seems to be so small that it really does not matter. However the verification time, the value really measures how efficiently the network is utilized, seems to have significant variation and benefits from small hidden layers.

PARALLEL TRAINING

I have eight cores on my laptop. Each core can execute two threads simultaneously. That made me wonder how to utilize that processing power since the results above are gained using only a single thread. By using Mini-Batch Gradient Descent the train data is split into sub-batches. The sub-batches can be executed parallel and the conclusion is combined as a single result. Since most of the work is executed in parallel, the overall execution should be faster than just using a single thread and a single batch.

I consider here two solutions to implement parallelism for training. The first one is kind of evolutionary: a random network is copied into threads that run mini-batches simultaneously, each of the networks is verified and the best one is selected for the next round and then fed with mini-batches again.

The alternative approach is to use the average of each mini-batch. First, train mini-batches in parallel and then merge networks by applying average weights and biases into the network used for the next round. Later on, these networks are called as Evolutional and Averaging.

In both approaches, it is assumed that gradient descents run faster towards minima when the network can be adjusted during the training. However, at first results were not so promising and I had to look closer to my code, and the analysis revealed that big issue was my Matrix class: the implementation used dynamic memory allocation extensively.

Therefore I implemented a dedicated C++ allocator that is taking advantage of the assumption that there are only a limited amount of different dimension matrices used. I wrote a free list algorithm that does not free memory upon deallocation, but instead implementation stores it into a list where consecutive allocation can get for re-allocation of a matrix of the same size. Furthermore, as I wanted to avoid any synchronization objects, the free list memory pool is implemented per thread so that, the pool is managed within thread local storage, and thus follows its life cycle. This design should also protect implementation from certain cache issues that are further elaborated in Dr. Dobb’s article.

For both implementations, I needed two more hyperparameters: concurrent threads used and the batch size. In the Evolutional I also had a verification batch size. After an extensive amount of trial and error rounds, I was able to find hyperparameters where Averaging was just about 2% more efficient than just a single threaded one as shown in Table 3. Using very small, only eight-item batches and just four parallel threads there was some improvement. Nevertheless, I would not rule out that there are more optimal parameters. This may be applied to Evolutional implementation that is about 3% less efficient to find MNIST numbers. The Matrix optimization above had also some positive impact on the single thread case.

Training	Accuracy %
Single thread	91.6
Evolutional	89.6
Averaging	93.3

Table 3: Different training approaches using 32,16 topology.

IMPROVING NETWORK

The previous chapter was most about the discuss how to improve the speed of the network by adjusting the topology and doing parallel calculations. However usually the more important issue is improving the predictability of network, i.e. how accurately network is expected to work.

Since the accuracy above 93% seemed to be quite easy to achieve, I looked further ways to improve it. At first my target was set to 95%, but along promising improvement, I raised it to 97%. That is a 67% improvement to incorrect results vs. 93% and generally can be considered a pretty good result for the feed-forward networks.

ENSEMBLE

Ensemble is like crowd knowledge; by collecting several results, combined, they shall form an improved output. The Neuropia network layers can be stored and restored to and from a file, therefore I was able to train multiple networks and reload them to form my ensemble results. There are several ways to conclude an ensemble output, but the most straightforward is hard and soft voting.

The network output layer results are given as a vector of probabilities. For example within MNIST that is ten values between 0 - 1.0. The index of the highest value is considered to be the network output. Then for hard voting, the index that appears the majority of networks is concluded to be the ensemble output. For soft voting, each of the probabilities is summed up and the index that has overall the highest value is concluded to be the ensemble output.

In Table 4 there are eight trainings that form an Ensemble. The Accuracy field represent the performance of each network after training.

Training	Accuracy
16	91.3%
32	92.2%
16,16	90.8%
16,16,16	86.2%
32,16	92.6%
128	90.6%
256,128	87.0%
128,32,16	84.1%

Table 4: Party of eight ensemble members

After reloading networks and forming an ensemble we get unified accuracies. The hard voting gives us 91.8% accuracy and the soft voting 92.7%. The hard voting was worse than the best network in the ensemble and soft voting was better, but only within error marginals.

The ensemble obviously works, but I suppose the networks trained are too homogenous; meaning that they initially have too good a mutual understanding of the result, whether the conclusion is correct or incorrect. The ensemble may shine if the party contains learners as support vector machines or decision tree, not just feed-forward neural networks as Neuropia implements, with mostly identical hyperparameters.

writer

Markus Mertama

Senior Software Developer, Insta Advance