Implementing Neural Networks: Image processing

Japanese character recognition

Author: Mohammad Hosseinzadeh | z5388543

Institution: UNSW

Faculty: Faculty of Engineering, School of Computer Science and Engineering

Course: Neural Networks, Deep Learning

Course code: ZZEN9444

Date: 27 September 2021

Course coordinator: Dr. Alan Blair


Table of Contents


Task 1 - NetLin: Linear function followed by log softmax

The Kuzushiji-MNIST dataset contains 70,000 28x28 greyscale images spanning 10 classes. Therefore, each image consists of 28x28x1 (black and white images consist of only one colour channel). We can create an object to compute the linear layer using torch.nn.Linear() class from PyTorch machine learning library (Paszke et al. 2019):

We define our neural network by sub-classing nn.Module, and initialize the neural network layers in 'init'. Every nn.Module sub-class implements the operations on input data in the forward method (Paszke et al. 2019).

NetLin is an example of a Perceptron learning algorithm. This type of algorithm, as shown by Frank Rosenblatt (1957), will always predict the training data successfully, provided the data are linearly separable. The main limitation with this type of algorithm is that most data, including the Japanese character recognition, are not linearly separable. Hence, NetLin has produced the poorest accuracy score of the three models.

The NetLin architecture as defined in kuzu.py script is as follows:

NetLin Architecture

Below is the final confusion matrix for NetLin after 10 epochs:


Task 2 - NetFull: A fully connected 2-layer network

One solution to the limitation of the NetLin Perceptron algorithm is to construct a two-layer neural network such as NetFull. In order for gradient descent to be applied successfully to multi-layer neural networks, the discontinuous step function need to be replaced with a continuous activation (differentiable) function, such as the hyperbolic tangent (tanh) function.

NetFull is a fully connected two-layer neural network with tanh activation at the hidden layer and log softmax at the output layer. The input features of the hidden layer is the image size being flattened into a 1D vector, 28 x 28 x 28, which is 784 features per sample. Therefore, the hidden layer is nn.Linear() with 784 input features and an arbitrary number of hidden features, 260. This is followed by a tanh() activation function, and another nn.Linear() function which reduces the network down to an appropriate output number of features (10), followed by log softmax to produce the predicted labels.

This network has resulted in a significant increase in accuracy from the simple linear model (85% vs 70%). This is due to neural networks requiring at least one hidden layer of activations, with a non-linearity in between in order to learn arbitrary functions (Stevens et al. 2020). However, the accuracy score of NetFull is not great. The model is a shallow classifier and would require more layers and capacity to improve its performance. A limitation of fully connected networks is that it takes every single input from our sample (image) and computes a linear combination with all other values to produce every output feature, which is not translation invariant. The network is therefore not utilizing the relative position of neighbouring elements in the image (Stevens et al. 2020).

The NetFull architecture using PyTorch as defined in kuzu.py script is as follows:

NetFull Architecture

Below is the final confusion matrix for NetFull after 10 epochs:


Task 3 - Convolutional neural network

The fully connected nature of NetFull means we have too many parameters which can make it easier for the model to memorise the training set data and due to the lack of position independence, it may result in difficulties for the model to generalize. One method to deal with the limitations of a fully connected two-layer network is by replacing the linear layer with another mathematical linear operation, convolution (Stevens et al. 2020).

Convolutions improve machine learning algorithms by using sparse interactions, parameter sharing and equivariant representation of the input elements (Goodfellow et al. 2016). Therefore, the network delivers locality and translation invariance resulting in superior performance compared to our NetLin and NetFull networks.

Our NetConv network contains the following structure:

The architecture defined in kuzu.py script, using PyTorch, is as follows:

NetConv1 Architecture

Our CNN model greatly improves performance, with an accuracy of 96%, an average loss of 0.2587 on the test set, and computation time of 20 minutes and 40 seconds when run on CPU. Below is the final confusion matrix after 10 epochs:

Task 3.1 - CNN: Architecture optimization

The performance of the original NetConv network was satisfactory as it had met the requirement of achieving 93% accuracy after 10 training epochs, with 96% and an average loss of 0.2587. However, the average loss value was not decreasing in a consistent manner. This suggests that our model was perhaps overfitting the data. Our model is learning the training data well but will potentially have low generalization accuracy if presented with new data. Therefore, we can try to build upon this network and see if we can improve performance as well as computational efficiency (training time was 20 minutes and 40 seconds). The following metaparameters influence a networks training error:

Another problem with convolutional neural networks arises as a result of the output of convolution layers being combined to create the output tensor. This can cause the output feature maps to become sensitive to the location of elements in the input. Therefore, sub-sampling the feature maps with max pooling will result in the sub-sampled outputs to be more robust to changes in the position of the elements in the image (Glassner 2021). An additional benefit of max pooling is a reduction in computational memory and execution time as pooling reduces the size of the tensors going through the network.

Therefore, through experimentation, the following architecture was found to significantly improve performance and efficiency:

The following is the PyTorch code used to define the optimized CNN model in kuzu.py script:

CNN Optimal

This network was able to achieve an accuracy score of 96% (9569/10000), average loss of 0.1660 where the loss was consistently decreasing after each epoch, and execution time of 4 minutes and 32 seconds (CPU). The original NetConv network had achieved an accuracy score of 96% (9586/10000), average loss of 0.2587 where the loss was up and down after each epoch (suggesting overfitting), and a much slower execution time of 20 minutes and 40 seconds. This demonstrates the importance of understanding and fine-tuning our hyperparameters, as well as experimenting with different network designs to achieve satisfactory results by maintaining high accuracy scores, minimizing our losses and increasing computational efficiency.

Below is the final confusion matrix after for our optimized CNN model:


Task 4 - Discussion of the confusion matrix for each model

To determine which characters are most likely to be mistaken for which other characters, we need to analyze the false negatives for each target label as false negatives represent characters that are negative (false) when our model has predicted positive (true). The following are the most likely misclassification for each model:

NetLin:

Character Misclassified
"o" (0) "ha" (5)
"ki" (1) "su" (2)
"su" (2) "ki" (1)
"tsu" (3) "su" (2)
"na" (4) "su" (2)
"ha" (5) "su" (2)
"ma" (6) "su" (2)
"ya" (7) "re" (8)
"re" (8) "su" (2)
"wo" (9) "su" (2)

NetFull:

Character Misclassified
"o" (0) "ya" (7)
"ki" (1) "ma" (6)
"su" (2) "tsu" (3)
"tsu" (3) "su" (2)
"na" (4) "o" (0)
"ha" (5) "su" (2)
"ma" (6) "su" (2)
"ya" (7) "wo" (9)
"re" (8) "tsu" (3)
"wo" (9) "su" (2)

NetConv:

Character Misclassified
"o" (0) "na" (4)
"ki" (1) "ma" (6)
"su" (2) "o" (0)
"tsu" (3) "su" (2)
"na" (4) "o" (0)
"ha" (5) "su" (2)
"ma" (6) "su" (2)
"ya" (7) "su" (2)
"re" (8) "o" (0)
"wo" (9) "tsu" (2)

The confusion matrix from our models demonstrate that the character "su" (2) is most likely to be mistaken for the characters "tsu" (3), "ha" (5), "ma" (6), "wo" (9). This could be due to the similarity of the characters in shape and looks. Also, messy handwriting or lack of clarity in the sample images can result in misclassification amongst similar looking characters, especially with the inefficient and shallow NetLin and NetFull networks. The convolutional network, NetConv, is able to detect complex features due to the hierarchial nature of its filter operations resulting in significantly less misclassifications.


References

Paszke, A, et al., 2019, PyTorch: An Imperative Style, High-Performance Deep Learning Library, Neural Information Processing Systems, no. 32, pp. 8024–8035, accessed 10 September 2021, URL.

Stevens, E, Antiga, LPG and Viehman, T, 2020, Deep Learning with Pytorch, Manning Publications, Shelter Island, NY, USA.

Goodfellow, I, Bengio, Y and Courville, A, 2016, Deep Learning, MIT Press, accessed 10 September 2021, Deep Learning eBook.

Glassner, A, 2021, Deep Learning: A Visual Approach, No Starch Press, San Francisco, CA, USA.

Masters, D and Luschi, C, 2018, Revisiting Small Batch Training For Deep Neural Networks, Graphcore Reseaarch, accessed 20 September 2021, Article URL.

Krohn, J, Beyleveld, G and Bassens, A, 2019, Deep Learning Illustrated: A Visual Interactive Guide to Artificial Intelligence, Addison-Wesley Professional, Boston, MA, USA.