Author: Mohammad Hosseinzadeh | z5388543
Institution: UNSW
Faculty: Faculty of Engineering, School of Computer Science and Engineering
Course: Neural Networks, Deep Learning
Course code: ZZEN9444
Date: 27 September 2021
Course coordinator: Dr. Alan Blair
# import libraries
import numpy as np
import pandas as pd
from tabulate import tabulate
import matplotlib.pyplot as plt
from ipypublish import nb_setup
from PIL import Image
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML('<style>{}</style>'.format(table_css))
The Kuzushiji-MNIST dataset contains 70,000 28x28 greyscale images spanning 10 classes. Therefore, each image consists of 28x28x1 (black and white images consist of only one colour channel). We can create an object to compute the linear layer using torch.nn.Linear() class from PyTorch machine learning library (Paszke et al. 2019):
We define our neural network by sub-classing nn.Module, and initialize the neural network layers in 'init'. Every nn.Module sub-class implements the operations on input data in the forward method (Paszke et al. 2019).
NetLin is an example of a Perceptron learning algorithm. This type of algorithm, as shown by Frank Rosenblatt (1957), will always predict the training data successfully, provided the data are linearly separable. The main limitation with this type of algorithm is that most data, including the Japanese character recognition, are not linearly separable. Hence, NetLin has produced the poorest accuracy score of the three models.
The NetLin architecture as defined in kuzu.py script is as follows:
Below is the final confusion matrix for NetLin after 10 epochs:
## load final confusion matrix from kuzu_main.py for NetLin
## confusion matrix was saved as NumPy array object - file 'conf_matrix.csv'
# setup pandas for pdf export
pd = nb_setup.setup_pandas(escape_latex=False)
# column names
japanese = ["o", "ki", "su", "tsu", "na", "ha", "ma", "ya", "re", "wo"]
class_values = np.arange(0,10)
cols = [japanese, class_values]
# Index label
target_label = 'Target Label:'
# column labels
header = ['Characters:', 'Predicted Label:']
# read saved confusion matrix from NetLin
data1 = pd.read_csv('conf_matrix_netlin.csv', sep=',', names=class_values)
# assign index names and label
data1.index = pd.Index(class_values, name=target_label)
# assign column names and labels
data1.columns = cols
data1.columns = data1.columns.rename(header, level=[0,1])
# display final accuracy result
print("Average loss = 1.0088 \n" +
"Accuracy = 6961/10000 (70%)")
#display confusion matrix df
data1
Average loss = 1.0088 Accuracy = 6961/10000 (70%)
Characters: | o | ki | su | tsu | na | ha | ma | ya | re | wo |
---|---|---|---|---|---|---|---|---|---|---|
Predicted Label: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Target Label: | ||||||||||
0 | 766.0 | 5.0 | 8.0 | 14.0 | 30.0 | 64.0 | 2.0 | 62.0 | 31.0 | 18.0 |
1 | 7.0 | 669.0 | 106.0 | 17.0 | 27.0 | 22.0 | 58.0 | 14.0 | 26.0 | 54.0 |
2 | 8.0 | 63.0 | 689.0 | 26.0 | 26.0 | 21.0 | 46.0 | 36.0 | 47.0 | 38.0 |
3 | 4.0 | 37.0 | 58.0 | 756.0 | 16.0 | 57.0 | 14.0 | 18.0 | 28.0 | 12.0 |
4 | 59.0 | 52.0 | 84.0 | 20.0 | 620.0 | 20.0 | 32.0 | 35.0 | 20.0 | 58.0 |
5 | 8.0 | 28.0 | 124.0 | 17.0 | 19.0 | 727.0 | 28.0 | 8.0 | 33.0 | 8.0 |
6 | 5.0 | 22.0 | 146.0 | 10.0 | 25.0 | 24.0 | 726.0 | 19.0 | 8.0 | 15.0 |
7 | 16.0 | 32.0 | 29.0 | 11.0 | 80.0 | 15.0 | 54.0 | 624.0 | 91.0 | 48.0 |
8 | 11.0 | 36.0 | 96.0 | 39.0 | 7.0 | 31.0 | 43.0 | 7.0 | 709.0 | 21.0 |
9 | 8.0 | 54.0 | 86.0 | 4.0 | 52.0 | 30.0 | 20.0 | 32.0 | 39.0 | 675.0 |
One solution to the limitation of the NetLin Perceptron algorithm is to construct a two-layer neural network such as NetFull. In order for gradient descent to be applied successfully to multi-layer neural networks, the discontinuous step function need to be replaced with a continuous activation (differentiable) function, such as the hyperbolic tangent (tanh) function.
NetFull is a fully connected two-layer neural network with tanh activation at the hidden layer and log softmax at the output layer. The input features of the hidden layer is the image size being flattened into a 1D vector, 28 x 28 x 28, which is 784 features per sample. Therefore, the hidden layer is nn.Linear() with 784 input features and an arbitrary number of hidden features, 260. This is followed by a tanh() activation function, and another nn.Linear() function which reduces the network down to an appropriate output number of features (10), followed by log softmax to produce the predicted labels.
This network has resulted in a significant increase in accuracy from the simple linear model (85% vs 70%). This is due to neural networks requiring at least one hidden layer of activations, with a non-linearity in between in order to learn arbitrary functions (Stevens et al. 2020). However, the accuracy score of NetFull is not great. The model is a shallow classifier and would require more layers and capacity to improve its performance. A limitation of fully connected networks is that it takes every single input from our sample (image) and computes a linear combination with all other values to produce every output feature, which is not translation invariant. The network is therefore not utilizing the relative position of neighbouring elements in the image (Stevens et al. 2020).
The NetFull architecture using PyTorch as defined in kuzu.py script is as follows:
Below is the final confusion matrix for NetFull after 10 epochs:
## load final confusion matrix from kuzu_main.py for NetFull
## confusion matrix was saved as NumPy array object - file 'conf_matrix.csv'
# setup pnadas for pdf export
pd = nb_setup.setup_pandas(escape_latex=False)
# column names
japanese = ["o", "ki", "su", "tsu", "na", "ha", "ma", "ya", "re", "wo"]
class_values = np.arange(0,10)
cols = [japanese, class_values]
# Index label
target_label = 'Target Label:'
# column labels
header = ['Characters:', 'Predicted Label:']
# read saved confusion matrix from NetFull
data2 = pd.read_csv('conf_matrix_netfull.csv', sep=',', names=class_values)
# assign index names and label
data2.index = pd.Index(class_values, name=target_label)
# assign column names and labels
data2.columns = cols
data2.columns = data2.columns.rename(header, level=[0,1])
# display final accuracy result
print("Average loss = 0.4907 \n" +
"Accuracy = 8478/10000 (85%)")
#display confusion matrix df
data2
Average loss = 0.4907 Accuracy = 8478/10000 (85%)
Characters: | o | ki | su | tsu | na | ha | ma | ya | re | wo |
---|---|---|---|---|---|---|---|---|---|---|
Predicted Label: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Target Label: | ||||||||||
0 | 845.0 | 3.0 | 2.0 | 5.0 | 31.0 | 32.0 | 4.0 | 38.0 | 35.0 | 5.0 |
1 | 6.0 | 815.0 | 37.0 | 2.0 | 17.0 | 12.0 | 58.0 | 6.0 | 18.0 | 29.0 |
2 | 8.0 | 10.0 | 844.0 | 38.0 | 13.0 | 16.0 | 25.0 | 11.0 | 19.0 | 16.0 |
3 | 3.0 | 9.0 | 31.0 | 917.0 | 1.0 | 16.0 | 6.0 | 1.0 | 7.0 | 9.0 |
4 | 41.0 | 26.0 | 22.0 | 5.0 | 820.0 | 5.0 | 30.0 | 16.0 | 20.0 | 15.0 |
5 | 9.0 | 9.0 | 86.0 | 8.0 | 9.0 | 827.0 | 31.0 | 1.0 | 14.0 | 6.0 |
6 | 3.0 | 9.0 | 52.0 | 9.0 | 11.0 | 4.0 | 898.0 | 7.0 | 2.0 | 5.0 |
7 | 17.0 | 15.0 | 22.0 | 3.0 | 26.0 | 7.0 | 29.0 | 828.0 | 22.0 | 31.0 |
8 | 12.0 | 27.0 | 29.0 | 48.0 | 2.0 | 7.0 | 25.0 | 3.0 | 840.0 | 7.0 |
9 | 3.0 | 20.0 | 49.0 | 4.0 | 30.0 | 7.0 | 19.0 | 14.0 | 10.0 | 844.0 |
The fully connected nature of NetFull means we have too many parameters which can make it easier for the model to memorise the training set data and due to the lack of position independence, it may result in difficulties for the model to generalize. One method to deal with the limitations of a fully connected two-layer network is by replacing the linear layer with another mathematical linear operation, convolution (Stevens et al. 2020).
Convolutions improve machine learning algorithms by using sparse interactions, parameter sharing and equivariant representation of the input elements (Goodfellow et al. 2016). Therefore, the network delivers locality and translation invariance resulting in superior performance compared to our NetLin and NetFull networks.
Our NetConv network contains the following structure:
The architecture defined in kuzu.py script, using PyTorch, is as follows:
Our CNN model greatly improves performance, with an accuracy of 96%, an average loss of 0.2587 on the test set, and computation time of 20 minutes and 40 seconds when run on CPU. Below is the final confusion matrix after 10 epochs:
## load final confusion matrix from kuzu_main.py for NetConv
## confusion matrix was saved as NumPy array object - file 'conf_matrix.csv'
# setup pnadas for pdf export
pd = nb_setup.setup_pandas(escape_latex=False)
# column names
japanese = ["o", "ki", "su", "tsu", "na", "ha", "ma", "ya", "re", "wo"]
class_values = np.arange(0,10)
cols = [japanese, class_values]
# Index label
target_label = 'Target Label:'
# column labels
header = ['Characters:', 'Predicted Label:']
# read saved confusion matrix from NetConv
data3 = pd.read_csv('conf_matrix_conv1.csv', sep=',', names=class_values)
# assign index names and label
data3.index = pd.Index(class_values, name=target_label)
# assign column names and labels
data3.columns = cols
data3.columns = data3.columns.rename(header, level=[0,1])
# display final accuracy result
print("Average loss = 0.2587 \n" +
"Accuracy = 9586/10000 (96%) \n" +
"Computation time = 20 min, 40 sec")
#display confusion matrix df
data3
Average loss = 0.2587 Accuracy = 9586/10000 (96%) Computation time = 20 min, 40 sec
Characters: | o | ki | su | tsu | na | ha | ma | ya | re | wo |
---|---|---|---|---|---|---|---|---|---|---|
Predicted Label: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Target Label: | ||||||||||
0 | 949.0 | 5.0 | 3.0 | 0.0 | 28.0 | 4.0 | 0.0 | 5.0 | 4.0 | 2.0 |
1 | 1.0 | 944.0 | 8.0 | 1.0 | 10.0 | 0.0 | 20.0 | 1.0 | 7.0 | 8.0 |
2 | 8.0 | 5.0 | 934.0 | 19.0 | 8.0 | 7.0 | 7.0 | 7.0 | 2.0 | 3.0 |
3 | 0.0 | 1.0 | 12.0 | 976.0 | 0.0 | 5.0 | 2.0 | 2.0 | 2.0 | 0.0 |
4 | 9.0 | 3.0 | 3.0 | 11.0 | 949.0 | 6.0 | 6.0 | 4.0 | 7.0 | 2.0 |
5 | 2.0 | 3.0 | 21.0 | 11.0 | 2.0 | 948.0 | 6.0 | 0.0 | 4.0 | 3.0 |
6 | 3.0 | 2.0 | 13.0 | 1.0 | 5.0 | 2.0 | 972.0 | 1.0 | 0.0 | 1.0 |
7 | 7.0 | 5.0 | 5.0 | 0.0 | 5.0 | 0.0 | 4.0 | 966.0 | 6.0 | 2.0 |
8 | 0.0 | 1.0 | 3.0 | 2.0 | 7.0 | 2.0 | 1.0 | 1.0 | 983.0 | 0.0 |
9 | 8.0 | 3.0 | 3.0 | 1.0 | 5.0 | 1.0 | 3.0 | 4.0 | 7.0 | 965.0 |
The performance of the original NetConv network was satisfactory as it had met the requirement of achieving 93% accuracy after 10 training epochs, with 96% and an average loss of 0.2587. However, the average loss value was not decreasing in a consistent manner. This suggests that our model was perhaps overfitting the data. Our model is learning the training data well but will potentially have low generalization accuracy if presented with new data. Therefore, we can try to build upon this network and see if we can improve performance as well as computational efficiency (training time was 20 minutes and 40 seconds). The following metaparameters influence a networks training error:
Learning rate: perhaps the most important hyperparameter as it is critical to learning via gradient descent. We control the change to the value of weights during every update with the learning rate hyperparameter. Larger values will lead to faster learning, however, they may also cause the gradient to miss the lowest minimum resulting in an increase in the error rate. On the other hand, a value too small will result in slower learning and an inefficient network (Glassner 2021). For backpropagation to be implemented successfully, it relies on making small changes to the weights. Therefore, care must be taken to find the right balance through trial and error.
Momentum: once the amount of change to each weight is computed, we can add in a small amount of its change from the previous step by using the momentum hyperparameter. This helps the weights to get over flat plateaus and add efficiency to small learning rates. When momentum is introduced, typically we reduce the learning rate at the same time, in order to compensate for amplifying downhill motion by a factor of $\frac{1}{1-\alpha}$.
Dropout: is a popular regularization method to delay the onset of overfitting (Glassner 2021). The role of dropout it to temporarily disconnect some of the neurons on the previous layer at random and by a fixed percentage that we assign. Since these neurons are disconnected, they are not involved in the forward calculations and are also not included in backpropagation. When the batch is completed and the rest of the weights are updated, the disconnected neurons and all of their connections are restored (Glassner 2021). At the start of the next batch, a new random set of neurons are again temporarily disconnected. This process is then repeated for each epoch. Dropout delays overfitting as the activation received by each neuron during testing is the average of what it would have received during training. Dropout will force the network to achieve redundancy since it will have to learn to deal with scenarios where some features are missing.
Another problem with convolutional neural networks arises as a result of the output of convolution layers being combined to create the output tensor. This can cause the output feature maps to become sensitive to the location of elements in the input. Therefore, sub-sampling the feature maps with max pooling will result in the sub-sampled outputs to be more robust to changes in the position of the elements in the image (Glassner 2021). An additional benefit of max pooling is a reduction in computational memory and execution time as pooling reduces the size of the tensors going through the network.
Therefore, through experimentation, the following architecture was found to significantly improve performance and efficiency:
The following is the PyTorch code used to define the optimized CNN model in kuzu.py script:
This network was able to achieve an accuracy score of 96% (9569/10000), average loss of 0.1660 where the loss was consistently decreasing after each epoch, and execution time of 4 minutes and 32 seconds (CPU). The original NetConv network had achieved an accuracy score of 96% (9586/10000), average loss of 0.2587 where the loss was up and down after each epoch (suggesting overfitting), and a much slower execution time of 20 minutes and 40 seconds. This demonstrates the importance of understanding and fine-tuning our hyperparameters, as well as experimenting with different network designs to achieve satisfactory results by maintaining high accuracy scores, minimizing our losses and increasing computational efficiency.
Below is the final confusion matrix after for our optimized CNN model:
## load final confusion matrix from kuzu_main.py for NetConv
# Experimented with diff CNN architecture
## confusion matrix was saved as NumPy array object - file 'conf_matrix.csv'
# setup pnadas for pdf export
pd = nb_setup.setup_pandas(escape_latex=False)
# column names
japanese = ["o", "ki", "su", "tsu", "na", "ha", "ma", "ya", "re", "wo"]
class_values = np.arange(0,10)
cols = [japanese, class_values]
# Index label
target_label = 'Target Label:'
# column labels
header = ['Characters:', 'Predicted Label:']
# read saved confusion matrix from NetConv
data4 = pd.read_csv('conf_matrix_conv2.csv', sep=',', names=class_values)
# assign index names and label
data4.index = pd.Index(class_values, name=target_label)
# assign column names and labels
data4.columns = cols
data4.columns = data4.columns.rename(header, level=[0,1])
# display final accuracy result
print("Optimized NetConv \n" +
"Average loss = 0.1660 \n" +
"Accuracy = 9569/10000 (96%) \n" +
"Computation time = 4 min, 32 sec")
#display confusion matrix df
data4
Optimized NetConv Average loss = 0.1660 Accuracy = 9569/10000 (96%) Computation time = 4 min, 32 sec
Characters: | o | ki | su | tsu | na | ha | ma | ya | re | wo |
---|---|---|---|---|---|---|---|---|---|---|
Predicted Label: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Target Label: | ||||||||||
0 | 966.0 | 1.0 | 3.0 | 0.0 | 18.0 | 1.0 | 0.0 | 6.0 | 4.0 | 1.0 |
1 | 2.0 | 945.0 | 3.0 | 0.0 | 8.0 | 2.0 | 23.0 | 2.0 | 9.0 | 6.0 |
2 | 11.0 | 5.0 | 912.0 | 40.0 | 5.0 | 7.0 | 9.0 | 5.0 | 4.0 | 2.0 |
3 | 0.0 | 0.0 | 16.0 | 973.0 | 0.0 | 5.0 | 2.0 | 2.0 | 1.0 | 1.0 |
4 | 12.0 | 4.0 | 1.0 | 8.0 | 941.0 | 2.0 | 11.0 | 11.0 | 7.0 | 3.0 |
5 | 2.0 | 3.0 | 22.0 | 4.0 | 1.0 | 954.0 | 4.0 | 1.0 | 5.0 | 4.0 |
6 | 4.0 | 5.0 | 9.0 | 1.0 | 0.0 | 3.0 | 972.0 | 3.0 | 1.0 | 2.0 |
7 | 3.0 | 0.0 | 5.0 | 0.0 | 3.0 | 1.0 | 3.0 | 978.0 | 3.0 | 4.0 |
8 | 3.0 | 7.0 | 4.0 | 8.0 | 8.0 | 2.0 | 1.0 | 1.0 | 965.0 | 1.0 |
9 | 7.0 | 3.0 | 5.0 | 1.0 | 4.0 | 0.0 | 3.0 | 3.0 | 11.0 | 963.0 |
To determine which characters are most likely to be mistaken for which other characters, we need to analyze the false negatives for each target label as false negatives represent characters that are negative (false) when our model has predicted positive (true). The following are the most likely misclassification for each model:
NetLin:
Character | Misclassified |
---|---|
"o" (0) | "ha" (5) |
"ki" (1) | "su" (2) |
"su" (2) | "ki" (1) |
"tsu" (3) | "su" (2) |
"na" (4) | "su" (2) |
"ha" (5) | "su" (2) |
"ma" (6) | "su" (2) |
"ya" (7) | "re" (8) |
"re" (8) | "su" (2) |
"wo" (9) | "su" (2) |
NetFull:
Character | Misclassified |
---|---|
"o" (0) | "ya" (7) |
"ki" (1) | "ma" (6) |
"su" (2) | "tsu" (3) |
"tsu" (3) | "su" (2) |
"na" (4) | "o" (0) |
"ha" (5) | "su" (2) |
"ma" (6) | "su" (2) |
"ya" (7) | "wo" (9) |
"re" (8) | "tsu" (3) |
"wo" (9) | "su" (2) |
NetConv:
Character | Misclassified |
---|---|
"o" (0) | "na" (4) |
"ki" (1) | "ma" (6) |
"su" (2) | "o" (0) |
"tsu" (3) | "su" (2) |
"na" (4) | "o" (0) |
"ha" (5) | "su" (2) |
"ma" (6) | "su" (2) |
"ya" (7) | "su" (2) |
"re" (8) | "o" (0) |
"wo" (9) | "tsu" (2) |
The confusion matrix from our models demonstrate that the character "su" (2) is most likely to be mistaken for the characters "tsu" (3), "ha" (5), "ma" (6), "wo" (9). This could be due to the similarity of the characters in shape and looks. Also, messy handwriting or lack of clarity in the sample images can result in misclassification amongst similar looking characters, especially with the inefficient and shallow NetLin and NetFull networks. The convolutional network, NetConv, is able to detect complex features due to the hierarchial nature of its filter operations resulting in significantly less misclassifications.
Paszke, A, et al., 2019, PyTorch: An Imperative Style, High-Performance Deep Learning Library, Neural Information Processing Systems, no. 32, pp. 8024–8035, accessed 10 September 2021, URL.
Stevens, E, Antiga, LPG and Viehman, T, 2020, Deep Learning with Pytorch, Manning Publications, Shelter Island, NY, USA.
Goodfellow, I, Bengio, Y and Courville, A, 2016, Deep Learning, MIT Press, accessed 10 September 2021, Deep Learning eBook.
Glassner, A, 2021, Deep Learning: A Visual Approach, No Starch Press, San Francisco, CA, USA.
Masters, D and Luschi, C, 2018, Revisiting Small Batch Training For Deep Neural Networks, Graphcore Reseaarch, accessed 20 September 2021, Article URL.
Krohn, J, Beyleveld, G and Bassens, A, 2019, Deep Learning Illustrated: A Visual Interactive Guide to Artificial Intelligence, Addison-Wesley Professional, Boston, MA, USA.