In this month’s column, we continue our discussion on deep learning.
A few months back, we had started a discussion on deep learning and its application in natural language processing. However, over the last couple of months, we digressed a bit due to various other requests from our readers. From this month onwards, let’s get back to exploring deep learning. In prior columns, we had discussed how supervised learning techniques have been used to train neural networks. These learning techniques require labelled data. Given that today’s state-of-art neural networks are extremely deep, with many layers, we require large amounts of data so that we can train these deep networks without over-fitting. Getting labelled annotated data is not easy. For instance, in image recognition tasks, we need to have specific pieces of images bound together in order to recognise human faces or animals. Labelling millions of images requires considerable human effort. On the other hand, if we decide to use smaller amounts of labelled data, we end up with over-fitting and poor performance on test data. This leads to a situation wherein many of the problems, for which deep learning is an ideal solution, do not get tackled simply due to the lack of large volumes of labelled data. So the question is: is it possible to build deep learning systems based on unsupervised learning techniques?
In last month’s column, we had discussed word embeddings, which along with paragraph and document embeddings are now widely used as inputs in various natural language processing tasks. In fact, they are used as input representation for many of the deep learning based text processing systems. If you have gone through the word2vec paper at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf, you would have seen that there is no labelled training data for the neural network described in this paper. The word2vec neural network is not a deep neural network. There are only three layers — an input layer, a hidden layer and an output layer. Just as in the case of supervised learning neural networks, back propagation is used to train the weights of the neural network. Here is a quick question to our readers — why do we initialise the weights of the nodes to be random weights instead of initialising them to zero?
In a supervised learning setting, we have labelled data, which provides what is the correct/expected output/label for a given input. The neural network produces an actual output for a given input. The difference between expected output and actual output is the error term at the output layer. As we had seen earlier during the discussion on back-propagation, this error term is back-propagated to earlier layer nodes whose weights are determined based on these error terms. The key idea behind back propagation is the following — weights of each node are adjusted in proportion to how much it contributes to the error terms of the next layer’s nodes, for which the first node’s output acts as input. For back propagation to work, the desired output for a given input for each output layer’s node needs to be known. This can then be back-propagated to the hidden layer’s neurons. Here is a question to our readers — in the word2vec paper, where the neural network is being trained, what is the expected output for a given input?
Well, the answer is simple. The expected output that is used for training is the same as the input itself. If you consider the simplest variant of the word2vec continuous bag-of-words model, with only one word per context, the task is to predict the target word, given one word in context. If we make the assumption that the context word and target word are adjacent, this reduces to the bigram model, and we need to predict the succeeding word given a word in a continuously moving window of the corpus. This task requires no explicitly labelled data except the input corpus itself. Thus, word2vec is a great example for a neural network method which can scale to large data, but does not require explicit labelled data.
By now, you may be wondering what is the need for unsupervised learning with neural networks? After all, supervised learning has proved itself to be quite successful with neural networks. As mentioned earlier, in many cases, we are not able to apply deep learning techniques due to a lack of labelled data. And another and more fundamental reason for the attractiveness of unsupervised neural network techniques is that they seem to be closer to the way human beings develop their knowledge and perception of the world. For instance, as children build their knowledge of the world around them, they are not explicitly provided labelled data. Also, more importantly, there appears to be a hierarchical learning method that human brains follow. It is the ability to represent abstract concepts, based on which higher concepts are built. If human beings have the ability to obtain much of their knowledge from unsupervised learning techniques, how can we employ similar techniques in artificial neural networks for deep learning? This requires us to take a brief digression from artificial neural networks to human cognition.
One question which has been debated a lot in human cognition is that of ‘localist’ vs ‘distributed’ representation. A localist representation means that one or a very few limited number of neurons are used to represent a specific concept in the brain. On the other hand, in a distributed representation approach, a concept is represented by distribution over a large number of neurons, and each neuron participates in representation of multiple concepts. In other words, while in a localist representation, it is possible to interpret the behaviour of a single neuron as corresponding to a specific concept/feature, in a distributed representation, it is not possible to attribute/map the activity of a single neuron in isolation as its behaviour is dependent on the activity of a host of other neurons. An extremist view of the localist approach is represented by what is known as ‘grandmother cells’. Are there neurons that are associated with a single concept only? For instance, if ‘grandmother’ is a concept represented in the human brain, is there a single neuron that gets fired in recognising this concept? Well, an affirmative answer to that question will rule in favour of the localist approach.
There has been a raging debate on the presence/absence of grandmother cells. There have been certain experiments in the past which seemed to indicate the presence of the grandmother cell (http://tandfonline.com/doi/full/10.1080/23273798.2016.1267782). But recent research indicates a lack of sufficient evidence for grandmother cells. Earlier experiments indicated that certain specific neurons only fired when recognising a specific concept such as ‘grandmother’ or a popular personality such as Jennifer Aniston (https://www.newscientist.com/article/dn7567-why-your-brain-has-a-jennifer-aniston-cell/). It is possible now to think of alternate explanations which can explain these experimental results. One strong possibility is the presence of sparse coding of neurons in human brains. Sparse neural coding means that only a relatively small number of neurons are used to represent different features of a concept. For instance, let us consider an image which has 1000 x 1000 pixels. Of the million pixels, only some are used to encode horizontal lines, certain others to encode vertical lines and so on. In fact, each sub-concept or sub-feature is represented by a selective, limited set of neurons. This means that out of the million pixels, only a few are activated for each image, enabling us to recognise a very large number of images. Thus, sparse coding provides an alternate explanation for the grandmother cell experiments. When you are shown the image of your grandmother, there are a few selective neurons that are encoding this concept and, hence, they are the ones that get activated. It is possible that these neurons also are involved in representing various other concepts. So the sparse coding explanation does not preclude a distributed representation for knowledge representation in human brains.
By now, you may be thinking that though the human brain is interesting, how is it relevant to deep learning and artificial neural networks? Well, it is important because human vision recognition and perception is one of the best designed natural neural networks, and the ultimate holy grail of deep learning is to figure out how to design deep learning networks which can match/surpass the power of human cognition. This brings us back to the question of designing efficient unsupervised deep learning networks. For instance, if we can just provide unlabelled YouTube videos to a deep neural network and allow it to learn in an unsupervised manner from the videos, what can be learnt by the neural network? For example, given a multi-layer neural network for this task, can we interpret the activity of certain nodes as recognising colour changes, certain nodes as recognising horizontal lines or certain nodes as recognising vertical lines?
The answer to the above questions is ‘Yes’. There are neural networks known as auto encoders which are used for unsupervised deep learning. An auto-encoder learns the weights of the network using back propagation, wherein the expected output is set to be equal to the input. We will discuss more on auto-encoders in our next column. Also, if you were wondering about the experiment wherein an auto encoder learns without any supervision from YouTube videos, here is the paper which describes this: https://arxiv.org/abs/1112.6209.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, wishing all our readers a wonderful and productive month ahead!