Part 5 (2/2)
The filter bank is copying what the human cochlea does, which is the initial step in our biological processing of sound. The software first identified phonemes based on distinguis.h.i.+ng patterns of frequencies and then identified words based on identifying characteristic sequences of phonemes.
A spectrogram of three vowels. From left to right: [i] as in ”appreciate,” [u] as in ”acoustic,” and [a] as in ”ah.” The Y axis represents frequency of sound. The darker the band the more acoustic energy there is at that frequency.A spectrogram of a person saying the word ”hide.” The horizontal lines show the formants, which are sustained frequencies that have especially high energy.10 The result was partially successful. We could train our device to learn the patterns for a particular person using a moderate-sized vocabulary, measured in thousands of words. When we attempted to recognize tens of thousands of words, handle multiple speakers, and allow fully continuous speech (that is, speech with no pauses between words), we ran into the invariance problem. Different people enunciated the same phoneme differently-for example, one person's ”e” phoneme may sound like someone else's ”ah.” Even the same person was inconsistent in the way she spoke a particular phoneme. The pattern of a phoneme was often affected by other phonemes nearby. Many phonemes were left out completely. The p.r.o.nunciation of words (that is, how phonemes are strung together to form words) was also highly variable and dependent on context. The linguistic rules we had programmed were breaking down and could not keep up with the extreme variability of spoken language.
It became clear to me at the time that the essence of human pattern and conceptual recognition was based on hierarchies. This is certainly apparent for human language, which const.i.tutes an elaborate hierarchy of structures. But what is the element at the base of the structures? That was the first question I considered as I looked for ways to automatically recognize fully normal human speech.
Sound enters the ear as a vibration of the air and is converted by the approximately 3,000 inner hair cells in the cochlea into multiple frequency bands. Each hair cell is tuned to a particular frequency (note that we perceive frequencies as tones) and each acts as a frequency filter, emitting a signal whenever there is sound at or near its resonant frequency. As it leaves the human cochlea, sound is thereby represented by approximately 3,000 separate signals, each one signifying the time-varying intensity of a narrow band of frequencies (with substantial overlap among these bands).
Even though it was apparent that the brain was ma.s.sively parallel, it seemed impossible to me that it was doing pattern matching on 3,000 separate auditory signals. I doubted that evolution could have been that inefficient. We now know that very substantial data reduction does indeed take place in the auditory nerve before sound signals ever reach the neocortex.
In our software-based speech recognizers, we also used filters implemented as software-sixteen to be exact (which we later increased to thirty-two, as we found there was not much benefit to going much higher than this). So in our system, each point in time was represented by sixteen numbers. We needed to reduce these sixteen streams of data into one while at the same emphasizing the features that are significant in recognizing speech.
We used a mathematically optimal technique to accomplish this, called vector quantization. Consider that at any particular point in time, sound (at least from one ear) was represented by our software by sixteen different numbers: that is, the output of the sixteen frequency filters. (In the human auditory system the figure would be 3,000, representing the output of the 3,000 cochlea inner hair cells.) In mathematical terminology, each such set of numbers (whether 3,000 in the biological case or 16 in our software implementation) is called a vector.
For simplicity, let's consider the process of vector quantization with vectors of two numbers. Each vector can be considered a point in two-dimensional s.p.a.ce.
If we have a very large sample of such vectors and plot them, we are likely to notice cl.u.s.ters forming.
In order to identify the cl.u.s.ters, we need to decide how many we will allow. In our project we generally allowed 1,024 cl.u.s.ters so that we could number them and a.s.sign each cl.u.s.ter a 10-bit label (because 210 = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively a.s.sign the first 1,024 vectors to be one-point cl.u.s.ters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cl.u.s.ter. We then collapse the two (one-point) cl.u.s.ters that are closest together into a single cl.u.s.ter. We are thus still left with 1,024 cl.u.s.ters. After processing the 1,025th vector, one of those cl.u.s.ters now has more than one point. We keep processing points in this way, always maintaining 1,024 cl.u.s.ters. After we have processed all the points, we represent each multipoint cl.u.s.ter by the geometric center of the points in that cl.u.s.ter. = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively a.s.sign the first 1,024 vectors to be one-point cl.u.s.ters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cl.u.s.ter. We then collapse the two (one-point) cl.u.s.ters that are closest together into a single cl.u.s.ter. We are thus still left with 1,024 cl.u.s.ters. After processing the 1,025th vector, one of those cl.u.s.ters now has more than one point. We keep processing points in this way, always maintaining 1,024 cl.u.s.ters. After we have processed all the points, we represent each multipoint cl.u.s.ter by the geometric center of the points in that cl.u.s.ter.
We continue this iterative process until we have run through all the sample points. Typically we would process millions of points into 1,024 (210) cl.u.s.ters; we've also used 2,048 (211) or 4,096 (212) cl.u.s.ters. Each cl.u.s.ter is represented by one vector that is at the geometric center of all the points in that cl.u.s.ter. Thus the total of the distances of all the points in the cl.u.s.ter to the center point of the cl.u.s.ter is as small as possible.
The result of this technique is that instead of having the millions of points that we started with (and an even larger number of possible points), we have now reduced the data to just 1,024 points that use the s.p.a.ce of possibilities optimally. Parts of the s.p.a.ce that are never used are not a.s.signed any cl.u.s.ters.
We then a.s.sign a number to each cl.u.s.ter (in our case, 0 to 1,023). That number is the reduced, ”quantized” representation of that cl.u.s.ter, which is why the technique is called vector quantization. Any new input vector that arrives in the future is then represented by the number of the cl.u.s.ter whose center point is closest to this new input vector.
We can now precompute a table with the distance of the center point of every cl.u.s.ter to every other center point. We thereby have instantly available the distance of this new input vector (which we represent by this quantized point-in other words, by the number of the cl.u.s.ter that this new point is closest to) to every other cl.u.s.ter. Since we are only representing points by their closest cl.u.s.ter, we now know the distance of this point to any other possible point that might come along.
I described the technique above using vectors with only two numbers each, but working with sixteen-element vectors is entirely a.n.a.logous to the simpler example. Because we chose vectors with sixteen numbers representing sixteen different frequency bands, each point in our system was a point in sixteen-dimensional s.p.a.ce. It is difficult for us to imagine a s.p.a.ce with more than three dimensions (perhaps four, if we include time), but mathematics has no such inhibitions.
We have accomplished four things with this process. First, we have greatly reduced the complexity of the data. Second, we have reduced sixteen-dimensional data to one-dimensional data (that is, each sample is now a single number). Third, we have improved our ability to find invariant features, because we are emphasizing portions of the s.p.a.ce of possible sounds that convey the most information. Most combinations of frequencies are physically impossible or at least very unlikely, so there is no reason to give equal s.p.a.ce to unlikely combinations of inputs as to likely ones. This technique reduces the data to equally likely possibilities. The fourth benefit is that we can use one-dimensional pattern recognizers, even though the original data consisted of many more dimensions. This turned out to be the most efficient approach to utilizing available computational resources.
Reading Your Mind with Hidden Markov Models
With vector quantization, we simplified the data in a way that emphasized key features, but we still needed a way to represent the hierarchy of invariant features that would make sense of new information. Having worked in the field of pattern recognition at that time (the early 1980s) for twenty years, I knew that one-dimensional representations were far more powerful, efficient, and amenable to invariant results. There was not a lot known about the neocortex in the early 1980s, but based on my experience with a variety of pattern recognition problems, I a.s.sumed that the brain was also likely to be reducing its multidimensional data (whether from the eyes, the ears, or the skin) using a one-dimensional representation, especially as concepts rose in the neocortex's hierarchy.
For the speech recognition problem, the organization of information in the speech signal appeared to be a hierarchy of patterns, with each pattern represented by a linear string of elements with a forward direction. Each element of a pattern could be another pattern at a lower level, or a fundamental unit of input (which in the case of speech recognition would be our quantized vectors).
You will recognize this situation as consistent with the model of the neocortex that I presented earlier. Human speech, therefore, is produced by a hierarchy of linear patterns in the brain. If we could simply examine these patterns in the brain of the person speaking, it would be a simple matter to match her new speech utterances against her brain patterns and understand what the person was saying. Unfortunately we do not have direct access to the brain of the speaker-the only information we have is what she actually said. Of course, that is the whole point of spoken language-the speaker is sharing a piece of her mind with her utterance.
So I wondered: Was there a mathematical technique that would enable us to infer the patterns in the speaker's brain based on her spoken words? One utterance would obviously not be sufficient, but if we had a large number of samples, could we use that information to essentially read the patterns in the speaker's neocortex (or at least formulate something mathematically equivalent that would enable us to recognize new utterances)?
People often fail to appreciate how powerful mathematics can be-keep in mind that our ability to search much of human knowledge in a fraction of a second with search engines is based on a mathematical technique. For the speech recognition problem I was facing in the early 1980s, it turned out that the technique of hidden Markov models fit the bill rather perfectly. The Russian mathematician Andrei Andreyevich Markov (18561922) built a mathematical theory of hierarchical sequences of states. The model was based on the possibility of traversing the states in one chain, and if that was successful, triggering a state in the next higher level in the hierarchy. Sound familiar?
A simple example of one layer of a hidden Markov model. S1 through S through S4 represent the ”hidden” internal states. The P represent the ”hidden” internal states. The Pi, j transitions each represent the probability of going from state S transitions each represent the probability of going from state Si to state S to state Sj. These probabilities are determined by the system learning from training data (including during actual use). A new sequence (such as a new spoken utterance) is matched against these probabilities to determine the likelihood that this model produced the sequence.
Markov's model included probabilities of each state's successfully occurring. He went on to hypothesize a situation in which a system has such a hierarchy of linear sequences of states, but those are unable to be directly examined-hence the name hidden hidden Markov models. The lowest level of the hierarchy emits signals, which are all we are allowed to see. Markov provides a mathematical technique to compute what the probabilities of each transition must be based on the observed output. The method was subsequently refined by Norbert Wiener in 1923. Wiener's refinement also provided a way to determine the connections in the Markov model; essentially any connection with too low a probability was considered not to exist. This is essentially how the human neocortex trims connections-if they are rarely or never used, they are considered unlikely and are pruned away. In our case, the observed output is the speech signal created by the person talking, and the state probabilities and connections of the Markov model const.i.tute the neocortical hierarchy that produced it. Markov models. The lowest level of the hierarchy emits signals, which are all we are allowed to see. Markov provides a mathematical technique to compute what the probabilities of each transition must be based on the observed output. The method was subsequently refined by Norbert Wiener in 1923. Wiener's refinement also provided a way to determine the connections in the Markov model; essentially any connection with too low a probability was considered not to exist. This is essentially how the human neocortex trims connections-if they are rarely or never used, they are considered unlikely and are pruned away. In our case, the observed output is the speech signal created by the person talking, and the state probabilities and connections of the Markov model const.i.tute the neocortical hierarchy that produced it.
I envisioned a system in which we would take samples of human speech, apply the hidden Markov model technique to infer a hierarchy of states with connections and probabilities (essentially a simulated neocortex for producing speech), and then use this inferred hierarchical network of states to recognize new utterances. To create a speaker-independent system, we would use samples from many different individuals to train the hidden Markov models. By adding in the element of hierarchies to represent the hierarchical nature of information in language, these were properly called hierarchical hidden Markov models (HHMMs).
My colleagues at Kurzweil Applied Intelligence were skeptical that this technique would work, given that it was a self-organizing method reminiscent of neural nets, which had fallen out of favor and with which we had had little success. I pointed out that the network in a neural net system is fixed and does not adapt to the input: The weights adapt, but the connections do not. In the Markov model system, if it was set up correctly, the system would prune unused connections so as to essentially adapt the topology.
I established what was considered a ”skunk works” project (an organizational term for a project off the beaten path that has little in the way of formal resources) that consisted of me, one part-time programmer, and an electrical engineer (to create the frequency filter bank). To the surprise of my colleagues, our effort turned out to be very successful, having succeeded in recognizing speech comprising a large vocabulary with high accuracy.
After that experiment, all of our subsequent speech recognition efforts have been based on hierarchical hidden Markov models. Other speech recognition companies appeared to discover the value of this method independently, and since the mid-1980s most work in automated speech recognition has been based on this approach. Hidden Markov models are also used in speech synthesis-keep in mind that our biological cortical hierarchy is used not only to recognize input but also to produce output, for example, speech and physical movement.
HHMMs are also used in systems that understand the meaning of natural-language sentences, which represents going up the conceptual hierarchy.
Hidden Markov states and possible transitions to produce a sequence of words in natural-language text.
To understand how the HHMM method works, we start out with a network that consists of all the state transitions that are possible. The vector quantization method described above is critical here, because otherwise there would be too many possibilities to consider.
Here is a possible simplified initial topology:
A simple hidden Markov model topology to recognize two spoken words.
Sample utterances are processed one by one. For each, we iteratively modify the probabilities of the transitions to better reflect the input sample we have just processed. The Markov models used in speech recognition code the likelihood that specific patterns of sound are found in each phoneme, how the phonemes influence one another, and the likely orders of phonemes. The system can also include probability networks on higher levels of language structure, such as the order of words, the inclusion of phrases, and so on up the hierarchy of language.
Whereas our previous speech recognition systems incorporated specific rules about phoneme structures and sequences explicitly coded by human linguists, the new HHMM-based system was not explicitly told that there are forty-four phonemes in English, the sequences of vectors that were likely for each phoneme, or what phoneme sequences were more likely than others. We let the system discover these ”rules” for itself from thousands of hours of transcribed human speech data. The advantage of this approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts.
Once the network was trained, we began to attempt to recognize speech by considering the alternate paths through the network and picking the path that was most likely, given the actual sequence of input vectors we had seen. In other words, if we saw a sequence of states that was likely to have produced that utterance, we concluded that the utterance came from that cortical sequence. This simulated HHMM-based neocortex included word labels, so it was able to propose a transcription of what it heard.
We were then able to improve our results further by continuing to train the network while we were using it for recognition. As we have discussed, simultaneous recognition and learning also take place at every level in our biological neocortical hierarchy.
Evolutionary (Genetic) Algorithms
<script>