Part 6 (1/2)

There is another important consideration: How do we set the many parameters that control a pattern recognition system's functioning? These could include the number of vectors that we allow in the vector quantization step, the initial topology of hierarchical states (before the training phase of the hidden Markov model process prunes them back), the recognition threshold at each level of the hierarchy, the parameters that control the handling of the size parameters, and many others. We can establish these based on our intuition, but the results will be far from optimal.

We call these parameters ”G.o.d parameters” because they are set prior to the self-organizing method of determining the topology of the hidden Markov models (or, in the biological case, before the person learns her lessons by similarly creating connections in her cortical hierarchy). This is perhaps a misnomer, given that these initial DNA-based design details are determined by biological evolution, though some may see the hand of G.o.d in that process (and while I do consider evolution to be a spiritual process, this discussion properly belongs in chapter 9 chapter 9).

When it came to setting these ”G.o.d parameters” in our simulated hierarchical learning and recognizing system, we again took a cue from nature and decided to evolve them-in our case, using a simulation of evolution. We used what are called genetic or evolutionary algorithms (GAs), which include simulated s.e.xual reproduction and mutations.

Here is a simplified description of how this method works. First, we determine a way to code possible solutions to a given problem. If the problem is optimizing the design parameters for a circuit, then we define a list of all of the parameters (with a specific number of bits a.s.signed to each parameter) that characterize the circuit. This list is regarded as the genetic code in the genetic algorithm. Then we randomly generate thousands or more genetic codes. Each such genetic code (which represents one set of design parameters) is considered a simulated ”solution” organism.

Now we evaluate each simulated organism in a simulated environment by using a defined method to a.s.sess each set of parameters. This evaluation is a key to the success of a genetic algorithm. In our example, we would run each program generated by the parameters and judge it on appropriate criteria (did it complete the task, how long did it take, and so on). The best-solution organisms (the best designs) are allowed to survive, and the rest are eliminated.

Now we cause each of the survivors to multiply themselves until they reach the same number of solution creatures. This is done by simulating s.e.xual reproduction: In other words, we create new offspring where each new creature draws one part of its genetic code from one parent and another part from a second parent. Usually no distinction is made between male or female organisms; it's sufficient to generate an offspring from any two arbitrary parents, so we're basically talking about same-s.e.x marriage here. This is perhaps not as interesting as s.e.xual reproduction in the natural world, but the relevant point here is having two parents. As these simulated organisms multiply, we allow some mutation (random change) in the chromosomes to occur.

We've now defined one generation of simulated evolution; now we repeat these steps for each subsequent generation. At the end of each generation we determine how much the designs have improved (that is, we compute the average improvement in the evaluation function over all the surviving organisms). When the degree of improvement in the evaluation of the design creatures from one generation to the next becomes very small, we stop this iterative cycle and use the best design(s) in the last generation. (For an algorithmic description of genetic algorithms, see this endnote.)11 The key to a genetic algorithm is that the human designers don't directly program a solution; rather, we let one emerge through an iterative process of simulated compet.i.tion and improvement. Biological evolution is smart but slow, so to enhance its intelligence we greatly speed up its ponderous pace. The computer is fast enough to simulate many generations in a matter of hours or days, and we've occasionally had them run for as long as weeks to simulate hundreds of thousands of generations. But we have to go through this iterative process only once; as soon as we have let this simulated evolution run its course, we can apply the evolved and highly refined rules to real problems in a rapid fas.h.i.+on. In the case of our speech recognition systems, we used them to evolve the initial topology of the network and other critical parameters. We thus used two self-organizing methods: a GA to simulate the biological evolution that gave rise to a particular cortical design, and HHMMs to simulate the cortical organization that accompanies human learning.

Another major requirement for the success of a GA is a valid method of evaluating each possible solution. This evaluation needs to be conducted quickly, because it must take account of many thousands of possible solutions for each generation of simulated evolution. GAs are adept at handling problems with too many variables for which to compute precise a.n.a.lytic solutions. The design of an engine, for example, may involve more than a hundred variables and requires satisfying dozens of constraints; GAs used by researchers at General Electric were able to come up with jet engine designs that met the constraints more precisely than conventional methods.

When using GAs you must, however, be careful what you ask for. A genetic algorithm was used to solve a block-stacking problem, and it came up with a perfect solution...except that it had thousands of steps. The human programmers forgot to include minimizing the number of steps in their evaluation function.

Scott Drave's Electric Sheep project is a GA that produces art. The evaluation function uses human evaluators in an open-source collaboration involving many thousands of people. The art moves through time and you can view it at electricsheep.org.

For speech recognition, the combination of genetic algorithms and hidden Markov models worked extremely well. Simulating evolution with a GA was able to substantially improve the performance of the HHMM networks. What evolution came up with was far superior to our original design, which was based on our intuition.

We then experimented with introducing a series of small variations in the overall system. For example, we would make perturbations (minor random changes) to the input. Another such change was to have adjacent Markov models ”leak” into one another by causing the results of one Markov model to influence models that are ”nearby.” Although we did not realize it at the time, the sorts of adjustments we were experimenting with are very similar to the types of modifications that occur in biological cortical structures.

At first, such changes hurt performance (as measured by accuracy of recognition). But if we reran evolution (that is, reran the GA) with these alterations in place, it would adapt the system accordingly, optimizing it for these introduced modifications. In general, this would restore performance. If we then removed the changes we had introduced, performance would be again degraded, because the system had been evolved to compensate for the changes. The adapted system became dependent on the changes.

One type of alteration that actually helped performance (after rerunning the GA) was to introduce small random changes to the input. The reason for this is the well-known ”overfitting” problem in self-organizing systems. There is a danger that such a system will overgeneralize to the specific examples contained in the training sample. By making random adjustments to the input, the more invariant patterns in the data survive, and the system thereby learns these deeper patterns. This helped only if we reran the GA with the randomization feature on.

This introduces a dilemma in our understanding of our biological cortical circuits. It had been noticed, for example, that there might indeed be a small amount of leakage from one cortical connection to another, resulting from the way that biological connections are formed: The electrochemistry of the axons and dendrites is apparently subject to the electromagnetic effects of nearby connections. Suppose we were able to run an experiment where we removed this effect in an actual brain. That would be difficult to actually carry out, but not necessarily impossible. Suppose we conducted such an experiment and found that the cortical circuits worked less effectively without this neural leakage. We might then conclude that this phenomenon was a very clever design by evolution and was critical to the cortex's achieving its level of performance. We might further point out that such a result shows that the orderly model of the flow of patterns up the conceptual hierarchy and the flow of predictions down the hierarchy was in fact much more complicated because of this intricate influence of connections on one another.

But that would not necessarily be an accurate conclusion. Consider our experience with a simulated cortex based on HHMMs, in which we implemented a modification very similar to interneuronal cross talk. If we then ran evolution with that phenomenon in place, performance would be restored (because the evolutionary process adapted to it). If we then removed the cross talk, performance would be compromised again. In the biological case, evolution (that is, biological evolution) was indeed ”run” with this phenomenon in place. The detailed parameters of the system have thereby been set by biological evolution to be dependent on these factors, so that changing them will negatively affect performance unless we run evolution again. Doing so is feasible in the simulated world, where evolution only takes days or weeks, but in the biological world it would require tens of thousands of years.

So how can we tell whether a particular design feature of the biological neocortex is a vital innovation introduced by biological evolution-that is, one that is instrumental to our level of intelligence-or merely an artifact that the design of the system is now dependent on but could have evolved without? We can answer that question simply by running simulated evolution with and without these particular variations to the details of the design (for example, with and without connection cross talk). We can even do so with biological evolution if we're examining the evolution of a colony of microorganisms where generations are measured in hours, but it is not practical for complex organisms such as humans. This is another one of the many disadvantages of biology.

Getting back to our work in speech recognition, we found that if we ran evolution (that is, a GA) separately separately on the initial design of (1) the hierarchical hidden Markov models that were modeling the internal structure of phonemes and (2) the HHMMs' modeling of the structures of words and phrases, we got even better results. Both levels of the system were using HHMMs, but the GA would evolve design variations between these different levels. This approach still allowed the modeling of phenomena that occurs in between the two levels, such as the smearing of phonemes that often happens when we string certain words together (for example, ”How are you all doing?” might become ”How're y'all doing?”). on the initial design of (1) the hierarchical hidden Markov models that were modeling the internal structure of phonemes and (2) the HHMMs' modeling of the structures of words and phrases, we got even better results. Both levels of the system were using HHMMs, but the GA would evolve design variations between these different levels. This approach still allowed the modeling of phenomena that occurs in between the two levels, such as the smearing of phonemes that often happens when we string certain words together (for example, ”How are you all doing?” might become ”How're y'all doing?”).

It is likely that a similar phenomenon took place in different biological cortical regions, in that they have evolved small differences based on the types of patterns they deal with. Whereas all of these regions use the same essential neocortical algorithm, biological evolution has had enough time to fine-tune the design of each of them to be optimal for their particular patterns. However, as I discussed earlier, neuroscientists and neurologists have noticed substantial plasticity in these areas, which supports the idea of a general neocortical algorithm. If the fundamental methods in each region were radically different, then such interchangeability among cortical regions would not be possible.

The systems we created in our research using this combination of self-organizing methods were very successful. In speech recognition, they were able for the first time to handle fully continuous speech and relatively unrestricted vocabularies. We were able to achieve a high accuracy rate on a wide variety of speakers, accents, and dialects. The current state of the art as this book is being written is represented by a product called Dragon Naturally Speaking (Version 11.5) for the PC from Nuance (formerly Kurzweil Computer Products). I suggest that people try it if they are skeptical about the performance of contemporary speech recognition-accuracies are often 99 percent or higher after a few minutes of training on your voice on continuous speech and relatively unrestricted vocabularies. Dragon Dictation is a simpler but still impressive free app for the iPhone that requires no voice training. Siri, the personal a.s.sistant on contemporary Apple iPhones, uses the same speech recognition technology with extensions to handle natural-language understanding.

The performance of these systems is a testament to the power of mathematics. With them we are essentially computing what is going on in the neocortex of a speaker-even though we have no direct access to that person's brain-as a vital step in recognizing what the person is saying and, in the case of systems like Siri, what those utterances mean. We might wonder, if we were to actually look inside the speaker's neocortex, would we see connections and weights corresponding to the hierarchical hidden Markov models computed by the software? Almost certainly we would not find a precise match; the neuronal structures would invariably differ in many details compared with the models in the computer. However, I would maintain that there must be an essential mathematical equivalence to a high degree of precision between the actual biology and our attempt to emulate it; otherwise these systems would not work as well as they do.

LISP

LISP (LISt Processor) is a computer language, originally specified by AI pioneer John McCarthy (19272011) in 1958. As its name suggests, LISP deals with lists. Each LISP statement is a list of elements; each element is either another list or an ”atom,” which is an irreducible item const.i.tuting either a number or a symbol. A list included in a list can be the list itself, hence LISP is capable of recursion. Another way that LISP statements can be recursive is if a list includes a list, and so on until the original list is specified. Because lists can include lists, LISP is also capable of hierarchical processing. A list can be a conditional such that it only ”fires” if its elements are satisfied. In this way, hierarchies of such conditionals can be used to identify increasingly abstract qualities of a pattern.

LISP became the rage in the artificial intelligence community in the 1970s and early 1980s. The conceit of the LISP enthusiasts of the earlier decade was that the language mirrored the way the human brain worked-that any intelligent process could most easily and efficiently be coded in LISP. There followed a mini-boomlet in ”artificial intelligence” companies that offered LISP interpreters and related LISP products, but when it became apparent in the mid-1980s that LISP itself was not a shortcut to creating intelligent processes, the investment balloon collapsed.

It turns out that the LISP enthusiasts were not entirely wrong. Essentially, each pattern recognizer in the neocortex can be regarded as a LISP statement-each one const.i.tutes a list of elements, and each element can be another list. The neocortex is therefore indeed engaged in list processing of a symbolic nature very similar to that which takes place in a LISP program. Moreover, it processes all 300 million LISP-like ”statements” simultaneously.

However, there were two important features missing from the world of LISP, one of which was learning. LISP programs had to be coded line by line by human programmers. There were attempts to automatically code LISP programs using a variety of methods, but these were not an integral part of the language's concept. The neocortex, in contrast, programs itself, filling its ”statements” (that is, the lists) with meaningful and actionable information from its own experience and from its own feedback loops. This is a key principle of how the neocortex works: Each one of its pattern recognizers (that is, each LISP-like statement) is capable of filling in its own list and connecting itself both up and down to other lists. The second difference is the size parameters. One could create a variant of LISP (coded in LISP) that would allow for handling such parameters, but these are not part of the basic language.

LISP is consistent with the original philosophy of the AI field, which was to find intelligent solutions to problems and to code them directly in computer languages. The first attempt at a self-organizing method that would teach itself from experience-neural nets-was not successful because it did not provide a means to modify the topology of the system in response to learning. The hierarchical hidden Markov model effectively provided that through its pruning mechanism. Today, the HHMM together with its mathematical cousins makes up a major portion of the world of AI.

A corollary of the observation of the similarity of LISP and the list structure of the neocortex is an argument made by those who insist that the brain is too complicated for us to understand. These critics point out that the brain has trillions of connections, and since each one must be there specifically by design, they const.i.tute the equivalent of trillions of lines of code. As we've seen, I've estimated that there are on the order of 300 million pattern processors in the neocortex-or 300 million lists where each element in the list is pointing to another list (or, at the lowest conceptual level, to a basic irreducible pattern from outside the neocortex). But 300 million is still a reasonably big number of LISP statements and indeed is larger than any human-written program in existence.

However, we need to keep in mind that these lists are not actually specified in the initial design of the nervous system. The brain creates these lists itself and connects the levels automatically from its own experiences. This is the key secret of the neocortex. The processes that accomplish this self-organization are much simpler than the 300 million statements that const.i.tute the capacity of the neocortex. Those processes are specified in the genome. As I will demonstrate in chapter 11 chapter 11, the amount of unique information in the genome (after lossless compression) as applied to the brain is about 25 million bytes, which is equivalent to less than a million lines of code. The actual algorithmic complexity is even less than that, as most of the 25 million bytes of genetic information pertain to the biological needs of the neurons, and not specifically to their information-processing capability. However, even 25 million bytes of design information is a level of complexity we can handle.

Hierarchical Memory Systems

As I discussed in chapter 3 chapter 3, Jeff Hawkins and Dileep George in 2003 and 2004 developed a model of the neocortex incorporating hierarchical lists that was described in Hawkins and Blakeslee's 2004 book On Intelligence On Intelligence. A more up-to-date and very elegant presentation of the hierarchical temporal memory method can be found in Dileep George's 2008 doctoral dissertation.12 Numenta has implemented it in a system called NuPIC (Numenta Platform for Intelligent Computing) and has developed pattern recognition and intelligent data-mining systems for such clients as Forbes and Power a.n.a.lytics Corporation. After working at Numenta, George has started a new company called Vicarious Systems with funding from the Founder Fund (managed by Peter Thiel, the venture capitalist behind Facebook, and Sean Parker, the first president of Facebook) and from Good Ventures, led by Dustin Moskovitz, cofounder of Facebook. George reports significant progress in automatically modeling, learning, and recognizing information with a substantial number of hierarchies. He calls his system a ”recursive cortical network” and plans applications for medical imaging and robotics, among other fields. The technique of hierarchical hidden Markov models is mathematically very similar to these hierarchical memory systems, especially if we allow the HHMM system to organize its own connections between pattern recognition modules. As mentioned earlier, HHMMs provide for an additional important element, which is modeling the expected distribution of the magnitude (on some continuum) of each input in computing the probability of the existence of the pattern under consideration. I have recently started a new company called Patterns, Inc., which intends to develop hierarchical self-organizing neocortical models that utilize HHMMs and related techniques for the purpose of understanding natural language. An important emphasis will be on the ability for the system to design its own hierarchies in a manner similar to a biological neocortex. Our envisioned system will continually read a wide range of material such as Wikipedia and other knowledge resources as well as listen to everything you say and watch everything you write (if you let it). The goal is for it to become a helpful friend answering your questions- Numenta has implemented it in a system called NuPIC (Numenta Platform for Intelligent Computing) and has developed pattern recognition and intelligent data-mining systems for such clients as Forbes and Power a.n.a.lytics Corporation. After working at Numenta, George has started a new company called Vicarious Systems with funding from the Founder Fund (managed by Peter Thiel, the venture capitalist behind Facebook, and Sean Parker, the first president of Facebook) and from Good Ventures, led by Dustin Moskovitz, cofounder of Facebook. George reports significant progress in automatically modeling, learning, and recognizing information with a substantial number of hierarchies. He calls his system a ”recursive cortical network” and plans applications for medical imaging and robotics, among other fields. The technique of hierarchical hidden Markov models is mathematically very similar to these hierarchical memory systems, especially if we allow the HHMM system to organize its own connections between pattern recognition modules. As mentioned earlier, HHMMs provide for an additional important element, which is modeling the expected distribution of the magnitude (on some continuum) of each input in computing the probability of the existence of the pattern under consideration. I have recently started a new company called Patterns, Inc., which intends to develop hierarchical self-organizing neocortical models that utilize HHMMs and related techniques for the purpose of understanding natural language. An important emphasis will be on the ability for the system to design its own hierarchies in a manner similar to a biological neocortex. Our envisioned system will continually read a wide range of material such as Wikipedia and other knowledge resources as well as listen to everything you say and watch everything you write (if you let it). The goal is for it to become a helpful friend answering your questions-before you even formulate them-and giving you useful information and tips as you go through your day. you even formulate them-and giving you useful information and tips as you go through your day.

The Moving Frontier of AI: Climbing the Competence Hierarchy

1. A long tiresome speech delivered by a frothy pie topping.

2. A garment worn by a child, perhaps aboard an operatic s.h.i.+p.

3. Wanted for a twelve-year crime spree of eating King Hrothgar's warriors; officer Beowulf has been a.s.signed the case.

4. It can mean to develop gradually in the mind or to carry during pregnancy.

5. National Teacher Day and Kentucky Derby Day.

6. Wordsworth said they soar but never roam.

7. Four-letter word for the iron fitting on the hoof of a horse or a card-dealing box in a casino.