Part 13 (1/2)
By the late 1980s expert systems were incorporating the idea of uncertainty and could combine many sources of probabilistic evidence to make a decision. The MYCIN system pioneered this approach. A typical MYCIN ”rule” reads:
If the infection which requires therapy is meningitis, and the type of the infection is fungal, and organisms were not seen on the stain of the culture, and the patient is not a compromised host, and the patient has been to an area that is endemic for coccidiomycoses, and the race of the patient is Black, Asian, or Indian, and the cryptococcal antigen in the csf test was not positive, THEN there is a 50 percent chance that cryptococcus is not one of the organisms which is causing the infection.
Although a single probabilistic rule such as this would not be sufficient by itself to make a useful statement, by combining thousands of such rules the evidence can be marshaled and combined to make reliable decisions.
Probably the longest-running expert system project is CYC (for enCYClopedic), created by Doug Lenat and his colleagues at Cycorp. Initiated in 1984, CYC has been coding commonsense knowledge to provide machines with an ability to understand the unspoken a.s.sumptions underlying human ideas and reasoning. The project has evolved from hard-coded logical rules to probabilistic ones and now includes means of extracting knowledge from written sources (with human supervision). The original goal was to generate one million rules, which reflects only a small portion of what the average human knows about the world. Lenat's latest goal is for CYC to master ”100 million things, about the number a typical person knows about the world, by 2007.”166 Another ambitious expert system is being pursued by Darryl Macer, a.s.sociate professor of biological sciences at the University of Tsukuba in j.a.pan. He plans to develop a system incorporating all human ideas.167 One application would be to inform policy makers of which ideas are held by which community. One application would be to inform policy makers of which ideas are held by which community.
Bayesian Nets. Over the last decade a technique called Bayesian logic has created a robust mathematical foundation for combining thousands or even millions of such probabilistic rules in what are called ”belief networks” or Bayesian nets. Originally devised by English mathematician Thomas Bayes and published posthumously in 1763, the approach is intended to determine the likelihood of future events based on similar occurrences in the past. Over the last decade a technique called Bayesian logic has created a robust mathematical foundation for combining thousands or even millions of such probabilistic rules in what are called ”belief networks” or Bayesian nets. Originally devised by English mathematician Thomas Bayes and published posthumously in 1763, the approach is intended to determine the likelihood of future events based on similar occurrences in the past.168 Many expert systems based on Bayesian techniques gather data from experience in an ongoing fas.h.i.+on, thereby continually learning and improving their decision making. Many expert systems based on Bayesian techniques gather data from experience in an ongoing fas.h.i.+on, thereby continually learning and improving their decision making.
The most promising type of spam filters are based on this method. I personally use a spam filter called SpamBayes, which trains itself on e-mail that you have identified as either ”spam” or ”okay.”169 You start out by presenting a folder of each to the filter. It trains its Bayesian belief network on these two files and a.n.a.lyzes the patterns of each, thus enabling it to automatically move subsequent e-mail into the proper category. It continues to train itself on every subsequent e-mail, especially when it's corrected by the user. This filter has made the spam situation manageable for me, which is saying a lot, as it weeds out two hundred to three hundred spam messages each day, letting more than one hundred ”good” messages through. Only about 1 percent of the messages it identifies as ”okay” are actually spam; it almost never marks a good message as spam. The system is almost as accurate as I would be and much faster. You start out by presenting a folder of each to the filter. It trains its Bayesian belief network on these two files and a.n.a.lyzes the patterns of each, thus enabling it to automatically move subsequent e-mail into the proper category. It continues to train itself on every subsequent e-mail, especially when it's corrected by the user. This filter has made the spam situation manageable for me, which is saying a lot, as it weeds out two hundred to three hundred spam messages each day, letting more than one hundred ”good” messages through. Only about 1 percent of the messages it identifies as ”okay” are actually spam; it almost never marks a good message as spam. The system is almost as accurate as I would be and much faster.
Markov Models. Another method that is good at applying probabilistic networks to complex sequences of information involves Markov models. Another method that is good at applying probabilistic networks to complex sequences of information involves Markov models.170 Andrei Andreyevich Markov (18561922), a renowned mathematician, established a theory of ”Markov chains,” which was refined by Norbert Wiener (18941964) in 1923. The theory provided a method to evaluate the likelihood that a certain sequence of events would occur. It has been popular, for example, in speech recognition, in which the sequential events are phonemes (parts of speech). The Markov models used in speech recognition code the likelihood that specific patterns of sound are found in each phoneme, how the phonemes influence each other, and likely orders of phonemes. The system can also include probability networks on higher levels of language, such as the order of words. The actual probabilities in the models are trained on actual speech and language data, so the method is self-organizing. Andrei Andreyevich Markov (18561922), a renowned mathematician, established a theory of ”Markov chains,” which was refined by Norbert Wiener (18941964) in 1923. The theory provided a method to evaluate the likelihood that a certain sequence of events would occur. It has been popular, for example, in speech recognition, in which the sequential events are phonemes (parts of speech). The Markov models used in speech recognition code the likelihood that specific patterns of sound are found in each phoneme, how the phonemes influence each other, and likely orders of phonemes. The system can also include probability networks on higher levels of language, such as the order of words. The actual probabilities in the models are trained on actual speech and language data, so the method is self-organizing.
Markov modeling was one of the methods my colleagues and I used in our own speech-recognition development.171 Unlike phonetic approaches, in which specific rules about phoneme sequences are explicitly coded by human linguists, we did not tell the system that there are approximately forty-four phonemes in English, nor did we tell it what sequences of phonemes were more likely than others. We let the system discover these ”rules” for itself from thousands of hours of transcribed human speech data. The advantage of this approach over hand-coded rules is that the models develop subtle probabilistic rules of which human experts are not necessarily aware. Unlike phonetic approaches, in which specific rules about phoneme sequences are explicitly coded by human linguists, we did not tell the system that there are approximately forty-four phonemes in English, nor did we tell it what sequences of phonemes were more likely than others. We let the system discover these ”rules” for itself from thousands of hours of transcribed human speech data. The advantage of this approach over hand-coded rules is that the models develop subtle probabilistic rules of which human experts are not necessarily aware.
Neural Nets. Another popular self-organizing method that has also been used in speech recognition and a wide variety of other pattern-recognition tasks is neural nets. This technique involves simulating a simplified model of neurons and interneuronal connections. One basic approach to neural nets can be described as follows. Each point of a given input (for speech, each point represents two dimensions, one being frequency and the other time; for images, each point would be a pixel in a two-dimensional image) is randomly connected to the inputs of the first layer of simulated neurons. Every connection has an a.s.sociated synaptic strength, which represents its importance and which is set at a random value. Each neuron adds up the signals coming into it. If the combined signal exceeds a particular threshold, the neuron fires and sends a signal to its output connection; if the combined input signal does not exceed the threshold, the neuron does not fire, and its output is zero. The output of each neuron is randomly connected to the inputs of the neurons in the next layer. There are multiple layers (generally three or more), and the layers may be organized in a variety of configurations. For example, one layer may feed back to an earlier layer. At the top layer, the output of one or more neurons, also randomly selected, provides the answer. (For an algorithmic description of neural nets, see this note: Another popular self-organizing method that has also been used in speech recognition and a wide variety of other pattern-recognition tasks is neural nets. This technique involves simulating a simplified model of neurons and interneuronal connections. One basic approach to neural nets can be described as follows. Each point of a given input (for speech, each point represents two dimensions, one being frequency and the other time; for images, each point would be a pixel in a two-dimensional image) is randomly connected to the inputs of the first layer of simulated neurons. Every connection has an a.s.sociated synaptic strength, which represents its importance and which is set at a random value. Each neuron adds up the signals coming into it. If the combined signal exceeds a particular threshold, the neuron fires and sends a signal to its output connection; if the combined input signal does not exceed the threshold, the neuron does not fire, and its output is zero. The output of each neuron is randomly connected to the inputs of the neurons in the next layer. There are multiple layers (generally three or more), and the layers may be organized in a variety of configurations. For example, one layer may feed back to an earlier layer. At the top layer, the output of one or more neurons, also randomly selected, provides the answer. (For an algorithmic description of neural nets, see this note:172) Since the neural-net wiring and synaptic weights are initially set randomly, the answers of an untrained neural net will be random. The key to a neural net, therefore, is that it must learn its subject matter. Like the mammalian brains on which it's loosely modeled, a neural net starts out ignorant. The neural net's teacher-which may be a human, a computer program, or perhaps another, more mature neural net that has already learned its lessons-rewards the student neural net when it generates the right output and punishes it when it does not. This feedback is in turn used by the student neural net to adjust the strengths of each interneuronal connection. Connections that were consistent with the right answer are made stronger. Those that advocated a wrong answer are weakened. Over time, the neural net organizes itself to provide the right answers without coaching. Experiments have shown that neural nets can learn their subject matter even with unreliable teachers. If the teacher is correct only 60 percent of the time, the student neural net will still learn its lessons.
A powerful, well-taught neural net can emulate a wide range of human pattern-recognition faculties. Systems using multilayer neural nets have shown impressive results in a wide variety of pattern-recognition tasks, including recognizing handwriting, human faces, fraud in commercial transactions such as credit-card charges, and many others. In my own experience in using neural nets in such contexts, the most challenging engineering task is not coding the nets but in providing automated lessons for them to learn their subject matter.
The current trend in neural nets is to take advantage of more realistic and more complex models of how actual biological neural nets work, now that we are developing detailed models of neural functioning from brain reverse NEAR engineering.173 Since we do have several decades of experience in using self-organizing paradigms, new insights from brain studies can quickly be adapted to neural-net experiments. Since we do have several decades of experience in using self-organizing paradigms, new insights from brain studies can quickly be adapted to neural-net experiments.
Neural nets are also naturally amenable to parallel processing, since that is how the brain works. The human brain does not have a central processor that simulates each neuron. Rather, we can consider each neuron and each interneuronal connection to be an individual slow processor. Extensive work is under way to develop specialized chips that implement neural-net architectures in parallel to provide substantially greater throughput.174
Genetic Algorithms (GAs). Another self-organizing paradigm inspired by nature is genetic, or evolutionary, algorithms, which emulate evolution, including s.e.xual reproduction and mutations. Here is a simplified description of how they work. First, determine a way to code possible solutions to a given problem. If the problem is optimizing the design parameters for a jet engine, define a list of the parameters (with a specific number of bits a.s.signed to each parameter). This list is regarded as the genetic code in the genetic algorithm. Then randomly generate thousands or more genetic codes. Each such genetic code (which represents one set of design parameters) is considered a simulated ”solution” organism. Another self-organizing paradigm inspired by nature is genetic, or evolutionary, algorithms, which emulate evolution, including s.e.xual reproduction and mutations. Here is a simplified description of how they work. First, determine a way to code possible solutions to a given problem. If the problem is optimizing the design parameters for a jet engine, define a list of the parameters (with a specific number of bits a.s.signed to each parameter). This list is regarded as the genetic code in the genetic algorithm. Then randomly generate thousands or more genetic codes. Each such genetic code (which represents one set of design parameters) is considered a simulated ”solution” organism.
Now evaluate each simulated organism in a simulated environment by using a defined method to evaluate each set of parameters. This evaluation is a key to the success of a genetic algorithm. In our example, we would apply each solution organism to a jet-engine simulation and determine how successful that set of parameters is, according to whatever criteria we are interested in (fuel consumption, speed, and so on). The best solution organisms (the best designs) are allowed to survive, and the rest are eliminated.
Now have each of the survivors multiply themselves until they reach the same number of solution creatures. This is done by simulating s.e.xual reproduction. In other words, each new offspring solution draws part of its genetic code from one parent and another part from a second parent. Usually no distinction is made between male or female organisms; it's sufficient to generate an offspring from two arbitrary parents. As they multiply, allow some mutation (random change) in the chromosomes to occur.
We've now defined one generation of simulated evolution; now repeat these steps for each subsequent generation. At the end of each generation determine how much the designs have improved. When the improvement in the evaluation of the design creatures from one generation to the next becomes very small, we stop this iterative cycle of improvement and use the best design(s) in the last generation. (For an algorithmic description of genetic algorithms, see this note.175) The key to a GA is that the human designers don't directly program a solution; rather, they let one emerge through an iterative process of simulated compet.i.tion and improvement. As we discussed, biological evolution is smart but slow, so to enhance its intelligence we retain its discernment while greatly speeding up its ponderous pace. The computer is fast enough to simulate many generations in a matter of hours or days or weeks. But we have to go through this iterative process only once; once we have let this simulated evolution run its course, we can apply the evolved and highly refined rules to real problems in a rapid fas.h.i.+on.
Like neural nets GAs are a way to harness the subtle but profound patterns that exist in chaotic data. A key requirement for their success is a valid way of evaluating each possible solution. This evaluation needs to be fast because it must take account of many thousands of possible solutions for each generation of simulated evolution.
GAs are adept at handling problems with too many variables to compute precise a.n.a.lytic solutions. The design of a jet engine, for example, involves more than one hundred variables and requires satisfying dozens of constraints. GAs used by researchers at General Electric were able to come up with engine designs that met the constraints more precisely than conventional methods.
When using GAs you must, however, be careful what you ask for. University of Suss.e.x researcher Jon Bird used a GA to optimally design an oscillator circuit. Several attempts generated conventional designs using a small number of transistors, but the winning design was not an oscillator at all but a simple radio circuit. Apparently the GA discovered that the radio circuit picked up an oscillating hum from a nearby computer.176 The GA's solution worked only in the exact location on the table where it was asked to solve the problem. The GA's solution worked only in the exact location on the table where it was asked to solve the problem.
Genetic algorithms, part of the field of chaos or complexity theory, are increasingly being used to solve otherwise intractable business problems, such as optimizing complex supply chains. This approach is beginning to supplant more a.n.a.lytic methods throughout industry. (See examples below.) The paradigm is also adept at recognizing patterns, and is often combined with neural nets and other self-organizing methods. It's also a reasonable way to write computer software, particularly software that needs to find delicate balances for competing resources.
In the novel usr/bin/G.o.d usr/bin/G.o.d, Cory Doctorow, a leading science-fiction writer, uses an intriguing variation of a GA to evolve an AI. The GA generates a large number of intelligent systems based on various intricate combinations of techniques, with each combination characterized by its genetic code. These systems then evolve using a GA.
. The evaluation function works as follows: each system logs on to various human chat rooms and tries to pa.s.s for a human, basically a covert Turing test. If one of the humans in a chat room says something like ”What are you, a chatterbot?” (chatterbot meaning an automatic program, which at today's level of development is expected to not understand language at a human level), the evaluation is over, that system ends its interactions, and reports its score to the GA. The score is determined by how long it was able to pa.s.s for human without being challenged in this way. The GA evolves more and more intricate combinations of techniques that are increasingly capable of pa.s.sing for human.
The main difficulty with this idea is that the evaluation function is fairly slow, although it will take an appreciable amount of time only after the systems are reasonably intelligent. Also, the evaluations can take place largely in parallel. It's an interesting idea and may actually be a useful method to finish the job of pa.s.sing the Turing test, once we get to the point where we have sufficiently sophisticated algorithms to feed into such a GA, so that evolving a Turing-capable AI is feasible.
Recursive Search. Often we need to search through a vast number of combinations of possible solutions to solve a given problem. A cla.s.sic example is in playing games such as chess. As a player considers her next move, she can list all of her possible moves, and then, for each such move, all possible countermoves by the opponent, and so on. It is difficult, however, for human players to keep a huge ”tree” of move-countermove sequences in their heads, and so they rely on pattern recognition-recognizing situations based on prior experience-whereas machines use logical a.n.a.lysis of millions of moves and countermoves. Often we need to search through a vast number of combinations of possible solutions to solve a given problem. A cla.s.sic example is in playing games such as chess. As a player considers her next move, she can list all of her possible moves, and then, for each such move, all possible countermoves by the opponent, and so on. It is difficult, however, for human players to keep a huge ”tree” of move-countermove sequences in their heads, and so they rely on pattern recognition-recognizing situations based on prior experience-whereas machines use logical a.n.a.lysis of millions of moves and countermoves.
Such a logical tree is at the heart of most game-playing programs. Consider how this is done. We construct a program called Pick Best Next Step to select each move. Pick Best Next Step starts by listing all of the possible moves from the current state of the board. (If the problem was solving a mathematical theorem, rather than game moves, the program would list all of the possible next steps in a proof.) For each move the program constructs a hypothetical board that reflects what would happen if we made this move. For each such hypothetical board, we now need to consider what our opponent would do if we made this move. Now recursion comes in, because Pick Best Next Step simply calls Pick Best Next Step (in other words, itself) to pick the best move for our opponent. In calling itself, Pick Best Next Step then lists all of the legal moves for our opponent.
The program keeps calling itself, looking ahead as many moves as we have time to consider, which results in the generation of a huge move-countermove tree. This is another example of exponential growth, because to look ahead an additional move (or countermove) requires multiplying the amount of available computation by about five. Key to the success of the recursive formula is pruning this huge tree of possibilities and ultimately stopping its growth. In the game context, if a board looks hopeless for either side, the program can stop the expansion of the move-countermove tree from that point (called a ”terminal leaf” of the tree) and consider the most recently considered move to be a likely win or loss. When all of these nested program calls are completed, the program will have determined the best possible move for the current actual board within the limits of the depth of recursive expansion that it had time to pursue and the quality of its pruning algorithm. (For an algorithmic description of recursive search, see this note:177) The recursive formula is often effective at mathematics. Rather than game moves, the ”moves” are the axioms of the field of math being addressed, as well as previously proved theorems. The expansion at each point is the possible axioms (or previously proved theorems) that can be applied to a proof at each step. (This was the approach used by Newell, Shaw, and Simons's General Problem Solver.) From these examples it may appear that recursion is well suited only for problems in which we have crisply defined rules and objectives. But it has also shown promise in computer generation of artistic creations. For example, a program I designed called Ray Kurzweil's Cybernetic Poet uses a recursive approach.178 The program establishes a set of goals for each word-achieving a certain rhythmic pattern, poem structure, and word choice that is desirable at that point in the poem. If the program is unable to find a word that meets these criteria, it backs up and erases the previous word it has written, reestablishes the criteria it had originally set for the word just erased, and goes from there. If that also leads to a dead end, it backs up again, thus moving backward and forward. Eventually, it forces itself to make up its mind by relaxing some of the constraints if all paths lead to dead ends. The program establishes a set of goals for each word-achieving a certain rhythmic pattern, poem structure, and word choice that is desirable at that point in the poem. If the program is unable to find a word that meets these criteria, it backs up and erases the previous word it has written, reestablishes the criteria it had originally set for the word just erased, and goes from there. If that also leads to a dead end, it backs up again, thus moving backward and forward. Eventually, it forces itself to make up its mind by relaxing some of the constraints if all paths lead to dead ends.
Combining Methods. The most powerful approach to building robust AI systems is to combine approaches, which is how the human brain works. As we discussed, the brain is not one big neural net but instead consists of hundreds of regions, each of which is optimized for processing information in a different way. None of these regions by itself operates at what we would consider human levels of performance, but dearly by definition the overall system does exactly that. The most powerful approach to building robust AI systems is to combine approaches, which is how the human brain works. As we discussed, the brain is not one big neural net but instead consists of hundreds of regions, each of which is optimized for processing information in a different way. None of these regions by itself operates at what we would consider human levels of performance, but dearly by definition the overall system does exactly that.
I've used this approach in my own AI work, especially in pattern recognition. In speech recognition, for example, we implemented a number of different pattern-recognition systems based on different paradigms. Some were specifically programmed with knowledge of phonetic and linguistic constraints from experts. Some were based on rules to pa.r.s.e sentences (which involves creating sentence diagrams showing word usage, similar to the diagrams taught in grade school). Some were based on self-organizing techniques, such as Markov models, trained on extensive libraries of recorded and annotated human speech. We then programmed a software ”expert manager” to learn the strengths and weaknesses of the different ”experts” (recognizers) and combine their results in optimal ways. In this fas.h.i.+on, a particular technique that by itself might produce unreliable results can nonetheless contribute to increasing the overall accuracy of the system.
There are many intricate ways to combine the varied methods in AI's toolbox. For example, one can use a genetic algorithm to evolve the optimal topology (organization of nodes and connections) for a neural net or a Markov model. The final output of the GA-evolved neural net can then be used to control the parameters of a recursive search algorithm. We can add in powerful signal- and image-processing techniques that have been developed for pattern-processing systems. Each specific application calls for a different architecture. Computer science professor and AI entrepreneur Ben Goertzel has written a series of books and articles that describe strategies and architectures for combining the diverse methods underlying intelligence. His Novamente architecture is intended to provide a framework for general-purpose AI.179 The above basic descriptions provide only a glimpse into how increasingly sophisticated current AI systems are designed. It's beyond the scope of this book to provide a comprehensive description of the techniques of AI, and even a doctoral program in computer science is unable to cover all of the varied approaches in use today.
Many of the examples of real-world narrow AI systems described in the next section use a variety of methods integrated and optimized for each particular task. Narrow AI is strengthening as a result of several concurrent trends: continued exponential gains in computational resources, extensive real-world experience with thousands of applications, and fresh insights into how the human brain makes intelligent decisions.
A Narrow AI Sampler
When I wrote my first AI book, The Age of Intelligent Machines, in the late 1980s, I had to conduct extensive investigations to find a few successful examples of AI in practice. The Internet was not yet prevalent, so I had to go to real libraries and visit the AI research centers in the United States, Europe, and Asia. I included in my book pretty much all of the reasonable examples I could identify. In my research for this book my experience has been altogether different. I have been inundated with thousands of compelling examples. In our reporting on the KurzweilAI.net Web site, we feature one or more dramatic systems almost every day.180 A 2003 study by Business Communications Company projected a $21 billion market by 2007 for AI applications, with average annual growth of 12.2 percent from 2002 to 2007.181 Leading industries for AI applications include business intelligence, customer relations, finance, defense and domestic security, and education. Here is a small sample of narrow AI in action. Leading industries for AI applications include business intelligence, customer relations, finance, defense and domestic security, and education. Here is a small sample of narrow AI in action.
Military and Intelligence. The U.S. military has been an avid user of AI systems. Pattern-recognition software systems guide autonomous weapons such as cruise missiles, which can fly thousands of miles to find a specific building or even a specific window. The U.S. military has been an avid user of AI systems. Pattern-recognition software systems guide autonomous weapons such as cruise missiles, which can fly thousands of miles to find a specific building or even a specific window.182 Although the relevant details of the terrain that the missile flies over are programmed ahead of time, variations in weather, ground cover, and other factors require a flexible level of real-time image recognition. Although the relevant details of the terrain that the missile flies over are programmed ahead of time, variations in weather, ground cover, and other factors require a flexible level of real-time image recognition.
The army has developed prototypes of self-organizing communication networks (called ”mesh networks”) to automatically configure many thousands of communication nodes when a platoon is dropped into a new location.183 Expert systems incorporating Bayesian networks and GAs are used to optimize complex supply chains that coordinate millions of provisions, supplies, and weapons based on rapidly changing battlefield requirements.
AI systems are routinely employed to simulate the performance of weapons, including nuclear bombs and missiles.
Advance warning of the September 11, 2001, terrorist attacks was apparently detected by the National Security Agency's AI-based Echelon system, which a.n.a.lyzes the agency's extensive monitoring of communications traffic.184 Unfortunately, Echelon's warnings were not reviewed by human agents until it was too late. Unfortunately, Echelon's warnings were not reviewed by human agents until it was too late.