Part 5 (2/2)

George Box is justly celebrated for his remark ”All models are false but some are useful.” The mark of great statisticians is their confidence in the face of fallibility. They recognize that no one can have a monopoly on the truth, which is unknowable as long as there is uncertainty in the world. But imperfect information does not intimidate them; they seek models that fit the available evidence more tightly than all alternatives. Box's writings on his experiences in the industry have inspired generations of statisticians; to get a flavor of his engaging style, see the collection Improving Almost Anything Improving Almost Anything, lovingly produced by his former students.

More ink than necessary has been spilled on the dichotomy between correlation and causation. Asking for the umpteenth time whether correlation implies causation is pointless (we already know it does not). The question Can correlation be useful without causation? Can correlation be useful without causation? is much more worthy of exploration. Forgetting what the textbooks say, most pract.i.tioners believe the answer is quite often yes. In the case of credit scoring, correlation-based statistical models have been wildly successful even though they do not yield simple explanations for why one customer is a worse credit risk than another. The parallel development of this type of model by researchers in numerous fields, such as pattern recognition, machine learning, knowledge discovery, and data mining, also confirms its practical value. is much more worthy of exploration. Forgetting what the textbooks say, most pract.i.tioners believe the answer is quite often yes. In the case of credit scoring, correlation-based statistical models have been wildly successful even though they do not yield simple explanations for why one customer is a worse credit risk than another. The parallel development of this type of model by researchers in numerous fields, such as pattern recognition, machine learning, knowledge discovery, and data mining, also confirms its practical value.

In explaining how credit scoring works, statisticians emphasize the similarity between traditional and modern methods; much of the criticism leveled at credit-scoring technology applies equally to credit officers who make underwriting decisions by handcrafted rules. Credit scores and rules of thumb both rely on information from credit reports, such as outstanding account balances and past payment behavior, and such materials contain inaccurate data independently of the method of a.n.a.lysis. Typically, any rule discovered by the computer is a rule the credit officer would also use if he or she knew about it. While the complaints from consumer advocates seem reasonable, no one has yet proposed alternatives that can overcome the problems common to both systems. Statisticians prefer the credit-scoring approach because computers are much more efficient than loan officers at generating scoring rules, the resulting rules are more complex and more precise, and they can be applied uniformly to all loan applicants, ensuring fairness. Industry leaders concur, pointing out that the advent of credit scoring precipitated an explosion in consumer credit, which boosted consumer spending, hoisting up the U.S. economy for decades. Consider this: since the 1970s, credit granted to American consumers has exploded by 1,200 percent, while the deep recession that began in 2008 has led to retrenchment at less than 10 percent a year.

Statistical models do not relieve business managers of their responsibility to make prudent decisions. The credit-scoring algorithms make educated guesses on how likely each applicant will be to default on a loan but shed no light on how much risk an enterprise should shoulder. Two businesses with different appet.i.tes for risk will make different decisions, even if they use the same credit-scoring system.

When correlation is not enough to be useful without causation, the stakes get dramatically higher. Disease detectives must set their sights on the source of contaminated foods, as it is irresponsible to order food recalls, which cripple industries, based solely on evidence of correlation. The bagged spinach case of 2006 revealed the sophistication required to solve such a riddle. The epidemiologists used state-of-the-art statistical tools like the casecontrol study and information-sharing networks; because they respect the limits of these methods, they solicited help from laboratory and field personnel as well.

The case also demonstrated the formidable challenges of outbreak investigations: urgency mounted as more people reported sick, and key decisions had to be made under much uncertainty. In the bagged-spinach investigation, every piece of the puzzle fell neatly into place, allowing the complete causal path to be traced, from the infested farm to the infected stool. Investigators were incredibly lucky to capture the P227A lot code and discover the specific s.h.i.+ft when the contamination had occurred. Many other investigations are less than perfect, and mistakes not uncommon. For example, a Taco Bell outbreak in November 2006 was initially linked to green onions but later blamed on iceberg lettuce. In 2008, when the Food and Drug Administration (FDA) claimed tomatoes had caused a nationwide salmonella outbreak, stores and restaurants immediately yanked tomatoes from their offerings, only to discover later that they had been victims of a false alarm. Good statisticians are not daunted by these occasional failures. They understand the virtue in being wrong, as no model can be perfect; they particularly savor those days when everything works out, when we wonder how they manage to squeeze so much out of so little in such a short time.

Crossovers Disney fans who use Len Testa's touring plans pack in an amazing number of attractions during their visits to Disney theme parks, about 70 percent more than the typical tourist; they also shave off three and a half hours of waiting time and are among the most gratified of Disney guests. In putting together these plans, Testa's team took advantage of correlations. Most of us realize that many factors influence wait times at a theme park, such as weather, holiday, time of the day, day of the week, crowd level, popularity of the ride, and early-entry mornings. Similar to credit-scoring technology, Testa's algorithm computed the relative importance of these factors. He told us that the popularity of rides and time of day matter the most (both rated 10), followed by crowd level (9), holiday (8), early-entry morning (5), day of week (2), and weather (1). Thus, in terms of total waiting time, there really was no such thing as an off-peak day or a bad-weather day. How did Testa know so much?

Testa embraced what epidemiologists proudly called ”shoe leather,” or a lot of walking. On any brilliant summer day in Orlando, Florida, Testa could be spotted among the jumpy 8:00 A.M A.M. crowd at the gates of Walt Disney World, his ankles taped up and toes greased, psyched up for the rope drop. The entire day, he would be shuttling between rides. He would get neither in line nor on any ride; every half hour, upon finis.h.i.+ng one loop, he would start over at the first ride. He would walk for nine hours, logging eighteen miles. To cover even more ground, he had a small staff take turns with different rides, all year round. In this way, they collected wait times at every ride every thirty minutes. Back at the office, the computers scanned for patterns.

Testa's model did not attempt to explain why certain times of the day were busier than others; it was enough to know which times to avoid. As interesting as it would be to know how each step of a touring plan decreased their wait times, Testa's millions of fans care about only one thing: whether the plan let them visit more rides, enhancing the value of their entry tickets. The legion of satisfied readers is testimony to the usefulness of this correlational model.

Polygraphs rely strictly on correlations between the act of lying and certain physiological metrics. Are correlations useful without causation? In this case, statisticians say no. To avoid falsely imprisoning innocent people based solely on evidence of correlation, they insist that lie detection technology adopt causal modeling of the type practiced in epidemiology. They caution against logical overreach: Liars breathe faster. Adam's breaths quickened. Therefore, Adam was a liar. Liars breathe faster. Adam's breaths quickened. Therefore, Adam was a liar. Deception, or stress related to it, is only one of many possible causes for the increase in breathing rate, so variations in this or similar measures need not imply lying. As with epidemiologists studying spinach and Deception, or stress related to it, is only one of many possible causes for the increase in breathing rate, so variations in this or similar measures need not imply lying. As with epidemiologists studying spinach and E. coli E. coli, law enforcement officials must find corroborative evidence to strengthen their case, something rarely accomplished. A noteworthy finding of the 2002 NAS report was that scientific research into the causes of physiological changes a.s.sociated with lying has not kept up with the spread of polygraphs. The distinguished review panel on the report underlined the need for coherent psychological theories that explain the connection between lying and various physiological measures.

For the same reason, data-mining models for detecting terrorists are both false and useless. Data-mining models uncover patterns of correlation. Statisticians tell us that rounding up suspects based on these models will inevitably ensnare hundreds or thousands of innocent citizens. Linking cause to effect requires a much more sophisticated, multidisciplinary approach, one that emphasizes shoe leather, otherwise known as human intelligence gathering.

The Dilemma of Being Together In 2007, the average college-bound senior scored 502 in the Critical Reading (verbal) section of the SAT. In addition, girls performed just as well as boys (502 and 504, respectively), so nothing was lost by reporting the overall average score, and a bit of simplicity was gained. The same could not be said of blacks and whites, however, as the average black student tallied 433, almost 100 points below the average white student's score of 527. To aggregate or not to aggregate: that is the dilemma of being together. Should statisticians reveal several group averages or one overall average?

The rule of thumb is to keep groups together if they are alike and to set them apart if they are dissimilar. In our example, after the hurricane disasters of 20042005, insurers in Florida rea.s.sessed the risk exposure of coastal residents, deciding that the difference relative to inland properties had widened so drastically that the insurers could no longer justify keeping both groups together in an undifferentiated risk pool. Doing so would have been wildly unfair to the inland residents.

The issue of group differences group differences is at the heart of the dilemma. When group differences exist, groups should be disaggregated. It is a small tragedy to have at our disposal ready-made groups to part.i.tion people into, such as racial groups, income groups, and geographical groups. This easy categorization conditions in us a cavalier att.i.tude toward forming comparisons between blacks and whites, the rich and the poor, red and blue states, and so on. Statisticians tell us to examine such group differences carefully, as they frequently cover up nuances that break the general rule. For instance, the widely held notion that the rich vote Republican fell apart in a review of state-by-state data. Andrew Gelman, a statistician at Columbia University, found that this group difference in voting behavior surfaced in ”poor” states like Mississippi but not in ”rich” states like Connecticut. (See his fascinating book is at the heart of the dilemma. When group differences exist, groups should be disaggregated. It is a small tragedy to have at our disposal ready-made groups to part.i.tion people into, such as racial groups, income groups, and geographical groups. This easy categorization conditions in us a cavalier att.i.tude toward forming comparisons between blacks and whites, the rich and the poor, red and blue states, and so on. Statisticians tell us to examine such group differences carefully, as they frequently cover up nuances that break the general rule. For instance, the widely held notion that the rich vote Republican fell apart in a review of state-by-state data. Andrew Gelman, a statistician at Columbia University, found that this group difference in voting behavior surfaced in ”poor” states like Mississippi but not in ”rich” states like Connecticut. (See his fascinating book Red State, Blue State, Rich State, Poor State Red State, Blue State, Rich State, Poor State for more on this topic.) Similarly, the Golden Rule settlement failed because the procedure for screening out unfair test items lumped together students with divergent ability levels. The mix of ability levels among black students varied from that among whites, so this rule produced many false alarms, flagging questions as unfair even when they were not. for more on this topic.) Similarly, the Golden Rule settlement failed because the procedure for screening out unfair test items lumped together students with divergent ability levels. The mix of ability levels among black students varied from that among whites, so this rule produced many false alarms, flagging questions as unfair even when they were not.

Statisticians regard this as an instance of the famous Simpson's paradox Simpson's paradox: the simultaneous and seemingly contradictory finding that no difference exists between high-ability blacks and high-ability whites; no difference exists between low-ability blacks and low-ability whites; and when both ability levels are combined, blacks fare significantly worse than whites. To our amazement, the act of aggregation manufactures an apparent racial gap!

Here is what one would expect: since the group differences are zero for both high- and low-ability groups, the combined difference should also be zero. Here is the paradox: the statistics show that in aggregate, whites outperform blacks by 80 points (the bottom row of Figure C-2 Figure C-2). However, the confusion dissipates upon realizing that white students typically enjoy better educational resources than blacks, a fact acknowledged by the education community, so the average score for whites is more heavily weighted toward the score for high-ability students, and the average for blacks toward the score for low-ability students. In resolving the paradox, statisticians compute an average for each ability level so as to compare like with like. Simpson's paradox is a popular topic in statistics books, and it is a complicated concept at first glance.

Figure C-2 Aggregation Creates a Difference: An Ill.u.s.tration of Simpson's Paradox Aggregation Creates a Difference: An Ill.u.s.tration of Simpson's Paradox [image]

The recognition of Simpson's paradox led to a breakthrough in fair testing. The procedure for differential item functioning (DIF) a.n.a.lysis, introduced in Chapter 3 Chapter 3, divides examinees into groups of like ability and then compares average correct rates within these groups. Benefiting from research by the Educational Testing Service (ETS) in the 1980s, DIF a.n.a.lysis has rapidly gained acceptance as the scientific standard. In practice, ETS uses five ability groups based on total test score. For the sake of simplicity, we only concerned ourselves with the case of two groups.

The strategy of stratification stratification (a.n.a.lyzing groups separately) is one way to create like groups for comparison. A superior alternative strategy is (a.n.a.lyzing groups separately) is one way to create like groups for comparison. A superior alternative strategy is randomization randomization, when feasible. Statisticians frequently a.s.sign test subjects randomly into one group or another; say, in a clinical trial, they will select at random some patients to be given placebos, and the remainder to receive the medicine under study. Because of random a.s.signment, the groups will have similar characteristics: the mix of races will be the same, the mix of ages will be the same, and so on. In this way, ”all else being equal” is a.s.sured when one group is chosen for special treatment. If the treatment has an effect, the researcher does not have to worry about other contributing factors. While statisticians prefer randomization to stratification, the former strategy is sometimes infeasible. For example, in DIF a.n.a.lysis, social norms would prevent one from exposing some students randomly randomly to higher-quality schools and others to lower-quality schools. to higher-quality schools and others to lower-quality schools.

By contrast, the attempt by Florida insurers to disaggregate the hurricane risk pools has pushed the entire industry to the brink in the late 2000s. This consequence is hardly surprising if we recall the basic principle of insurance-that partic.i.p.ants agree to cross-subsidize each other in times of need. When the high-risk coastal policies are split off and pa.s.sed to take-out companies with modest capital bases, such as Poe Financial Group, or to Citizens Property Insurance Corporation, the state-run insurer of last resort, these ent.i.ties must shoulder a severe concentration of exposure, putting their very survival into serious question. In 2006, Poe became insolvent after 40 percent of its customers bled a surplus of ten years dry in just two seasons.

Crossovers By Arnold Barnett's estimation, between 1987 and 1996, air carriers in the developing world sustained 74 percent of worldwide crash fatalities while operating only 18 percent of all flights (see Figure C-3 Figure C-3a). If all airlines were equally safe, we would expect the developing-world carriers to share around 18 percent of fatalities. To many of us, the message could not be clearer: U.S. travelers should stick to U.S. airlines.

Yet Barnett contended that Americans gained nothing by ”buying local,” because developing-world carriers were just as safe as those in the developed world. He looked at the same numbers as most of us but arrived at an opposite conclusion, one rooted in the statistics of group differences. Barnett discovered that the developing-world airlines had a much better safety record on ”between-worlds” routes than on other routes. Thus, lumping together all routes created the wrong impression.

Since domestic routes in most countries are dominated by home carriers, airlines compete with each other only on international routes; in other words, about the only time American travelers get to choose a developing-world carrier is when they are flying between the two worlds. Hence, only the between-worlds routes are relevant. On these relevant routes, over the same period, developing-world carriers suffered 55 percent of the fatalities while making 62 percent of the flights (see Figure C-3 Figure C-3b). That indicates they weren't more dangerous than developed-world airlines.

Figure C-3 Stratifying Air Routes: Relative Proportion of Flights and Deaths by Developing-World and Developed-World Carriers, 19871996 Stratifying Air Routes: Relative Proportion of Flights and Deaths by Developing-World and Developed-World Carriers, 19871996 [image]

Group differences entered the picture again when comparing developed-world and developing-world carriers on between-worlds routes only. The existence of a group difference in fatality rates between the two airline groups is what would compel us to reject the equal-safety hypothesis.

Any stratification strategy should come with a big warning sign, statisticians caution. Beware the cherry-picker who draws attention only to one group out of many. If someone presented only Figure C-3 Figure C-3b, we could miss the mediocre safety record of developing-world carriers on their domestic routes, surely something we ought to know while touring around a foreign country.

Such mischief of omission can generally be countered by asking for information on every group, whether relevant or not.

Stratification produces like groups for comparison. This procedure proved essential to the proper fairness review of questions on standardized tests. Epidemiologists have known about this idea since Sir Bradford Hill and Sir Richard Doll published their landmark 1950 study linking smoking to lung cancer, which heralded the casecontrol study as a viable method for comparing groups. Recall that Melissa Plantenga, the a.n.a.lyst in Oregon, was the first to identify the eventual culprit in the bagged-spinach case, and she based her hunch on a 450-item shotgun questionnaire, which revealed that four out of five sickened patients had consumed bagged spinach. Disease detectives cannot rely solely on what proportion of the ”cases” (those patients who report sickness) were exposed to a particular food; they need a point of reference-the exposure rate of ”controls” (those who are similar to the cases but not ill). A food should arouse suspicion only if the cases have a much higher exposure rate to it than do the controls. Statisticians carefully match cases and controls to rule out any known other factors that may also induce the illness in one group but not the other.

In 2005, a year before the large E. coli E. coli outbreak in spinach, pre-packaged lettuce salad was blamed for another outbreak of outbreak in spinach, pre-packaged lettuce salad was blamed for another outbreak of E. coli E. coli, also of cla.s.s O157:H7, in Minnesota. The investigators interviewed ten cases, with ages ranging from three to eighty-four, and recruited two to three controls, with matching age, for each case patient. In the casecontrol study, they determined that the odds of exposure to prepackaged lettuce salad was eight times larger for cases than for controls; other evidence subsequently confirmed this hypothesis.

The result of the study can also be expressed thus: among like people, those in the group who fell ill were much more likely to have consumed prepackaged lettuce salad than those in the group who did not become ill (see Figure C-4 Figure C-4). In this sense, the casecontrol study is a literal implementation of comparing like with like. When like groups are found to be different, statisticians will treat them separately.

Figure C-4 The CaseControl Study: Comparing Like with Like The CaseControl Study: Comparing Like with Like [image]

The Sway of Being Asymmetric If all terrorists use barbecue barbecue as a code word and we know Joe is a terrorist, then we are certain Joe also uses the word as a code word and we know Joe is a terrorist, then we are certain Joe also uses the word barbecue barbecue. Applying a general truth (all terrorists) to a specific case (Joe the terrorist) is natural; going the other way, from the specific to the general, carries much peril, and that is the playground for statisticians. If we are told Joe the terrorist says ”barbecue” a lot, we cannot be sure that all other terrorists also use that word, as even one counter-example invalidates the general rule.

Therefore, when making a generalization, statisticians always attach a margin of error margin of error, by which they admit a chance of mistake. The inaccuracy comes in two forms: false positives false positives and and false negatives false negatives, which are (unhelpfully) called type I type I and and type II type II errors in statistics texts. They are better understood as false alarms and missed opportunities. Put differently, accuracy encompa.s.ses the ability to correctly detect positives as well as the ability to correctly detect negatives. In medical parlance, the ability to detect true positives is known as errors in statistics texts. They are better understood as false alarms and missed opportunities. Put differently, accuracy encompa.s.ses the ability to correctly detect positives as well as the ability to correctly detect negatives. In medical parlance, the ability to detect true positives is known as sensitivity sensitivity, and the ability to detect true negatives is called specificity specificity. Unfortunately, improving one type of accuracy inevitably leads to deterioration of the other. See the textbook Stats: Data and Models Stats: Data and Models by Richard D. De Veaux for a formal discussion under the topic of by Richard D. De Veaux for a formal discussion under the topic of hypothesis testing hypothesis testing, and the series of illuminating expositions on the medical context by Douglas Altman, published in British Medical Journal British Medical Journal.

When anti-doping laboratories set the legal limit for any banned substance, they also fix the trade-off between false positives and false negatives. Similarly, when researchers configure the computer program for the PCa.s.s portable lie detector to attain desired proportions of red, yellow, and green results, they express their tolerance of one type of error against the other. What motivates these specific modes of operation? Our discussion pays particular attention to the effect of incentives incentives. This element falls under the subject of decision theory decision theory, an area that has experienced a burst of activity by so-called behavioral social scientists.

In most real-life situations, the costs of the two errors are unequal or asymmetric asymmetric, with one type being highly publicized and highly toxic, and the other side going unnoticed. Such imbalance skews incentives. In steroid testing, false negatives are invisible unless the dopers confess, while false positives are invariably mocked in public. No wonder timid testers tend to underreport positives, providing inadvertent cover for many dopers. In national-security screening, false negatives could portend frightening disasters, while false positives are invisible until the authorities reverse their mistakes, and then only if the victims tell their tales. No wonder the U.S. Army configures the PCa.s.s portable polygraph to minimize false negatives.

Not surprisingly, what holds sway with decision makers is the one error that can invite bad press. While their actions almost surely have made the other type of error worse, this effect is hidden from view and therefore neglected. Because of such incentives, we have to worry about false negatives in steroid testing and false positives in polygraph and terrorist screening. For each drug cheat caught by anti-doping labs, about ten other cheaters have escaped detection. For each terrorist trapped by polygraph screening, hundreds if not thousands of innocent citizens have been falsely implicated. These ratios are worse when the targets to be tested are rarer (and spies or terrorists are rare indeed).

The bestselling Freakonomics Freakonomics provides a marvelously readable overview of behavioral economics and incentives. The formulas for false positives and false negatives involve provides a marvelously readable overview of behavioral economics and incentives. The formulas for false positives and false negatives involve conditional probabilities conditional probabilities and the famous and the famous Bayes' rule Bayes' rule, a landmark of any introductory book on statistics or probability. For the sake of simplicity, textbook a.n.a.lysis often a.s.sumes the cost of each error to be the same. In practice, these costs tend to be unequal and influenced by societal goals such as fairness as well as individual characteristics such as integrity that may conflict with the objective of scientific accuracy.

Crossovers Banks rely on credit scores to make decisions on whether to grant credit to loan applicants. Credit scores predict how likely customers are to repay their loans; arising from statistical models, the scores are subject to errors. Like polygraph examiners, loan officers have strong incentives to reduce false negatives at the expense of false positives. False-negative mistakes put money in the hands of people who will subsequently default on their loans, leading to bad debt, write-offs, or even insolvency for the banks. False-positive errors result in lost sales, as the banks deny worthy applicants who would otherwise have fulfilled their obligations. Notice, however, that false positives are invisible to the banks: once the customers have been denied loans, the banks could not know if they would have met their obligations to repay the loan or not. Unsurprising, such asymmetric costs coax loan officers into rejecting more good customers than necessary while reducing exposure to bad ones. It is no accident that these decisions are undertaken by the risk management department, rather than sales and marketing.

The incentive structure is never static; it changes with the business cycle. During the giant credit boom of the early 2000s, low interest rates pumped easy money into the economy and greased a cheap, abundant supply of loans of all types, raising the opportunity cost of false positives (missed sales). At the same time, the economic expansion lifted all boats and lessened the rate of default of the average borrower, curtailing the cost of false negatives (bad debt). Thus, bank managers were emboldened to chase higher sales at what they deemed lower risks. But there was no free lunch: dialing down false positives inevitably generated more false negatives, that is, more bad debt. Indeed, by the late 2000s, banks that had unwisely relaxed lending standards earlier in the decade sank under the weight of delinquent loans, which was a key factor that tipped the United States into recession.

Jeffrey Rosenthal applied some statistical thinking to prove that mom-and-pop store owners had defrauded Ontario's Encore lottery. Predictably, a howl of protests erupted from the accused. Leaders of the industry chimed in, too, condemning his d.a.m.ning report as ”outrageous” and maintaining that store owners had ”the highest level of integrity.”

Was it a false alarm? From the statistical test, we know that if store owners had an equal chance in the lotteries as others, then the probability they could win at least 200 out of 5,713 prizes was one in a quindecillion (1 followed by forty-eight zeros), which was practically zero. Hence, Rosenthal rejected the no-fraud hypothesis as impossible. The suggestion that he had erred was tantamount to believing that the insiders had beaten the rarest of odds fair and square. The chance of this scenario occurring naturally-that is, the chance of a false alarm-would be exactly the previous probability. Thus, we are hard-pressed to doubt his conclusion.

(Recall that there is an unavoidable trade-off between false positives and false negatives. If Rosenthal chose to absorb a higher false-positive rate-as much as one in a hundred is typical-he could reduce the chance of a false negative, which is the failure to expose dishonest store owners. This explains why he could reject the no-fraud hypothesis for western Canada as well, even though the odds of 1 in 2.3 million were higher.) The Power of Being Impossible Statistical thinking is absolutely central to the scientific method, which requires theories to generate testable hypotheses. Statisticians have created a robust framework for judging whether there is sufficient evidence to support a given hypothesis. This framework is known as statistical testing statistical testing, also called hypothesis testing hypothesis testing or or significance testing significance testing. See De Veaux's textbook Stats: Data and Models Stats: Data and Models for a typically fluent introduction to this vast subject. for a typically fluent introduction to this vast subject.

Take the fear of flying developing-world airlines. This anxiety is based on the hunch that air carriers in the developing world are more p.r.o.ne to fatal accidents than their counterparts in the developed world. Arnold Barnett turned around this hypothesis and reasoned as follows: if the two groups of carriers were equally safe, then crash fatalities during the past ten years should have been scattered randomly among the two groups in proportion to the mix of flights among them. Upon examining the flight data, Barnett did not find sufficient evidence to refute the equal-safety hypothesis.

All of Barnett's various inquiries-the comparison between developed-world and developing-world carriers, the comparison among U.S. domestic carriers-pointed to the same general result: that it was not not impossible for these airlines to have equal safety. That was what he meant by pa.s.sengers having ”nowhere to run”; the next u

<script>