Noise: A Flaw in Human Judgment

Central ideas:

1 – In similar cases for embezzlement, one defendant had been sentenced to 117 days in prison, while another to twenty years. Listing several such cases, Judge Marvin Frankel deplored the exorbitant powers of federal judges in the US. The variability in these and other judgments is called noise.

2 – Robyn Dawes has achieved a major simplification of predictive tasks. His idea: instead of using multiple regression to determine the precise weight of each predictor variable, he proposed equal weight for all of them.

3 – The invisibility of noise is a direct consequence of causal thinking. Noise is inherently statistical: it becomes visible only when we think statistically about a set of similar judgments.

4 – If we take the average of a hundred judgments, we reduce the noise by 90%, if we take the average of 400 judgments, we reduce it by 95%, practically eliminating it. This statistical law is the mechanism behind the wisdom of the crowd’s approach.

5 – What would organizations look like if they were reformed to reduce noise? Hospitals, hiring committees, economic analysts, government agencies, criminal justice systems, and other sectors would become alert to the problem. Noise auditing would become routine.

About the authors:

Daniel Kahneman is a professor of psychology at Princeton University and of public relations at the Woodrow Wilson School of Public and International Affairs. He has received numerous awards, including the Nobel Prize in Economics in 2002. He is the author of the international bestseller Fast and Slow.

Olivier Sibony is a professor of strategy and corporate policy in the MBA programs of the École des Hautes Études Commerciales in Paris. He is a consultant specializing in strategic decision-making and the organization of decision-making processes.

Cass R. Sunstein specializes in constitutional law, regulatory policy, and the economic analysis of laws. He writes for numerous media outlets, including The New York Times, The Washington Post, and others. He is coauthor of Nudge and The Cost of Rights.

Introduction

To understand the error in judgment, we must understand both bias and noise. Sometimes, as we shall see, noise is the more important problem. But in public discussions of human error and in organizations around the world, it is rarely acknowledged. Bias steals the scene. Noise is a secondary actor, usually behind the scenes. The subject of bias is discussed in thousands of scientific articles and dozens of popular books, but few mention the problem of noise. This book is an attempt to redress the balance.

In real-world decisions, the amount of noise is often outrageously high. Here are some examples of alarming amounts of noise in situations where precision matters.

Medicine is noisy. For the same patient, different doctors offer different judgments about skin or breast cancer, heart disease, tuberculosis, pneumonia, depression, and a host of other ailments. The noise is particularly high in psychiatry, in which subjective judgment is obviously important.

Expert predictions are noisy. Professional analysts offer highly variable forecasts about the sales of a new product, the likely growth of the unemployment rate, the likelihood of bankruptcy of a company in crisis, and just about anything.

Personal decisions are noisy. People conducting job interviews make widely different evaluations of the same candidates. Performance reviews of the same employee are also highly variable and depend more on who is doing the review than on the performance being reviewed.

Forensic science is noisy. We have been taught to think of fingerprint identification as infallible. But fingerprint examiners sometimes differ when comparing the print found at a crime scene with that of a suspect.

Part I: Finding the Noise

In the 1970s, universal enthusiasm for the discretion of magistrates began to crumble for one simple reason: the alarming evidence of noise. In 1973, a famous judge, Marvin Frankel, called attention to the problem. Prior to that, he had served as a passionate advocate for free speech and human rights, helping to found the Lawyers Committee for Human Rights (an organization now known as Human Rights First).

Frankel did not provide any kind of statistical analysis to support his argument but offered a series of compelling cases showing unjustifiable disparities in the treatment of similar individuals. Two men, both with no criminal record, had been convicted of cashing forged checks worth $58.40 and $35.20, respectively. The first had received fifteen years in prison, the second thirty days. In similar cases for embezzlement, one defendant had been sentenced to 117 days in prison, while another to twenty years. Listing several such cases, Frankel deplored what he called the “almost completely indiscriminate and unlimited powers” of federal judges, which resulted in “arbitrary cruelties perpetrated daily,” which he considered unacceptable in a “government of laws, not of men.”

In the 1970s, Frankel’s arguments and the empirical results on which they were based came to the attention of Edward M. Kennedy, brother of the assassinated President John Kennedy and one of the most influential members of the U.S. Senate. Kennedy was shocked. In 1975, the senator introduced legislation for judicial reform that did not go forward. But he did not give up. Mentioning the evidence, he continued year after year to push for the passage of the legislation. In 1984, he finally succeeded. To combat unwarranted variability, Congress passed the Sentencing Reform Act of 1984.

The new legislation was intended to attack the problem of noise in the system by restricting “the unrestricted discretionary power that the law confers on those judges and sentencing reduction committees responsible for imposing and implementing such sentences. In particular, members of Congress referred to an “unjustifiably wide” sentencing disparity, citing the finding that in Greater New York the penalty in real identical cases could range from three to twenty years in prison. As Justice Frankel had recommended, the law created the United States Sentencing Commission, whose primary function was clear: to produce guidelines that would be binding and establish a narrow scope for criminal sentencing.

Part II: Your mind is a measuring instrument

Evaluative Judgment. Up to this point, our focus has been on predictive judgment tasks, and most of the judgments we will see are of this type. But Chapter 1 on Frankel and the noise in federal judges’ sentences examines another kind of judgment. Sentencing someone for a crime is not predictive. It is an evaluative judgment that seeks to match the sentence to the seriousness of the crime. Jurors at wine competitions and skating or show jumping competitions; professors grading students, and committees awarding grants for research projects, all make evaluative judgments.

A different kind of evaluative judgment occurs in decisions involving multiple options and preferences among them. Consider managers selecting candidates, investors deciding between different strategies, or even presidents responding to an epidemic in Africa. No doubt the input to all these decisions depends on predictive judgments-for example, how the candidate will do in the first year, how the stock market reacts to a certain maneuver, or how quickly the epidemic will spread if left unchecked. But final decisions require choices between the pros and cons of the various options, and these choices are made by evaluative judgments.

Like predictive judgment, evaluative judgment involves a narrow expectation of disagreement. One will hardly hear a self-respecting federal judge say, “I prefer this punishment and don’t care if my colleagues think differently.” And the decision maker choosing among several strategic options expects that colleagues and observers armed with the same information and sharing the same goals will agree with him, or at least not disagree too much. Evaluative judgment depends in part on the values and preferences of the decision maker, but it is not merely a matter of taste or opinion.

For this reason, the boundary between predictive and evaluative judgments is vague, and the person making the judgment is often unaware of it. Judges determining sentences or teachers giving grades are too focused on the task and committed to finding the “right” answer.

Part III: The Noise in Predictive Judgment

More simplicity: robustness and beauty. Robyn Dawes was another member of the extraordinary team in Eugene, Oregon, that studied judgment in the 1960s and 1970s. In 1974, Dawes made great progress in simplifying predictive tasks. His idea was surprising, almost heretical: instead of using multiple regression to determine the precise weight of each predictor variable, he proposed equal weights for all of them.

Dawes called the formula of equal weights an improper linear model. His surprising discovery was that these equal-weight models are almost as accurate as ‘proper’ regression models, and far superior to clinical judgments.

Today, many years after Dawes’ great innovation, the statistical phenomenon that so astonished his contemporaries is well understood. As we explained earlier in this book, multiple regression computes the “optimal” weights that minimize the squared errors. But multiple regression minimizes the error in the original data. The formula, therefore, adjusts to predict the random chance in the data. If, for example, the sample includes some managers with high technical skills who also did exceptionally well for unrelated reasons, the model will overstate the weight of technical skill.

The challenge is this: when the formula is applied outside the sample – that is, when it is used to predict outcomes on a different data set – the weights will no longer be optimal. The mismatches in the original sample are no longer present, precisely because they were mismatches; in the new sample, managers with high technical skills do not stand out. And the new sample has different causes, which the formula cannot predict. The correct measure of a predictive model’s accuracy is its performance in a new sample, called cross-validation correlation. In practice, a regression model is too successful in the original sample and a cross-validation correlation is almost always lower than it was in the original data. Dawes and Corrigan compared equal-weight models to multiple regression (cross-validation) models in several situations.

One of their examples involved predictions of first-year grade point average for ninety psychology graduate students at the University of Illinois, using variables linked to academic success: aptitude test scores, grades, varied ratings by peers (e.g., extroversion), and self-ratings (e.g., integrity). The standard multiple regression model obtained a correlation of 0.69, which shrank to 0.57 (PC = 69%) in cross-validation. The correlation of the model of the equal weight with the first-grade school average was about the same: 0.60 (PC = 70%). Similar results have been obtained in many other studies.

Part IV: How noise happens

The noise is statistical. As we have already noted, our normal way of thinking is causal. We naturally pay attention to the particular, following and creating causally coherent accounts of individual cases, in which failures are often attributed to errors and errors to biases. The ease with which bad judgments can be explained away leaves no room for noise in our accounts of errors.

The invisibility of noise is a direct consequence of causal reasoning. Noise is inherently statistical: it becomes visible only when we think statistically about a set of similar judgments. In fact, from that point on, it hardly goes unnoticed: it is the variability in retrospective statistics about sentencing decisions and insurance premiums; it is the range of possibilities when you and others consider how to predict a future outcome; it is the scattering of shots at a target. Causally, noise is nowhere; statistically, it is everywhere.

Adopting the statistical perspective is not easy. We readily invoke causes for the events we observe, but thinking statistically about them requires study and mastery of the subject. Causes are natural; statistics are difficult.

The result is a marked imbalance in our view of bias and noise as sources of error. Perhaps you have seen a type of illustration used in introductory psychology in which a detailed figure stands out against an indistinct background. Our attention is firmly fixed on the figure even when it is small against the background. Figure/background demonstrations are an apt metaphor for our intuitions about bias and noise: bias is a figure that attracts the eye, while noise is the background, to which we pay no attention. In this way, we remain largely oblivious to a wide gap in our judgment.

We are properly focused on reducing biases. We should also be concerned with reducing the noise.

Part V: Improving judgments

Improving predictions. Research also offers suggestions for reducing noise and bias. We will not review them exhaustively here, but we will cover two noise reduction strategies that have wide applications. One is the use of the principle mentioned earlier: selecting better judges produces better judgments. The other is one of the most universally applicable decision hygiene strategies: aggregating multiple independent estimates.

The easiest way to aggregate diverse predictions is to take an average. The average is a mathematical guarantee of noise reduction; specifically, it divides the noise by the square root of the average number of trials. This means that if we average a hundred trials we reduce the noise by 90%, if we average 400 trials we reduce it by 95% – essentially eliminating it. This statistical law is the mechanism behind the wisdom of the crowd’s approach.

Since the average does nothing to reduce bias, its effect on the total error (EQM) depends on the proportions of bias and noise it contains. This is why the wisdom of crowds works best when judgments are independent, and therefore less likely to contain shared biases. Empirically, ample evidence suggests that averaging multiple predictions increase to a high degree the accuracy of, for example, investment analysts’ “consensus” on the stock market. With respect to forecasts in sales, weather, and economics, the unweighted group average outperforms most, and sometimes all, individual forecasters. Averaging forecasts obtained by different methods have the same effect: In an analysis of thirty empirical comparisons in various areas, the combined forecasts reduced errors by 12.5% on average.

The simple arithmetic mean is not the only way to aggregate forecasts. A selective crowd strategy, which selects the best judges according to the accuracy of their recent judgments and averages the judgments of a small number of judges (for example, five), can be just as effective as the simple average. It is also easier for the decision maker who respects expert knowledge to understand to adopt a strategy based on both aggregation and selection.

One method for producing aggregate forecasts is to use predictive markets, where individuals bet on likely outcomes and are thereby incentivized to make correct forecasts. Most of the time predictive markets have done very well, in the sense that if the predictive market price suggests that events have, say, a 70% chance of happening, they happen about 70% of the time. Many companies in various industries use predictive markets to aggregate diverse opinions.

Part VI: Optimized Noise

Setting standards without specifying details usually leads to noise, which can be controlled by some strategies already seen, such as aggregating judgments and using the mediating assessments protocol. Sometimes leaders want to set rules without being able to agree on them in practice. In constitutions themselves, we find many standards (protecting freedom of religion, for example). The same is true of the Universal Declaration of Human Rights (“All human beings are born free and equal in dignity and rights”).

The great difficulty in getting diverse people to agree on noise abatement rules is one reason for setting standards rather than rules. The leaders of a company may be unable to agree on specific words that determine how employees should deal with the customer. The best way to achieve this may be standards. The public sector is analogous. Legislators can agree on a standard (and tolerate the resulting noise) if that is the price to pay for the exercise of the legislature itself. Doctors may agree on standards for diagnosing disease, but if someone tries to impose rules, intractable disagreement usually ensues.

But social and political divisions are not the only reason people resort to standards rather than rules. Sometimes the real problem is that they lack the information that would enable them to produce sensible rules. A university may be unable to create rules to govern its decisions to promote faculty members. An employer may have difficulty foreseeing all the circumstances that would lead him to retain or suspend an employee. Federal representatives may not know the proper level of air pollutants – particulate matter, ozone, nitrogen oxide, and lead. The best thing to do is to pass some kind of standard and trust the experts to specify its meaning, even if the consequence is noise.

Rules can be biased in many ways. A law could prohibit women in the police force. Even though they create a wide bias, the rules will markedly reduce the noise (if everyone abides by them). If the law says that the sale and consumption of alcoholic beverages to people under 21 is prohibited, and everyone abides by it, there will probably be little noise. Standards, on the other hand, are an invitation to noise.

In business and government, the choice between rules and standards is often intuitive, but it can be made in a more disciplined way. As a first approximation, the choice depends on only two factors: (1) the cost of decisions, and (2) the cost of errors.

With standards, the cost of decisions can be very high for judges of all kinds, simply because it is up to them to work to assign content to them. The exercise of judgment can be costly. To make his best diagnosis, the doctor would have to reflect at length on each case (and judgments may well be noisy). When doctors have clear guidelines for deciding whether patients have pharyngitis, their decisions can be quick and relatively unambiguous. If the speed limit is 100 km/h, the policeman doesn’t have to waste time thinking about how fast cars can go, but if there is a standard stating that they should not drive at an “unreasonable speed”, he will rack his brains to make up his mind (and enforcement will certainly be noisy). With rules, the costs of decisions, in general, are much lower.

Still, it is complicated. The enforcement of a rule can be unambiguous, but before it is established, someone has to decide what it consists of. Creating a rule can be difficult. Sometimes the cost is prohibitive. Legal systems and private companies alike often use words like reasonable, prudent, and feasible. For this reason, terms like these play an equally important role in areas such as medicine and engineering.

Epilogue

Imagine what organizations would be like if they were reformed to reduce noise. Hospitals, hiring committees, economic analysts, government agencies, insurance companies, public health authorities, criminal justice systems, law firms, and universities would be more alert to the problem and try to reduce it. The noise audit would be routine: it could be conducted annually.

Leaders of organizations would use algorithms to replace human judgment or to supplement it in many more areas than today. People would break down complex judgments into simpler mediating evaluations. They would know the hygiene of the decision and follow its prescriptions. Independent judgments would be made and aggregated. Business meetings would look very different; discussions would be more structured. An outsider’s view would be more systematically integrated into the decision-making process. Obvious disagreements would be more frequent and resolved more instructively.

The result would be a less noisy world. This would save a lot of money, improve public health and safety, increase fairness, and prevent avoidable mistakes. Our goal in writing this book was to draw attention to this opportunity. We hope you will take advantage of it.

FACTSHEET: