GASSSPP the Destroyer

I have characterized GASSSPP as a “black hole,” meaning that it destroys everything that comes close enough to it to go over the “event horizon,” which in our case is defined by the criteria we must meet to publish a piece of work.  That may seem a bit extreme, so let me illustrate some key features of the GASSSPP “mythomethodology” which combine to destroy the collective value of our research.

The first problem is the most pervasive and most subtle, and for those reasons many people have a hard time accepting this when they first hear it.  This is the effective substitution of statistical significance for statistical effect (if you want to see some illustrations of this go to my recent paper, 50 Lost Years, and look at pages 11-13).  There are two related parts to this.  First, whenever we analyze data we have collected in a study, the objective of the study really boils down to one of two questions: if there is a difference, for example, between means, how big is the difference; if there is an association (correlation or regression coefficient), how strong is the association?  There’s no rocket science here—these are measures of “statistical effect,” and are the only ones that really matter in making a decision about the results of our study.  Second, when we designed the study, we should have specified either an α level to express the minimum threshold for making a “Type I” error, that of rejecting the null hypothesis when it was true (a false positive), or have chosen to calculate an evidential p value.  We should be absolutely clear that p and α are not the same, and cannot be used interchangeably, despite the fact that the large majority of published research does exactly that (Hubbard & Armstrong, 2006)!  The primary issue with α levels is risk—incorrectly rejecting a null hypothesis regarding the toxicity of a new drug is a totally different risk than incorrectly rejecting a null hypothesis about the toxicity of a new rat poison.  People are not rats, and the issue of toxicity is entirely different, because in one case a Type I error might result in the deaths of people, and in the other it might not result in the death of rats.  In our drug study we might only accept one chance in 10,000 (α < .0001), where in the rat poison study one chance in one hundred (α < .01) might be acceptable.  The p value is the level of “statistical significance,” and it tells us the probability that we got the data that we did under the conditions of the null hypothesis (P(D|H)), that is, the way we set up the study in the first place.  In most of our literature p and α are hopelessly confused, and we routinely see nonsense “tests” where “significance” is claimed because p < α.

Several derivative types of confusion in the interpretation of these concepts in our research (which is not too surprising given that many statistics text authors seem to be competing for a prize for obfuscation) have become institutionalized in our literature.  The most fundamental confusion is that the p level tells us the probability that the hypothesis was correct given the data, or P(H|D).  The p level has absolutely nothing to do with this—it is an entirely different issue.  But this is exactly the conclusion that the majority of published research asserts.  The misinterpretation of p has become so severe that many researchers now believe that if they have any effect that attains statistical significance it demonstrates the correctness of their idea; so getting the magical p < .05 shows they have “discovered” some previously unknown important relationship.  Ironically, most research now reports effect sizes like regression coefficients, differences between means, etc., but these are simply ignored in favor of the p level.  In the worst cases, only the p levels are reported, and the effects are not even shown!  At the risk of beating a dead horse, the research literature almost universally discusses p as if it means P(H|D), instead of P(D|H).  This is dead wrong.

Having made this fundamental mistake, we can now imbue a bunch of other “powers” to p (check the list on the GASSSPP page).  By substituting statistical significance for statistical effect, and using statistical significance as the sole criterion for decisions on whether a study tells us anything, we guarantee two kinds of mistakes: (1) we  automatically pay attention to statistically significant outcomes that have no statistical effect, and (2) we ignore statistical effects that are not statistically significant but which might be worth pursing further.  The result is that we are absolutely assured of sucking large amounts of noise and garbage into the “body of knowledge” that comprises our collective research.  Among others, Hunter (1997) points out that the exclusive use of statistical significance to evaluate research outcomes results in error rate of 50 percent—a coin toss.  If we use large samples, however, despite the benefit of increased power, the percentage correct drops to 40.

Claiming that statistical significance has been substituted for statistical effect may sound like an exaggeration, but it isn’t.  I gave a paper in Dubai in April, 2011, and in one session I attended a young man presented a study in which the sole objective was to find out whether comparisons of means of microfinance programs in four countries under two different modes of lending displayed any statistically significant differences in performance.  That was the sole objective—to see whether any of these were statistically significantly different.  What he “found” was that one of the four programs had a significant performance difference; however, the mean differences (the statistical effect) were the smallest of the four comparisons, and only achieved significance because the standard errors were small.  In the meantime, he completely ignored two other comparisons which had very material differences because they were not statistically significant! I tried to get this person to realize why he was ignoring his best findings and paying attention to meaningless differences, but I don’t know if I succeeded.   The GASSPP rot has clearly spread around the world!

Having now accepted a number of false interpretations of p, the GASSSPP expands its range of errors in several directions.  Many readers will recall the “cold fusion” (or Low Energy Nuclear Reaction—LENR) buzz from 1989, when Martin Fleischmann and Stanley Pons, then at the University of Utah, announced the discovery of apparent nuclear fusion in a tabletop low-temperature device.  This created huge global excitement in the hope that a safe, inexpensive, and scalable new source of electrical energy had been discovered.  The Japanese quickly started a LENR research project and eventually spent over $20 million on it.  LENR, however, flew in the face of conventional nuclear-fusion theory, and so the scientific community immediately set to work to replicate the Fleischmann-Pons experiments.  No one was able to do it, and to make a long story short, LENR is now considered a “pathological science,” meaning in part that no one else had been able to confirm the original experiments.  Funding for LENR research is now virtually impossible to obtain, and publication of any research on it is equally difficult, because no one else can reliably reproduce what Fleischmann and Pons first reported, and it is simply considered bogus science (despite work on LENR by a group at the U.S. Navy’s Space and Naval Warfare Systems Center in San Diego reporting some positive but puzzling results).

Many years before, in the early years of solar energy research, Stanford Ovshinsky caused a similar upset in that field when he claimed that amorphous silicon (like beach sand) could produce electricity from light.  The conventional wisdom at the time was that the production of electricity from silicon was the result of light bumping electrons out of a crystalline structure, and without that crystalline structure, it would not happen; hence, generation of electricity from amorphous (noncrystalline) silicon was not possible.  In this case, however, Ovshinsky’s results were verified through replication, and we now know that he was correct.

What both of these stories illustrate is that real science relies on the ability to reproduce results that challenge existing models and thinking, and when that can be done successfully the thinking has to change (what Kuhn, 1962, referred to as a “paradigm shift”).  Failure to confirm such outcomes results in rejection of the new ideas.  One of the most glaring and persistent failures of  GASSSPP research is the lack of replication studies, the second major destroyer of research value.  The mythomethodology here is the belief  that the level of statistical significance obtained in a result is the likelihood that replication of the study would produce a different outcome—if a study produced a result at p < .01, this is believed to mean that there is only a one percent chance that a repeat study would result in a different outcome (or would be 99 percent likely to produce the same result).  Given that, the logic continues, why repeat the study?

The problem is that this belief is simply not true—statistical significance has absolutely no bearing on, or relationship to, replication of a result.  I don’t know where this myth originated, and it would be fun to try to track it down, but someone somewhere managed to convince themselves that there was such a relationship, got that statement published, and it has since “gone viral” and become an unquestioned GASSSPP supposition.  But it is still incorrect.

Wrong or not, this myth is so pervasive that it is nearly impossible to publish a replication study on anything in our literature, and that is one huge reason why GASSSPP is the destroyer that it is.  Without replication we can neither expunge the erroneous nor promote the promising.  It is difficult to publish any study; once that is done, it is considered all the way “done,” and so even a highly similar study is likely to be rejected as “merely a replication,” regardless of whether it reinforces or refutes the original work.  We have gone so far with this erroneous belief that we now do not consider a study on a subject worthy of publication unless it is sufficiently differentiated from all previous work.  We have effectively imposed a requirement that research studies, like dissertations, must be unique; combined with the erroneous belief that significance levels predict replicability, we have created what I termed in my online book (2002) a situation of “dysfunctional uniqueness,” a third GASSSPP destroyer of value.

Dysfunctional uniqueness exacerbates a fourth GASSSPP weakness, that of measurement.  It is extremely difficult to measure variables in the realm of the social sciences, and the irony is that it is in these fields where measurement is of enormous importance.  Virtually nothing that we study in business and management research is an outcome of one predictor variable; on the contrary, it is hard to imagine outcomes that depend on less than 10 variables.  For example, how one individual performs on a job is a function of multiple factors specific to the person (motivation, skills, goals, perceptual processes, etc.), interacting with situational factors (physical environment, organizational factors, externalities, and the like, all of which are multiple themselves), and transient factors that are both rare and unpredictable.  Any one factor that might account for, say, 15 percent of that performance alone is worth knowing about, but will challenge our ability to measure accurately and consistently, i.e., validly and reliably.  Despite the obvious importance of measurement, one does not find anything in the social sciences like the institutions devoted to standards and measures found in the exact sciences and commerce.

Even peer review contributes to the problem as it is done in the GASSSPP context.  Peer review is considered a sine qua non of science, and all of the journals considered to play a serious role in social science are peer-reviewed.  But as Bedeian (2004) illustrates, peer review is also a mechanism for “social construction” of knowledge, and in that capacity is just as much a gatekeeper for the status quo as a procedure for establishing the quality of content.  In fact, Jefferson et al. (2002a, 2002b) found little evidence that peer review improved the quality of articles in medicine, and Wager & Jefferson (2001) concluded that peer review in medical journals is “sometimes ineffective at identifying important research and even less effective at detecting fraud.”  In a classic study, Peters & Ceci (1982) showed that psychology journals had hugely inconsistent evaluations of journal articles published in them a short time earlier, rejecting eight of 12 resubmitted articles!  But to many GASSSPP researchers, the fact that our journals are peer-reviewed is all that needs to be said when questions about research quality are raised.  I think that is far too generous—if nothing else, we cannot deny that peer review has been absolutely ineffective at detecting and correcting the false beliefs regarding statistical significance testing in our journals.

Other characteristics of GASSSPP also contribute to its role as a destroyer of knowledge, but I’ll stop with these five.  The bottom line is that for all the hard work that scholars put into their research, once it conforms to the GASSSPP model it is doomed.  Whether the particular study was a work of genius, a bit of unintended luck, or absolute rubbish, will never be known.  It will simply become a line on a vita, may be fortunate enough to be cited by other GASSSPP researchers, but will otherwise serve as neither a basis for sound science nor a guide to action in the outside world.

As a profession, I fear we do ourselves a great disservice in allowing this state of affairs to continue.  AACSB International, the premier accrediting organization for business schools, estimates that as of 2005 b-school research produced about 20,000 articles each year at a cost of USD 320 million (AACSB, 2007).  Yet there is no study anywhere that shows that what we produce is of any relevance or value to the professional community (I will be adding an extensive reference page on this subject in the future); there are, unfortunately, studies showing what Porter & McKibbin (1988) concluded from their extensive first-hand study of management education, which is that managers “ignore business-school research with impunity.”  If we claim to have a valid basis to teach others how to organize and manage in a competitive world as some argue, it is difficult to reconcile that claim with a track record of such wasted work; and in a period of financial and economic stress, I have to wonder how long it will be until those paying the cost will ask whether they are getting any worthwhile return.  Watching our research being done using fatally flawed processes, something that is well-documented in our own journals, with the apparent sanction of our leading professional organizations and their leadership, leaves me with no alternative but to think that our profession is woefully prepared to respond to such a question, or even to defend its existence.

References for this page

AACSB International. (2007).  Impact of Research Draft Report. St. Louis, Mo.: AACSB International.

Bedeian, Arthur G. Peer review and the social construction of knowledge in the management discipline. Academy of Management Learning & Education. 2004 Jun; 3(2):198-216.

Hubbard, Raymond, & Armstrong, J. Scott. Why we don’t really know what “statistical significance” means: a major educational failure. Journal of Marketing Education. 2006;  28(2), 114-120.

Jefferson, Tom; Alderson, Philip; Wager, Elizabeth, and Davidoff, Frank. Effects of editorial peer review: A systematic review. JAMA, The Journal of the American Medical Association. 2002a Jun 5; 287(21):2784-2786.

Jefferson, Tom; Wager, Elizabeth, and Davidoff, Frank. Measuring the quality of editorial peer review. JAMA, The Journal of the American Medical Association. 2002b Jun 5; 287(21):2786-2790.

Kmetz, John L. The skeptic’s handbook: Consumer guidelines and a critical assessment of  business and management research. Social Science Research Network; 2002; http://ssrn.com/author=53148.

Kuhn, Thomas S.. The structure of scientific revolutions. Chicago: University of Chicago Press; 1962.

Peters, Douglas and Ceci, Stephen J. Peer-review practices of psychological journals: the fate of published articles, submitted again. The Behavioral and Brain Sciences. 1982; 5:187-195.

Porter, Lyman W. and McKibbin, Lawrence E. Management education and development: Drift or thrust into the 21st century? New York: McGraw-Hill; 1988.

Wager, Elizabeth and Jefferson, Tom. Shortcomings of peer review in biomedical journals. Learned Publishing. 2001 Oct; 14:257-263.