Statistical Significance References

The following roughly 140 items are an annotated list of references for those not as fully aware of the extent of published material documenting the fact that statistical significance is not an appropriate criterion to evaluate the strength or importance of results in social science research.  A few others are included as examples of textbook errors in this field, and a few actually try to defend significance testing if it is interpreted correctly.  None of these, however, support the erroneous interpretation of significance testing as it is found in our literature.

…and statistics. The Economist. 2005 Sep 3; 376(8442):72.

Economics focus: Signifying nothing? The Economist. 2004 Jan 31; 370(8360):76.

Abelson, Robert P. On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science. 1997; 8(1):12-15.
Abstract: Criticisms of null-hypothesis significance tests (NHSTs) are reviewed.  Used as formal, two-valued decision procedures, they often generate misleading conclusions. However, critics who argue that NHSTs are totally meaningless because the null hypothesis is virtually always false are overstating their case. Critics also neglect the whole class of valuable significance tests that assess goodness of fit of models to data.  Even as applied to simple mean differences, NHSTs can be rhetorically useful in defending research against criticisms that random factors adequately explain the results, or that the direction of mean difference was not demonstrated convincingly.  Principled argument and counterargument produce the lore, or communal understanding, in a field, which in turn helps guide new research.  Alternative procedures—confidence intervals, effect sizes, and meta-analysis—are discussed.  Although these alternatives are not totally free from criticism either, they deserve more frequent use, without an unwise ban on NHSTs.

—. Statistics as principled argument. Hillsdale, N.J.: Lawrence Erlbaum; 1995; ISBN: 0-8058-0528-1.  While a defender of significance testing properly done, Abelson provides one of the best short explanations (pp. 40-41) of the mistaken interpretation of p that dominates our literature.

Abrahamson, Eric. Torturing data until it speaks: Significance testing and the file drawer problem in organizational research. Miami, FL. Paper presented during the symposium on Rejecting the Null Hypothesis, Academy of Management National Meeting,   2005.

Aiken, Leona S.; West, S. G.; Sechrest, L.; Reno, R. R.; Roediger III, H. L.; Scarr, S.; Kazdin, A. E., and Sherman, S. J. Graduate training in statistics, methodology, and measurement in psychology: a survey of PhD programs in North America. American Psychologist. 1990 Jun; 45(6):721-734.
Abstract: A survey of all PhD programs in psychology in the United States and Canada assessed the extent to which advances in statistics, measurement, and methodology have been incorporated into doctoral training. In all, 84% of the 222 departments responded. The statistical and methodological curriculum has advanced little in 20 years; measurement has experienced a substantial decline. Typical first-year courses serve well only those students who undertake traditional laboratory research. Training in top-ranked schools differs little from that in other schools. New PhDs are judged to be competent to handle traditional techniques, but not newer and often more useful procedures, in their own research. Proposed remedies for these deficiencies include revamping the basic required quantitative and methodological curriculum, culling available training opportunities across campus, and training students in more informal settings, along with providing retraining opportunities for faculty. These strategies also require psychology to attend carefully to the human capital needs that support high-quality quantitative and methodological training and practice.

Anderson, David R.; Burnham, Kenneth P., and Thompson, William L. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management. 2000; 64(4):912-923.

Armstrong, J. Scott. Significance tests harm progress in forecasting. International Journal of Forecasting. 2007 Apr-2007 Jun 30; 23(2):321-327.
Abstract: I briefly summarize prior research showing that tests of statistical significance are improperly used even in leading scholarly journals. Attempts to educate researchers to avoid pitfalls have had little success. Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions. I then show that tests of statistical significance are harmful to the development of scientific knowledge because they distract the researcher from the use of proper methods. I illustrate the dangers of significance tests by examining a re-analysis of the M3-Competition. Although the authors of the re-analysis conducted a proper series of statistical tests, they suggested that the original M3-Competition was not justified in concluding that combined forecasts reduce errors, and that the selection of the best method is dependent on the selection of a proper error measure. I show that the original conclusions were correct. Authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses. Practitioners should ignore significance tests and journals should discourage them.

—. Statistical significance tests are unnecessary even when properly done and properly interpreted: Reply to commentaries. International Journal of Forecasting. 2007 Apr-2007 Jun 30; 23(2):335-336.
Abstract: The three commentators on my paper agree that statistical tests are often improperly used by researchers, and that even when properly used, readers may misinterpret them. These points have been well established by empirical studies. However, two of the commentators do not agree with my major point that significance tests are unnecessary even when properly used and interpreted.  [He then goes on to refute their arguments.]

Atkinson, Donal R.; Furlong, Michael J., and Wampold, Bruce E. Statistical significance, reviewer evaluations, and the scientific process: is there a (statistically) significant relationship? Journal of Counseling Psychology. 1982; 29(2):189-194.
Abstract: Although the opinion is widespread among psychology researchers that manuscripts reporting statistically nonsignificant findings are unlikely to be published in American Psychological Association journals, little empirical evidence exists to support this contention. Defenders of current publication policy maintain that there is no bias against statistically nonsignificant findings and that published studies are better designed than those rejected for publication. To test for an effect of statistical significance, 101 consulting editors of the Journal of Counseling Psychology and the Journal of Consulting and Clinical Psychology were asked to evaluate 3 versions of a research manuscript, differing only with regard to level of statistical significance. The statistically nonsignificant and approach-significance versions were more than 3 times as likely to be recommended for rejection than the statistically significant version. The research design rating was also found to be related to the level of statistical significance reported.

Badia, Pietro; Haber, Audrey, and Runyon, Richard P. Research problems in psychology. Reading, Mass.: Addison-Wesley; 1970.

Bakan, David. The test of significance in psychological research.  Psychological Bulletin. 1966 Dec; 66(6):423-437.

Bartko, John J. Proving the null hypothesis. American Psychologist. 1991 Oct; 46(12):1089.

Becker, Betsy J. Applying tests of combined significance in meta-analysis. Psychological Bulletin. 1987 Jul; 102(1):164-171.
Abstract: In this article, I examine the inferences that can be based on the meta-analysis summaries known as “tests of combined significance.” First, the effect size, significance value, and one example of a test of combined significance are introduced. Next, the statistical null and alternative hypotheses for combined significance tests are compared with those for analyses based on measures of effect magnitude. The hypotheses tested in effect-size analyses are more specific than the hypothesis tested in combined significance tests. Three previously analyzed sets of effect sizes are transformed into significance values and reanalyzed by using one of the most highly recommended tests of combined significance. Effect-size analyses appear more informative because the combined significance test gives identical results for three very different patterns of study outcomes. An assessment of the usefulness of combined significance methods concludes the article.

Berkson, Joseph. Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association. 1938; 33:526-542.

—. Tests of significance considered as evidence. Journal of the American Statistical Association. 1942; 37:325-335.

Blalock, Jr. Hubert M. Social Statistics. New York: McGraw-Hill; 1960.
Abstract: “…In other words, the smaller the risk of a type I error, the greater the probability of a type II error….It is thus impossible to minimize the risks of both types of errors simultaneously unless one redesigns his study and selects additional cases or a different statistical test….” (p. 160, emphasis original)”    If there is no practical decision to be made other than whether or not to publish the results of a study, another rule of thumb should be followed.   The researcher should lean over backwards to prove himself wrong or to obtain results that he actually does not want to obtain.”  (1972 ed, p. 161, emph original)

Blampied, Neville M. Single-case research designs: a neglected alternative. American Psychologist. 2000 Aug; 55(8):960.

Bolles, Robert and Messick, S. Statistical utility in experimental inference. Psychological Reports. 1958; 4223-227.

Bowditch, James L. and Buono, Anthony F. A primer on organizational behavior. second ed. New York: Wiley; 1990.
Abstract: An example of a text incorrectly explaining stat significance. In discussion of  stat sig (p. 337): “What this means, in shorthand, is that there is a probability of 5% (p < .05), 1% (p < .01),  or 1/10 of 1% (p < .001) that the  results could have occurred by ‘chance.'”  “Chance” is defined in the preceding paragraph as the influence of uncontrolled variables on the findings of the study.  (They cite Runyon & Haber for their interpretation of stat sig.)  But then in the next paragraph,  they cite Hays (1963) warning about sig findings lacking “substantive significance, “which is not what Hays really said!

Campbell, John. Editorial: some remarks from the outgoing editor.  Journal of Applied Psychology. 1967; 66:691-700.

Campbell, John P. Cummings, Larry L. and Frost, Peter J., Editors. Publishing in the organizational sciences. Homewood, Ill.: Irwin; 1985.

Carver, Ronald P. The case against statistical significance testing. Harvard Education Review. 1978:378-399.

—. The case against statistical significance testing, revisited. Journal of Experimental Education. 1993; 61 (4):287-292.
Abstract: At present, too many research results in education are blatantly described as significant, when they are in fact trivially small and unimportant. There are several things researchers can do to minimize the importance of statistical significance testing and get articles published without using these tests. First, they can insert statistically in front of significant in research reports. Second, results can be interpreted before p values are reported. Third, effect sizes can be reported along with measures of sampling error. Fourth, replication can be built into the design. The touting of insignificant results as significant because they are statistically significant is not likely to change until researchers break the stranglehold that statistical significance testing has on journal editors.

Cohen, Jacob. The earth is round (p < .05). American Psychologist. 1994 Dec; 49(12):997-1003.
Abstract:  After 4 decades of severe criticism, the ritual of null hypothesis significance testing – mechanical dichotomous decisions around a sacred .05 criterion – still persists. This article reviews the problems with this practice, including its near-universal misinterpretation of p as the probability that Ho is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects Ho one thereby affirms the theory that led to the test. Exploratory data analysis and the use of graphic methods, a steady improvement in and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication.

—. The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology. 1962 Mar; 65(3):145-153.

—. Things I have learned (so far). American Psychologist. 1990 Dec; 45(12):1304-1312.
Abstract: This is an account of what I have learned (so far) about the application of statistics of psychology and the other sociobiomedical science. It includes the principles “less is more” (fewer variables, more highly targeted issues, sharp rounding off), “simple is better” (graphic repsentation, unit weighting for linear composites), and “some things you learn aren’t so.” I have learned to avoid the many misconceptions that surround Fisherian null hypothesis testing. I have also learned the importance of power analysis and the determination of just how big (rather than how statistically significant) are the effects that we study. Finally, I have learned that there is no royal road to statistical induction, that the informed judgment of the investigator is the crucial element in the interpretation of data, and that things take time.

Combs, James G. Big samples and small effects: let’s not trade relevance and rigor for power. Academy of Management Journal . 2010 Feb; 53(1):9-13.
Abstract: The author discusses trends in samples and effects in management research methodology. He suggests that researchers are gathering larger samples but revealing smaller effects through analysis and discusses how he explored this phenomenon by studying correlations from quantitative research published in the journal. He comments that surveys revealed large effects while research utilizing secondary data exhibited smaller effects. He uses statistical theory to suggest that large samples eliminate random sampling errors that increase effect sizes.

Conner, Bradley T. When is the difference significant? Estimates of meaningfulness in clinical research. Clinical Psychology: Science and Practice . 2010 Mar; 17(1):52-57.
Abstract: Shearer-Underhill and Marker (2010) provide a review of effect sizes and clinical significance. They remind researchers that statistical significance and effect size estimates do not provide information that is easy for our clients to understand or use when choosing among treatment alternatives. They assert that we should be reporting more intuitive statistics, such as the number needed to treat, to convey information about clinical significance. They also present data indicating that, as a field, we are not following best practices in reporting statistical, practical, and clinical significance. The present commentary extends the discussion by providing further background information on practical and clinical significance and by making specific recommendations on how to improve our reporting of these statistics.

Dunnette, Marvin D. Fads, fashions, and folderol in psychology. American Psychologist. 1966; 21:343-352.

Edwards, Jeffrey R. To prosper, organizational psychology should …overcome methodological barriers to progress. Journal of Organizational Behavior. 2008; 29(4):469-491.
Abstract: Progress in organizational psychology (OP) research depends on the rigor and quality of the methods we use. This paper identifies ten methodological barriers to progress and offers suggestions for overcoming the barriers, in part or whole. The barriers address how we derive hypotheses from theories, the nature and scope of the questions we pursue in our studies, the ways we address causality, the manner in which we draw samples and measure constructs, and how we conduct statistical tests and draw inferences from our research. The paper concludes with recommendations for integrating research methods into our ongoing development goals as scholars and framing methods as tools that help us achieve shared objectives in our field.

Edwards, Jeffrey R. and  Berry, James W. The presence of something or the absence of nothing: Increasing theoretical precision in management research. Organizational Research Methods. 2010; 13( 4):668-689.
Abstract: In management research, theory testing confronts a paradox described by Meehl in which designing studies with greater methodological rigor puts theories at less risk of falsification. This paradox exists because most management theories make predictions that are merely directional, such as stating that two variables will be positively or negatively related. As methodological rigor increases, the probability that an estimated effect will differ from zero likewise increases, and the likelihood of finding support for a directional prediction boils down to a coin toss. This paradox can be resolved by developing theories with greater precision, such that their propositions predict something more meaningful than deviations from zero. This article evaluates the precision of theories in management research, offers guidelines for making theories more precise, and discusses ways to overcome barriers to the pursuit of theoretical precision.

Falk, Ruma and Greenbaum, Charles W. The amazing persistence of a probabilistic misconception. Theory & Psychology. 1995 Feb; 5(1):75-98.
Abstract:  We present a critique showing the flawed logical structure of statistical significance tests. We then attempt to analyze why, in spite of this faulty reasoning, the use of significance tests persists. We identify the illusion of probabilistic proof by contradiction as a central stumbling block, because it is based on a misleading generalization of reasoning from logic to inference under uncertainty. We present new data from a student sample and examples from the psychological literature showing the strength and prevalence of this illusion.  We identify some intrinsic cognitive mechanisms (similarity to modus tollens reasoning; verbal ambiguity in describing the meaning of significance tests; and the need to rule out chance findings) and extrinsic social pressures which help to maintain the illusion. We conclude by mentioning some alternative methods for presenting and analyzing psychological data, none of which can be considered the ultimate method.

Faulkner, Cathy; Fidler, Fiona, and Cumming, Geoff. The value of RCT evidence depends on the quality of statistical analysis. Behaviour Research and Therapy. 2008; 46:270-281.
Abstract: The authors examined statistical practices in 193 randomized controlled trials (RCTs) of psychological therapies published in prominent psychology and psychiatry journals during 1999–2003. Statistical significance tests were used in 99% of RCTs, 84% discussed clinical significance, but only 46% considered—even minimally—statistical power, 31% interpreted effect size and only 2% interpreted confidence intervals. In a second study, 42 respondents to an email survey of the authors of RCTs analyzed in the first study indicated they consider it very important to know the magnitude and clinical importance of the effect, in addition to whether a treatment effect exists. The present authors conclude that published RCTs focus on statistical significance tests (‘‘Is there an effect or difference?’’), and neglect other important questions: ‘‘How large is the effect?’’ and ‘‘Is the effect clinically important?’’ They advocate improved statistical reporting of RCTs especially by reporting and interpreting clinical significance, effect sizes and confidence intervals.

Ferguson, Christopher J. An effect size primer: a guide for clinicians and researchers. Professional Psychology: Research and Practice. 2009; 40( 5):532-539.
Abstract: Increasing emphasis has been placed on the use of effect size reporting in the analysis of social science data.   Nonetheless, the use of effect size reporting remains inconsistent, and interpretation of effect size estimates continues to be confused. Researchers are presented with numerous effect sizes estimate options, not all of which are appropriate for every research question. Clinicians also may have little guidance in the interpretation of effect sizes relevant for clinical practice. The current article provides a primer of effect size estimates for the social sciences. Common effect sizes estimates, their use, and interpretations are presented as a guide for researchers.

Fidler, F.; Thomason, N.; Cumming, G.; Finch, S., and Leeman, J. Editors can lead researchers to confidence intervals, but can’t make them think. Psychological Science. 2004; 15:119-26.

Finch, Sue; Cumming, Geoff, and Thomason, Neil. Colloquium on Effect Sizes: the Roles of Editors, Textbook Authors, and the Publication Manual: Reporting of Statistical Inference in the Journal of Applied Psychology: Little Evidence of Reform. Educational and Psychological Measurement. 2001; 61:181-205.Abstract: Reformers have long argued that misuse of Null Hypothesis Significance Testing (NHST) is widespread and damaging. The authors analyzed 150 articles from the Journal of Applied Psychology (JAP) covering 1940 to 1999. They examined statistical reporting practices related to misconceptions about NHST, American Psychological Association (APA) guidelines, and reform recommendations. The analysis reveals (a) inconsistency in reporting alpha and p values, (b) the use of ambiguous language in describing NHST, (c) frequent acceptance of null hypotheses without consideration of power, (d) that power estimates are rarely reported, and (e) that confidence intervals were virtually never used. APA guidelines have been followed only selectively. Research methodology reported in JAP has increased greatly in sophistication over 60 years, but inference practices have shown remarkable stability. There is little sign that decades of cogent critiques by reformers had by 1999 led to changes in statistical reporting practices in JAP.

Fisher, Ronald A. Statistical methods for research workers. 4 ed. Oxford, England: Oliver & Boyd; 1932.

Frick, Robert W. The appropriate use of null hypothesis testing. Psychological Methods. 1996 Dec; 1(4):379-390.
Abstract: The many criticisms of null hypothesis testing suggest when it is not useful and what it should not be used for. This article explores when and why its use is appropriate. Null hypothesis testing is insufficient when size of effect is important, but it is ideal for testing ordinal claims relating the order of conditions, which are common in psychology. Null hypothesis testing also is insufficient for determining beliefs, but it is ideal for demonstrating sufficient evidential strength to support an ordinal claim, with sufficient evidence being 1 criterion for a finding entering the corpus of legitimate findings in psychology. The line between sufficient and insufficient evidence is currently set at p <.05; there is little reason for allowing experimenters to select their own value of alpha. Thus null hypothesis testing is an optimal method for demonstrating sufficient evidence for an ordinal claim.

—. A problem with confidence intervals. American Psychologist. 1995 Dec; 50(12):1102-1103.
Abstract: Comments on J. Cohen’s (see record 1995-12080-001) suggestion that experimenters should calculate confidence intervals. Two different ways of interpreting a confidence interval are discussed. The author asks how Cohen can criticize the logic of null hypothesis significance testing and then recommend reporting a statistic that relies on this logic.

Friedrich, James. The road to reform: of editors and educators. American Psychologist. 2000 Aug; 55(8):961-962.

Gill, Jeff. The insignificance of null hypothesis significance testing  . Political Science Quarterly. 1999; 52(3):647-674.
Abstract: The current method of hypothesis testing in the social sciences is under intense criticism, yet most political scientists are unaware of the important issues being raised. Criticisms focus on the construction and interpretation of a procedure that has dominated the reporting of empirical results for over fifty years. There is evidence that null hypothesis significance testing as practiced in political science is deeply flawed and widely misunderstood. This is important since most empirical work argues the value of findings through the use of the null hypothesis significance test. In this article I review the history of the null hypothesis significance testing paradigm in the social sciences and discuss major problems, some of which are logical inconsistencies while others are more interpretive in nature. I suggest alternative techniques to convey effectively the importance of data-analytic findings. These recommendations are illustrated with examples using empirical political science publications.

Goodman, Steven N. Toward evidence-based medical statistics, 1: the p value fallacy. Annals of Internal Medicine. 1999; 130(12):995-1004.
Abstract: An important problem exists in the interpretation of modern medical research data:  Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain “error rates,” without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent  approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used—the Bayes factor, which  properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.

—. Toward evidence-based medical statistics, 2: the Bayes factor. Annals of Internal Medicine. 1999; 130:1005-1013.
Abstract: Bayesian inference is usually presented as a method for determining how scientific belief should be modified by data. Although Bayesian methodology has been one of the most active areas of statistical development in the past 20 years, medical researchers have been reluctant to embrace what they perceive as a subjective approach to data analysis. It is little understood that Bayesian methods have a data-based core, which can be used as a calculus of evidence. This core is the Bayes factor, which in its simplest form is also called a likelihood ratio. The minimum Bayes factor is objective and can be used in lieu of the P value as a measure of the evidential strength. Unlike P values, Bayes factors have a sound theoretical foundation and an interpretation that allows their use in both inference and decision making. Bayes factors show that P values greatly overstate the evidence against the null hypothesis. Most important, Bayes factors require the addition of background knowledge to be transformed into inferences—probabilities that a given conclusion is right or wrong. They make the distinction clear between experimental evidence and inferential  conclusions while providing a framework in which to combine prior with current evidence.

Goodman, Steven N. and Royall, Richard. Evidence and scientific research. American Journal of Public Health. 1988 Dec; 78(12):1568-1574.

Gorsuch, Richard L. Things learned from another perspective (so far). American Psychologist. 1991 Oct; 46(12):1089.

Grant, David A. Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Badia; Haber, and Runyon. Psychological Review.  1962; pp. 54-61.

Greenwald, Anthony G. Consequences of prejudice against the null hypothesis. Psychological Bulletin. 1975; 82:1-20.

Hagen, Richard L. In praise of the null hypothesis significance test. American Psychologist. 1997 Jan; 52(1):15-24.
Abstract: Jacob Cohen (see record 1995-12080-001) raised a number of questions about the logic and information value of the null hypothesis statistical test (NHST). Specifically, he suggested that: (1) The NHST does not tell us what we want to know; (2) the null hypothesis is always false; and (3) the NHST lacks logical integrity. It is the author’s view that although there may be good reasons to give up the NHST, these particular points made by Cohen are not among those reasons. When addressing these points, the author also attempts to demonstrate the elegance and usefulness of the NHST.

Haig, Brian D. Explaining the use of statistical methods. American Psychologist. 2000 Aug; 55(8):962-963.

Harlow, Lisa Lavoie; Muliak, Stanley A., and Steiger, James H. What if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum; 1997.
Abstract: This book is the result of a spirited debate stimulated by a recent meeting of the Society of Multivariate Experimental Psychology. Although the viewpoints span a range of perspectives, the overriding theme that emerges states that significance testing may still be useful if supplemented with some or all of the following — Bayesian logic, caution, confidence intervals, effect sizes and power, other goodness of approximation measures, replication and meta-analysis, sound reasoning, and theory appraisal and corroboration. The book is organized into five general areas. The first presents an overview of significance testing issues that synthesizes the highlights of the remainder of the book. The next discusses the debate in which significance testing should be rejected or retained. The third outlines various methods that may supplement current significance testing procedures. The fourth discusses Bayesian approaches and methods and the use of confidence intervals versus significance tests. The last presents the philosophy of science perspectives. Rather than providing definitive prescriptions, the chapters are largely suggestive of general issues, concerns, and application guidelines. The editors allow readers to choose the best way to conduct hypothesis testing in their respective fields. For anyone doing research in the social sciences, this book is bound to become “must” reading.

Harris, Richard J. Significance tests have their place. Psychological Science. 1997; 8(1):8-11.
Abstract: Null-hypothesis significance tests (NHST), properly used, tell us whether we have sufficient evidence to be confident of the sign of the population effect—but only if we abandon two-valued logic in favor of Kaiser’s (1960) three-alternative hypothesis tests.  Confidence intervals provide a useful addition to NHSTs, and can be used to provide the same sign-determination function as NHST.  However, when so used, confidence intervals are subject to exactly the same Type I, II, and III error rates as NHST.  In addition, NHSTs provide two pieces of information about our data—maximum probability of a Type III error and probability of a successful exact replication—that confidence intervals do not.  The proposed alternative to NHST is just as susceptible to misinterpretation as is NHST.  The problem of bias due to censoring of data collection or publication can be handled by providing archives for all methodologically sound data sets, but reserving interpretations and conclusions for statistically significant results.

Hays, William L. Statistics. New York: Holt, Rinehart and Winston; 1963.
Abstract: ”    This points up the fallacy of evaluating the ‘goodness’ of a result in terms of statistical significance alone, without allowing for the sample size used.  All significant results do not imply the same degree of true association between independent and dependent variables.
It is sad but true that researchers have been known to capitalize on this fact.  There is a certain amount of ‘testmanship’ involved in using inferential statistics.  Virtually any study can be made to show significant results if one uses enough subjects, regardless of how nonsensical the content may be.  There is surely nothing on earth that is completely independent of anything else….”(p. 326; emphasis original)

Hedges, Larry V. How hard is hard science, how soft is soft science?  The empirical cumulativeness of research. American Psychologist. 1987 May; 42(2):443-455.

Hubbard, Raymond. Alphabet soup: Blurring the distinctions between p’s and ∝’s in psychological research. Theory & Psychology. 2004; 14(3):295-327.
Abstract: Confusion over the reporting and interpretation of results of commonly employed classical statistical tests is recorded in a sample of 1,645 papers from 12 psychology journals for the period 1990 through 2002. The confusion arises because researchers mistakenly believe that their interpretation is guided by a single unified theory of statistical inference. But this is not so: classical statistical testing is a nameless amalgamation of the rival and often contradictory approaches developed by Ronald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, on the other. In particular, there is extensive failure to acknowledge the incompatibility of Fisher’s evidential p value with the Type I error rate, a, of Neyman–Pearson statistical orthodoxy. The distinction between evidence (p’s) and errors (α’s) is not trivial. Rather, it reveals the basic differences underlying Fisher’s ideas on significance testing and inductive inference, and Neyman–Pearson views on hypothesis testing and inductive behavior. So complete is this misunderstanding over measures of evidence versus error that it is not viewed as even being a problem among the vast majority of researchers and other relevant parties. These include the APA Task Force on Statistical Inference, and those writing the guidelines concerning statistical testing mandated in APA Publication Manuals. The result is that, despite supplanting Fisher’s significance-testing paradigm some fifty years or so ago, recognizable applications of Neyman–Pearson theory are few and far between in psychology’s empirical literature. On the other hand, Fisher’s influence is ubiquitous.

Hubbard, Raymond and Armstrong, J. Scott. Why we don’t really know what “statistical significance” means: a major educational failure. Journal of Marketing Education. 2006 Aug; 28(2):114-120.
Abstract: The Neyman–Pearson theory of hypothesis testing, with the Type I error rate, alpha, as the significance level, is widely regarded as statistical testing orthodoxy. Fisher’s model of significance testing, where the evidential p value denotes the level of significance, nevertheless dominates statistical testing practice. This paradox has occurred because these two incompatible theories of classical statistical testing have been anonymously mixed together, creating the false impression of a single, coherent model of statistical inference. We show that this hybrid approach to testing, with its misleading p < alpha statistical significance criterion, is common in marketing research textbooks, as well as in a large random sample of papers from twelve marketing journals. That is, researchers attempt the impossible by simultaneously interpreting the p value as a Type I error rate and as a measure of evidence against the null hypothesis. The upshot is that many investigators do not know what our most cherished, and ubiquitous, research desideratum—“statistical significance”—really means. This, in turn, signals an educational failure of the first order. We suggest that tests of  statistical significance, whether p’s or alpha’s, be downplayed in statistics and marketing research courses. Classroom instruction should focus instead on teaching students to emphasize the use of confidence intervals around point estimates in individual studies, and the criterion of overlapping confidence intervals when one has estimates from similar studies.

Hubbard, Raymond and Bayari, M. J. Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing (with comments). The American Statistician. 2003 Aug; 57:171-182.
Abstract: Confusion surrounding the reporting and interpretation of results of classical statistical tests is widespread among applied researchers. The confusion stems from the fact that most of these researchers are unaware of the historical development of classical statistical testing methods, and the mathematical and philosophical principles underlying them. Moreover, researchers erroneously believe that the interpretation of such tests is prescribed by a single coherent theory of statistical inference. This is not the case: Classical statistical testing is an anonymous hybrid of the competing and frequently contradictory approaches formulated by R.A. Fisher on the one hand, and Jerzy Neyman and Egon Pearson on the other. In particular, there is a widespread failure to appreciate the incompatibility of Fisher’s evidential p value with the Type I error rate, á, of Neyman–Pearson statistical orthodoxy. The distinction between evidence (p’s) and error (α’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman–Pearson views of hypothesis testing and inductive behavior. Unfortunately, statistics textbooks tend to inadvertently cobble together elements from both of these schools of thought, thereby perpetuating the confusion. So complete is this misunderstanding over measures of evidence versus error that is not viewed as even being a problem among the vast majority of researchers. The upshot is that despite supplanting Fisher’s significance testing paradigm some fifty years or so ago, recognizable applications of Neyman–Pearson theory are few and far between in empirical work. In contrast, Fisher’s influence remains pervasive. Professional statisticians must adopt a leading role in lowering confusion levels by encouraging textbook authors to explicitly address the differences between Fisherian and Neyman–Pearson statistical testing frameworks.

Also: “The manner in which the results of statistical tests are reported in marketing journals is used as an empirical barometer for practices in other applied disciplines. We doubt whether the findings reported here would differ substantially from those in other fields.
More specifically, two randomly selected issues of each of three leading marketing journals—the Journal of Consumer Research, Journal of Marketing, and Journal of Marketing Research—were analyzed for the eleven-year period 1990 through 2000 in order to assess the number of empirical articles and notes published therein. This procedure yielded a sample of 478 empirical papers. These papers were then examined to see whether classical statistical tests had been used in the data analysis. Some 435, or 91.0%, employed such testing.
Although the evidential p value from a significance test violates the orthodox Neyman–Pearson behavioral hypothesis testing schema, Table 1 shows that p values are commonplace in marketing’s empirical literature. Conversely, α levels are in short supply.”  Sec. 1.4, Confusion of p‘s and α’s in marketing journals,  p.16.

Hubbard, Raymond and Lindsay, R. Murray. Why p values are not a useful measure of evidence in statistical significance testing. Theory & Psychology. 2008; 18(1):69-88.
Abstract: Reporting p values from statistical significance tests is common in psychology’s empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as an ‘objective’ measure of inductive evidence against the null hypothesis. We review several reasons why the p value is an unobjective and inadequate measure of evidence when statistically testing hypotheses. A common theme throughout many of these reasons is that p values exaggerate the evidence against H0. This, in turn, calls into question the validity of much published work based on comparatively small, including .05, p values. Indeed, if researchers were fully informed about the limitations of the p value as a measure of evidence, this inferential index could not possibly enjoy its ongoing ubiquity. Replication with extension research focusing on sample statistics, effect sizes, and their confidence intervals is a better vehicle for reliable knowledge development than using p values. Fisher would also have agreed with the need for replication research.

Hubbard, Raymond; Parsa, Rahul A., and Luthy, Michael R. The spread of statistical significance testing in psychology: the case of the Journal of Applied Psychology. Theory & Psychology. 1997 Aug; 7(4):545-554.Abstract: Because the widespread use of statistical significance testing has deleterious consequences for the development of a cumulative knowledge base, the American Psychological Association’s Board of Scientific Affairs is in the process of appointing a Task Force whose charge includes the possibility of phasing out such testing in textbooks and journal articles. Just how popular is significance testing in psychology? This issue is examined in the present historical study, which uses data from randomly selected issues of the Journal of Applied Psychology for the period 1917-94. Results indicate that the practice of significance testing, at one time of restricted usage, has expanded to the point that it is virtually synonymous with empirical analysis. The data also lend support to Gigerenzer and Murray’s (1987) allegation that an inference revolution occurred in psychology during the period 1940-55. Unfortunately, it is concluded that the ubiquity of significance testing constitutes a classic example of the overadoption of a methodology.

Hubbard, Raymond and Ryan, Patricia A. The historical growth of statistical significance testing in psychology—and its future prospects. Educational and Psychological Measurement. 2000; 60:661-81.
Abstract: The historical growth in the popularity of statistical significance testing is examined using a random sample of annual data from 12 American Psychological Association (APA) journals. The results replicate and extend the findings of Hubbard, Parsa, and Luthy, who used data from only the Journal of Applied Psychology. The results also confirm Gigerenzer and Murray’s allegation that an inference revolution occurred in psychology between 1940 and 1955. An assessment of the future prospects for statistical significance testing is offered. It is concluded that replication with extension research, and its connections with meta-analysis, is a better vehicle for developing a cumulative knowledge base in the discipline than statistical significance testing. It is conceded, however, that statistical significance testing is likely here to stay.

Huberty, Carl J. Historical origins of statistical testing practices: the treatment of Fisher versus Neyman-Pearson views in textbooks. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice. 1993 Summer; 61(4):317-333.
Abstract: Textbook discussion of statistical testing is the topic of interest. Some 28 books published from 1910 to 1949, 19 books published from 1990 to 1992, plus five multiple-edition books were reviewed in terms of presentations of statistical testing. It was of interest to discover textbook coverage of the P-value (i.e., Fisher) and fixed-alpha (i.e., Neyman-Pearson) approaches to statistical testing. Also of interest in the review were some issues and concerns related to the practice and teaching of statistical testing: (a) levels of significance, (b) importance of effects, (c) statistical power and sample size, and (d) multiple testing. It is concluded that it is not statistical testing itself that is at fault; rather, some of the textbook presentation, teaching practices, and journal editorial reviewing may be questioned.

Hunter, John E. Needed: A ban on the significance test. Psychological Science. 1997; 8(1):3-7.
Abstract: The significance test as currently used is a disaster.  Whereas most researchers falsely believe that the significance test has an error rate of 5%, empirical studies show the average error rate across psychology is 60%—12 times higher than researchers think it to be.  The error rate for inference using the significance test is greater than the error rate using a coin toss to replace the empirical study. The significance test has devastated the research review process. Comprehensive reviews cite conflicting results on almost every issue. Yet quantitatively accurate review of the same results shows that the apparent conflicts stem almost entirely from the high error rate for the significance test. If 60% of studies falsely interpret their primary results, then reviewers who base their reviews on the interpreted study “findings” will have a 100% error rate in concluding that there is conflict between study results.

Hunter, John E. and Schmidt, Frank L. Meta-analysis: Correcting error and bias in research findings. Thousand Oaks, Ca.: Sage; 2004; ISBN: 1-4129-0479-X.

Hunter, John E.; Schmidt, Frank L., and Jackson, Gregg B. Meta-analysis: cumulating research findings across studies. Beverly Hills, Ca.: Sage; 1982.

Ioannidis, John P. A. Why most published research findings are false. PLos Med. 2005 Aug 30; 2(8):e124.
Abstract: There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

Iyengar, Sheena S. and Lepper, Mark R. When choice is demotivating: Can one desire too much of a good thing? Journal of Personality and Social Psychology. 2000; 79(6):995-1006.

Johnson, Douglas H. The insignificance of statistical significance testing. Journal of Wildlife Management. 1999; 63(3):763-772.
Abstract: Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research. Indeed, they frequently confuse the interpretation of data. This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one. I discuss the arbitrariness of P-values, conclusions that the null hypothesis is true, power analysis, and distinctions between statistical and biological significance. Statistical hypothesis testing, in which the null hypothesis about the properties of a population is almost always known a priori to be false, is contrasted with scientific hypothesis testing, which examines a credible null hypothesis about phenomena in nature. More meaningful alternatives are briefly outlined, including estimation and confidence intervals for determining the importance of factors, decision theory for guiding actions in the face of uncertainty, and Bayesian approaches to hypothesis testing and other statistical practices.

Jones, L. V. Statistics and research design. Annual Review of Psychology. 1955; 6:405-430.
Abstract: appropriate statististics are point estimates of effect sizes and confidence intervals around these point estimates

Kalinowski, Pawel and Fidler, Fiona. Interpreting significance: the differences between statistical significance, effect size, and practical importance. Newborn & Infant Nursing Reviews. 2010 Mar; 10(1):51-54.
Abstract: It is a common misconception that statistical significance indicates a large and/or important effect. In fact the three concepts—statistical significance, effect size, and practical importance—are distinct from one another and a favorable result on one dimension does not guarantee the same on any other. In this article, we explain these concepts and distinguish between them. Finally, we propose reporting confidence intervals as a step toward disambiguating these concepts

Kish, Leslie. Some statistical problems in research design. American Sociological Review. 1959; 24:328-338.
Abstract: appropriate statististics are point estimates of effect sizes and confidence intervals around these point estimates

Kmetz, John L. Proposals to improve the science of organization; Paper presented to the Research Methods Division of the national meeting of the Academy of Management, Las Vegas, NV.  1992.

—. Science and the study of management: an opportunity to set the global standard for valid social science research.  Second International Conference on Corporate Governance and Corporate Social Responsibility. State University Higher School of Economics, Moscow, Russia; 2007 Nov 22.

—. The skeptic’s handbook: Consumer guidelines and a critical assessment of  business and management research.  Social Science Research Network; 2002.

—. What “food chain?”  The disregard of academic research in best-selling business books. Article in Second Review. 2011.

Knapp, Thomas R. Comments on the statistical significance testing articles. Research in the Schools. 1998; 5(2):39-41.
Abstract: [He notes a number of screwups in the articles in this special edition made by critical authors.]  This review assumes a middle-of-the-road position regarding the controversy. The author expresses that significance tests have their place, but generally prefers confidence intervals. His remarks concentrate on ten errors of commission or omission that, in his opinion, weaken the arguments. These possible errors include using the jackknife and bootstrap procedures for replicability purposes, omitting key references, misrepresenting the null hypothesis, omitting the weaknesses of confidence intervals, ignoring the difference between a hypothesized effect size and an obtained effect size, erroneously assuming a linear relationship between p and F, claiming Cohen chose power level arbitrarily, referring to the “reliability of a study,” inferring that inferential statistics are primarily for experiments, and recommending “what if” analyses.

Kupfersmid, Joel. Improving what is published: a model in search of an editor. American Psychologist. 1988;  43(8):635-642.

Loftus, Geoffrey R. Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science. 1996; 5:161-171.

Lykken, David T. Statistical significance in psychological research. Badia, Haber and Runyon, 263. Psychological Bulletin. 1968; 70(3):151-159.

McClelland, Gary H. Increasing statistical power without increasing sample size. American Psychologist. 2000 Aug; 55(8):963-964.

McCloskey, Deirdre N. The bankruptcy of statistical significance.  Eastern Economic Journal. 1992 Summer; 18:359-361.

McCloskey, Deirdre N. and Ziliak, Stephen T. The standard error of regressions. Journal of Economic Literature. 1996 Mar; XXXIV:97-114.

McLean, James E. and Ernest, James M. The role of statistical significance testing in educational research.  Research in the Schools. 1998; 5(2):15-22.
Abstract: The research methodology literature in recent years has included a full frontal assault on statistical significance testing. The purpose of this paper is to promote the position that, while significance testing as the sole basis for result interpretation is a fundamentally flawed practice, significance tests can be useful as one of several elements in a comprehensive interpretation of data. Specifically, statistical significance is but one of three criteria that must be demonstrated to establish a position empirically. Statistical significance merely provides evidence that an event did not happen by chance. However, it provides no information about the meaningfulness (practical significance) of an event or if the result is replicable. Thus, we support other researchers who recommend that statistical significance testing must be accompanied by judgments of the event’s practical significance and replicability.

Meehl, Paul E. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology. 1978; 46(4):806-834.
Abstract: Theories in “soft” areas of psychology (e.g., clinical, counseling, social, personality, school, and community) lack the cumulative character of scientific knowledge because they tend neither to be refuted nor corroborated, but instead merely fade away as people lose interest. Even though intrinsic subject matter difficulties (20 are listed) contribute to this, the excessive reliance on significance testing is partly responsible (Ronald A. Fisher). Karl Popper’s approach, with modifications, would be prophylactic. Since the null hypothesis is quasi-always false, tables summarizing research in terms of patterns of “significant differences” are little more than complex, causally uninterpretable outcomes of statistical power functions. Multiple paths to estimating numerical point values (“consistency tests”) are better, even if approximate with rough tolerances; and lacking this, ranges, orderings, 2nd-order differences, curve peaks and valleys, and function forms should be used. Such methods are usual in developed sciences that seldom report statistical significance. Consistency tests of a conjectural taxometric model yielded 94% success with no false negatives.

—. Theory-testing in psychology and physics: a methodological paradox. Philosophy of Science. 1967; 34(2):103-115.
Abstract: Summary of 5 “intellectual vices” in behavior science (p. 291-292); technical argument concludes that using current methods, increased precision in psych. tests yields a prob about  = 1/2 of being supported by NHST, even if theory is totally without merit.

Meehl, Paul E. What social scientists don’t understand. Fiske, D. W. and Shweder, R. A. Metatheory in social science: pluralisms and subjectivities. Chicago: University of Chicago Press; 1986.

Meehl, Paul E. Why summaries of research on psychological theories are often uninterpretable. Psychological Reports. 1990; 66 (Monograph Supplement 1-Vol. 66):195-244.
Abstract: Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown. The net epistemic effect of these ten obfuscating influences is that the usual research literature review is well-nigh uninterpretable. Major changes in graduate education, conduct of research, and editorial policy are proposed.

Moonesinghe, Ramal; Khoury, Muin J., and Janssens, A. Cecile J. W. Most published research findings are false—but a little replication goes a long way. PLos Med. 2007 Feb 27; 4(2):e28.
Abstract: None.  They show that predictive value goes up with multiple stat sig studies, but replication still best way

Morrison, Denton E. and  Henkel, R. E. The significance test controversy: a reader. Chicago : Aldine; 1970.

Murray, Michael L. On the significance of significance (the significance of statistical information). CPCU Journal. 1993 Jun; 46(2):67-69.
Abstract: Research papers often have statistical tables which are accompanied by asterisks denoting statistical significance at some decimal level or another. Most people believe these asterisks are important since their absence would imply that if another study was done, the same results may not appear. However, statistical tests are often misused and causation is often erroneously equated with correlation.

Neyman, Jerzy and Pearson, Egon S. On the problem of the most efficient test of statistical hypotheses. Philosophical Transactions of the Royal Society (A). 1933; 231:289-337.
Abstract: Prob. of rejecting null function of five factors: (1) choice of one- or two-tail test; (2) p level; (3) std dev.; (4) amount of deviation from null (i.e., effect size); & (5) N.  Should there be any deviation from the null, no matter how small, a large enough N will lead to rejection of the null.

Nix, Thomas W. and Barnette, J. Jackson. The data analysis dilemma: Ban or abandon.  A review of null hypothesis significance testing. Research in the Schools. 1998; 5(2):3-14.
Abstract: Null Hypothesis Significance Testing (NHST) is reviewed in a historical context. The most vocal criticisms of NHST that have appeared in the literature over the past 50 years are outlined. The authors conclude, based on the criticism of NHST and the alternative methods that have been proposed, that viable alternatives to NHST are currently available. The use of effect magnitude measures with surrounding confidence intervals and indications of the reliability of the study are recommended for individual research studies. Advances in the use of meta-analytic techniques provide us with opportunities to advance cumulative knowledge, and all research should be aimed at this goal. The authors provide discussions and references to more information on effect magnitude measures, replication techniques and meta-analytic techniques. A brief situational assessment of the research landscape and strategies for change are offered.

—. A review of hypothesis testing revisited: Rejoinder to Thompson, Knapp, and Levin. Research in the Schools. 1998; 5(2):55-57.
Abstract: This rejoinder seeks to clarify the authors’ position on NHST, to advocate the routine use of effect size, and to encourage reporting results in simple terms. It is concluded that the time for action, such as that advocated in Nix and Barnette’s original article, is overdue.

Nunnally, Jum C. The place of statistics in psychology. Educational and Psychological Measurement. 1960; 20(4):61-650.

Oakes, Michael W. Statistical inference: a commentary for the social and behavioural sciences. New York: Wiley; 1986.
Abstract: NHST impedes cumulative development of info in research

Pocock Stuart J. and Ware, James H. Reply to Comment on “Translating statistical findings into plain English”. Lancet. 2009 Sep 26; 374:1065-1066.
Comment: See Stang 2009–Stang corrects them

Pocock, Stuart J. and Ware, James H. Translating statistical findings into plain English. Lancet. 2009 Apr 16; 373:1926-1928.
Comment: They get it wrong, are corrected by Stang, then try to “clear it up.”

Romano, Paul E. Editorial: Tchebysheff!  Yet another reason you should NOT use “statistical significance = probability <.05” . Binocular Vision and Strabismus Quarterly. 1998; 13(1):15.

—. The insignificance of a probability value of P < 0.05 in the evaluation of medical scientific studies . Journal of Laboratory Clinical Medicine. 1988; 111:501-3.

—. The “statistical significance = p <.05 trap” . Opthalmology. 2002 Nov; 109(11):1949-1950.

Rosenthal, Robert. The “file drawer problem” and tolerance for null results. Psychological Bulletin. 1979; 86:638-641.

—. Meta-analytic procedures for social research. revised ed. Newbury Park, Ca.: Sage; 1991.

Rosenthal, Robert and Gaito, J. The interpretation of levels of significance by psychological researchers. Journal of Psychology. 1963; 55(1):33-38.
Abstract: 19 Ss (faculty and graduate students) were asked to indicate degree of belief in research results at different p values, once with an n of 10 and then with n equal to 100. Ss had greater confidence in the p levels when they were associated with the larger sample size. Graduate students tended to place more confidence in the p levels than did faculty Ss. Most of the Ss showed a more precipitous loss of confidence in moving from .05 to .10 than at any other levels of significance.

Rosnow, Ralph L. and Rosenthal, Robert. Focused tests of significance and effect size estimation in counseling psychology. Journal of Counseling Psychology. 1988; 38(3):203-208.

—. Statistical procedures and the justification of knowledge in psychological science. American Psychologist. 1989; 44:1276-1284.

Rozeboom, William W. The fallacy of the null-hypothesis significance test. Psychological Bulletin. 1960; 57(5):416-428.

Runyon, R. P. and Haber, A. Fundamentals of behavioral statistics. Reading, Mass.: Addison-Wesley; 1972.
Abstract: Another example of incorrect explanation of stat significance.  Cited in discussion of stat sig (p. 337, note 3) by Bowditch and Buono: “What this means, in shorthand, is that there is a probability of 5% (p < .05), 1% (p < .01),  or 1/10 of 1% (p < .001) that the  results could have occurred by ‘chance.'”  “Chance” is defined in the preceding paragraph as the influence of uncontrolled variables on the findings of the study.

Schafer, William D. Interpreting statistical significance and nonsignificance. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice. 1993 Summer; 61(4):383-387.
Abstract: Not an abstract, since this is a commentary, but here is the entire 3rd para:  As Huberty (1993) has shown, opinion is divided on the proper rationale for, let alone the desirability of, significance testing, but a reasonable position is that a test allows evaluation of the quality of data to estimate one or more parameters in a model. Assume a researcher performs a test with the null hypothesis that a parameter is zero. If the null hypothesis is not rejected, then the data can be thought to provide no evidence that the parameter is other than zero. On the other hand, if the null hypothesis is rejected, then the data do provide such evidence. This is a yes-or-no decision about the quality of the data.

Schmidt, Frank L. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods. 1996 Jan; 1(2):115-129.

—. What do data really mean?  Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist. 1992 Oct; 47(10):1173-1181.

Schmidt, Frank L. and Hunter, John E. Eight common but false objections to the discontinuation of significance testing in the analysis of research data. Harlow, Lisa Lavoie; Muliak, Stanley A., and Steiger, James H., editors. What if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum; 1997; p. 37-.

Sedlmeier, Peter and Gigerenzer, Gerd. Do studies of statistical power have an effect on the power of studies? Psychological Bulletin. 1989 Feb; 105(2):309-316.
Abstract: The long-term impact of studies of statistical power is investigated using J. Cohen’s (1962) pioneering work as an example. We argue that the impact is nil; the power of studies in the same journal that Cohen reviewed (now the Journal of Abnormal Psychology) has not increased over the past 24 years. In 1960 the median power (i.e., the probability that a significant result will be obtained if there is a true effect) was .46 for a medium size effect, whereas in 1984 it was only .37. The decline of power is a result of alpha-adjusted procedures. Low power seems to go unnoticed: only 2 out of 64 experiments mentioned power, and it was never estimated. Nonsignificance was generally interpreted as confirmation of the null hypothesis (if this was the research hypothesis), although the median power was as low as .25 in these cases. We discuss reasons for the ongoing neglect of power.

Serlin, Ronald C. Confidence intervals and the scientific method: a case for Holm on the range. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice . 1993 Summer; 61(4):350-360.
Abstract: Based on principles of modern philosophy of science, it can be concluded that it is the magnitude of a population effect that is the essential quantity to examine in determining support or lack of support for a theoretical prediction. To test for theoretical support, the corresponding statistical null hypothesis must be de rived from the theoretical prediction, which means that we must specify and test a range null hypothesis. Similarly, confidence intervals based on range null hypotheses are required. Certain of the newer multiple comparison procedures are discussed in terms of their applicability to the problem of generating confidence intervals based on range null hypotheses.

Shaver, James P. What statistical significance testing is, and what it is not. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice. 1993 Summer; 61(4):293-316.
Abstract: A test of statistical significance addresses the question, How likely is a result, assuming the null hypotheses to be true. Randomness, a central assumption underlying commonly used tests of statistical significance, is rarely attained, and the effects of its absence rarely acknowledged. Statistical significance does not speak to the probability that the null hypothesis or an alternative hypothesis is true or false, to the probability that a result would be replicated, or to treatment effects, nor is it a valid indicator of the magnitude or the importance of a result. The persistence of statistical significance testing is due to many subtle factors. Journal editors are not to blame, but as publishing gatekeepers they could diminish its dysfunctional use.

Shrout, Patrick E. Should significance tests be banned? Introduction to a special section exploring the pros and cons. Psychological Science. 1997; 8(1):1-2.
Abstract: Significance testing of null hypotheses is the standard epistemological method for advancing scientific knowledge in psychology, even though a has drawbacks and it leads to common inferential mistakes. These mistakes include accepting the null hypothesis when it fails to be rejected, automatically interpreting rejected null hypotheses as theoretically meaningful, and failing to consider the likelihood of Type II errors.  Although these mistakes have been discussed repeatedly for decades, there is no evidence that the academic discussion has had an impact.  A group of methodologists is proposing a new approach—simply ban significance tests in psychology journals. The impact of a similar ban in public-health and epidemiology journals is reported.

Sohn, David. Significance testing and the science. American Psychologist. 2000 Aug; 55(8):964-965.
Abstract: Reaction to Wilkinson and TFSI, and most critical; basically argues that statsig test, reason for TFSI’s existence, is barely mentioned, leaving scientists profoundly disappointed.  Quotes Loftus (1996, 161): “I have developed a certain angst over the intervening 30-something years–a constant, nagging feeling that our field spends a lot of time spinning wheels without really making much progress.”  Good quote from Sohn regarding TFSI: “What the report amounts to is a vote of confidence for business as usual in the conduct of the science of psychology.” (964).

Stang, Andeas. Comment on “Translating statistical findings into plain English”. Lancet. 2009 Sep 26; 374:1065-1066.
Comment: See Pocock and Ware 2009–they got it wrong, Stang corrects them

Starbuck, William H. Croquet with the Queen of Hearts; Anaheim, CA. Academy of Management meetings; 2008.

—. On behalf of naïveté. Baum, A. C. and Singh, J. V., editors. Evolutionary Dynamics of Organizations. New York: Oxford; 1994; pp. 205-220.

—. The production of knowledge: the challenge of social science research. New York: Oxford; 2006; ISBN: 0-19-928853-3.

—. A trip to view the elephants and the rattlesnakes in the garden of Aston. Pugh, Derek, editor. The Aston programme. Aldershot, England: Ashgate; 1998; III. ISBN: 1 84014 0577.

—. What if the Academy banned null hypothesis significance testing? Honolulu, HI: Academy of Management meetings: 2005.

Sterne, Jonathan A. C. Sifting the evidence—what’s wrong with significance tests?  British Medical Journal. 2001 Jan 27; 27(226-231):115-129.

Task Force on Statistical Inference. Narrow and shallow. American Psychologist. 2000 Aug; 55(8):965.
Abstract: Reaction to critics in this issue of AP on TFSI report. They are defensive, argue that they need to go slowly and still allow NHST, although they claim to endorse the points raised in the comments published.  No need to ban NHST: “Instead, we believe psychological research can improve regardless of whether researchers use null hypothesis statistical tests, as long as psychologists pay more attention to the fundamental methodological issues that we raised.”  They also cite a replication of Aiken’s (1990) study:  “The results of her ongoing replication study promise to be equally distressing.”

Thompson, Bruce. Foreword. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice. 1993 Summer; 61(4):285-286.
First paragraphs:
Issues involving statistical significance have probably caused more confusion and controversy than any other aspect of contemporary analytic practice. For example, critics of the quantitative paradigm too often incorrectly assume that one must invoke statistical significance tests to conduct quantitative research. And way too many social scientists still believe that p is the probability that results in a given study will replicate.
Some of the blame for this confusion can be laid on the doorstep of journal editors. Melton provided a classic example of interpreting small p values as implying high probability of result replicability. After 12 years as editor of the Journal of Experimental Psychology, he seemingly boasted that “in editing the Journal there has been a strong reluctance to accept and publish results related to the principal concern of the researcher when those results were significant [only] at the .05 level…. It reflects a belief that it is the responsibility of the investigator in a science to reveal his effect in such a way that no reasonable man would be in a position to discredit the results by saying that they were the product of the way the ball bounces. (Melton, 1962, p. 554)”
Fortunately, considerable progress has been made in the last few years, at least at some journals. For example, one fellow editor I know will not tolerate sloppy writing regarding statistical tests. Whenever authors note in a manuscript that “the results approached statistical significance,” he always immediately writes the authors back with the query, “How do you know your results were not working very hard to avoid being statistically significant?”

—. If statistical significance tests are broken/misused, what practices should supplement or replace them? Theory & Psychology. 1999; 9(2):165-181.
Abstract: Given some consensus that statistical significance tests are broken, misused or at least have somewhat limited utility, the focus of discussion within the field ought to move beyond additional bashing of statistical significance tests, and toward more constructive suggestions for improved practice. Five suggestions for improved practice are recommended; these involve (a) required reporting of effect sizes, (b) reporting of effect sizes in an interpretable manner, (c) explicating the values that bear upon results, (d) providing evidence of result replicability, and (e) reporting confidence intervals. Though the five recommendations can be followed even if statistical significance tests are reported, social science will proceed most rapidly when research becomes the search for replicable effects noteworthy in magnitude in the context of both the inquiry and personal or social values.

—. Journal editorial policies regarding statistical significance tests: Heat is to fire as p is to importance. Educational Psychology Review. 1999; 11:157-169.

—. Statistical significance and effect size reporting: Portrait of a possible future. Research in the Schools. 1998; 5(2):33-38.
Abstract: The present paper comments on the matters raised regarding statistical significance tests by three sets of authors in this issue. These articles are placed within the context of contemporary literature. Next, additional empirical evidence is cited showing that the APA publication manual’s “encouraging” effect size reporting has had no appreciable effect. Editorial policy will be required to affect change, and some model policies are quoted. Science will move forward to the extent that both effect size and replicability evidence of one or more sorts are finally seriously considered within our inquiry.

—. The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, Statistical Significance Testing in Contemporary Practice . 1993 Summer; 61(4):361-377.
Abstract: Three of the various criticisms of conventional uses of statistical sig nificance testing are elaborated. Three alternatives for augmenting statistical significance tests in interpreting results are then elaborated. These include emphasizing effect sizes, evaluating statistical significance tests in a sample size context, and evaluating result replicability. Ways of estimating result replicability from data in hand include crossvalidation, jackknife, and bootstrap logics. The bootstrap is explored in some detail.

Tukey, John W. Conclusions vs. decisions. Technometrics. 1960 Nov; 26(4).

Tversky, Amos and Kahneman, Daniel. The belief in the “law of small numbers”. Psychological Bulletin. 1971; 76:105-110.

Tweney, Ryan D.; Doherty, Michael E., and Mynatt, Clifford R. On scientific thinking. New York: Columbia University Press; 1981.

Tyler, R. W. What is statistical significance? Educational Research Bulletin. 1931; 10(5):115-142.

Vacha-Haase, Tammi. Statistical significance should not be considered one of life’s guarantees: Effect sizes are needed. Educational and Psychological Measurement. 2001; 61:219-224.

Wainer, H. One cheer for null hypothesis significance testing. Psychological Methods. 1999; 4212-213.

Wang, Lihshing Leigh. Retrospective statistical power: Fallacies and recommendations. Newborn & Infant Nursing Reviews. 2010 Mar; 10(1):55-59.
Abstract: The calculation of statistical power after a study has been concluded is a highly controversial practice in quantitative research. Retrospective power in association with statistical nonsignificance presents special challenge to applied researchers in interpreting their statistical outcomes. The purposes of the present study are to review the current debate on retrospective power analysis, to examine the evidential basis of some myths associated with it, and to recommend some practical guidelines for quantitative researchers. I first briefly explain the theoretical concepts of prospective and retrospective power. I then describe three fallacies in misusing and abusing retrospective power. I conclude by providing some recommendations for improving the statistical practice associated with statistically nonsignificant findings.

Webster, Jane and Starbuck, William H. Theory building in industrial and organizational psychology. Cooper, Cary L. and Robertson, Ivan T., editors.  International Review of Industrial and Organizational Psychology 1988. New York: Wiley; 1988; pp. 93-138.  ISBN: 0-471-91844-X.

White, Garland F.; Katz, Janet, and Scarborough, Kathryn E. The impact of professional football games upon violent assaults on women. Violence and Victims. 1992 Summer; 7(2):157-172.

Wilkinson, Leland and the Task Force on Statistical Inference. Statistical methods in psychology journals: Guidelines and explanations. American Psychologist. 1999 Aug; 54(8):594-604.

Ziliak, Stephen T. and McCloskey, Deirdre N. The cult of statistical significance: how the standard error costs us jobs, justice, and lives.  University of Michigan Press ; 2008.
Reviews: McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to.”
—Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland, and 2005 Nobel Prize Laureate in Economics

“With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. . . . Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes.”
—Kenneth Rothman, Professor of Epidemiology, Boston University School of Health

The Cult of Statistical Significance shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesn’t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.

Zuckerman, Miron; Hodgins, Holley S.; Zuckerman, Adam, and Rosenthal, Robert. Contemporary issues in the analysis of data: a survey of 551 psychologists. Psychological Science. 1993 Jan; 4(1):49-53.
Abstract: We asked active psychological researchers to answer a survey regarding the following data-analytic issues: (a) the effect of reliability on Type I and Type II errors, (b) the interpretation of interaction, (c) contrast analysis, and (d) the role of power and effect size in successful replications.  Our 551 participates (a 60% response rate) answered 59% of the questions correctly; 46% accuracy would be expected according to participants’ response preferences alone.  Accuracy was higher for respondents with higher academic ranks and for questions with “no” as the right answer.  It is suggested that although experienced researchers are able to answer difficult but basic data-analytic questions at better than chance levels, there is also a high degree of misunderstanding of some fundamental issues of data analysis.