**Aris Spanos**

Wilson E. Schmidt Professor of Economics

*Department of Economics, Virginia Tech*

**Recurring controversies about P values and conﬁdence intervals revisited*
**

*Ecological Society of America (ESA) ECOLOGY*

Forum—P Values and Model Selection (pp. 609-654)

Volume 95, Issue 3 (March 2014): pp. 645-651

*INTRODUCTION*

The use, abuse, interpretations and reinterpretations of the notion of a *P* value has been a hot topic of controversy since the 1950s in statistics and several applied ﬁelds, including psychology, sociology, ecology, medicine, and economics.

The initial controversy between Fisher’s signiﬁcance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher’s *post-data threshold *for the *P *value. Fisher adopted a falsiﬁcationist stance and viewed the *P *value as an indicator of disagreement (inconsistency, contradiction) between data *x*_{0}* _{ }*and the null hypothesis (

*H*

_{0}). Indeed, Fisher (1925: 80) went as far as to claim that ‘‘The actual value of

*p*…indicates the strength of evidence against the hypothesis.’’ Neyman’s behavioristic interpretation of the pre-data Type I and II error probabilities precluded any evidential interpretation for the accept/reject the null (

*H*

_{0}) rules, insisting that accept (reject)

*H*

_{0}does not connote the truth (falsity) of

*H*

_{0}. The last exchange between these protagonists (Fisher 1955, Pearson 1955, Neyman 1956) did nothing to shed light on these issues. By the early 1960s, it was clear that neither account of frequentist testing provided an adequate answer to the question (Mayo 1996): When do data

*x*_{0}provide evidence for or against a hypothesis

*H*?

The primary aim of this paper is to revisit several charges, interpretations, and comparisons of the *P* value with other procedures as they relate to their primary aims and objectives, the nature of the questions posed to the data, and the nature of their underlying reasoning and the ensuing inferences. The idea is to shed light on some of these issues using the *error-statistical* perspective; see Mayo and Spanos (2011).

…..

Click to read all of A. Spanos on “Recurring controversies“.

……

*SUMMARY AND CONCLUSIONS*

The paper focused primarily on certain charges, claims, and interpretations of the *P *value as they relate to CIs and the AIC. It was argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the questions posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences.

In the case of the *P *value, the crucial issue is whether Fisher’s evidential interpretation of the *P *value as ‘‘indicating the strength of evidence against *H*_{0}’’ is appropriate. It is argued that, despite Fisher’s maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test.

The error-statistical perspective brings out a key weakness of the *P *value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs; see Mayo and Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be ‘‘stylistic’’ (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures.

*Spanos, A. (2014) Recurring controversies about P values and conﬁdence intervals revisited. *Ecology *95(3): 645-651.

**Murtaugh_In defense of P values Murtaugh_Rejoinder**

** Burnham & Anderson_P values are only an index to evidence_ 20th- vs 21st-century statistical science**

Thanks for posting this, Aris! A little while ago I tried to find the entire article and discussion online and was stymied. Now, at least I get to read a lit bit of it.

Corey: You can read all of it, it’s linked here. Let me know if the link doesn’t work. Thanks.

I just quickly scanned Murtaugh’s paper. I credit him for at least trying to defend P-values against criticisms based on mere misuses. He doesn’t deal with Spanos’s points in his reply.

Noteworthy is his response to the Schervish criticism. He assumes that the P-value should obey the “special consequence” property, as characterized by Schervish. Note first that Bayesians, most of whom regard “confirmation” as measured by an increase in posterior (i.e., a B-boost) rather than a posterior, are happy to reject the special consequence property, else so long as anything is confirmed, everything is confirmed. This is discussed at length on this blog:

https://errorstatistics.com/2013/10/19/bayesian-confirmation-philosophy-and-the-tacking-paradox-in-i/

(and follow-ups).

Most importantly, the proper use of a P-value, in relation to a measure of how severely tested the claims are, scotches this problem entirely. The trouble is when people look at methods based on error probabilities as “evidential-relation” logics. This is a mistake. From the perspective of how well tested the corresponding claims are, the appropriate answer emerges. I don’t know whether he talks about selection effects, stopping rules, and the like. These are also relevant here.

Hi–I discuss Murtaugh’s paper here. I agree with Murtaugh on some points and disagree with him on others, and here I speak as a statistician who has on occasion, successfully used p-values and hypothesis testing in my own work, and in other settings I have reported p-values (or, equivalently, confidence intervals) in ways that I believe have done no harm, as a way to convey uncertainty about an estimate.

Thanks Andrew: I had linked it when you had it on your blog, and it’s perfect here.

I’ve added Murtaugh’s paper and his response to other papers into the post. I will try to link it here as well, we’ll see if it works.

By the way, I’ve never heard of this person and don’t know his work.

Paul Murtaugh is indeed a sensible statistician, willing to explore and use a variety of tools from the statistical tool chest in order to better understand structure in data, without heaping scorn upon some subset of reasonable tools for some specious reason such as that they are old. Full disclosure – Paul Murtaugh was a classmate of mine in the Biostatistics Ph.D. program at the University of Washington. I have followed his career with interest, and have admired his contributions towards assisting ecological researchers and maintaining a high level of statistical honesty and integrity, as illustrated by this forum.

What next, Burnham and Anderson, shall we abandon Bayesian methods, because they predate Fisher’s likelihood? Why, I can practically smell the mould on Bayes’ theorem.

Lavine’s comments regarding Schervish’s example prompted me to dig into the history of this specious example.

Schervish asserts that because Ho: theta in [-0.82, 0.52] is rejected for an observed X=2.18 (UMPU p-value = 0.0498), but Ho: theta in [-0.50, 0.50] is not rejected (UMPU p-value = 0.0502) that the P-values are incoherent, according to something or other from a paper by Gabriel in 1969.

Looking at Gabriel’s paper (K. R. Gabriel (1969). “Simultaneous Test Procedures – Some Theory of Multiple Comparisons”. The Annals of Mathematical Statistics, Vol. 40, No. 1, pp 224 – 250), I see that it concerns multiple testing, and addresses the issue via a measure theoretic approach, which no doubt scares off the vast majority of readers. But fear not the sigma algebra – there’s plenty of sensible discussion about multiple testing issues in Gabriel’s paper. An important component of multiple testing is the construction of an omnibus test that assesses whether there is evidence of differences while maintaining an overall type I error rate of alpha. So the first step when testing multiple components, say in an ANOVA of multiple conditions of interest, is an omnibus test. If the result of the omnibus test yields a large p-value, then there is no statistical evidence of differences in means among the constituent components, and testing is done. No evidence is present to assert that some means differ from others. If the omnibus test yields a small p-value, then differences can be asserted subject to the overall type I error rate alpha, and the next step is to assess which means differ substantially. There are many such procedures in statistics – Scheffe’s test, Tukeys HSD, Dunnett’s test, and on and on, all derived to determine which differences can be asserted while maintaining the overall type I error rate.

Gabriel’s point is that if the omnibus test fails to reject, you can not then go in and peek at the constituent components, and test them one against the other until you find a small p-value. That is the incoherence Gabriel lays out and seeks to codify mathematically.

If the set is rejected, subset differences may then be explored (subject to maintaining the overall alpha level at which the set was rejected).

If the set is not rejected (accepted), then it becomes incoherent to look at subset differences and reject any of them (overall alpha level then does not hold).

“If the common critical value happens to be between these two statistics’ values the subset would be rejected whilst the set was accepted. This would lead to the incoherence of asserting that all means of a set are equal whilst some of them differ from each other. Hence, such a procedure would be incoherent.” (p230 Gabriel)

So Gabriel’s point would be that if [-0.82, 0.52] was accepted, then incoherence would ensue if [-0.5, 0.5] was rejected.

Precisely the opposite is true in Schervish’s purported example of damnation.

The interval [-0.82, 0.52] is rejected. Thus the next question is, what subsets of [-0.82, 0.52] can also be rejected? This is the nub of the multiple comparisons conundrum that Gabriel discussed. Since Schervish picks out [-0.5, 0.5], we are left additionally with [0.5, 0.52] and [-0.82, -0.5]. Following Schervish’s p-value reasoning (UMPU Lehman style), the p-value for [-0.82, -0.5] is 0.0050 and the p-value for [0.5, 0.52] is 0.0949. We know from Schervish that the p-value for [-0.5, 0.5] is 0.0502 (adjusting these three p-values via Benjamini’s and Hochberg’s procedure yields p-values 0.015, 0.0753, and 0.0949). We see that the observed X=2.18 would be highly unlikely if theta was in the region [-0.82, -0.5] and this is precisely why the UMPU p-value for [-0.82, 0.52] is smaller than that for [-0.5, 0.5]. It’s that large chunk of territory to the left of -0.5 that makes theta in [-0.82, 0.52] less likely than theta in [-0.5, 0.5] given an observed X=2.18. There is nothing incoherent or inconsistent about any of this at all.

Had we failed to reject [-0.82, 0.52], Gabriel’s point is that no subset of [-0.82, 0.52] should then be rejected – this would have been the incoherent outcome in Gabriel’s lines of reasoning, and this is not what Schervish discusses.

Schervish has misinterpreted the intent of Gabriel’s measure theoretic treatise, misrepresented the meaning and intent of “incoherence” as defined by Gabriel, and no one seems willing or able to point this out.

Steven: Thank you so much for your interesting comment. It’s nice to hear a little about the person associated with “Murtaugh”, whom I do not know at all, but am glad to hear he’s a sensible statistician.I have dealt with the general issue raised by Schervish a long time ago,fresh out of grad school actually, but not in relation to Gabriel’s paper, so far as I remember. I’m interested to hear the 1-sided vs 2-sided issue put in terms of multiple testing because that’s exactly what’s happening, and why the associated error rates differ. I’ll look it up when I can, and study your comment. Overly pressed at the moment.

Gabriel’s paper “Simultaneous Test Procedures – Some Theory of Multiple Comparisons” can be found here:

http://projecteuclid.org/download/pdf_1/euclid.aoms/1177697819

Steven: I’m replying to your remarks on Burnham and Anderson in a comment below.

Murtaugh even concedes that Shervish’s example is a logical inconsistency (for P-values?): there is no such thing. He seems surprised by this baked-on example from comparing 1-sided vs 2-sided tests.

Surely Schervish’s 1996 observation that two-sided P-values for interval null hypotheses can be ‘incoherent’ s nothing more than an argument against the use of interval null hypotheses and in favour of one-sided P-values.

Michael: The real point is that far from it being pejorative to distinguish 1 and 2 sided inferences, as tests do, it is essential to their function in assessing error probabilities. However, it might properly be a concern for Bayesians that they violate the consequence condition, at least if they appraise evidence as an increase in “Bayesian confirmation”. In any event the author doesn’t seem to be familiar with the issue.

It’s true that each field tends to grow its very own literature on significance tests; as more fields come on line statistically, they each launch reviews of the literature, repeating essentially the same points with slightly different emphasis. It’s good to hear someone not buying some of the well-worn, knee-jerk criticisms of p-values, but it is noteworthy, in my judgment, that there aren’t references to earlier discussions even in ecology, e.g., the Taper and Lele (2004) volume comes to mind.

I didn’t read everything but I read Aris’s discussion. Two remarks sparked by this:

1) I think that the term “accept” is very bad and misleading for “not rejecting the H0”. It invites misinterpretation as evidence “for H0” and I therefore think that one should make some effort to get the habit of using this term changed (at least I tell my students). I guess Aris could agree with this but still he uses it at times.

2) Arguments for the usefulness of the AIC, as far as I know them, emphasize that if the aim is prediction quality, underfitting is much worse than overfitting, which is reflected in what AIC does. So AIC may overfit too often, which seems to be connected to Aris’s criticism, but proponents of the AIC may object that this is not a big problem and the benefit from underfitting very rarely is worth overfitting more. (The problem here may be that I read too quickly, but still, it’s a blog comment and there are worse places for making a fool of oneself.)

Isn’t AICc the recommended practice anyway? I was under the impression that AICc is better about overfitting.

I wouldn’t agree that underfitting is “much worse” than overfitting. In fact, overfitting can be just as bad for prediction quality and has arguably done much more harm to scientific understanding than underfitting has.

One sociologically/psychologically insidious aspect of overfitting is that it’s much easier for scientists to mistake unbiasedness (which is really the allure of overfitting) for rigor and thoroughness.

vl: For me there is no generally “recommended practice” at all. It always depends on the data, the background and the aims of analysis what I think should be done.

What I was writing was about the aim of having good prediction quality and the use of AIC. Of course overfitting can be very bad for other aims, and it can be even bad for prediction quality (but regarding the latter I mean overfitting in general, not the specific amount of overfitting you can expect from the AIC). But if we’re talking about the AIC and its merits for prediction quality, this is not so relevant.

What is relevant is that people tend to expect the outcome of a single method to be optimal in all kinds of respects, which is rarely justified.

vl PS: If I want to have prediction quality, I will always run proper cross-validation as long as I can afford the computation time, and not use any of the “IC”s.

Christian: I’m unclear of the relationship between the “IC’s” as you call them and other model selection or construction techniques that do involve proper cross-validation. this came up in the recent Potti post.

I also don’t really see how the reasoning or rationale can be the same,even if the end is to prefer the same model asymptotically. Can you explain?

This paper…

Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7

…shows what the title says, namely that the AIC can be thought of as an asymptotic approximation of leave-one-out CV model selection.

Generally penalised likelihood-type ICs such as AIC, BIC and others try to approximate the prediction error or (BIC) the posterior probability of a certain model under uniform priors based on the information in the likelihood, whereas cross-validation gives a more “honest” assessment mimicking the fitting and prediction process for new observations not relying on any assumed distribution for the truth (although it requires i.i.d. at least in its basic form).

Although BIC was derived from a Bayesian point of view, it allows frequentist consistency results regarding estimating the model order, by the way. AIC doesn’t do that.

Christian: I do know the Stone paper. What are the implications, if any, for the business we discussed on Potti?

https://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-potti-training-and-test-data-fireworks-part-1/

And in relation to Corey’s comment:

https://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-potti-training-and-test-data-fireworks-part-1/#comment-34692

I have difficulties commenting on the Potti-case not having read the original literature. However, the exchange quoted by you looks as if Potti/Nevins got their cross-validation wrong and the others pointed that out. The only thing I can personally contribute is that what Potti/Nevins did (namely applying CV only after a model is selected based on all data) seems to be standard practice in certain quarters and I can understand why it sometimes happens (people don’t work systematically enough to separate test data from the beginning, and to use different levels of CV where required), and why in some cases it is argued that it may be harmless (but really this doesn’t hold in any generality although at times one stumbles over situations where it is fairly clear that it doesn’t do much harm). So it *is* wrong, but it is widespread enough that when I see it, I rather guess that people don’t understand the implications properly than that anybody is cheating on purpose.

Were there information criteria involved in the Potti-case? I didn’t think so.

Cory writes that information criteria are often rather used for “inference” than for prediction, which may be correct as a description of what many people do, but in case of AIC this seems to be ill-advised, and even with other criteria, it doesn’t allow to come up with any kind of error probabilities, at least currently.

Christian Hennig

5h

I have difficulties commenting on the Potti-case not having read the original literature. However, the exchange quoted by you looks as if Potti/Nevins got their cross-validation wrong and the others pointed that out. The only thing I can personally contribute is that what Potti/Nevins did (namely applying CV only after a model is selected based on all data) seems to be standard practice in certain quarters and I can understand why it sometimes happens (people don’t work systematically enough to separate test data from the beginning, and to use different levels of CV where required), and why in some cases it is argued that it may be harmless (but really this doesn’t hold in any generality although at times one stumbles over situations where it is fairly clear that it doesn’t do much harm). So it *is* wrong, but it is widespread enough that when I see it, I rather guess that people don’t understand the implications properly than that anybody is cheating on purpose.

Hennig: “ Cory writes that information criteria are often rather used for “inference” than for prediction, which may be correct as a description of what many people do, but in case of AIC this seems to be ill-advised, and even with other criteria, it doesn’t allow to come up with any kind of error probabilities, at least currently.”

Indeed, or rather, we might not like them. Then how can Murtaugh (and others) equate AIC with significance testing? (Did I get this wrong? I’m traveling and read it quickly.) As for cross-validation (CV), and I know there are many kinds) is it not intended to warrant an error probability assessment, which is why, for example, Baggerly and Coombes could raise the criticism of nonreproducibility based on a T-test between means?

Hennig: “Were there information criteria involved in the Potti-case? I didn’t think so.”

Hadn’t you been noting an equivalence, at least asymptotically between CV and AIC?

Hennig: “..it is widespread enough that when I see it, I rather guess that people don’t understand the implications properly than that anybody is cheating on purpose.”

Isn’t that so often the case with QRPs? And isn’t that the point of providing a foundational scrutiny of what is and is not accomplished by a given method?

Mayo: I can’t comment on Murtaugh but I guess that one can come up with a mathematical connection between the computation of p-values and the AIC under some rather severe assumptions. Now many things in mathematics can be related with many others, so this doesn’t amount to “equating” them, and perhaps Murtaugh (or his readers) over-interpreted what can be made of this connection.

Mayo: “As for cross-validation (CV), and I know there are many kinds) is it not intended to warrant an error probability assessment, which is why, for example, Baggerly and Coombes could raise the criticism of nonreproducibility based on a T-test between means?”

No, as far as I understood it, the problem was not a basic problem with CV, but rather that Potti applied CV in the wrong way, not leaving test data aside when selecting the model.

Apart from this, CV indeed isn’t meant to warrant an error probability assessment, although this depends on the use of the term “error probability” – it is about estimating prediction errors indeed, but doesn’t care about error probabilities for getting the model wrong (wrong models can still be good at prediction, which according to CV then is fine).

Mayo: “Hadn’t you been noting an equivalence, at least asymptotically between CV and AIC?”

This is for leave-one-out CV, which isn’t what Potti was doing (as far as I get from the discussion – don’t forget I’m the wrong person to discuss Potti’s work).

My answer to your last questions should be “yes”, I think.

Christian: I agree that in an informal blog like mine one shouldn’t worry about being overly meticulous, and anyone who wants me to omit a comment they made should let me know. That’s better than having people scared off commenting.

In that spirit, let me venture something your point (2), admitting to not having thoroughly read Murtaugh’s paper. I thought he said somewhere that AIC effectively makes it harder to reject a model, in contrast to what Aris and (I think) you say.*

On “accept”, Neyman says in a few places that it was intended as a shorthand for “do not reject”. On the other hand, there is a problem with merely saying “do not reject” and leaving it at that: it leads many people to suppose there is no information at all in that case, whereas one may and should be able to rule out discrepancies (that the test would have detected with high probability).

*I did send a link to this post to Murtaugh, inviting him to comment.

Mayo: It would be interesting to have Murtaugh commenting on this. Generally the AIC is not a test and therefore the word “reject” doesn’t really reflect what it does; it’s rather a device for ranking models but it is for example not appropriate to say that a model was “rejected” if it was ranked second best. There are many situations in which there are not enough data to tell different models apart with satisfactory error probability and many cannot be rejected, but AIC will still pick a winner.

Regarding the term “accept”, I assume that Neyman understood his own method properly, but I’d guess that he wasn’t so aware of how tempting the term “accept” would be for many people to interpret it as a positive confirmation of the H0. If indeed more information than just “do not reject” is properly quantified (severity calculations), one should explain this in some detail rather than saying that the H0 is “accepted”. At least that’d be my preference.

Christian: Take a look at this post:

https://errorstatistics.com/2011/11/09/neymans-nursery-2-power-and-severity-continuation-of-oct-22-post/

i think the article links are worn out (I’ll get them replaced), but my own discussion gives enough of a sense of what Neyman had in mind for acceptance—at least in these hidden” articles, and maybe elsewhere too.

I was curious as to what Steven was saying about Burnham and Anderson’s allegations that we cannot use significance tests because, well, they’re just too old! Well here’s their paper, you can see Steven was putting it mildly, they actually say that using these methods is tantamount to being put in a hearse!

Some commentators on my latest post were questioning if the litany of howlers could really be raised by serious people in dismissing frequentist statistics….

https://errorstatistics.com/2014/06/14/statistical-science-and-philosophy-of-science-where-should-they-meet/

Look no further, we have a made-for-order example in Burnham and Anderson. They don’t get into the statistically complex examples, but settle for vague (but enthusiastic) cheerleading for contemporary, modern millie methods like AIC, claiming they give post-data probabilities to models, and besides, we know that “p-values are not proper evidence as they violate the likelihood principle”(627). (In fact, anyone who cares about controlling error probabilities must reject the likelihood principle (LP), even most up-to-the-minute Bayesians now agree.)

We learn, too, about the “scandal” due to the fact that an alternative to a test or null hypothesis cannot be falsified, and they even namedrop Popper. The alternative, be it a one-sided parametric alternative, or a two-sided “real effect” hypothesis or a substantive alternative, is falsifiable in the way one would expect for a statistical test, namely statistically! What is scandalous are those ever-so-modern, new-fangled methods that we keep discovering lead to non-reproducible effects and models. And how do we manage to discover this? By statistically falsifying claims to have found genuine, reproducible effects and reliable models! Whether using age-old Fisherian checks, or analogous more modern fraudbusting techniques using simulations, the findings of “non-reproducibility” or “mere chance coincidence” or “overfitting” or even, in some cases, fraud (as Jens Forster recently),rest upon those dusty-old, tried and true methodologies and reasoning. Without them, there is no critical scrutiny.

Step right up to the future folks! Predictive modeling free of error probability control: you too can be part of a clinical trial testing someone’s exploratory “big data” model. You insist on error probability validation first? That’s so last week, man!

After reading your last comment, I had a quick look at the Burnham & Anderson book on model selection. Their Akaike weights are just model-wise maximum likelihoods standardised to sum up to one. They go on to call them “model probabilities” at times, despite admitting that they are not proper Bayesian probabilities, and neither specifying any valid probability framework within which they could indeed be interpreted as “probabilities for models being true”. Later in the book they state that the idea of a “true model” should be discarded anyway, so if none of the models is true, shouldn’t all “model probabilities” be zero?

All this is extremely confusing in a rather ironic way, given how they complain about people being confused about p-values.

I have read Spanos’s article with interest. I think, however, that he is too dismissive of Lehmann’s view of the P-values ‘as the smallest significance level…at which the hypothesis would be rejected for the given observation’ (p70).

Even if one is completely Neyman-Pearson in approach, to regard this use of P-values as being unacceptable would require that one assume that the scientist who designs the experiment is mandated on behalf of all scientific posterity to make the reject/not reject decision.

If, however, Neyman habitually uses a different type I error rate to Pearson, the P-value allows them to communicate the results of an experiment in such a way that each can make a decision. Seen this way, the post data/ pre data distinction is largely a red-herring.

Furthermore, there is the curious historical fact that Fisher in his statistical practice and through the tables he co-authored with Yates, introduced the habit of tabulating critical values for standard values of alpha.

See Senn, S. (2002). “A comment on replication, p-values and evidence, S.N.Goodman, Statistics in Medicine 1992; 11:875-879.” Stat Med 21(16): 2437-2444; author reply 2445-2437.

Thus although, I accept much of the stricture against a strength of evidence interpretation, I think a lot of it boils down to the fact that neither P-values nor NP tests are adequate to tell us the whole story. Applied statisticians typically use confidence intervals and likelihood in addition.

Finally, I should like to draw attention to Fisher’s criticism of the notion of an alternative hypothesis. He regarded null hypotheses as being more primitive than statistical tests but regarded tests as being more primitive than the alternative hypothesis. See

https://errorstatistics.com/2014/02/21/stephen-senn-fishers-alternative-to-the-alternative/

Stephen:

I didn’t catch where Aris dismissed Lehmann’s view of P-values; I thought it was something we generally agreed on. The post-data/pre-data distinction refers to the fact that post-data, you can evaluate tests in terms of the actual observed value (rather than things like worst-cast cut-offs).

Yes, Fisher could be as or more “automatic” than Neyman.

I really like the Senn paper you reference, I hope all with study it.

I was picking up on the last paragraph first column, P646 of Spanos’s paper. Maybe I misinterpreted him but I interpret Lehmann to mean that there is a perfectly respectable interpretation of the P-value from the NP perspective.

I think that the post-data, pre-data distinction is more or less irrelevant and in any case, I don’t even think that P-values versus significance levels is an important distinction between Fisher and Neyman. As I pointed out it was Fisher who introduced fixed levels of significance. Compare, for example, Student’s tabulation of the t-integral (1908) & Fisher’s similar (but more useful) approach in CP44 (1925) with Fisher’s Statistical Methods for Research Workers (1925) Table IV http://psychclassics.yorku.ca/Fisher/Methods/tabIV.gif . In the latter table, the table is entered by the significance level and it is the critical value that is delivered.

Furthermore, imagine a large community of scientists with very different habits regarding type I error rates. Even if each of them uses a NP hypothesis test and even if each of them abhors the idea of making an inference if they stop to ask each other why so and so rejected and so and so did not they will soon realise that by calibrating the proportion who rejected on the known distribution over the scientists of the tolerated type I error rates they will come to the P-value.

Thus I have always found this one of the least interesting differences between Fisher and Neyman (if there is, indeed, any difference). I think that likelihood v power is far more important. My interpretation of the NP lemma is that power is (often) an incidental bonus of likelihood but likelihood is more fundamental. This, I think, is the reverse of what Neyman thought. (About E Pearson, I am not so sure.) My interpretation of Fisher’s thinking is that he thought that power was pretty pointless since precisely those cases that were sufficiently well-structured to permit its calculation permitted the calculation of likelihood. I think that this goes too far (and high dimensional problems are difficult with a pure likelihood approach but also for any philosophy of stats) but power worship delivers the sort of strange statistics that Perlman & Wu condemned in ‘The Emperor’s New Tests’ http://projecteuclid.org/euclid.ss/1009212517

See also my recent blog ‘Blood Simple’