I'm not being nitpicky - they are components of a problem of reproducibility, but orthogonal to each other. Bad UI design and a poor backend are both reasons "X website sucks!" but that doesn't mean they're the same.
A perfectly designed, un-p-hacked study should still perhaps be held to a stricter p-value criteria than 0.05.
And I am correct - because I've done it. Presently working on a paper where, because I primarily work with simulations, I can translate minute and meaningless difference into arbitrarily small p-values. And I used arbitrary for a good reason - my personal record is the smallest value R can express.
Ironically, this isn't because I have a tiny sample, but because I can make tremendously large ones. All of this is because no where in the calculation of a p-value is the question "Does this different matter?"
First off, I don’t have any experience with publishing based off the results of simulations. My (short) time in writing papers centered around economics research with observational datasets.
I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc.
Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me. I think a better analogy would be trying to game something like your Pagespeed score. In order to get a higher score, you skimp on UX so the page loads faster, and cut out backend functionality because you want fewer HTTP requests. Making it harder to achieve a Pagespeed score forces you at some point to evaluate the tradeoffs of chasing that score.
I have two questions for you:
1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?
2) In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?
"I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc."
This is only true if you haven't collected your own data, and the size of the original sample is known - and that you used all of it. I would suggest that a fixed, known sample size is a relatively rare outcome for many fields.
"Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me."
The suggestion is they're unrelated. Changing to say, p = 0.005, will impact studies that aren't p-hacked, and does not p-hacking proof evidence. It potentially makes things more difficult, but not in a predictable and field-agnostic fashion.
"1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?"
It might take me more time - but I could also write a script that does the analysis in place and simply stops when I meet a criteria. The question is will it take me meaningfully more time - "run it over the weekend instead of overnight" isn't a meaningful obstacle.
"In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?"
My preference is to move past a reliance on significance testing and report effect sizes and measures of precision at the very least. If one must report a p-value, I'd also require the reporting of the minimum detectable effect size that could be obtained by your sample.
Pre-announcing sample size would...just be a huge pain in the ass, generally.
>I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way
Correct, but the most common methods of p-hacking involve changing the dataset size, either by repeating the experiment until the desired result is achieved (a la xkcd [0]), or by removing a large part of the dataset due to a seemingly-legitimate excuse (like the fivethirtyeight demo that has been linked already).
Pre-announcing your dataset size is pre-announcing your sample size. If you pre-announce your dataset, p-hacking is not possible. This is true. But most research doesn't use a public dataset that is pre-decided.
>Would it take you more time to p-hack a lower threshold
Yes.
>In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc.
Sorry if the second question was unclear. My point was that for simulation based research, it doesn’t seem that pre-announcing your sample size would do much for preventing p-hacking.
E.g. if I say “I will do 10000 runs of my simulation”, what’s to prevent me from doing those runs multiple times, and selecting the one that gives me the desired p-value? For observational research, there’s obviously a physical limit to how many subjects you can observe etc. Would still love an answer from the grandparent comment.
One nice thing about simulation based research is that it is often (more) reproducible, so a simulation can be run 10000 times, but then the paper might be expected to report how often the simulation succeeded. In other words, you can increase the simulation size to make p-hacking infeasible
Note that in practice, pre-announcing your sample size doesn't prevent p-hacking unless your sample size is == to a known sample. If you say "our sample size will be X", but you can collect 2 or 3x X data even, you can almost certainly p-hack.
Not to mention that I'm unaware of any field where people actually pre-announce their sample sizes. Does this happen on professor's web pages and I'm unaware, or as footnotes in prior papers?
Again, academia/research is not my profession. But, some cool efforts in this area include osf.io, which is trying to be the Arxiv or Github of preregistration for scientific studies.
The best preregistration plans will typically include a declared sample or population to observe (http://datacolada.org/64), or at least clear cut criteria for which participants or observations you will exclude.
I think for the type of economics/finance research I’m most familiar with, you often implicitly announce your sample when securing funding for a research proposal. E.g. if I’m trying to see if pursuing a momentum strategy with S&P 500 stocks is profitable (a la AQR’s work), it’s pretty obvious what the sample ought to be. This is partly why that meta study I linked to earlier was able to sniff out potential signs of p-hacking.
A perfectly designed, un-p-hacked study should still perhaps be held to a stricter p-value criteria than 0.05.
And I am correct - because I've done it. Presently working on a paper where, because I primarily work with simulations, I can translate minute and meaningless difference into arbitrarily small p-values. And I used arbitrary for a good reason - my personal record is the smallest value R can express.
Ironically, this isn't because I have a tiny sample, but because I can make tremendously large ones. All of this is because no where in the calculation of a p-value is the question "Does this different matter?"