Reevaluating P Values | Stats + Stories Episode 93 / by Stats Stories

Nicole Lazar.jpg

Nicole Lazar is Professor of Statistics at the University of Georgia. After receiving her BA in Statistics and Psychology from Tel Aviv University, she served three years as Statistics Officer in the Israel Defense Forces Department of Behavioral Sciences. She then moved to the US for graduate school, obtaining her MS in Statistics from Stanford University and Ph.D. in Statistics from The University of Chicago. She was Associate Professor of Statistics at Carnegie Mellon University before joining the Department of Statistics, University of Georgia.

+ Full Transcript

(Background music plays)

Rosemary Pennington: In hypothesis testing, a p-value can indicate how statistically significant a particular set of findings are. But there are some concerns that researchers have become overly reliant on the measure, leading to a lack of nuance in some discussions of data, or they are discarding interesting work simply because the p-value isn't on the right side of point 0.05. The debate over p-values is the focus of a special issue of The American Statistician and this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I’m Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and media journalism and film as well as the American Statistical Association. Joining me in the studio our regular panelist John Bailer, chair of Miami’s Statistics Department and Richard Campbell former and founding chair of Media Journalism and Film. Our guest today is Nicole Lazar. Lazar is a professor of statistics at the University of Georgia. She also recently edited with Ron Wasserstein and Allen Schirm, special issue of the American statistician which included 43 wide ranging articles focused on p-values, contextualizing their history, their use as well as the debate around their utility and usefulness. Nicole, thank you so much for being here today. Before we get started with the argument of the special issue and some of the articles in it, could we talk a bit about sort of, why p-values have become an issue?

Nicole Lazar: Sure. So there's a number of reasons I think, partly has to do with the fact that scientists and researchers have become so reliant on this measure and journals have sort of encouraged that I guess, by putting such an emphasis on this 0.05 threshold, which is really just an arbitrary threshold, but it is taken to mean a whole lot more than it really does. P values are also hard to understand and get misinterpreted a lot and so this, together with some of the issues surrounding perceived reproducibility and other crises in science has just created what we call a perfect storm of circumstances around this issue.

John Bailer: Oh this is great. I'm going to ask you to rewind just a little bit because I know my Dad's going to be listening to this!

(Collective giggles)

Bailer: So there are some things that were in Rosemary's introduction, and in your response that I'd love to get you to give a simple description, more broadly. You know like, what is a hypothesis test? What's being evaluated in a hypothesis test? And how are p-values used to choose between those hypotheses?

Lazar: OK, so the standard way that we teach our intro-statistical courses, and actually even more advanced courses as well, is that we have a set of two competing hypotheses: one is called the null hypothesis, and that's the one that we sort of hope is not true, and the other is the alternative hypothesis which is the one that maybe we would like to be true. And a lot of the statistical procedures that we teach for instance in our classes, are built on basically testing these two hypothesis, one against the other. And the hope is that we could reject, as it's called, the null hypothesis in favor of the alternative hypothesis. And the p-value is a way of trying to get to that decision, whether we should reject the null hypothesis or not.

Richard Campbell: So as the journalism teacher here, there's two of us actually, I love the way you use metaphors and for me not being a statistician, when statisticians use metaphors to explain things, I am appreciative. So you used a timber and construction metaphor at the beginning of this article, in which I think you equate p-values with rotting timber and that there needs to be a kind of a new foundation. You also say in the article and I loved this line too, “The tool has become the tyrant”.

Bailer: Oh I like that too.

Lazar: Yeah, this was Ron.

(Collective laughter)

Campbell: But I’ve read some of your other stuff and it's got metaphors in it too. So talk about what those mean. So the tool has become a tyrant, let's start there.

Lazar: OK. So the tool has become a tyrant, this goes back to what I was saying couple minutes ago about that 0.05 threshold being the magic threshold. So for instance it's hard to get results published in journals, in good journals if you have a p-value that's greater than 0.05, even if it's just a little bit greater than 0.05, which doesn't make a whole lot of sense. So in that regard, the tool has become the tyrant, because it's become a barrier for publication, which it never was meant to be and never should be. It also means then that people, whether consciously or not, will try to get to that magic threshold, 0.05, so that they can publish their results, which is backwards, I think. We should let the results speak for themselves and not be so compelled to get to some arbitrary cutoff point just to even get our publications out.

Bailer: So you have just one “don't” in your editorial then, that leads into this…

Lazar: Yeah.

Bailer: …this special issue and I find that's kind of an interesting one that says, don't say statistically significant. So that's your kind of banning a phrase or proposing...

Lazar: Well I didn’t say banning, although I certainly see people taking it that way…

Bailer: OK. So tell us…this is almost a statement that's made as, giving almost a blessing to a research result in the past. You said that somehow having this level, that somehow 1 in 20 is sufficiently special to merit publication and recognition.

Lazar: Right.

Bailer: But less than 1 in 20 in terms of the unusualness of a result would be not so great, you know. So why do you say this? You, in the editorial, are suggesting this one particular don't?

Lazar: Well first of all, the original is a statement from 3 years ago. We had a lot of don'ts. So we thought that, saying don't, don't, don't, doesn't really help to move the conversation forward, because ultimately people are going to say, well if we can't do all these things that we're used to doing, what should we do instead? So that's why in the editorial, we focused on the do’s. The particular don't around not saying statistically significant, comes down to this misinterpretation, because the word significance has meaning in everyday life as well. And you tend to think, if something is statistically significant then that means that it's really important or interesting. But that's not necessarily true. There are statistically significant results that are not that interesting and there are not “statistically significant results” that could be interesting as well. And so this confusion with “significant” was important to me, that we’re trying to move people away from it.

Campbell: The article quoted an 1885 quote about the original meaning of this, which was to indicate when a result warrants further scrutiny. And I thought that's very different today.

(Collective: Yes)

Campbell: Could you talk a little bit about that?

Lazar: Sure! So I think this gets back to the tool becoming the tyrant idea. I mean originally, it was not what any of these thresholds or cut-offs were proposed for. It was just to see should we look further, or is there anything here that might be potentially interesting. Fisher also, in some of his work on this, said you know, you should never make it contingent based on just one study. You should follow up and see what happens in those studies as well, or whether the interesting results persist. So 1 in 20, you know, we could argue about whether that in itself is even all that surprising, we also need to think about the context that people were working in, in those days. The sample sets were lot smaller, the studies were more of the designed experiment and sampling, whereas today we're living in a completely different landscape from where these tools were first developed, and that shifted our practices as well, in how it works.

Campbell: Very good!

Pennington: How has it become a tyrant? How did we find ourselves in this space where it's gone from an indicator, that maybe you should explore further, to being this thing where I could imagine some, you know, peer reviewer seeing a study with a p-value slightly about 0.05, and rejecting a study outright. How did we get to this sort of tyranny?

Lazar: Gosh! I wish I knew! Maybe schedule an interview with Steve Stigler for the history of this, because this is really like a historical and sociological question and I'm not really sure how we got to this point. I think, my best take on it is that there's been a sort of interlocking system of incentives coming from the journals, I mean coming from the way that we teach coming from the way that people get hired and promoted and get grants. I mean it has all kind of worked together in a, I think, not so great way. But I don't know how we got from…well, maybe a little bit I do know how we got from that. So you know, now we're so really used to having computers that can calculate all of our statistics for us. But in the past, everything was done by hand. And so if you needed to, if you go back and look at Metrika and all these journals, they published tables of statistical distributions. But you had to then click some thresholds and some cutoffs that you were going to look at. And so 0.05 was one of them, 0.1 was one of them, 0.01 was one of them. But there were very few values that they actually computed, and so those became, sort of by default, values that we looked at, because that's all we had access to.

Bailer: I wonder also if there's just sort of such an attractive simplicity in having this dichotomous call. I mean we see a lot of times where people are reporting, even summaries of distributions in single numbers. It will show the average of this group versus the average of that group, without any kind of comment on the variability of the responses that are associated with those averages. I think it is part of the simplicity of message and simplicity of application.

Lazar: Oh absolutely, but the problem is, in this case, I mean life is not simple, and data are not simple, data analysis is not simple.

Campbell: That gets to the question of uncertainty, right? A lot of us want to be certain about these things and I know from working with John for many years that uncertainty is always sort of at the heart of this. And I think that's what you're talking about a lot, is the subject of uncertainty. Can you talk a little bit about that, what that term means to you?

Lazar: Yeah, so to me, I mean you know the world is an uncertain place and if I take…the analogy I've been trying to use was not really an analogy but it is the thought experiment that I've been trying to use is, if I collect the samples today, and I do some calculation on it, that's going to give me one value. And if I collect a different sample from the same population another day, it is going to give me a different value if I do the same analysis and so on. So I could do this lots of times. The p-value itself is not a fixed number, so the p-value that I get out from the analyses that I do in these different samples will also be different and some of them will be greater than 0.05 and some of them will be less than 0.05, as the study goes on and on. So that whole aspect of uncertainty gets lost, the repeated sample aspect of it. But you know, if you want…do you want certainty? I've been trying to introduce some of these ideas in my teaching, and my students, their reaction is, well, if we can't you know, use the 0.05 threshold to make our decisions, what are we supposed to do? So they're scared of tackling that uncertainty.

(Background music plays)

Pennington: You're listening to Stats and Stories and today our guest is Nicole Lazar, who co-edited a special issue of The American Statistician, exploring the issue of p-values. Nicole there are 43 articles in this special issue of the journal. I was wondering, if you could give us sort of a broad overview of kind of the various arguments and maybe the main themes running throughout these articles as they appeared in the issue?

Lazar: Sure, I can try anyway. So we wanted the issue to be broadly accessible, not just meant for statisticians. Statisticians talking to statisticians about this issue is not going to change anything. So we really did try to have a range of papers that would be of interest to scientists and researchers in different fields as well as to statistical audience. So there are papers that argue a variety of perspectives. So there are papers, some just say well, we should continue to use p-values as thresholds but there are dangers of this threshold. There are some that propose alternative measures to p-values that may also be threshold but not necessarily. There are papers about education-how do we get from…how do we change the way that we teach our students, so that this cycle gets broken. There are papers about how do we change the editorial process so that this cycle can get broken at the Journal level. There are papers that we hope will be of interest to practitioners in different fields, ranging from medical to social sciences. So pretty much I think we were hoping to have something for everyone that was interested in this question.

Bailer: So I do think you succeeded in that regard, and I like that you tried to capture your do’s, in a simple way. So you got your Adam prompt.

Lazar: Yes.

Bailer: So accept uncertainty, be thoughtful, open and modest. Could you just say a sentence or two about each of those components that are embedded in your recommendation?

Lazar: Sure. So I think we talked about uncertainty already and why that's important to recognize and acknowledge and not be afraid of it, because it's part of science and it’s part of the world. Being thoughtful, you know I think, using a fixed threshold is sort of the opposite of being thoughtful. So think about what's interesting, think about what's important, think about how you use your data. Be open, starts to talk about also open science and some of the issues there that are coming at the reproducibility crisis from a different angle. Be open about what you've done, share it, share your data, share your analyses, share your code, be open to criticisms as well, I would say of your methods and your techniques. And being modest, don't put too much on any single result. You know, think about the end, sort of the bigger picture and how things tie together.

Campbell: So will there be a lot of pushback on this? And have you given any data on this?

Lazar: Not so far. I've heard about not too much pushback, most people are at least willing to have the conversation. Even if they don't, I mean not everybody agrees obviously with everything in the editorial or in the special issue because people...people in the special issue don't even agree with themselves. I mean that's exactly what should be happening and you know we were very deliberate, there's no one answer, there's no one size fits all. In fact, we want to move away from the one size fits all. So I personally have not received any e-mail saying, oh this is terrible! I have heard through the back alleys, some are saying this might be ultimately bad for science and statistics. But I'm not sure. My Dad asked if I'm being savaged on Twitter but I told him I wasn’t.

(collective laughter)

Bailer: I can’t imagine you will be. I mean the first statement that I said was well received and well promoted. So I expect that this is going to get the same kind of attention.

Lazar: And again you know, we're not trying to lay down a law for anybody. We're just trying to get people talking which I think we've succeeded in doing.

Bailer: I think so too. You mentioned the idea of reproducibility. Let’s talk a little bit more about that, how that reproducible results is a concern now and how does this p<0.05 data-preserve-html-node="true" idea contribute to this problem?

Lazar: Sure. So this is obviously an area that still developing because people have different ideas about what is meant by a study being reproducible, sort of a crude meaning of it might be, well, if the first study had a p<0.05 data-preserve-html-node="true" then the second study should have p<0.05 data-preserve-html-node="true" as well, but that's not realistic for reasons that we already talked about. So you can see, in a couple of ways, that because we have the arbitrary threshold you know and Andreas Georgiou has a really nice expression that he uses in his concerts from time to time which is that the studies are chasing noise. And so there may not even be much to find there but you will analyze your data and torture your data set until they reveal the magic 0.05. If you do that, the next study that comes along is not going to reproduce that result. So that's one way that p-values have come into the problem. The other way I think is sort of the flip side of it is that all those studies that don't appear in the literature because they didn't meet the threshold of 0.05 or whatever threshold the journal uses and then, those could be studies that show that the effect is not really there or the result is not really there or is more elusive than might be indicated by the papers that make it into the journals. And so that also contributes from the other side, to our ability to reproduce and replicate results. They just don’t have the whole picture.

Pennington: To sort of a side track really quickly, I took part in a science reporting workshop years ago when I was still a journalist, and one of the people who are leading the workshop, one of the guest speakers sort of pointed to the 0.05 as being sort of a way for journalists who are trying to decide whether to cover a study or not, of being sort of a quick shorthand to be able to evaluate whether a study is worth sort of digging into more, whether you should pitch it when it comes to your inbox.

Campbell: Whether it is statistically significant.

Pennington: Yes and worth it for news. So I'm wondering, for an audience that's not as quantitatively literate, who you know, is reading a scholar, and is sort of sifting through scientific data, what advice would you give to them for…if they're not going to look at a p-value, what else could they look for in a study, to sort of gauge whether it's something worth investing more time in?

Lazar: Well that's a hard question. That’s one of the things that is still in flux. We will need to have new standards and new ideas. But some initial thoughts that I have would be that I would encourage a person like that to look at how the study was done. You know, whether it looks like it was done in a reasonable way. I would like to see the effect sizes and we can talk about that more if we need to, effect sizes reported as well as measures of uncertainty for those effect sizes, as well as the p-value. A small p-value might still mean something, a large p-value might still mean something. Give us the information and let us think about it. Journalists might have to also retrain the way they think.

Bailer: You know, one takeaway for me as I look at this and think about the ideas here is that it's pushing things a little bit before the data is collected. It's the idea of the well-designed study, that there's been good thought about what's going to be evaluated, what are the primary hypotheses that you want to test about a system, you know, make sure you have enough data to evaluate it critically, you know that then you're not relying on just this magic dichotomous call. You're doing the work upfront is what matters. So now you're asking people to have a better sense of how to critically evaluate how data are collected and how studies are designed. And that's a harder question.

Lazar: It is a harder question, no doubt about it. But if we want…well yeah, life is hard!

(Collective laughter)

Lazar: And I think you know, it's not impossible, right? I mean I don't want to bring too much personal stuff in, but you know, I was having this conversation with my soon to be 13 year old the other day, and I said to him, you know you should be skeptical. When you hear something, you should think about well, how did that happen and why did this happen and why was it like this and how were the data collected and how was the analysis done. And he got it you know and so I think if we build that in, maybe, you know…so this is not something that's going to change overnight. It is going to take a lot of effort but we didn't get here either overnight. I mean it took us 70 or 80 years for people to feel comfortable with oh, p<0.05, data-preserve-html-node="true" still we don't really understand what it means.

Campbell: It seems like one of the challenges here is to change the way the journals accept papers, which is like a tough, because that's so important to how people get promoted, how they're recognized in the field. So do you have any…and I didn't read all of the articles, so you have any ideas on how do you change that culture?

Lazar: So some of the ideas that I've seen bandied about have to do with openness again, so there are there are more and more open publication models. So I'm thinking of things like registered, pre-registered reports as an example. So you can put your whole study design, this is what we're planning to do, this is how we're going to collect the data, these are our hypotheses, this is how we're going to analyze. Then you put it all out there first, the paper gets accepted on the strength of the science, and the results that you get in the end don't matter, you know whether you have small p-values or large p-values as long as you followed your pre-registered protocol. So that's one way. These other things are like open peer review platforms that are starting to pop up, so people who put their papers out there are archived because it was a pre-publication peer review aspect. I’m thinking about other platforms that are starting to pop up here and there. So you put your paper out there and then people…readers can comment on it, give you suggestions and then you can either submit to a journal, a regular “journal” if you want to, or you can just leave it up there, published on a site like this. So I think people are coming up with very creative ideas about how to change the publishing model. Of course we still have the Big Society journals and those which are going to be slower to change. They may or they may not, it's unclear but some journals are starting to make moves in that direction, you know, that obviously there was the famous or infamous social psychology Journal that banned hypothesis testing few years ago and that was part of this whole conversation as well.

Bailer: So my quick last question is, how has being part of this conversation and this this editorial review process and looking at all these papers, how does this change your practice, or how do you think about the discipline, or how do you think about training others in this discipline?

Lazar: Yeah, so it's definitely forced me to embrace the uncertainty.

(Collective laughter)

Lazar: You know, because all these things that you asked about, I mean I've been struggling with them as well and I don’t know, I have been trying to change some of my teaching habits, but it's hard because the students then become much more freaked out than they usually are about taking a stats class.

(Collective laughter)

Lazar: And this is also stats major and minors. I know I'm telling them to abandon this crutch that they have been taught, and it's hard. It has made me think a lot more about how I present results in class and how I analyze them and the model that I try to set for my students in that regard. But I do, it is kind of like, I'm not really sure now what to do either in some respects and I think that's OK because I want to try and figure out how to do things, so maybe more emphasis on modeling everything together rather than relying on hypothesis just so much and strict cutoffs, which is sort of ironic because some of my other work is focused on the multiple testing problem, which is hitting this flat-on and so I'm kind of saying, well, all that stuff we don’t want to be doing anymore anyway. Yeah.

Pennington: Well Nicole, that's all the time we have for this episode of Stats and Stories. Thank you so much for being here.

Lazar: Thank you! Pennington: Stats and Stories is a partnership between Miami University’s Departments of Statistics and Media, Journalism and Film and the American Statistical Association. You can follow us on Twitter, Apple podcast or other places where you can find podcasts. If you'd like to share your thoughts on the program, send your email to statsandstories@miamioh.edu or check us out at statsandstories.net. Be sure to listen for future editions of Stats and Stories where we discuss the statistics behind the stories and the stories behind the statistics.