The Winner’s Curse | Stats + Stories Episode 232 / by Stats Stories

Erik van Zwet (@erikvanzwet) is an Associate Professor in the Department of Biomedical Data Sciences of the Leiden University Medical Center where he has been since 2009. He joined the school wanting to do more applied work in the areas of statistics and data analysis and has since published multiple papers in Significance Magazine including the main focus of today’s episode, “Addressing exaggeration of effects from single RCTs”.

Episode Description

A randomized controlled trial is viewed as the golden standard in medical research, particularly as it relates to treatments or interventions. But there may be pitfalls to trusting that approach too much. That's the focus of this episode of Stats and Stories with guest Erik van Zwet.

+Timestamps

What is a RCT? (1:15), What are characteristic of a well designed trial? (2:00), How did you get interested in this research? (3:45), Data you obtained from Cochrane Database? (5:18), Power and how you got results (7:05), How does affect the laymen (9:49), Coverage of RCTs (12:00), Trends of exaggeration (14:17), What goes into exaggeration? (16:54), What needs to be done? (18:56), Across other fields (21:58)


+Full Transcript

Rosemary Pennington
A randomized controlled trial is often viewed as the golden standard in medical research, particularly as it relates to treatments or interventions. But there may be pitfalls to trusting that approach to much less the focus of this episode of Stats and Stories where we explore the statistics behind the stories. And the stories behind those statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and media, journalism and film, as well as the American Statistical Association. Joining me as always is regular panelist John Bailer, Chair of Miami statistics department. Our guest today is Erik vanZwet, an associate professor in the department of biomedical data sciences of the Leiden University Medical Center where he's been since 2009. He joined the school wanting to do more applied work in the areas of statistics and data analysis. What is the co author of a Significance article exploring the issue of effect exaggeration and randomized controlled trials? Eric, thank you so much for joining us today.

Erik van Zwet
You're very welcome.

Rosemary Pennington
I just to get started, could you explain what a randomized control trial is?

Erik van Zwet
Yeah, certainly. So this is when somebody thinks of perhaps a new medical treatment, and they want to try out if it's working, though, then the best thing to do is to take a group of patients that are eligible for the treatment, randomly divided into two groups, and give one group a placebo or control or potentially the current best treatment, and give the other half the new treatment. And then you see what the outcome is after a little while of the treatment, and then you can compare the controls to the treatment group. And in that way, you can get an estimate of the treatment effect, together with a standard error for that estimate, which says something about how accurate the estimate is,

John Bailer
You know, what aspect of designing randomized clinical trials is, is a, there's a big planning part of this. And there's a major aspect of that is determining how large a study that must be conducted? Could you talk a little bit about some of the decisions that have to be made? And what are some of the targets? Or what are the characteristics of kind of a well designed trial?

Erik van Zwet
Yeah, so these sample size calculations, as they're cold, is something that I, as a statistician, I get called in to do a lot. So then I'll talk to the medical investigators about planning the trial. The idea is that you have a good chance of finding a statistically significant result at the end of the trial, because then you will have proven that the treatment is working. And if the result is not statistically significant, you're really left empty handed in a sense, because you also didn't prove that the treatment is not working, because perhaps you just didn't have a large enough sample size. So it's all about the sample size needs to be big enough to get a clear result that the sample size calculations are, are often bad news, because it often turns out that given the the variation between between subjects, and given a realistic effect size, you will need a lot more people than are generally available in the time and the money that are available for study. So then there's a bit of back and forth between the statistician and the researchers to come up with some sort of compromise. And, and that's how the sample size calculations kind of go in practice, there's the theory that you want to have maybe 80% probability of getting a significant result. But then you're gonna have to imagine a slightly bigger effect size, then maybe it is possible, or you underestimate the variability just to get a reasonable sample size that you can manage.

Rosemary Pennington
Your article in Significance is about this idea of effects exaggeration. Could you talk a bit about how you got interested in this particular issue related to this kind of research? And maybe sort of what led you to write this piece in Significance?

Erik van Zwet
Yeah, so there's a well known principle, which is called the Winner's Curse. So if you do a trial, and you get a significant result, then that's a combination of maybe having a real effect and maybe having an effective treatment, and a bit of luck. So there's a there's a, it's a combination of the two, and you don't really know which one is dominant, but there's always luck involved. And that means that your effect is a little bit over estimated, because you've gotten lucky, and that's why your trial is significant. And somehow, this Winner's Curse is well known. But it's kind of the elephant in the room, it gets ignored a lot. We just are happy with our significant results. Because it looks good. It's convincing. It's big, but we also know it's overestimated. And then on top of that, I know that if a trial has very low statistical power, like we talked about earlier, because maybe the effects really overestimated and power really wasn't 80% If the power is lower, you must have had a multiple luck to get a significant result, which means that there's winners curves very severe. So there's a combination, it's like a perfect storm, significant trials are lucky trials, and underpowered significant trials are very lucky trials and are hugely over estimated.

John Bailer
Yeah, I thought that was really interesting that you, you, then you saw this, and then you decided to try to investigate this systematically, and part of your systematic investigation included kind of use of the Cochrane Database of systemic reviews. Could you talk a little bit about the kinds of data that you were obtaining from this database? And how did that kind of start building this, this structure for really exploring the question of exaggerated effects?

Erik van Zwet
Yeah, so um, the coupon database is a wonderful resource. It's a collection of results from clinical trials, 10s of 1000s of them, and they're gathered by dedicated people who really try to do even find trials that are not published. So you know, there'll be some bias in there. And, of course, successful trials have a larger chance of being published, the Cochrane Database is really the best we have. So it's really a record of clinical trials, 10s of 1000s of them that have been conducted in the last decades. So it's really wonderful. If it's public, you can go online and investigate a particular treatment. If you're interested in that. A couple of years ago, I met a PC soons, Simon Schwab, in Zurich, who happened to have collected all these data systematically, he collected them all. And you can actually also apply for that with Cochrane and get all these data. And now you can go one level up, and not just look at all the trials that are about a particular treatment, which is called a meta analysis. But you can do a meta meta analysis, where you look at all the trials in the whole database, and really, some say something about how clinical research is done in practice. So now we have a bird's eye view of how the sausage is made.

John Bailer
That's always a scary prospect to see how the sausage is made, though, you know, so this, what aspect of looking through your paper, I mean, I was struck by a number of aspects of it. One was, when you were talking about the review of this database, and these 10s of 1000s of RCTs, these randomized clinical trials that you looked at, you have this statement about, you know, nine, nine out of 10 of these trials have power less than their targeted 80%. And then the median actual power was 13%. Of these studies. That was horrifying to me. So can you talk a little bit about, you know, how did you go through and, and do this kind of evaluation of each of these trials to come up with these types of estimates?

Erik van Zwet
Yeah, so power is, it's a complicated concept. And it means different things. Same word means different things. The one way to do a power or sample size calculation, is to imagine what's the smallest effect that you wouldn't want to miss. So you're saying, Okay, well, this, this, this new treatment would only be interesting for patients, if the effect is at least this large. And then you design a trial to be able to pick up on an effect that's this large. Now, often, effects are not actually that that large. And so maybe the trial really was designed to have 80% power 80% probability of a significant result, if there is a really interesting effect. But that effect just has us not being there for medical research really hard. So therefore, it's to be expected that power against the actual effect that's really there is lower than that. And on top of that law explained earlier, there is this, this compromising, because trust often really will require more people than are easily available, and people run out of money and out of time. So Charles is also there's always a pressure to make Charles a little bit smaller than Ben is maybe wise. So on the one hand, the fact that this power against true effect is lower than the power against the effect that you would like to see or that will be interesting is all in the game. And on top of that, there's the limited availability of subjects and the timing of money. And that leads to the fact that trials tend to be underpowered for the actual effects. So while they're even maybe correctly powered, for the effect, it's interesting, that underpowered for the effect, it's actually there. And it's interesting that we can estimate this power distribution, because we never know the true effect. Of course, we only get an estimate of the true effect. In any particular trial. We don't know the true effect, but because we have these 10s of 1000s of trials, we can do some statistical trickery and get this distribution of the power against the true effects even though we can never observe them.

Rosemary Pennington
So for someone who's not a statistician hearing about this discussion of power is going to be maybe slightly confusing, if it did help and help in trying to understand why this matters. So why did you find writing your paper and why does it matter for someone who maybe is not in this field of stats and is going to care as much as about the methods and the in the power? What is the takeaway for our producer, Charles or perhaps even myself?

Erik van Zwet
Okay, so the power is probability, the power against a real effect is the real probability that your trial is going to be significant given the number of patients that you have. And given the actual effect of your treatment. If the power is low, that means you have a low probability of reaching significance. But of course, even if the probability of reaching significance is only 20%, you still have a one in five chance of getting a significant result. So that happens a lot. Because many, many trials and even these underpowered trials lead to significant results that make it into the literature and people get excited about that. The problem is that if a trial with low power reaches significance, it's very lucky. It's no, it's one out of five of the lucky ones. And that means that the effects overestimated because you've got so lucky that even though the probability that it would be Significance was no, it still made the bar so so the fact that lots of trials have been done with low power is in itself not such a big problem. The problem is that when you see the significant ones that make the headlines, you're looking at a very, very selected group. And because of this selection, and because of the fact that they got lucky, these effects are overestimated. And that's bad, because now his treatment looks much better than they are, if you tried to do a replication study, you will find probably a much lower effect in you than you thought you would find. Because next time around two won't be as lucky. If you want to do a new study, and you want to calculate the sample size, on the basis of your old stone age, I got lucky that it had this overestimated effect size, you'll design a new study that again, has too low power. So it's sort of this low power that leads to all sorts of problems. downstream,

Rosemary Pennington
You're listening to Stats and Stories. And today we're talking with Erik van Zwet, of Leiden University. You mentioned news media coverage before I did the break. And so I do wonder, I mean, as I'm hearing you talk about this and thinking about, you know, people like you know, medical decisions maybe being made based on what appears in some of these studies sometimes. But I do wonder about what your thoughts are on the way that news media cover these kinds of studies, and whether there is something that you could suggest journalists keep in mind when they're trying to decide what research in this area they should cover. Because I do think that journalists have a duty of care to actually report you know, accurately and carefully, particularly when it comes to things that are related to medical interventions or treatments.

Erik van Zwet
I think that with any reporting, you always have to wonder what you're not seeing. Because, you know, you're looking at it for, for a reason, maybe because it stands out in some way. And that's always a difficulty. So that's now even apart from the statistics, and from the clinical trials setting, thinking about what you don't see, I think is always a good idea for anyone not even, you know, just also regular people, not just journalists to think about what you don't see. But this is very hard. So the research that we've done is important because it shows all the trials that are being done, we look at all the tasks that are being done, not just the significant ones. And that allows us to incorporate what we don't see into our interpretation of what we do see. So it means that if we do see a significant trial, we realize that it's got lucky, it's been selected. And we don't see maybe the other trials that didn't read significance there in the Cochrane Database, but they don't make the headlines, but they're there. And you can use statistics, again, some sort of Bayesian methods to adjust the effect estimate and to shrink it a little bit to make it a little bit smaller, to account for this over estimation. And the fact that we have this Cochrane Database allows us to see what we're missing. And so, that's a recommendation that we have in our paper to supplement the usual estimates that we see with a shrunken version that accounts a little bit for the for the for the luck factor.

John Bailer
You know, I was going to channel my inner rosemary and ask a question about headlines, but she beat me to it. So I'm gonna ask sort of another aspect of what you had responded just before the break was this issue of reproducibility. And this really kind of cried out for thinking about okay, if you know this, and you do this shrinkage of the estimate, you could plan a subsequent study now with this revision, if that was a target, in or or the other question is maybe looking and I don't know if you've done this, but but considering within a particular study of a certain drug, if there have been repeated studies, do you see trends or trends in the trajectory of exaggeration ratios?

Erik van Zwet
Yeah. So um, reproducibility, first of all the fact that if you're first studying significant to overestimate the effect. So if you if you don't account for that you'll be disappointed in your in your replication study, which doesn't mean that the first study was a fluke, it just means that you have to be a little bit more realistic, we actually have a follow up paper with, with with Steve Goodman, which has just appeared in statistics in medicine, if I can plug that where we really look at reproducibility in terms of the probability of again, reaching a significant result if you would replicate the study Exactly. And what's the probability that you would at least get the same direction of the effect if you will reproduce a study Exactly. And also, how much bigger your sample size would have to be to get a certain probability of reaching that significance with your replication study. So we can, we can say a lot more than is just in this paper, this Cochrane Database is such a wonderful resource that we can study these things. And then we can check them empirically because this paper, so the one that's in significance here, and also the follow up paper that I mentioned, just views the Cochrane Database as a huge pile of individual studies of separate studies, but they're really meta analyses. And we can use that structure to say something about if so if I say, Well, given that I saw a significant study, what's the probability that an exact replication would be significant? Again, I can use the structure of the Cochrane Database to check what I'm what comes out of the math. So we're also working on that, and I can say it works reasonably well. There has been, but there's not really so much as a trend in, in his follow up studies, it only looks like a trend, it only looks like a trend when you first pick a significant study and then go for a follow up, and then you'll see it go down. But if you know, if you just look at a bunch of studies that have been done in time, there's no reason why there will be a trend in them. Okay.

John Bailer
So you know, we've used this term and phrase exaggeration ratio. You know, that seems like that's a really important part of the story that you're telling here. Could you maybe help us unpack the ingredients of what's part of an exaggeration ratio? So, you know, people might say, Well, if that's big, that means you're exaggerating? So I mean, I think we all could, we're with you there. But let's, let's talk about you know, so here, you're saying an exaggeration ratio is something larger than one systematically, but what are the ingredients that go into that?

Erik van Zwet
Yeah, so if you have a clinical trial, you can, if you if you, you can kind of capture the essence of a clinical trial in just three numbers. There's the true effect. And then there's the estimated effect, and then there's the standard error of that estimate. So what I call the exaggeration ratio, is the ratio of the estimated effect to the true effect. So if that's, let's say, 1.2, it means that the true effect of the magnitude of the true effect is overestimated by 20%. Now, of course, I know these things are random. If I look, if I just select some sort of a random study from this Cochrane Database, then it will have some effect or have some estimate, and both will vary across clinical trials. In the paper, we study, what's the distribution of this exaggeration, given that you observe a certain Z score in a particular study? So a z score is what measures the significance of a study and the z score of about two means that the estimated effect is two standard errors from zero. And that will be statistically significant at the 5% level. So given that you've got a study that's just significant, with a Z score of about two, then you would expect or you would expect the exaggeration ratio to be about 60%. You would expect that half the studies that are just significant, overestimate the true effect by about 60%.

Rosemary Pennington
Wow. So we talked about how reproducing these studies is not going to necessarily capture the exaggeration. So I wonder if you and your co authors have thoughts about how you work to ensure that these kinds of exaggeration ratios aren't being repeated or reproduced? Like what needs to be done? What is the intervention to ensure this doesn't keep happening? Because again, this is about you looking at medical research that could be helping people decide what treatments they got, what drugs they got?

Erik van Zwet
Yeah, so I think that many statisticians will say that these medical researchers should just use larger sample sizes. And, you know, because, you know, you're statisticians, and we like good estimates. And, you know, we want lots of information and with reliable results, and that means, large sample sizes. So this is, of course, true, but I don't think that's going to work. Because like I said, the reality is that you can't just open a can of patience. If you don't have a lot of money, these drugs are expensive, you don't have a lot of time because maybe a PC student needs to finish the paper. So there's always a pressure to have small studies, that's always going to be the case. I think this whole exaggeration is a real problem. And two ways to address it. The first way is sort of a technical way, which we do in our paper, which is to look at all of the Cochrane Database and say, Okay, well, if this is a typical study, just like a random study from the Cochrane Database, then given the results from this study, we expect that overall that the effects of estimate by whatever 60% or something and shrink it back. So that's like a technical solution. And a more sort of societal solution is to not focus so much on single studies, but to look more at meta analysis, because they have a lot more power, because they're multiple studies, which means there are more patients involved. And that really reduces this problem of exaggeration. So if you look at a single study, if you read about a single study, I think the effect must be overestimated because I'm reading it, because it stood out. So then you can use our method a little bit to say, well, maybe it should be shrunk back by a factor of two, I think that's a safe thing to do. If you read a paper in a journal, it's significant, just divide the effect by two, roughly speaking, why not, but also, if you really want to make decisions about patient care, then don't just rely on a single study rely on meta analysis. And in practice, of course, this is done in a sense, because the European Medicines Agency and the FDA in the US, you know, they required at least two significant studies. So that helps. But meta analysis is really is really the way to go to combine information from multiple studies, which, as an added bonus, will also say something about how the effect varies in different populations and in different settings, because it's also something that's really, really important, and another part of the Cochrane Database that we can study.

John Bailer
You know, I thought it was really interesting when you were looking at some of the results, particularly of the exaggeration ratio, by thinking about studies that had small standard errors versus larger standard errors depending on how variable they were. And you know that certain patterns where if it was more variable, you'd see more exaggeration, we're likely to see this or higher degrees of exaggeration, but that got me thinking about other contexts. And, you know, I was thinking about, you know, I wonder if, if we were looking at kind of studies, if this is sort of universally true, or as if there if you were doing a study study of medical devices, sort of clinical trials for medical devices, if you would see the same magnitude of kind of exaggeration than you would see in a study of some, you know, some cancer treatment? I mean, so I just, I don't know, do you have any sort of intuition or insight about whether or not this exaggeration would be expected to be the same across different different application areas?

Erik van Zwet
Yeah, so I did a little bit of research about that. And so this, this whole how large, the exaggeration, I mean, there's always going to be exaggeration, because just the fact that something is significant means that they got lucky. To some degree, this paper says something about clinical trials, all of them. And you can stratify, or you can zoom in on certain types of clinical trials, of course, and that doesn't make a big difference. So you can do so in the Cochrane Database, there's different medical specialties like oncology, or neurology or psyche, psychiatry, it's all pretty similar. However, they're also different phases to clinical research. And a phase two trial is when phase one trials are sort of trying it out a little bit in small populations, phase three trials are much more shared, a lot more money is riding on that and, and they really want to nail it down. And phase three trials do have much more power, and consequently, they don't need as much luck to become significant, and therefore they're not as exaggerated. So there are differences, and it doesn't make a difference. The setting you're, you're you're thinking about now, the Cochrane is, is unique, in the sense, it's all about medical research, it would be very interesting to know something about sociological research, or psychological research because of the replication crisis, as they say, you know, basically big role in that area. But they don't have a nice resource, like the Cochrane Database force for psychology and, and just to look at the literature, we know that not everything makes it into the, into the literature as a lot of publication bias, the file drawer effect, where we really only get to see the winds. And the nice thing about the Cochrane Database even though you know, there'll be some selection in it, it's really pretty complete. And I just don't. I would love to have a resource like that for psychology, but it's just out there. Registered reports are becoming a thing where more people just were, you know, journals will publish based on the study plan and not so much on the study result there. There'll be real help, because then we'll get a complete record. And then we can see how research is done in other areas. But yeah, I have my suspicions, but we can't know the computations that we do for the Corcoran.

Rosemary Pennington
Well, that's all the time we have for this episode of Stats and Stories. Erik, thank you so much for joining us today. Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.