Yoav Benjamini is the Nathan and Lily Silver Professor of Applied Statistics at the Department of statistics and Operations Research at Tel Aviv University. He is a co-developer of the widely used and cited False Discovery Rate concept and methodology. He received the Israel Prize for research in Statistics and Economics, is a member of the Israel Academy of Sciences and Humanities, and has been elected to receive the Karl Pearson Prize of ISI this summer.

## + Full Transcript

Rosemary Pennington: The Karl Pearson prize for contemporary research contribution is one of the biggest awards in the statistical community. Given out every two years by the International Statistical institute, the award recognizes research published sometime in the last 30 years that has, “profound influence on statistical theory, methodology, practice or applications”. The work of the latest winner is the subject of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's departments of Statistics and Media, Journalism and Film, and the American Statistical Association. Joining me in the studio are regular panelists John Bailer, Chair of Miami's Statistical Department, and Richard Campbell, former Chair of Media, Journalism, and Film. Our guest today is Yoav Benjamini, the Nathan and Lily Silver Professor of Applied Statistics at Tel Aviv University. Benjamini is also the 2019 winner of the ISI's Karl Pearson prize for his work developing the concept of the false discovery rate. Yoav, congratulations and thank you so much for being here today.

Yoav Benjamini: Thank you very much, it's a pleasure to be here with you.

Pennington: You're being recognized for a paper that was published in 1995, and that's been cited more than 50,000 times, that has clearly had an impact on the field. How did this concept of false discovery rate come about and how did you end up writing about it?

Benjamini: Well, that's an interesting story, the issue of addressing simultaneous and selective inference, which is the technical term, I'll explain shortly. It comes up when you have – you're facing a problem of inference which is not – okay so, the question of addressing multiple inferences in the form of having to put many confidence and choose the most promising one. Or looking at many groups and making comparisons between them and choosing the largest outcome and the smallest outcome, and making this comparison. And all of these questions raise the problem that the regular statistical tool fails to keep the properties. So if we allow ourselves to do a test of hypothesis comparing factor and control and allow yourself a 5% error of making a statement about a discovery, then we know that there is a 5% chance that we are in error even though we know that there is a discovery. Suppose you're facing 20 such potential discoveries. Now, on the average, you will find one which is within this error.

So the question is how do you cope? How do you keep the original properties of the statistical procedure when facing a large number of potential outcomes? And these are not theoretical questions. If you look at any drug discovery experiment and you have usually one or two major measures, or they called endpoints, measures of success, of the efficacy of the drug. But on top of that, you have nowadays dozens of other indications of the success of the drug.

Let me give one example. Suppose there is a drug that is given and it's given in four different doses, and it is given with another drug and a commonly used drug, the efficacy is measured not only five years as I said before but also half a year after, and then long-term survival, and then progression-free, if it is cancer. So you can easily accumulate dozens of indications of the success of the treatment. And if you then pick up those handfuls which show success and treat them as if they are pure and innocent, you lose the statistical properties. You have to protect against that.

Bailer: So, with multiple comparisons, I mean like the example that you just said, if you wanted to compare seven groups, there were 21 pairs of group comparisons you could make, but the false discovery didn't catch fire at that point. It was more recent technology and testing that seemed to be where your ideas of false discovery caught on and have really been applied. Can you talk a little bit about those?

Benjamini: The point is that at that time the way to handle that one, as suggested by Tukey and others, the way to do it was to have very strong control. I called it panic control, control the probability of making even a single error. And that resulted in very strong requirements of valid proof. And for us, for Yosi Hochbergand for me I wasn't involved with working in the area. For me the problem came up when he was first involved in research about a drug related to blood pressure. It was more than a quality of life question, it was more than 100 items. And there was a joint score and this was a single indication of success, but you also wanted to say does this drug reduce headaches? Does it improve other things? So the problem really was that using any of the methods and Yosi was one of the designers of the Hochberg method. He was one of the designers of problems that control the probability of making even one error, we ended up shorthanded. It didn't provide any indication for success. So we started to do some other things. At that time there was a paper that discussed the graphical method for estimating the number of null hypotheses from a graph of the p-values. The point was that you could use the bigger data to estimate what is really the problem that you're facing. Because if you understand what you're facing 100 and about 50 of them are real effects, and you have to offer more protection than otherwise. This is where we started with the false discovery rate.

Campbell: Talk about that.

Benjamini: The point was that essentially after reading a paper which argued the fact of false- but he actually said you don't need to control- he was not talking about control, he was talking about errors, he's saying let's look at the relative errors, the expected number of discoveries out of discoveries, and we captured this and formulated it into a new concept, and this is the false discovery rate. Instead of just controlling the number of errors or the probability of making even one error, we control the proportion of the errors among the discoveries. It becomes a ratio and we are trying to control only that. Now that means that if we make many rejections, we allow errors. And that was the issue. We were more lenient than the usual panic approach that happened.

And so the objection for many of our reviewers was that here we hamper the pureness protection that the usual approach kept, do not allow any error with the probability of 0.05. And that's as close to the problem some people appreciated it and some not. Interestingly, once we get very quickly the result and I sent it to John Tukey, who wasn’t my advisor at Princeton where I did my Ph. D. but- and I thought that somebody already did it, and he just complained about the fact that the pages weren't numbered.

Pennington: That sounds like a professor.

Benjamini: So, a month later we submitted the paper that was 1989. And as you mentioned it took us three journals and almost five years- not for the original idea, but the truncated idea. As I noted we had a method of estimating the number of true null hypothesis, and that's useful in methods for controlling the false discovery rate. We didn't include that, and the result was published in 1995 was the Benjamini-Hochberg procedures which is non-adaptive in nature.

I think one interesting note to understand how statistics have changed is the fact that the first review we go through, we are asked to do a simulation, and we were asked and we were questioned about the concept that we had, and we changed it a bit and we did a simulation. In that second version that we submitted to the same journal, we did a simulation for 4, 8, 16, and 64 size of the problem, hypothesis being tested. And we got complaints from the reviewer, how can someone think about doing 64 tests of hypothesis. You have to stop at 20, nobody does more than that.

Pennington: You're listening to Stats and Stories, and today we're talking with Tel Aviv University's Yoav Benjamini, winner of the 2019 Karl Pearson Prize from the International Statistical Institute. Yoav, this paper was published in 1995, we've talked a bit about the problems you faced in getting it published. It's been cited more than 50,000 times. When did you and your co-author realize this was something that was picking up steam? After all the struggle of getting it published, and when you saw people start picking it up, was there a moment when you were like, "Oh, yes, this is something that is really going to have the impact that we want it to have on the field"?

Benjamini: 95 wasn’t the end of the road. The original paper was published in 2000. It took another 5 years. I knew from the beginning that it was something to work. I had problems with the concept of controlling the probability of having even one error, because how can I as a statistician control the probability of making any errors throughout my lifetime, to 0.05? Now controlling the false discovery rate is visible, if you do each study separately. So, I had a good feeling that we're on the right road.

I think that there were a few people who really made an impact on that, and a few areas of science made an impact on that, on its acceptance. The first is a visit by Donohoe that came after a collaborator of mine of Felix Abramovich came from sabbatical brought in the idea of wavelets, and actually it was wavelets analysis we started to show that it makes sense to threshold wavelet coefficients with FDR with the BH approach. Then David Donohoe spent a sabbatical in Israel in 1994, and got the idea, liked it very much then went back to Stanford and talked about it. People around Stanford, most notably Brad Efron and his students, and Jonathan Taylor and then other people, so people got interested in that, and essentially at the same time, around the year 2000 the number of genes that were analyzed for differential gene expression- which is sort of the area of 2,000 to 4,000. And suddenly the idea of false discovery rate control in these problems of this size became a practical solution. So two things happened jointly. They work with Donohoe, it was Dave and Ian Johnstone and Felix Abramovich that continued through 1995, 1996, 1997, Johnstone talked about it in Wald lectures in 1997, the paper wasn't published until much later, but it had a big impact that was a terrific founded thoughts. Concepts like testimation came up, that is if you are facing a big problem use FDR control for leaving out the coefficient which doesn't pass the threshold of significance, strangely enough, and significance makes sense. Just leave them out and estimate all the other parameters. That's good enough with theoretical justification which in the sparse problems really the right rate of asymptotic optimality and constant and so on and so forth. So that gave also a push to the theoretical interest in the false discovery rate.

Bailer: So, with gene expression, with thousands of genes being studied, it's easy to imagine the number of comparisons that one might make. I was reading about some of the background material that you had also sent out where it was noted that FDR has been used in astronomy and brain research and psychology. Could you talk about a few of the examples in areas other than gene expression where FDR is proven to be very useful?

Benjamini: Sure, one of the things that I do is working myself in this field. In working in animal behavior, in meteorology, in brain research, I’m a member of the Sagol School for Neuroscience- so my interests are also very much in applied areas of statistics and other sciences. And one of the last things about FDR is that these problems have given rise to the development of FDR methodology. So for instance, our interest in climatological areas- together with my ex-students who now collect, Dani Yekutieli wrote about interest in dealing with dependency. And that was back in the 1990s our original work, and then dependency because of medical research, brought about the property of procedures and the positive dependency.

In brain research, especially in fMRI where you image the brain while the subject is functioning to try to locate areas of activation. This functional MRI areas of research, but the problem eventually there are many voxels, you can do inference on individual points in the brain. But you're actually interested in islands of activity, in regions of activity. Suddenly it’s not only the individual points but a cluster of points, which are kind of interesting. And many of us were interested in that, and Nicholas and others, and they proved the need to do hierarchical analysis of discovery rate. First at the level of clusters and then going down into inside it.

Our most recent work with Chiara Sabatti, Marina Bogomolov, and Christine Peterson deals with microbiome data, but that brings up trees that is constructed of the biologicaltree that is built out of species, families and so on and so forth. As we enter into more and more interesting problems we have to build up, based on the original concept, but build up new methodology and sometimes modify the concepts as well.

Richard Campbell: So one of the things that you're bringing up here is the complexity of this kind of research in science and statistics. And one of the jobs that journalists have to do is try to explain this to a general audience so you've got a couple of journalists sitting here. Do you have any frustrations with journalists? And tips for journalists who need to do a good job in explaining some of the complexity of some of the work that you've been involved in, and some of the things that you've just been talking about?

Benjamini: You are asking this question on a bad day, because I've been involved over the last year in running a committee for designing strategies for data sciences in an Israel academy, and today came out a description of what we did in the local paper, and it's all completely – not only not in focus, but it's simply wrong.

Campbell: So, what did the journalists do wrong?

Benjamini: I think I learned from one journalist- we called... you see, people are working on multiplicity, like Tukey and other people, here were interested in the problem of selection. The fact that you're selecting particular inferences changes the statistical properties of what you do. It's the selection. And the way to deal with that is simultaneous inference. I think the simultaneous inference is a difficult concept for journalists to grasp.

Campbell: Yes.

Benjamini: I think that the false discovery rate is more intuitive. I distributed, I'd say- I'm checking 130 types of food, so people who take more of them or less of them and see how they methods affect their getting a specific cancer, and then I – let's say eight of them seem to cause – so I have eight discoveries. So, a false discovery rate, people understand how many of them might be false, because I’m never sure when I'm doing statistical analysis. So in that sense, it's a very – I think that for journalists now, I think it is a reasonable concept. I'm not asking about the family and what defines the family and so on, I have – you know, I’m presenting you discoveries and I want to tell you what my assessment of the proportion of false discoveries is here. I can't tell you exactly, but maybe I can tell you on the average. So, in that sense I think it’s something that is communicable.

Campbell: Yes. I understand that.

Bailer: So there's been recent discussions on reproducibility and about the American Statistical Association has a p-value statement, there have been conferences, there have been special issues of journals that are talking about this. Can you reflect on this issue and what you think about p-values and the role of p-values in inference?

Benjamini: Yes, because it's very much related to what I was talking about the nature of selective inference. I think that the whole attack on the p-value is not justified, and in fact, there is a retreat from that. Because p-values are our purest defense against being fooled by randomness. In the sense that it needs minimal assumption, and in fact, if you are conducting a well-designed experiment, you can be sure that the assumption will hold.

There is no other statistical method that can guarantee it. If you're doing an estimator you're assuming something outside, where you can assure it. So in that sense, there is no reason to attack the p-value. Now the p-value was singled out because there are some problems with misinterpretation. But, that’s true about any other method. That is true about confidence intervals and likelihood ratios and so on. The point was that it was clear when the problems of replicability and reproducibility started, it was clear that statistics isn’t part of the problem. There is being open and so on and so forth, but the statistical methods weren't part of that.

And there was a void and the void was entered by people outside the profession and it was easy to attack the p-value. Especially when you see in statistics, the patients, there is the Bayesian religion, the likelihood religion and so on and so forth, and that's an opportunity to attack the p-value. But that's not what's really behind it and it's not using a threshold like p less than 0.05 for p less than 0.005. In my view, there are two basic problems that underline all statistical problems and one of them has changed in recent years. And the fact the selective inference. And the second one is hunting out the real uncertainty that assures replicability. And I'm talking about replicability not reproducibility, because replicability is the property of the result of the study and it can only be established by replication. But we can try to enhance the replicability. And one of the obstacles to replicability is the fact that people don't address selective inference.

The example that I gave, genomics, brain research and so on, people started to address the problem where the size of the problems addressed came close to or past 1,000. In acute analysis it was about 700. In gene expression, it was 4,000. When you reach 4,000 people realize that they have to address selective inference. But in many areas like medical research, which is not for regulatory, they are very careful about selective inference. But the regular papers that are being published in the best of the medical journals like the New England Journal of Medicine, on the average, on a sample of 100, there are 27 different inferences made. In experimental psychology, the average is around 70. It goes from 600-700 you can find it in papers. And there are people that don't address selective inference. If they do it in a minimal way but not appropriately. And this is something that has changed because science has been industrialized some-20 years ago, with inventions of tools like genomics and proteinomics and the brain imaging, and recording tools in psychology. It is so easy to collect so many things. Passing epidemiology from taking out the patient's records to taking them out of the computer. It's so easy to collect the information. Slowly the size increased and people didn't pay attention to that. So. Selective inference is not addressed this is true by all properties confidence intervals, p-values, likelihood ratios are credence intervals. All of these suffer from this problem.

Pennington: Well, Yoav, I think we can probably talk about this all day long. Thank you so much for spending part of your day with us.

Benjamini: You're very welcome. I'm happy to share my ideas with you and with the audience.

Pennington: That's all the time we have for this episode of Stats and Stories. Stats and Stories is a partnership between Miami University's departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter or Apple podcasts or other places where you find podcasts. If you'd like to share your thoughts on the program send your emails to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics.