Stephen T. Ziliak is Professor of Economics at Roosevelt University and Conjoint Professor of Business and Law at the University of Newcastle-Australia. A major contributor to the American Statistical Association “Statement on Statistical Significance and P-values” (2016) he is probably best known for his book (with Deirdre N. McCloskey) on The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (2008), showing the damage done by a culture of mindless significance testing, the history of wrong turns, and the benefits which could be enjoyed by returning to Bayesian and Guinnessometric roots.
+ Full Transcript
Rosemary Pennington: Here are a few facts about Guinness: It is a dark, creamy stout first brewed in Dublin, Ireland in 1759, iIt remains the best-selling alcoholic drink in Ireland, there are articles written about the best way to pour a pint, and there are articles written about the statistics related to Guinness. That’s the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I’m Rosemary Pennington. Stats and Stories is a production of Miami University’s Departments of Statistics and Media, Journalism and Film, as well as the American Statistical Association. Joining me in the studio today is our regular panelist John Bailer, Chair of Miami Statistics Department. Richard Campbell is away today. Our guest is Steven Ziliak. Regular listeners may remember Ziliak won our Better Bays contest by cheating with a haiku about Bays. Today the Roosevelt University Professor of Economics is here to talk about Guinnessometrics. So, Steve thank you so much for being here again.
John Bailer: I think all of his answers should be in the form of a haiku.
Bailer: So Steve, you’re going to be constrained.
Pennington: So, this is a show about the story behind statistics. I’m just going to ask you to tell us the story of what Guinnessometrics is.
Ziliak: Sure, of course. Thank you so much, Rosemary and John, it’s so great to be back with you. Guinnessometrics is a name that I’ve given to an experimental approach to decision-making, which is based on statistical methodologies. Both the design of the experiments and observational studies and the analysis and decision making that comes after that analysis. It started at the Guinness brewery in the hands, largely, of a man who statisticians know by the name of Student. But Student is actually the pen name of an Oxford-trained natural scientist and mathematician named William Sealy Gosset. Mr. Gosset is the inventor of this design approach and analytical decision-making approach to statistics. And I think it’s quite important for our current crisis in validity, in what I call a crisis of validity. That is the lack of ability of scientific studies to replicate or reproduce, or even find something of real substantive importance. They’re focusing too much on statistical significance.
Bailer: First, I’ve got to tell you just how inspired it is doing your research of the archives at the Guinness storehouse. I know that when I visited Dublin as part of a World Stat Congress. The first thing I did was go there, and I stopped by the archives and I found the plaque to Gosset that’s on the walls. What a great place to visit. And a really neat story to talk about in terms of the inspiration for both the t-based inference and other ideas. As part of what you were talking about in your paper describing, that special American Statistician issue, you talked about the idea of people thinking in terms of expected value versus in terms of utility. Can you talk a little bit about that and maybe give an illustration?
Ziliak: Yes, that’s right. The difference between expected- well, let’s start with expected value. Expected value is the expected value of some event that has not yet occurred. There is some probability we think, that it will occur, and there is some value attached to it. It could be a positive value or a negative value. So let’s say that the winner of the flip of one coin, heads or tails, wins a dollar. So it’s just one flip and there’s no entry fee, so most people would say that the probability is equal. It’s one half, and the prize is going to be worth a dollar. So the expected value of that flip would be 50 cents- with no entry fee, assuming no cost of entering. The utility is different. Expected utility says what do people really value? What are their preferences? How do they rank risk and reward at different levels of risk and reward? And so if there’s no entry fee for the coin flip then many people would say sure I’ll play this game. But suppose instead that you change the rules. You have to pay 50 cents to get in the game and then flip the coin, and then whoever wins gets a dollar. That’s going to make some people say “I don’t want to play that game”. They’ll be right at the margin of playing because the expected value is exactly equal to the cost of entering the gamble. But now suppose the entry fee is $1,000, but the prize is $2,000 from one flip of a coin. Some people with very risky preferences will say, “Let’s go for it”. That sounds like me in Reno or Vegas, you know? But many other people would say “No way José, that’s not going to work”.
Pennington: So, John and I were talking before we all got on here about what kind of approaches to research would look like if they took a perspective that was more Guinnessometric. I don’t know how to use that word in a sentence, but, sort of in your thinking of how researchers and scholars might adopt a more Gosset-esque approach to studies, what would you imagine them doing? And how is that different from what you think researchers are doing now?
Ziliak: That’s a great question. To use my little spontaneous coin flip example there, I think it helps show the importance of what Gosset, or what we call students and statistics what he was really doing there for the Guinness brewery, and indeed for Irish and English agriculture. Small sample distributions, the t-distribution, for example, small sample tests of significance, including Fisher’s p-test, those actually have an economic interpretation, that was Gosset’s point. It’s difficult, it’s costly to run an experiment on a new strain of barley or on a new type of hops to brew the beer. And understand that Guinness was the largest brewery in the world by 1900. They were selling unpasteurized Guinness stout and draft to the tune of 100 million gallons per year. Can you imagine? 100 million gallons per year, going by boat to the horn of Africa. So switching barley varieties, switching hop varieties is going to be very expensive for this large brewery. Therefore, the Guinnessometric approach, Gosset’s approach was to design a small series of independent, stratified, and balanced experiments on particular varieties of barley and hops, the chief ingredients for the beer, all around the Irish-growing regions and English-growing regions, in the case of hops. So in the case of Ireland Gosset, together with the Irish Department of Agriculture, laid out ten different barley-growing regions, which differed by soil quality, rainfall, and all that kind of stuff. And then they found farmers who were willing to be commissioned by the Guinness brewery to annually run small experiments on new varieties, and testing them for yield and brewing quality against previous best competitors. And this is so important, if you don’t mind me completing the thought, Rosemary, you asked me how does Guinnessometrics help advanced statistics and science and society over and above what we’re already doing. Well, right now the National Institute of Health and World Bank and so on, they’re doing “one and done”. One and done is a very large randomized control study with many thousands of participants and so forth, and then it’s a non-repeated experiment, oftentimes. Well, Gosset and Guinnessometrics said, “No, that’s actually the wrong approach”. In fact, you don’t even want to use complete randomization, it’s better to systematically allocate treatments and controls to experimental units because of all sorts of other error related issues that are not random. So I hope I answered some of your questions. Sorry.
Bailer: Well let me ask just for a little clarification on that. My sense with things like with NIH, or particularly as you get into the point of drug approval, a lot of times those studies are being done at a series of different clinical centers, so you do have essentially the stratification that’s being embedded in there. And also you have different organizations that look at meta-analytic efforts to pool the results across a series of independent studies. So, I mean, I guess I would think that NIH actually is looking, and particularly when I think about drug approval processes, that these types of multiple populations or multiple clinical centers are being built into that. So, I’m thinking that perhaps some of these issues are in fact embedded into what’s being done.
Ziliak: That’s a very interesting point, John. Let me back up and say that in economics and political science and you know, the social sciences, that one and done is- including with my friends and colleagues at the World Bank, one and done with the large-scale randomized control trial is currently considered gold-standard behavior, and you see it all over the journals. You’re right that when you get down to the nitty-gritty of drug approval that there are many other steps being taken and I agree with that. Now whether or not that actually gets us to the final, you know, what Deirdre McCloskey and I would call the correct and best-unbiased assessment of the “oomph” of some new drug, the true efficacy if we can speak in that way for a particular context, that we don’t know, do we? So I guess I would agree with Gosset that it’s better to have independent experiments- start with small series of independent experiments that are both stratified, and have allocation balance, and then you know that as you build up your evidence piece by piece that you’re doing it correctly along the way.
Pennington: Since you brought up McCloskey, I want to ask you a question about this point you raise in an article you published together, and I can’t remember what year it was, about this idea of fit not being the same idea as importance. Which I think you’re kind of dancing around here a little bit so can you explain what you were thinking when you wrote that?
Ziliak: Yes, that’s right. In our book The Cult of Statistical Significance, we’re basically arguing that statistical significance is neither necessary nor sufficient for proving a scientific or business or medical or economic result. And so, fit is not the same thing as importance, and importance is what we want, both as statisticians and decision-makers in our field. So could I give you a short illustration?
Pennington: Yeah, please.
Ziliak: Okay cool. Suppose your mom calls and says she wants to lose some weight. She’s thinking about a diet pill. She knows that you’re good with the computer and web-searching and so on, and you think about data, so could you please help? So, you say, “Sure mom”. So, you call her back after doing the research and you offer mom two different diet pills. The first pill promises to take 20 pounds off of mom on average, but it’s rather uncertain in its effects, at plus or minus ten pounds. This pill is called “pill oomph”. Promises to take 20 pounds off of mom, but it’s uncertain at plus or minus ten pounds. So it could take ten pounds off of mom, it could take 30 pounds off. On average research says it’ll take 20 pounds off. All you hear now is crickets on the phone. You’re like “Mom, are you there?” She’s like “Yeah, why? You were just going blah blah blah blah”. “No mom, stay with me”. One more pill. “Pill precision” that promises to take five pounds off. But pill precision is much more precisely estimated at plus or minus a half-pound, around that average. So pill precision can take five pounds off mom, that’s the average, that’s the prediction. But it could also take four and a half pounds off or five and a half pounds off. Much more tightly estimated. And so the question that we ask in the book of the reader is, which pill for mom?
Pennington: I would go with the one that’s going to give me ten pounds.
Bailer: Yeah, you don’t care about variability if the lower is that much better than the precise one.
Pennington: I’m looking at John to see if this is the right answer.
Ziliak: You would take the one that would guarantee you at least a ten pound weight loss because I presume you’re arguing that that minimum effect from pill oomph actually dominates the maximum effect of pill precision.
Ziliak: The best pill precision will do is take off five and a half right? That’s true. So that’s a- now can I just ask a follow-up question? Might it be the case that some people will want to choose pill precision?
Bailer: Why? I just have trouble imagining that. I mean, it’s precise, but guaranteeing a lower value than something that’s certain, but you’re guaranteeing a higher return, if that’s what you’re looking for.
Ziliak: Right but that’s the thing is, if that’s what you’re looking for, and if you can handle it. If you go back to that risk and reward example of expected utility versus expected value, you can see that some moms are starting with a baseline weight of 115 pounds, god bless them. So if they lost 30 it might kill them. On the other hand, maybe that same mom wants to fit into a pair of jeans for the summer garden party or whatever, and five pounds would be awesome.
Bailer: I think I was starting with different prior assumptions about where the weight was. So my prior belief of 110 was zero.
Ziliak: Precisely. And I think that’s why in my paper in the American Statistician, “How large are your g-values?” G-value number ten is at the top. It says consider best practice. Compare results with what’s actually going on right now. And that means understanding our baseline variables and evaluating them from the Guinnessometric points of view or the oomph point of view before we make any of these decisions.
Pennington: You’re listening to Stats and Stories and today we are talking with Roosevelt University’s Steven Ziliak about Guinness, statistics, and perhaps some haiku.
Bailer: That’s right you have a couple of times where you mention that you have some statistics haikus, so we may get back to that. I’d like to change gears a little bit and talk about your progression from working as a caseworker to analyzing welfare to work programs to econometrics. So what were some of those early lessons from the experience of doing casework and welfare to work program analysis that you’ve carried forward, and some of the insights? How has that helped shape you as an econometrician?
Ziliak: That’s a great question, thank you so much for asking me that John. Gosh, the first thing I guess I discovered as a caseworker is that statistics are made. We make them. They are not something that is given to us. And I think we make a mistake sometimes in our textbooks and in our textbook-based teaching, giving students the impression that data are found that data are just out there to be picked up like, I don’t know, dollar bills off the sidewalk if you’re so lucky. But data is created. And that’s was probably the first thing that I learned as a welfare caseworker in my early 20s. Interestingly enough, I ended up writing my dissertation on econometric and social-historical study of attempts to abolish public welfare in America in the 19th and 20th century. And what I learned is that my very city of Indianapolis where I was living and working as a welfare caseworker, was the premier site for the birth of the Scientific Charity movement in the United States of America and in fact it was the group, the Scientific Charity people, called the Charity Organization Society, they invented statistical casework. They literally invented it. And so I ended up going back to Indianapolis and looking in the Indiana historical archives at the original case records of welfare recipients and charity recipients from the 1870s to the early 1900s in Indianapolis. From there I created a dataset that enabled me to do some, what we call hazard analysis, and econometric analysis to estimate the different effects that household and external work and environmental variables have on people’s propensity to stay on welfare. So that ended up being very fulfilling for me to go back and actually look at the data, and in some cases, the exact same addresses that I had been visiting 100 years later. To look at welfare from that perspective and that econometric point of view. But I guess to answer your question John, probably the biggest thing that’s carried forward with me came from my – from an experience I had at the Indiana Department of Employment and Training Services now called Indiana Department of Workplace Development. May I tell you briefly what happened there?
Bailer: Yes please.
Ziliak: Half of my job as a labor market analyst at Employment and Training was to behave as a reference librarian for the Indiana economy. So this is in the age before the internet, people would telephone my office and ask, for example, what’s the GDP per capita in the state of Indiana for the following years, and that kind of thing. I had to find that data for them. Well one day a man from Gary Indiana called and he wanted to know the distribution of unemployment rates for black youth workers in Indiana. The metropolitan statistical areas were Gary, Fort Wayne, Indianapolis, and so forth, so for each of the metro areas what’s the unemployment rate for 16-21 year-old black workers? I was so confident that I could supply that data for him, being a representative of the US Department of Labor, that I kept my landline on. I said “I’ll be right with you” and I just set the landline phone on the desk. But I couldn’t find the data. So I said I’ll call you back. My boss couldn’t find the data. His boss couldn’t find the data. His boss couldn’t find the data. Finally, we ended up with the Chicago US Department of Labor. We collect black youth unemployment data for Indiana labor markets but we’re not publishing them. How come you’re not publishing the data? Well, the p-values are too high.
Bailer: P-values for what?
Ziliak: The US Department of Labor had a rule at the time, and I believe that’s beginning to change, thanks in part to initiatives from the ASA, but the US Department of Labor in the 1980s had a rule that if the p-value, Fisher’s p-value came in at higher than 0.10, that they would not release the estimates.
Bailer: For testing what?
Ziliak: I’m sorry?
Bailer: What were they testing?
Ziliak: They’re testing unemployment rates for black youth workers relative to some previous benchmark. And what they’re saying is that essentially because of their small sample sizes and lack of investment in this particular area, they had too much variability, they thought, in their estimates. So since they could not find statistical significance in the new unemployment estimates they decided to withhold publication. My argument would be “No, black youth unemployment in Indiana is a major issue. That’s a public policy issue”. And the baseline there is probably 40% unemployed in Gary Indiana in the late 1980s, and nowadays probably double that or more. So, we start looking at the fact that we have this thing in economics and society that needs to be discussed and solved. Now, whether or not we have statistical significance for the most recent estimates is probably not the most important issue. So basically what happened to me is that I got upset, and I made complaints to the Department of Labor and I said, “I think we’re doing something wrong here”, and I vowed then and there that I was going to fight statistical significance.
Pennington: Before we go, since we’re talking about your history leading up to your experience as an academic, I was reading on your website that you have taught introduction to economics with Grapes of Wrath. So since we’re talking about this issue of society and welfare I kind of want to get your quick rundown of how you use Grapes of Wrath to teach introduction to economics to students who are probably not inclined to care about economics. I would imagine that this is a large class that has students from across the university in it.
Ziliak: Yes, that’s right. You don’t really know who’s going to appear in econ 100. Students coming from all over the university, so first and foremost. But I started teaching intro to economics using the Grapes of Wrath my very first semester as a teacher if you can believe that. I don’t know why I did it at first.
Pennington: Because you have so much energy the first time you’re teaching.
Ziliak: But I kept it going, and I’m going to do it again this fall actually. I haven’t taught Grapes of Wrath for about maybe five years because I’ve been teaching mostly graduate courses. But I’m actually going to do it again this fall, I’m really looking forward to it. On and off I’ve been doing this for 23-24 years. Economics textbooks suffer from many of the same problems that statistics textbooks do. The world is too pretty, the free markets are too perfect, the people are too rational, and all that kind of stuff. And the Grapes of Wrath is just a perfect counterpoint, oh my gosh. But you know The Grapes of Wrath , for people who don’t know, is centered on a family, which is a tenant-farming family in Oklahoma during the Dustbowl and the Great Depression which comes along with it. And so this tenant-farming family which has a virtually no property, and they have- they’re actually illiterate. They are actually three generations of illiterates in the Joad family, the central family in the story, they’re forced to move. They’re kicked off the land by the bank. Much of the land has been foreclosed and then enclosed, so you have expansion of the farms and replacement of labor with capital, and all that kinds of stuff. So the Joad family, this farming family goes west. They follow a handbill that promises high wages, orange groves, and white picket fences if they’ll just come to California and work in the grape fields and so forth. Well, it doesn’t work out very well because a lot of people followed that same handbill. So supply exceeded demand and wages fell instead of rose. Not so funny. But that’s part of the lesson. That’s something that the students can really relate to because by the time that the wages start falling in California, the students have become completely enchanted by the Joad family and all of them, including Rosasharn, and is her baby going to survive the journey? The Tom Joad character is also really important from a teaching point of view because he shows that a person might start off as homo economicus, all they care about is themselves, and they’re just going to allocate their limited budget to maximize utility for themselves and who cares what other people think and blah blah blah. But as you follow Tom Joad throughout the novel Grapes of Wrath, you see a person discovering other people, discovering his own empathy, you know? And I think that’s so important.
Pennington: Well, that’s all the time we have. Before we go, Steve, we would like to see if you would be willing to share one of your favorite statistical haiku on the way out?
Ziliak: I had one for you regarding Gosset, regarding the brewer. It’s not my favorite haiku, but if you want my favorite, I’m more than happy to tell you that one.
Bailer: We could have a two for one special, they’re not that long.
Ziliak: Okay. Good point. Okay, here’s the one- Karl Pearson who was the reigning authority of statistics for a while in the early 1900s called his friend Gosset a naughty brewer in one of the letters. He called him naughty for playing around with such small samples of numbers. Pearson being a large sample guy. So I have a little haiku, it goes like this: A naughty brewer made a small sample of beer and found students tea.
Ziliak: And now I’ll tell you my favorite one: Statistical fit epistemological strangling of wit.
Pennington: Well, thank you so much for being here today, Steve.
Ziliak: It’s my pleasure thank you so much.
Pennington: Stats and Stories is a partnership between Miami University’s Departments of Statistics and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter or Apple podcasts or other places where you can find podcasts. If you’d like to share your thoughts on the program send your email to email@example.com or check us out at statsandstories.net, and be sure to listen for future episodes of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.