Big Data Soup | Stats + Stories Episode 190 / by Stats Stories

10.30.14Meng_Xiao-Li64 copy.jpg

Xiao-Li Meng is the Whipple V. N. Jones Professor of Statistics, and the Founding Editor-in-Chief of Harvard Data Science Review is well known for his depth and breadth in research, his innovation, and passion in pedagogy, his vision, and effectiveness in administration, as well as for his engaging and entertaining style as a speaker and writer. Meng was named the best statistician under the age of 40 by COPSS (Committee of Presidents of Statistical Societies) in 2001, and he is the recipient of numerous awards and honors for his more than 150 publications in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development.

Episode Description

Big data, though not new, is often talked about as though it is. It’s become something of a buzzword associated with everything from politics to record sales to epidemiology. But, not all big data is created the same, some of it might not even be that big at all. That’s the focus of this episode of Stats and Stories with guest Xiao-Li Meng

+Full Transcript

Rosemary Pennington
Big Data, though not entirely new is often talked about as though it is, it's become something of a buzzword associated with everything from politics to record sales to epidemiology. But not all big data are created equal, some might not even be as big as they seem. That's the focus of this episode of stats and stories where we explore the statistics behind the stories. And the stories behind the statistics. I'm Rosemary Pennington. stats and stories is a production of Miami University's Department of Statistics and media, journalism and film, as well as the American Statistical Association. Joining me as regular panelist, John Bailer Chair of Miami statistics department, Richard Campbell is a way today, our guest is Xiao-li Meng, he is the Whipple van Jones, Professor of statistics at Harvard University. His research interests include the theoretical foundations of stats, statistical methodologies, and the application of statistics in areas such as engineering or the social sciences, Meng is also the founding editor in chief of the Harvard data science review. Thank you so much for coming back to the show.

Xiao-Li Meng
Thank you for having me. Again, I hope that's the vote of confidence. I hold on

John Bailer
Very, very much. So Xiao-li very much. So it's great to have you back. It's great to be back,

Rosemary Pennington
A recent Bloomberg Businessweek piece about big data, quotes from a 2018 journal article you published about big data and data quality. To get this conversation started, would you explain sort of how this issue of data quality in relation to Big Data became a concern for you?

Xiao-Li Meng
Right, thank you for asking that question that actually really went back to 2012. I had a visitor to my department, whose name is Jeremy Wu, who was the manager at the Census Bureau. And he was working on this great product called on a map, they tried to study the US kind of employment, you know, work dynamics, and and they use a lot of data, survey data at the Ministry of data, it's all kinds of data missing, they use about like, 2020 different data sources. And he came to my department to give a talk, and he posed a question. And that's how I got me start thinking to sing very seriously. The question he posed, because, you know, this was a to a statistical audience. He said, you know, we all understand that a 5%, a random sample is better than 5% of, you know, non random sample in measurable ways. And I think, for all statisticians, we know that when we say random sample, we don't mean a haphazard sample, we mean, a carefully controlled probabilistically, you know, you know, construct a sample. And so he'll start questions say that we have studied this for like, you know, 100 years, right. Understanding the randomness helps us to ensure a sense of being representative.

But then the question he posed, and that's exactly how I got me really thinking about He said, But can you tell me that if I have a 5% of random samples, which I know that go to quality? And I have 80% of some large data set? And I have no idea what was the quality of the data set? All I knew is that it covers 80% of the population. Okay, the question is, which one? Should I trust more? The 5% of the good quality data and 80% of unknown quality? We're not saying it's a bad quality, right? I mean, obviously, with the 80% that have the same quality at the 5% Sure, we go with 80%. Right, but I don't know what that thing is. And so, you know, he posed a question to the, to the, to the audience, I was one of them. And, you know, the statistics audience are, you know, very cautious. You know, it's no, this kind of a trick question. So you ask people say, 5% versus 80%, would you are you going to trust more now? Okay, there will be a smart few smart people in the audience, myself included, will be asking, say, trust for what purpose? Right now, that's actually a really internal question. Yeah. You know, for what purpose? Okay, then then Jeremy, what a crime clarify, say, let's for the purpose of estimating the population average. Okay. That's a very specific one, which turned out to be a very general one, but many, many things we look at in more real life studies essentially, is all population average. Right? All the COVID in terms of the, you know, the, the the percentage of people who suffered, you know, the vaccine uptake is all about the population average, right? Okay, so now you ask the question, now, it's great that you guys can try that depends on how statistically oriented our audience will be. For that kind of 80% versus 5% and I Tried myself in quite a few conferences, most audiences will trusting the 5%. Okay, because of their their training, then you know, then you can say, Okay, what about is a 5% versus 90%? I going to switch your mind, you will see people start to be a little bit hesitant. Okay. And then, you know, there's still people said, No, I still chose the 5% as fine. What about 95% 5%? of risk? 95% 5% 99%? Right, then you can see even the heart kind of a die hard statistical start to lower their hands. Right. And so then, you know, when I give this talk, and I learned all the things really from Jeremy was single this question. And then I will be posing the question to the audience was the fact that your lower your hand, that means somewhere there is a tipping point? You can you're gonna switch? But what is the tipping point? How do you calculate that? Right? And why what what would causing this same? So my 2018 article is really talks to quite a few years really trying to study that question. And it's kind of really interesting, very fortunate for me, and it's probably for all of us. For this particular question about estimating population mean, 10. out there is a very simple, almost universal metric to measure the thing we want to measure, which is how do you do this trade off?

John Bailer
So it's so when this trade off, and I assume you're referring to your data defect index is this? So you know, if you can you help unpack that that data defect index, because you have bias? So when you mean, what do you mean by the relative bias? That's in terms of estimating this population mean, as you've described it? Right. Right. And you have it related to some correlation structure and also to something about the population size. Good. So could you describe that and then talk about how the relative bias is related to those other constituent parts?

Xiao-Li Meng
Right, thank you for for the great question. You clearly have read my article, thank you very much at so what I was able to do was that this is also I want to plug this the importance of teaching. Okay, the when I started thinking about that question, I said, Well, how do I even go about it, right? Because you know, measuring that kind of sampling variability, we have all the statistical tools, probability, right, we can talk about simple random sample, we talk about design fact, what the question a Jeremy posed was much more of analyzing non probabilistic sample, right? Something is just out there, right? How do I even measure something like with other statistical tools, you know how to do that. And then that remind me that somehow I was very lucky in a way that I got a reminder of a formula I taught many years ago, when I become an assistant professor at the University of Chicago. In the first semester I was, given the book to teach about sample survey was the older statisticians probably no, that's Leslie Kish 1965, the book. And I can tell you, and this is not a secret, I can confess, the only reason I'm qualified to teach the course is because I have never learned. So dive into the book. And I learned a lot. And during the process, I learned there is a particular formula used to control the buys of a particular type of estimated called a ratio estimator. Okay, it's not quite relevant here. But But when I'm looking at the formula, I realized the formula allows me to decompose the any kind of estimation errors for estimating the population mean by the sample mean into three components, and only three components. Okay? When components is controlled by the data size, as well as the population size, I guess, it's probably easy to explain that if your size, if your sample size of getting close to the population size, then your estimation error should be lower and lower. So there's one term control that and another one is what I call the problem. A difficulty is to control how hard it is to estimating the population average. And that essentially is the statisticians term is called a standard deviation, because all you're trying to estimate is how homogeneous the population is, if everybody in the population is of exactly the same opinion, you only need one person to find it that average, right? And the worst case is a 50/50.

Okay, so there is a term estimating that for what's left, and that's the key term. Is there the term medium measure what I call the the data quality, okay? And that term is measuring the correlation between the answers you're supposed to give, and whether you get included in the sample or not. So the simplest example will be thinking about the early days of the COVID-19. We tried to find out how many people in the population can be tested positive, right. At the beginning, when we do these tests, we did not say hey, let's randomly select a bunch of people, give them a give them test. We give tests to the people we suspect. They're either already developed the symptom or they have some Serious exposure to people carrying this kind of, you know, virus, right? So you know, there's a, there's a positive correlation between being infected and being tested. And that's a collision I'm talking about. Okay, so it turns out that you need to measure that correlation. And that's what I call the data defect correlation. Why call? Why do I call the data defect? Because the higher the correlation, the more selective bias it is in your sample, right? I mean, it's quite clear if everybody in your, in your sample are people already infected, what's going to happen, your sample conclusion will be the whole population is being infected, if you only use that a sample, as your estimate, right? It should not be it is that correlation, which which is the key. So and the traditional good quality data doing property to sampling is to ensure that correlation on average is zero. Because if you're including a nod is by some sampling mechanism, and you don't get yourself to choose whether you can, you know, be a part of study or not, then the correlation, you can prove mathematically on average is zero. That doesn't surprise me that when I did this study is not only on average is zero, a practically it's so small, you can show mathematically that the correlation is on the order of magnitude of one over square root of the population size, not the sample size. And that's the kind of stuff that most of you have not been looking to that carefully, you can show me I can even be very precise, the variance of the correlation, because the college itself is random, for simple random sample is one over capital N minus one, okay, the capital N here is a public size. So you will see that in practice, if a truly simple random sample, or equivalently, these kind of eco properties sample, that quantity is incredibly small. The problem is, when you have the selection bias, either by design, like you know, you only test people who are more likely going to have thing or by self choice, like when in 2016 elections, you know, people feels like it's not a popular answer to tell people, I'm inclined to vote for certain candidates, they don't. So I'm not gonna tell you right, that induce the bias as well. In either case, correlation is going to stay is not going to diminish, as as small as the capital, scheduled capital. And that's where the problem lies.

Rosemary Pennington
You're listening to Stats and Stories. And today we're talking to Harvard University is Xiao-li Ming, editor in chief of the Harvard data science review, there's this quote that the Bloomberg piece pulled out of your article where it talks about you talk about the wishful thinking of wanting to rely on the bigness of big data. Could you unpack that a little bit? What do you mean, when you say that there's this this thing, this wishful thinking that you can rely on the bigness, right of big data,

Xiao-Li Meng
right, because this is also the insight I gained from the mathematical formula? And initially, I did not believe myself, literally, because what I did a calculation say, Oh, you say, yeah, we understand there's a some deterioration of the sample, you know, quality, but if I have enough of them, I should be fine. Right? How wrong that can be. So what I did in the article, I say, Okay, let's let's do the following thought experiment. Let's say I know there is a defect correlation, say, a half percent dieback correlation. And where did I get the half percent? I actually use the 2016 data, estimated bigger for 2016. In the end, we know who was the president, you can use the truth, you can deduce what the correlation was. And it turned out to be the answer of no voting for Trump, there was a half percent minus half percent correlation, okay, exact a correlation of the tendency of, you know, not wanting respond. That's why it's a negative correlation. Okay. And, and actually actually voting for Trump. So you s ay, Okay, take that hopless. It's not even 5%. It's not even 1% 100% correlation, how much damage can they do? So I say, Okay, I since I have the mathematical formula, I can write down what will be the equivalent of sample size, if I did not have that the fact that it would deliver the exact a statistical error, that these are all precise mathematical form that you can put down. And if for all my statistical audience that know what I'm talking about, it's called effective sample size, effectively how much templates I have, and that's where the mind boggling results coming coming. That's where the Bloomberg you know, put it in, put it in their title. So what I did is this, let them then I did the following solid experiment. I said, Okay, before the November election that I went back about two to three months of time, and look at how many surveys were reported. Well, you will find like a 20 3050 surveys in newspaper newspaper, in media, in social media, you have all kinds of stuff. So I did a very rough calculations. I put them all together. Think about each survey usually has like 500 people to 5000 you rarely see his opinion people surveys has a lot more than like 5000. Right?

So I put all these together, I calculate a roughly that you we have about 2.3 million people have contributed to that kind of opinion. Okay, that's about a 1% of the voting population. The question is that with, with now retrospective, we have understanding that these 2.3 million answer, which is about roughly 2000 points behind your surveys, each with 1000 people in it, right, okay. And he will say, well, that's a lot of evidence a lot. If they all say Clinton was gonna win, then we should all believe it, right, with 2300 surveys. But what I did is that nom teasing you I'm not giving the answer yet. I'm teasing. He's the audience as well. So I did this calculation, it turns out that the 2.3 million people because it's not the absolute number matters, it's because they only represent 1% of the population with the 1% population. And where's this minus a half percent? It's not to be equivalent to about a 400 people's answer form a simple random sampling. Okay, that is the reduction. That's what's getting quoted. That's a 99.98% reduction, no exaggeration, whatsoever. It's a mathematical formula. But the first time I got it, I shocked myself, I say, Wait, wait, wait, this cannot ask me to all my colleagues say, Xiao-li, you've got to get this. Right. Okay. Because otherwise this, like you're telling, you're telling me like, it's all the things are wasted. I did it like a really multiple time I had my best students, you know, to help me and, you know, it went through NOC Applied Statistics, you know, all corrected, right? It's, it's all and and then, of course, it reaches faculty, we all understand why, why that's the case. And now this is for my statistical audience, right? We know, the mean square error is two times the variance and the bias, but the various goes down one over little n, right? When the sample size is reasonably large, 2.3 million, you know, that one of and it disappeared. So whatever the left is all buys, right? So now you think about how much and you need to take really need to overcome this. Right? It's, it's enormous. And so that's what I meant by trying to play this game. So I have enough people, yeah, you, yes, you can eventually compensate. But we're talking about your needle, which like the 99, you know, like, you know, what, you don't you don't have that, right, it's just like, usually, usually, you have a 50% of population, you are very happy about it. But but with even that car percent of correlations, that you know, you still getting a very tiny, you know, a real sample size. But so that's, that's my point of emphasizing, you would rather spend money investing in these good small scale surveys, then trust the self report online, online data? Because, you know, we all know well, right? The online population is a different it's a people have some strong opinions they want to express. That's, that's what that's what happens. And that's how that don't really does not really capture the big picture of what the population Yes. Right. You know,

John Bailer
This is a great story. And I, as I'm listening to you telling this, I'm, I'm wondering, you know, you got to you got to know, how do you communicate this in a way to try to reach a population that maybe isn't going to be reading a technical paper? That's not going to think about deconstruction, of kind of the sources of variation in the study? I mean, I liked that there was a it was a great headline grabber. Right. And, and in Bloomberg, I mean, I love this big data can be 99.8%, smaller than it appears. And I think that you, you, you approached it really nicely, I thought was one of the the examples, your, your soup example. Right. But are there others? Can you talk a little bit about that example, as well as kind of other strategies for, for how do you you help someone understand that bigger is not better, just because you have more more of the wrong thing doesn't make up for a smaller set of the right thing? How do you tell that story?

Xiao-Li Meng
In short, no, that's, that's a great point. I want to even emphasize it, when you have the wrong thing, a lot of them that you actually make sure yourself, you'd never get the right because you'll be concentrated or you'll be so overly confident. And even statistically, you know, for those understand there's something called confidence interval, your confidence interval will be so narrowly situated but in the wrong place. So so you so you, so you will never get right. So the so the Yeah, the analogy of using I have another one, which I probably I'm happy to share with the with the audience is that when you're asked to taste, you know how salty or how delicious a soup is, right? It does not matter what the size of the container. What a mat is that if you mix well, right? If you mix the well then you know a few spoons is all you need. But the problem is that this assumption of mixing well think about mixing a small soup or mixing a gigantic soup, making giant super takes a lot more effort, right? So what matters is say, let's say there is one unfortunate there was a piece of salt to just get stuck in this gigantic pocket. Well, unless you catch that, that piece of salt, right, and you just not not going to get right.

On the other hand that if it's in a in a small soup, you know, you might have, you might have a better chance to, to, you know, to, you know, to catch it. So, so the problem is that once you don't mix well, then the things just gets much harder. And and and the larger the population, that the kind of you know, the harder it is to eliminate that kind of, you know, that kind of buys another analogy I used for, for people to understand this, this this notion of mixing Well, every one of us, I'm quite sure that unfortunately, or fortunately, sometime, you need to get your blood test, right? And when they when the nurse get your blood, they don't really say well, you know, how much do you weigh right? Now they don't take the amount of blood according to your size. Oh, you know, that everybody takes like, that's pretty much the same amount of the blood. Right? Why do they do that? Well, that's because your blood is well mixed? Well, hopefully, right? So that's why that's why you you know, you take the blood form your arm that will tell you about your liver, about your you know, kidneys, right, you don't need to go to the kidney to take some blood to the liver to test some blood, right? It's because mixed well. So that's the notion why the mixing is so important. It's the idea is, is big, become homogeneous, so any product will be representative. But once you don't have a mixing, either by the design, or you mix it wrongly, or because people self select, then you can see the problem become much, much harder. And I think what we have not done the statistical feud itself is to really study when you don't mix. Well. How do you quantify that? And that's essentially what my 2018 paper trying to do that really a lot more need to be done. I was very lucky, having the low hanging fruit for estimating population mean, there's one simple estimate one simple quantity can capture the whole problem. But there are a lot more problem needs to be worked on.

Rosemary Pennington
How you've spoken to your statistical audience several times, how would you suggest the non statistical audience as they are consuming news about big data? Because it does come up in everything right? Every time you turn around, people talk about the big data that Facebook is scooping, and they talk about how big data is going to change medical science, right? What advice would you give to the casual consumer of news about big data about how they could go about judging whether it feels like that big data is worth trusting or worth investigating?

Xiao-Li Meng
Yep, that's absolutely a great question. And there The short answer, there's a long answer. The long answer will take a PhD to really understand but the short answer is very simple. The short answer is always ask, Where did the data come from? And who collected? additional question will be originally collected? For what purpose? The moment you say it's a Facebook data. Facebook has done many, many studies. And but we should all understand that the Facebook users are I think I don't have to convince people, they are not necessarily representing the whole country. Right. And and you know, and currently, you know, I'm engaged in the study, looking at the estimation about the vaccine uptake, like, you know, how many what's the percentage of population currently have take the particular vaccine, and we're using data from Facebook survey from the US Census Bureau survey, as well as the CDC, supposedly, they should be the most accurate one, because they tried to not they're not doing survey, they tried to report everything. And currently the study, and this is not, this one is going to be quite obvious for anyone looking at them, that their answers are very different. The Facebook has, you know, could be as high as like 70%. And the CDC is like 50%. So we need to look into the reasons for all these for these differences. It's, it's, you know, the Facebook might be over estimating, the CDC might have no reporting for all kinds of reasons. But there's one thing we're all sure and we should be sure these differences are not due to sampling error. These are not the traditional statistical sampling error there are there are biases in the sampling frame, like the origin position taken from their bias due to now response rate, you know, their bicycle depends on people's response spies, like they have not taken a vaccine, but that but they think it's a it's a it's a correct answer to answer say, I have done that, right. There's all these issues. So I think also the for the for the general audience is the always asking where the data come from who collect them. The other question I would ask is, especially when it compared different surveys, how the question was asked, and also the other one, which is most people may or may not realize and you invited me back, I gave you another story about, about the impact of the ordering of the question. Survey. Okay, it's identical question depends on where you put it, the rate can change by like, 100% time.

John Bailer
Right? Yeah. You know that framing stuff is really fascinating. And how that's that's, that is carried carried out in these studies. You know, in the same in this piece is Bloomberg piece, I really enjoyed the comment you were talking about the idea of big data is not a substitute, but a supplement to traditional data collection analysis. And I know that you've been involved talking with colleagues at the UN statistical commission about kind of some of the work that they hit that they are considering and how they incorporate big data sources. Could you talk just a little bit about kind of what what some of these national statistical organizations were thinking about doing? And what were some of the cautions that you gave to them?

Xiao-Li Meng
Yeah, thank you. Well, in fact, that was interesting that I got invited by the by the UN statistical agency, because the article 2018. And I said, someone read it. So they invited me and then they had a conference, really, kind of a, you know, international conference workshop. They were trying to really answer the question that many countries, really were asking, are asking, should we still do these time consuming, costly, kind of a careful surveys when we have so many, big data's out there, like the Ministry of data is? So that's why they invited me because my article was exactly to address that question. So I guess you probably don't know my answer to them was a resounding yes. You definitely need to do that to the survey. So what I did was really show using the super knowledge others presented my study, I think I did have a little bit impact convince people say you at least should you did never should abandon these. A careful study carefully designed the studies because probably all your truth is in from these quality studies, instead of other these unknown quality big data. Well, that's

Rosemary Pennington
That is all the time we have for this episode, Xiao-li thank you so much for joining us again. Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.