Introducing The Harvard Data Science Review | Stats + Stories Episode 112 / by Stats Stories

Xiao-Li Meng, the Whipple V. N. Jones Professor of Statistics, and the Founding Editor-in-Chief of Harvard Data Science Review, is well known for his depth and breadth in research, his innovation and passion in pedagogy, his vision and effectiveness in administration, as well as for his engaging and entertaining style as a speaker and writer. Meng was named the best statistician under the age of 40 by COPSS (Committee of Presidents of Statistical Societies) in 2001, and he is the recipient of numerous awards and honors for his more than 150 publications in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development.

+ Full Transcript

Rosemary Pennington: There's a lot of talk about data science, newsrooms are investing in staffers who are data savvy and universities are creating data science related majors. However, there's not a real consensus on just what data science is. A new journal is hoping to create a shared understanding of data science and that's the focus of this episode of Stats and Stories is a production of Miami University’s Departments of Statistics and Media, Journalism and Film as well as the American Statistical Association. Joining me in the studio are regular panelists John Bailer, chair of Miami’s Statistics Department and Richard Campbell former and founding chair of Media, Journalism & Film. Our guest today is Xiao-li Meng. Meng is the Whipple V.N. Jones professor of statistics at Harvard University. His research interests include the theoretical foundations of statistics statistical methodologies and the application of statistics in areas such as engineering or the social sciences. Meng is also the editor-in-chief of the recently launched Harvard data science review Xiao-li thank you so much for being here today

Xiao-li Meng: It's great to be here thank you.

Pennington: Just get started could you talk about what the mission of the journal is?

Meng: Absolutely to put it very simply the mission of Harvard Data Science Review which is kind of pretty ambitious mission is to really help to define and shape what data science is exactly. Like what you said because there has been a lot of talk about data science but it has been a very hard to define what it is and so what we hope is to through publications through the kind of articles and the content we select we will showcase what are the relevant parties of data science and what I have discovered in the last year working on this is we found data science as probably everybody expected is really very broad and happy to follow up with some details later.

John Bailer: So, you in your editorial in the first issue that just came out you described data science as an ecosystem so could you talk a little bit about why you picked that metaphor.

Meng: Yes, thank you for asking that question I hoping somebody noticed that, but you know seriously because if you just think about data science as a single discipline you were running to all sorts of issues as I was trying to elaborate in my editorial but if you really think about data science as a way of thinking as how much it involves in any kind of discipline it's very hard to think about any sort of single discipline where data science is irrelevant. So if you if you think of it holistically and also look through all the editorial members that on my board I realized what we're talking here the best word is really a really ecosystem because you know it certainly evolves it feeds on itself and it also has all kinds of you know complications like any ecosystem has and particularly it could if you may have some disasters just like you have natural disasters waiting within the natural ecosystem. Particularly, with all these you know all these AI developments, the technology, the impact on a society. So that's sort of that's the only the best words I can come up with and to describe the enormous scope of the data science itself

Richard Campbell: Can you talk a little bit about you don't like the term big data so and can you talk about why that is and I've both read in places and seen you talk about this so could you can you talk a little bit about that

Meng: Absolutely, you know I guess any reputable statistician probably knows the limitation of the big phrase. Big data is there for very simple reason I think a big data is a very catchy phrase and I can complete understanding that as more a sort of PR tool that would be very effective. But the problem with the term big data is it simply just only emphasized the size of the data without any indication on the on the quality of the data and I think about now many people understand that the data quality is really far more important than data quantity. I've done some recent work to show mathematically why it's absolutely crucial to take into account the data quality because it's so easy to be misled by saying oh I have a million answers but you know my most recent work just show really mathematically you could have like you know a few million people but the end I put in terms of real information in in these data is only worse like you know like a few hundred answers from well-controlled you know study. So I think it's a kind of a problem that I think that we need to work on more and that's the reason that I feel the term big data itself probably has helped to encourage people to think about only the size of data without worry about the quality the complexity the variety of all kinds of things about the data.

Bailer: Can you talk a little bit more about what you mean by data quality just characterize that and then differentiate that with I mean I think we understand what quantity is in this context but what do you mean when you're talking about the quality of data?

Meng: Let me give you a specific example and again the farm my most recent study using the 2016 election as an example. During that time there were many surveys opinion polls conducted by newspapers radio stations online right and you will see a lot of lots of these surveys so if you think about the data quantity you would say oh wow there's so many of them right. I did some rough calculations based on like a couple months before the election date they were probably up to about 2.3 million people have responded that's about one percent of the voting population that probably is a is an overestimate by itself. So you know to put that in English content that's about like a twenty two hundred surveys each was once than people in it now if you work in this sort of a survey area or even just a general you know public person to try to understand what that means you will see. Well that's a lot of surveys right a thousand person a survey is it is a decent size most surveys probably vary from few hundred to a few thousand you probably never see a survey with hundred thousand people in making opinion polls and so you would think about that you know if I have over like two thousand surveys and each with one thousand you know responses and if they are all point to the same direction you pretty much convinced that that's the reality and we all know what it really happened afterwards. You know almost regardless your ideology which part of your supporting or not supporting we're all surprised say how could all these surveys we're wrong you know in all direction but the surprise comes is not because any particular survey said you know Clinton was going to win is this collective sort of number and we think oh that's a huge number of you know numbers people have responded but my calculation what I did there is since we know what the answer was that we choose after election that there's a way to back it back calculate how much you bias in people's reporting behaviors. I want to say bias here and meant to say people refused to give the answer when they actually have an opinion unnecessarily means the people live this is simply saying you know they just say why don't feel comfortable to give the answer and we've taken that into account my calculation shows that you know these two points remitting people you end up with about that the information in that data set is a statistically equivalent. Meaning that I can show mathematically that complete equivalent to about 400 people refer to you honestly and that's the kind of calculation what I meant by the data quality because what happened is they was the bias is induced by people's self-selected way to respond the whole survey the whole sort of thing is essentially the whole statistical theory mostly is based on the ideas data are representative and what I was able to show is that when in a deviates from this representativeness and how striking it will be. But if even if it seems a small deviation that you know because of the large population there is a multiplying fact which is a little bit detail I won't get into it but basically there this phenomena is that the larger that population is the more these bias gets confirmed there now it gets eliminated and that's essentially goes against the common wisdom if you have lots of data as I should be getting right well unfortunately the lots of data only helps to confirm the bias and it's that kind of a bias that affects the data quality and there are specific ways to measure that that's my that's what I mean by data quality.

Pennington: You've talked about the idea of data science being an ecosystem and this issue of data quality what kinds of articles are you hoping to see published in the Data Science Review

Meng: That's a great question if you look at the first issue just specific on the data quality issue itself we have we had a three articles and the one is by a computer scientist and which you know normally people would expect a computer scientists that they shouldn't talk about data and the data processing data analysis. Yes that would article we do that but then we also have two articles one is from a leading scholar in social science in information library science and she really talks about the kind of issues that most of us probably know is there but we don't know how much the reading bought she talked about what she called that the afterlife of the data the idea is that if you know the data is getting collected. Then being used by someone but these good data or particularly large scale data said usually will be used by it by future investigators and the researchers and in all the in order for them to use the data properly there are lots of documentation needs to be done what so called the you know the data curation. Data problems and these are issues usually the neither computer scientist estheticians actually we really you know work on it but it's the crew should because that it directly affect the data quality on the other end I have another person of leading philosopher wrote about the notion of the data itself is really a philosophical subject because her point is that there's no such thing as raw data the moment you decide what to collect how they collected and how to measure you know these are all effects your analysis and the conclusion so these actually goes directly into data quality issues. So the kind of article you know HDSR wants to publish is goes beyond the typical data science journals which focus are the more called quantitative aspect analysis the sort of a computing part but we really want to look at data science from a holistic perspective including a kind of a philosophical and the qualitative investigation. Not to mention you know there's a huge area about the data privacy data assets data algorithm transparency all those things are evolving legal studies the warming most of the debates involving people's behavior I mean all kinds of those things so what we're trying to really cast that's exactly why I use the word ecosystem to present data science in ways that even for those of us who proudly call our self-data scientists may not realize how broad the whole subject is.

Pennington: You're listening to Stats and Stories and today we're talking with Harvard University's Xiao-li Meng editor-in-chief of the Harvard Data Science Review

Campbell: Xiao-li you're both an editor of a journal and you've had a commitment to helping students become better writers where did that come from and how important is it for statisticians as well as you know data scientists to be good writers and how just talk about that challenge

Meng: You know writing and more broadly communication including speaking and other forms of communication is obsequious especially think about data science as a holistic ecosystem because a lot of things we do is not just about you know down where research is about to communicate to others about why that's important. What's impact and the most important what is the implications and because everything we do this is also the nature of the ecosystem comes with those positive parts and as well as negative or you have an unintended consequences. So, you know throughout my career that I benefited tremendously from you know my professors my teachers to help me to think about the you know communications that if you allow me to share a little story please. I'm going to tell you how I really screwed up the first time that I made a presentation and I learned from that point on that why the communication is absolutely crucial. When I was student at Harvard is really many years ago long time ago that we had a requirement of having a post-qualifying paper which is required to be about 30 pages as well as a post-qualifying presentation which is about 30 minutes so it's first time when I was asked to make a presentation based on my paper without knowing anything about what the meaning of presentation I told myself it's perfect I have 30 pages I have 30 minutes. It was it was it was a true story that I practiced to read every page per minute and so when I was asked to present at the time there's no fancy you know slides anything xeroxed every single page of my thesis and I start to read in the middle of the first page is about like 1 or 2 minutes and my colleague well Herman Chernoff stopped me and he asked the question and I immediately stopped and panicked. I didn't budget any time for questions I didn't even know that professor was allowed to ask any question. I just start to mumbo-jumbo something trying to move on and of course you know Herman being Herman that he knew that it was a time to give me a lesson. He stopped me and he said to this day afternoon what exactly said and that's how I've been always used this story as well to motivate myself and all my student he said Xiao-li you're not answering my question you know I didn't know how I got out of that but that was quite clear to me I had no idea what I was doing I thought and I've know that lots of student make the same mistake who's presenting to professor is to show off how much he have done. That's exactly what I was doing even I couldn't think of that way I was trying to show off like I have 30 pages research article right there lots of stuff there I wanted to tell everyone what I have done it was not about communicating it was not about to make people understand what is one or two key ideas there. That was the sort of my starting point and I really you know at a time I have to say there's not as much emphasis on communication as there is now but you know we managed to talk to others I learned and you know I love writing and so that gradually gave me the sense of whether you're communicating in orally or writing is always about telling a story so I loved your idea of Stats and Stories because that's literally how I use that as a metaphor to all my students. I say well you should always have a flow you should always have a punch line and that's when you tell a story. That is essentially you know for publishing this HTS are. We also made it very clear to all the authors we will publish any kinds of materials as I only have two requirements there I don't give them a sort of a cap in term how long or how short the article would be but I said there's always too things are important one is the content has to justify the length the second is it has to be engaging even more very technical articles which we will publish people that's also one aspect of the data science what we require as authors guideline we require them to write like a summary or extraction which we call is a media friendly that's the whole idea. Now I probably should stop here because I've gone too long the story of that story, the idea of have a student competition later

Bailer: Oh that's great I do a story about turning Chernoff faces that someone when I was in grad school with was printing off some examples of it and they at the computer center they killed the job because I thought someone was playing they didn't realize that they were doing something that was an analysis. I think it's great to hear your story I I'd like to talk to you about the tag liner that what you have underneath the description of the Harvard Data Science Review and I really love the way that you've characterized this as microscopic telescopic and kaleidoscopic views of data science and I was I was hoping that you might be able to give us just a very short description with an example of each of those

Meng: Oh absolutely and I've just said that I don't take the fool credit that I discussed this with a few of my editorial members I know they're very much into finding sort of crisp and a catchy phrases. So I worked with them and for each of them is it's actually quite simple the microscopic one is obviously is about exam in detail right. Take a very careful look what we're doing because we really have too many things are just you know too much hype I've been out there so this is the kind of thing that we will be published and really the deep end. You know research article let me give you one specific example which is not in the first issue but it will be coming in the second or possibly the third issue and again from this scholar you know her name is Christine Bachmann from UCLA she's a really guru in information in the library science she will be published in the article of I think the title is when and why and how and what you know about why scientists to reuse other people's data. No you could easily say well you know sure we use other people's data because you know the data is useful in some way to help it's me that but she don't really take a very scholarly view because it's mushy the library scientists or its it's a kind of a social science studies you know classified the you know in these works she talk is about foreground use and background use while foreground uses are really for using other people's data to come up with new conclusions background uses are use other people's data to make a comparison okay and what, why that's important well actually when I use background might use data you probably see you're making comparisons you probably don't need to know even there of course you should know as much as possible but it it's a lesser requirement to use other people. The other just makes a comparison like we all do you know the review will say well did you compare to somebody else message so let's use somebody else data shoot to showcase what they have done in the here's or in certain of most times of course or what messes better that's why we get published. But to do the to use other people's the data trying to really do new discoveries then you really have to understand a far better about how the data were collected but it's not a list rating purpose then there are her conclusion that if you really want to use other people data to the full grant and the typical practice has been the scientists will engaged the person or the teams original collection of data to be part of the sort of core source code team investigator and a comparison user don't do that. I mean though there are a lot of interesting stuff is taking a very careful look of even seemingly very sort of a common benign I sort of the practice to looking to in a very scholarly way right so that's what I meant by taking is that that kind of sort of sort of microscopic view. The telescope view is also obvious because we wanted to have the kind of vision we want this that this things like you know we probably see from very far we probably only see a big picture and a lot of details need to be worked out but at this moment they're just a lot of issues going on that people just so speculating and that's basically the kind of article we publish in the first section of HDSR which is the on his perspectives and so you know for example the Article Michael Jordan On The Artificial Intelligence The Revolution Has Not Happened yet hiss vision, his take you know definitely were too much hype there are a lot of things that has not happened yet but if you read these 11 discussions and they're coming they're really a variety of the leading scholars as well as people from you know government sectors they have a variety of views right because they have access to a different amount of information they have you know different perspectives they work with different you know people so that's kind of what I call the telescope. Because you know none of them is sure what the future of AI looks like sure but I'm hoping by put them together we will at least get a big picture of the you know of the AI scope and also at a statistical eyes in the variation instead of having a single view so that's that kind of you know kind of telescope.

The last one I can't even pronounce the word correctly [Laughter] I know the Chinese phrase of it but I talk to my colleagues and they give you an English so my understanding of that phrase is that's essentially is like a variety that's essentially you know you see all kinds of interesting shapes the colors in the moving and that's basically we want to showcase that the Harvard Data Science Review is to publish any kinds of articles and you probably see from the first issue we already have article from philosophers social scientists computer scientists we have historians of science we publish articles on the computing as well as well as on the education right at some of the aspect I would love to emphasize more later and moving forward that really going to look forward I look for all kinds of articles. I can tell you very briefly then I should have stopped we are creating columns for example one of them I created is called a recreations in randomness of an impoverished data. Science, sports, art, entertainment, pastime you name it that kind of a sort of things the general public so you know there's something of interest to me and it's something I can relate and we will use that as a pedagogical moment to engage the readers to sink a little bit more broadly you know how do you go about it in a more rigorous way.

Pennington: That sounds great before we wrap up, I do want to ask you mentioned something somewhere along the way about a student competition so before we before we end our conversation, I'd like to hear about that

Meng: Sure I mean be brief the idea is that if you have these very technical articles and as I said we will publish the technical articles one ways to us ask the authors to write an executive summary or some whatever we call the media friendly. A summary now but there's another way of doing it which is to get students involved we're thinking about have a competition right with the article is posted then we have a competition which we were the welcome any students not necessary PhDs or graduate high schools whoever can write right because this will be a pedagogical moment itself. The students need to read the article and to understand well enough to summarize in a in a plain language and I also have a feeling that by doing that it might be better than the other way themselves because to overemphasize certain aspects because they are very fond of tend to be more technical and watered down but as someone else reading at a particular students are reader they probably will you know they may or may not get right but through the competition I can even discover you know how people understand this article. If everyone comes back have something which is not the also intended then I have to blame the also not everyone else okay that's the kind of process that we're thinking about we have not really put together the mechanism yet but that's what we want to do it because that observes both the purpose of serving the general public but it also has a strong education component and also goes so well with the idea of training the student to do better writing because that kind of riding is not easy but if somebody can do well it's incredibly useful that's great very good that sounds.

Pennington: Well, Xiao-li that’s all the time we have for this episode of Stats and Stories. Thanks so much for being here.

Meng: Thank you. Time is too short.

Pennington: Stats and Stories is a partnership between Miami University’s departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter or Apple podcasts or other places where you find podcasts. If you’d like to share your thoughts on the program send your emails to statsandstories@miamioh.edu or check us out at statsandstories.net and be sure to listen for future editions of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics.