R R R R R ... Not Just For Pirates Anymore | Stats + Stories Episode 62 / by Stats Stories

62 wickham.jpg

Hadley Wickham (@hadleywickham) is Chief Scientist at RStudio, a member of the R Foundation, and Adjunct Professor at Stanford University and the University of Auckland. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (the tidyverse: including ggplot2, dplyr, tidyr, purrr, and readr) and principled software development (roxygen2, testthat, devtools). He is also a writer, educator, and speaker promoting the use of R for data science ( http://hadley.nz ).>. 

+ Full Transcript

(Background music plays)

Rosemary Pennington : The explosion of data available to virtually everyone highlights the need for tools to explore and understand it. Data analysis tools range from programming languages such as python, C and Java to commercial software such as SPSS. or SAS. A relatively new tool is the open source software platform R. Developers can Create R packages dealing with everything from basic analysis to complicated data visualizations. R is twenty-five years old this year and is the focus of this episode of stats and stories where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and stories is a partnership between Miami University's departments of statistics and media journalism and film as well as the American Statistical Association. Joining me in the studio as always is regular panelist John Baylor chair of Miami's Statistics Department. Richard Campbell is away today. Our guest is Hadley Wickham. Wickham is chief scientist at R Studio, as well as an adjunct professor of statistics at the University of Auckland, Stanford University and Rice University. He's also a bit of a rock star in R world. Nerd famous for R programs he's developed. Thank you so much for being here today Hadley.

Hadley Wickham: Thanks for having me.

Pennington : I've heard R described as a programming language or as an environment so, when you are explaining what R is to people who maybe have no concept of it, how do you explain R?

Wickham : To me, R is kind of like a tool, for like expressing your thought. Like to me it's kind of like the language side of the programming language. I think that it's the most interesting that it gives you this kind of language, this tool to kind of get your thoughts about how to tackle the data analysis part of it out of your head and into the computer, because the computer has to do it for you these days, because you can't do statistics on pencil and paper. So how can you like describe to the computer what you want to do and how you can help the computer do it as efficiently as possible.

John Bailer: So, you mention the idea of tackling a data analysis problem. You know just for the folks that are listening, how would you describe some of the phases or components of a data analysis problem?

Wickham : So, I think kind of the overarching goal is to take some raw data and to turn it into something useful like maybe that's an insight or some knowledge or some like action that allows you to make the world a better place in some small way. So that's kind of the overarching goal. I have been talking about like the seven stages of data analysis. So, you first import your data, you get it from whatever crazy formats, R or Python or whatever your data analysis environment is, then you will do some tidying to get it into a structure that makes the rest of your analysis easy, and then you do some data transformation, which is things like computing summaries like means or even really simple things like counts, and then you sort of go into what I think of the two main engines of knowledge generation which are visualizations and modeling. So visualization is kind of a fundamentally human activity, modeling is a sort of a fundamentally computational, mathematical activity, and together they give you these really powerful tools to understand what's going on in the data set. So, you are often going to iterate through those a few times and then at some point, you decide well I am kind of, like I've wrung out some insights on this data, hopefully it's not you've tortured the data until it's fixed. You found some real insight, then you go on to the hardest part of the process, the communication process, where you're going to take that like mental model you have built up, as like a data scientist that has a statistician, an expert and then they can explain that to people then. You know I'd be very familiar with the demand but I'm probably not so familiar with the precise techniques used.

Pennington : I was reading that the packages that you've developed have sort of earned this moniker of the tidyverse. Could you sort of explain how that's come about?

Wickham : This is what I think, it's a really interesting phenomena. So, the tidyverse as a thing has only been around for a couple of years now and previously people kind of called my body of work the Hadley verse which I just cannot talk about, it just sounds overwhelmingly arrogant.

(Collective laughter)

Wickham : And so that kind of life it's like that just that visceral dislike of that term, the Hadley verse, so it prevented me from like thinking about like the collection of my work. So, a couple of years ago I said like have to come up with like a name for it so I can like talk about it, think about it and you know to recognize that it's not just me, there's now like a team of people working on this at R studio and hundreds of collaborative contributors in the R community as a whole and you know, it took a surprising amount of time to come up with the tidyverse, which I think seems obvious in retrospect now because some of my earlier work was around this idea of tidying data.

Bailer : What motivated you to develop a set of connected tools that have been part of this under this umbrella of tidyverse?

Wickham : That's a good question. I don't know like some of it is just like this weird…dislike of things that snap together kind of nicely. I mean the goal of the individual packages before I was sort of thinking about them as a whole was like, how can we teach data analysis, like how I've been pretty good at data analysis for like a long time now and I could like look at data set and say oh you need to do like X, Y and Z to it. But like how could I explain those ideas to other people and give them the tools to express that and so on. So, the kind of the individual packages solved little bits of the problem, I think, about how do you do the visualization or data manipulation or strings of date times but I never sort of really thought about them as a whole like how do all of these pieces fit together, like what are the underlying ideas, the underlying philosophy, that help all these pieces fit together so…

Bailer : As you reflect on kind of your practice of statistics I mean we've talked about you know the R system hitting it's its twenty fifth birthday. How has data analysis changed over time as you practiced it?

Wickham : I think…I don't know, in some ways the biggest change is like it's become sexy like that.

(Collective laughter)

Wickham : You know when I majored in statistics and computer science, you know fifteen years ago, I was an undergrad, not something like the nerdiest of nerds. But now I'm you know, the only people who majored in statistics were people that felt like whatever, a passion for statistics from high school but now it's just become like you know it's so exciting and so many people are kind of understanding like why being able to work with data is such a powerful tool. I think just like so exciting and it's moved from just kind of experts being out there would have been like a Ph D. In statistics to like anyone doing that and I just I just you know I love that like people in high schools learning R, learning data science, they're people in the humanities journalists so that's just so fantastically exciting to me.

(Background music plays)

Pennington : You're listening to stats and stories where we discuss the statistics behind the stories and the stories behind the statistics. The topic today is programming language R. Our guest is Hadley Wickham, chief scientist at R studio. You mentioned the fact that a lot of different people are using R these days. How are journalists using it?

Wickham : I think I see journalists using it in kind of two ways. I think like using R was just some powerful tool for investigative journalism and there are a few places and Propublica is one that I work with you know, using a lot of sophisticated, doing a lot of FOIA requests like thinking about how they can really tackle these big important stories and use data to get at the heart of what's going on and I think that you know, that's really exciting and awesome and then we're also seeing like you know news rooms kind of picking it up for small stories, using it for you know smaller local investigations and doing a little bit of explorative visualization and checking out what's going on in the groups as well.

Bailer : You know as Rosemary has mentioned a couple of times that you serve now in a role of Chief Data scientist at R studio. How do you define data science?

Wickham : That's a tough one. I don't know I kind of like this sort of broad idea that it's like turning you know data and insider turning data into knowledge. I think I'm pretty happy with like I like this sort of broader definition, the one thing that I do believe in pretty strongly that to be like a data scientist or to do data science as opposed to like doing data analysis, you have to be programming. You can't be kind of the pointing and clicking you've got to be writing programming language. That's not to say, like there's a lot of stuff you can do in like Excel and tableau, do fantastic analysis, it's not like to put down the people using these tools in any way whatsoever. But I just think that to me is like one of the distinguishing features of data science, it's about using a programming language.

Pennington : I will jump in here because I was talking to John about this earlier so I do some data analysis in my own research but I use largely Excel or SPSS. and I've been intimidated by R. Because I think it…what I've seen people do with it is really interesting but I don't program and I have no idea how to approach it. So, what…how would you suggest someone in sort of my situation who wants to do more sophisticated analysis but is also kind of scared of the programming part of that, how do you…how would you suggest they sort of get over that hump?

Wickham : So, I mean my sort of biased advice, I have a book out called The Art of Data Science which is sort of my attempt to do this. Like I think like the key thing is to find some like you know what I want to make it like programming it is you know it's going to be frustrating and it's going to be painful and it's going to take you a while before you can do the things you know to do things you can already do now faster or even at the same speed. So, I think it's really important to like to find some kind of motivating problem that you just can't tackle now or is like really painful right now might figure out a better way. And to me like I think it just a really great kind of way to get into the programming and start visualization just because we're new but you can start creating some like really fantastic visualizations with just…after learning sort of a few big ideas and you can create things that are you know it's way, way easier to do that in R than it would be in Excel.

Bailer : I think she might know a statistician or two that would be willing to talk to her if she goes to them…

(Collective laughter heard)

Bailer : She could, she can phone a friend here Hadley.

Wickham : So, I recently I went to NICAR, the big investigative data journalism conference and they have like a bunch of…so I did a one day workshop on R there and there are a bunch of other workshops like there's just sort of so neat to me to see like journalists teaching other journalists like how it's done with R like statistics, asking all these questions. It's really really neat.

Pennington : Since we're talking about journalism I might…we ask this of a lot of the people that we have on stats and stories, so when it comes to data visualizations or uses of R or data journalism have you seen examples that you've found frustrating or moments where you see a piece of data journalism and you are like oh, this could be so much better if they had done this? Is there something that you find frustrating in data journalism when you read it or see it?

Wickham : By and large, no, I think.

Pennington : Okay, that's good.

Wickham : I don't know…I'm just like, even if it's done sub-optimally or I could do it better, I don't think that's a bad thing. You try and you know maybe you fail along the way but it's better to do that than not try at all.

Bailer : So, in your course, the workshop you did for the journalists, what was kind of the syllabus, what would that look like for the journalist course?

Wickham : It was sort of excerpts from our data science course, so we started out with visualization, talked a little bit about data manipulation and then some sort of tools for like functional programming which I think basically is the way to think about is how to kind of automate more of your work, so like you solve like one little problem, how can you then like take that solution and kind of generalize it to solve a whole class of problems and also to solve like a whole many instances of the same basic problem.

Pennington : I was reading an A.M.A. that you did a couple of years ago and where someone asked you about big data and journalist I know. I've been covering a lot of big data stories and you called it overhyped so I was wondering if you still feel that way because that was a couple of years ago and if so why do you think Big Data is overhyped?

Wickham : So, you know yeah, I guess I don't know it's not. I mean the thing that overhyped now it's like machine learning, deep learning and then AI in the sort of...I don't know, I think a few years ago it was kind of like you know have big data wave a magic wand and all of your problems will be solved. And now that magic wand is like magic wand and machine learning and AI. So, it's I don't know I think like absolutely there are problems we're having a huge amount of data is really important to solve them but I think they just make up a fairly small proportion of all problems even we need to have big data is often like a whole bunch of like really really simple stuff you can do. Like counting. You know binning and counting, and you can reduce it down to something manageable and still insightful very very quickly. So, I just don't in general, I don't know I kind of think there's sort of, there's a bit of a meme war going on, like you know what is data science, what is big data, what is machine learning…I just I don't know…I want data science to be defined as a big team that's inclusive and welcoming to me you have to have a Ph D. In machine learning and thirty terabytes of data to be doing data science and you know you're doing data science if you've got you know thirty data points in a Google sheet that you're pulling down and analyzing in R.

Bailer : You know you talk about the idea of kind of these new tools like machine learning and AI, these more complicated tools being overhyped. I think that's…it seems like some sense that that the more inaccessible and black box-ish tools can be that they almost take on this this aura of I don't know magic and that leads to that hype. Do you think that's part of the story here?

Wickham : I think a part of it. The other thing that I kind of been, I think interested and pleased to see at least of like really trivial application of deep learning so two of them that I've seen recently, there's the hot dog to take to act, seen this? But it basically tells you if there's a hot dog in a photo, basically.

(Laughter)

Pennington : Yes.

Wickham : It was like a spoof on the Silicon Valley T.V. show and someone actually made the real app that works reasonably well. And the other thing is someone may do one of these style trends. If you Google for like a dinosaur lost in the deep learning someone's done those kind of crazy mash up of dinosaur pictures in the style of like you know Victorian botanical drawings, and the thing I think sort of neat about these sort of trivial uses is that shows like the tall chains kind of matured enough that there's it's not so, hard to do these things that you only do the big important serious projects you can also do them for a little fun projects and I think that also really important like that you can sort of these tools are like sufficiently easy to use their you can start to have fun with them and kind of regardless of where the you know. What's going on under the hood, I just think. That that's a really positive development to me.

Bailer : So, I just want to let you know really quickly that it's like the engineer is now diving into Google to try to find the hot dog detector app as well as the dinosaur deep learning app.

Wickham : Awesome.

(Background music plays)

Pennington : You're listening to stats and stories. Today we're talking about statistical programming language R. You just mentioned these two interesting apps. Are there interesting or surprising ways that people have been using R that have sort of caught you off guard or that you've really enjoyed coming across?

Wickham : I don't know like I sort of see such kind of neat things over time. I just like love to see the diversity of issues, I'm sort of seeing them all the time and I just have this sort of like background positive feeling. I think the thing that now extends out to me which is like my new criterion for success of a software product is that is like when your when your tool is used to commit like academic fraud.

(Collective laughter)

Wickham : So, you know this is I don't know how much like responsibility you might go you know a tool builder has. To go to what their tools are used to build but I think that sort of interesting like a measure of success so successful that people are using them to do like you know unethical things. Which you know gives me a lot of conflicted feelings.

Bailer : That's an interesting feeling, I like it. So, let me ask you about the idea of all the new code development and some of the efforts to think about reproducibility of research and I mean how do you see the role of some of the tools that are being built and that the call for accessibility of the research being done and the ability to replicate it?

Wickham : Yeah, it just seems so incredibly important and even the sort of issue of like you know computational replicability just that if you have the data and the code you can get the same results but just seems like such a low bar to like you know being on a reproducing an entire experiment, from recollecting the data on a different sort of people about observational units and it just seems so important like how can you trust anything if you can't even given the same code and same data like recreate the same result just seems building on this like terribly shaky foundation. So that just seems like so important and now like working in any other way is so foreign to me that I can't kind of understand it. But to kind of circle back I think that by reproducibility is one of the you know really big selling points of programming language either a Powerpoint and click enter and save that they're being able to make just rerun your script again when the data changes and then being able to like look at what you've done and critique it and understand the problem of how you get from the figure from the raw data, I think that that's one of the reasons that pretty strongly believe that scientists who are doing data analysis really shouldn't be programming it.

Bailer : So, one question that we often ask guests when they come on to stats and stories are advice that people might give to students that are interested in pursuing careers. So, I guess this is a two-part question: what kind of foundation would you recommend for someone who wants to do data science directly from the technical side and second part of the question is, what advice might you give for a journalism student who wants to do data journalism in terms of background and exposure they should have as part of their studies?

Wickham : Again, I'm going to give my biased answer, which is you should start by reading my R for data science book. Not just because it's my book, along with Garrett Grolemund but because that's kind of my…that is exactly the question I am trying to answer with that book and you know while it is far from perfect, it does a pretty good job of at least telling you like what are the things you should know, you need tools of data, import and tidying and transformation, visualization, model making, communication like and those are the big things you should be thinking about. And you know obviously I'm a passionate believer in R and an evangelist of R but you know it doesn't matter what you're doing, whether it is python or JavaScript or something more exotic it's the program in the thinking about those pieces that that's really important.

Bailer : So, sort of the complement that I meanI think that that you've given some really good advice in terms of framing and approaching these problems, but I'm trying to think about a student that's going to sit in with you know of an advisor in journalism and I want to do this kind of work what kind of courses should I take? Do you have particular recommendations on that front?

Wickham : Yeah so, I would say you need to you need to take enough courses in which you have to program that you become a competent programmer. I don't think it really matters in what language but I think just getting, being forced to program in a class environment, I mean it's really helpful, in kind of getting over the kind of initial hump. I think some of those classes where you have to work with databases, much of the data lives in a database and if you have the basic skills to get the data out of the database like that you will be really really valued and similarly I think you know something around like again, on the data X. Is like something about like we have APIs and we're scraping and how do you turn things that look you know weird and wonderful into a nice tidy rectangular dataset. I think those are really valuable skills.

Pennington : Hadley in the set up to this episode we talked about the fact that it's the twenty fifth anniversary year of R, so this is a program that's increasingly embraced in the academic world as well as outside of it. What do you see is the future for R?

Wickham : I think just like more of the same. I think It's going to keep growing the communities of people who use it, they're going to expand, more and more people are going to use that one thing that kind of blows me away at R studio, that we're starting to talk to companies that have you know like one hundred or two hundred or a thousand people who are using R at their company or they're thinking about like you know we've got four hundred people using SAS or some other you know older system and the like how can we retrain all of those people to use R. There's this continued growth, it seems like the growth of R is still accelerating. I'm really, really excited.

Pennington : So, I was reading an interview where you said it's crazy that you've become famous for R. How is it that you have found yourself in this position, how did R become your life's work and do you still find it crazy that that this is what you're famous for?

Wickham : I'm still utterly kind of blown away with the craziness of like you know people want to take selfies with me, they want me to autograph things, because you know I set out to become a PhD in statistics, not where I am watching my career going by any means. But equally I like I've always liked, I kind of you know always loved making tools for other people to use and I've been really fortunate to have turned that into a career and be you know so supported by my current employer, R studio, like to do that and you know the fact that I have as now I'm not just me doing those kind of my evenings and weekends but I'm doing this full time and moreover have got a team of like seven people working on, pure open source R development, it's just you know, blows me away. I just you know I find it so incredibly rewarding and enjoyable and I hope I can keep on doing this for the rest of my life.

Pennington : Hadley, thank you so much for being here today.

Bailer : Yeah thanks, Hadley.

Wickham : You are welcome. Thanks for having me.

(Background music plays)

Pennington : That is all the time we have for this episode of stats and stories. Stats and stories is a partnership between Miami university's departments of statistics and media journalism and film as well as the American Statistical Association. You can follow us on Twitter or iTunes. If you'd like to share your thoughts on the program send your e-mail to statsandstories@miamioh.edu and be sure to listen for future editions of stats and stories where we discuss the statistics behind the stories and the stories behind the statistics.