Happy Birthday to R | Stats + Stories Episode 63 / by Stats Stories

49 thieme.jpg

Nick Thieme (@FurrierTransform) is a researcher and freelance reporter with writings appearing in Slate Magazine, BuzzFeed News, Significance Magazine, and Undark Magazine. Both his writings and research focus on technology, science, and statistics. Currently, he's working at New America’s Open Technology Institute investigating fair internet usage and net neutrality.

+ Full Transcript

(Background music plays)

Rosemary Pennington: Open Source technologies have been hailed as democratizing spaces - spaces where users can help shape the development and evolution of technologies. If you've used Mozilla's Firefox to visit Stats and Stories, you've used an open source technology. Or if you've stalked a long lost high school friend on Facebook, you've also used a kind of open source technology. What you may not know, or maybe you do because you listen to Stats and Stories, is that some of this data visualizations you come across as you surf the web using Firefox could have been created with another type of open source tack, this time a statistical programming language known as R. Since it was first launched, millions of users and have adopted R for data analysis and presentation. The R community is marking a pretty big milestone in the life of the programming language. R’s twenty fifth birthday is the focus of this episode of Stats and Stories. Stats and Stories is a partnership between Miami University's departments of statistics and media journalism and film as well as the American Statistical Association. I'm Rosemary Pennington, joining me in the studio, our regular panelist John Baylor, Chair of Miami statistics department and Richard Bailer, Chair of media journalism and film. Today's guest making his second appearance on Stats and Stories is Nick Thieme. Thieme is a freelance reporter whose work has appeared in Slate, Buzz Feed and Undark. He's also a researcher at new America's Open Technology Institute investigating fair internet usage and net neutrality. He's here today to talk about an article he's written for Significance magazine about R's twenty fifth birthday. Thank you so much for being here again, Nick.

Nick Thieme: Yeah thank you so much for having me.

Pennington: So, I'm going to start right off the bat, why is the anniversary of R something worth celebrating?

Thieme: Oh wow! That’s a good question.

(Collective laughter)

John Bailer: So, you have thirty minutes to answer it.

Thieme: Yeah, I mean I think you summed it up perfectly in the introduction. R democratized Statistical computing in a way that no other language really has, I mean if you talk to data scientists, if you talk to biologists, you talk to anyone who works with data on a professional or personal basis, they are either familiar with R or they would gush about how much they like it. I happen to fall into the second category. I think it's one of those things where, this tool is useful enough and important enough in both people's lives and in the work, that affects people's lives who aren't data scientists that it's worth recognizing and celebrating.

Bailer: So, you got to answer the question for folks that have not encountered this before. Why is it called R?

Thieme: Yeah this was something I actually didn't know prior to researching the piece, it's kind of funny. So, the language itself is a combination of two languages, sort of the way it was created was, these two people Ross Ihaka and Robert Gentleman, who were professors at the University of New Zealand in the early 1990’s. So, they combined different characteristics of these two languages. Scheme was one of them and S was the other one. As you may have noticed Ross and Robert, both of their names start with R so, and Scheme and S, both of those languages start with S, so as a reference to being sort of the next iteration of both Scheme and S and in reference to their own full names, you get R.

Richard Campbell: For those of us who are journalists on the panel, so what is R? How do you…how would I explain R to my ninety-two-year-old mother, who's a smart person but would need some help here?

Thieme: Yeah of course! So, I won't say statistical computing environment.

(Collective laughter)

Campbell: Very good!

Thieme: I would say it is a language that allows people to directly interact with data and by typing commands, to turn data into visualizations and useful results. And one of the things that makes R special is that it is free and that is to say, free as in ideas, not free as in beer. It does also happen to be free, you don't have to pay for it, but what's nice about it is, as part of this free and open source software movement, so you're allowed to look into the building blocks that make R work and mix them around in a way that lets you produce useful tools for yourself.

Bailer: So why did R decide to roll out as an open source product?

Thieme: So, it was a combination of things, all of which I found kind of interesting. So, I came into R in 2012 and by the time that I'd started using it right, it was a big enough programming language that I thought, I had sort of assumed, it had always been the way that it was now. Originally, in the mid-nineties, Ross and Robert had considered selling it. They had originally developed it at the University of New Zealand and for one reason or another they had used it to teach their undergraduates there how to program in a statistical environment, basically how to run a statistics class and in the process of doing that they realized that they you know, we are in possession of something kind of cool and that could be useful to a lot more people. So, they put it online and people started using it and people started developing for it and started trying to contribute. And at some point, in the middle of all this, you know, Ross and Robert looked at one another and said well, maybe we can sell this, like maybe we can do something. Maybe we can turn this into a product. So actually, they went so far as to buy a book on starting a business. They really, really considered turning this into a product that they could make money on and it was in large part the contribution of one of those developers I was talking about Mark Meckler at ETH Zurich who had already been involved in the Free and Open Source Software community and he recommended to them that maybe they should make this free software, they should let anyone who wants to use this, use it, anyone who wants to contribute, contribute. So, in an interview with the ETH Zurich actually he came out and said he wanted to do good for the world, he wanted to do something good for the world and I think that that impetus to try and do something good is why he recommended and why he pushed for R being open source. But I also ultimately think it's why Ross and Robert took his recommendation they wanted to and I'm speculating here but you know they did think about starting this business but ultimately, they thought it would be more valuable and essentially more successful if it was free and open source instead of a commercial product like you know, SAS or SPSS.

(Background music plays)

Pennington: You're listening to Stats and Stories where we discuss the Statistics behind the Stories and the Stories behind the Statistics. The topic today, statistical programming language R, our guest is journalist Nick Thieme who has written an article about the twenty fifth birthday of R for Significance magazine. Nick, I am someone who was trained in SPSS. That is the software I used to do statistical analysis in my own research. Why should someone like me who knows how to use that tool consider moving into R? Right? Hadley tried to convince me the last time we talked, when we talked to him. Why do you…do you have a reason why people sort of move from that space into R, or why people should consider it?

Thieme: Oh what…did he succeed?

Pennington: I'm contemplating. (Laughs)

Thieme: Well if he couldn't I don't know that I can but I…

(Collective laughter)

Thieme: What makes R special and the reason why I personally use R and think that a lot of other people should too is CRAN, it’s the packages. You know with SPSS or with other closed source software like that, if you want to do something new, something that isn't already programmed into it, you have to wait for the new release. You have to…you basically have to wait for the new functions to come out through the original channel, right?

Pennington: Yeah.

Thieme: What makes R special is that you don't. Instead of there being one team of developers and all new functions come from them on high, with R, functions come from everyone. Everyone who's using contributes and you can, as a user, take advantage instead of, you know fifty developers’ work, you can take advantage of tens of thousands of developers; work. So, for example, if you need a tool to scrape the Internet to get a particular kind of data, I can almost guarantee if you google for long enough, someone has already put that in R view. It's hard to find that in other languages.

Pennington: You've convinced me because I do a lot of Internet analysis.

Thieme: Oh nice!

(Collective laughter)

Campbell: Nick I am somebody who is not trained in SPSS or in R. And I study narrative and Stories so I want you to tell me a story about how R helps us visualize data, which I think is one of the, one of its important contributions.

Thieme: One of the things that's going into this piece, the piece that's coming out in Significance in August, is a sort of insert that describes the process from data scraping to visualization and I think that that's important because the narrative of…it's funny that you phrase it that way, because that's exactly I think how R works, is you can view it as this narrative, right? You have different pieces of it so you know you start from the data and for example if we're talking about the tidy verse which is Hadley Wickham’s contribution, you start from the raw data and you have things like dplyr. So, the dplyr is the beginning of the story. In this case you start with your data and you can transform it into something more usable, something that resembles what clean data should be. Once you have that clean data, you can apply all sorts of transformations to it, anything that calculates the statistics that you might want to think about, visualize, and at this point you have you know things like ggplot come in. So, this is a package that takes that transformed data and easily allows us to visualize. So, for example, the version of this that is coming out in Significance as we look into a video game ratings on Metacritic and the video game ratings that are there they come with user ratings and critic ratings and sales. And the data is not particularly great, it's a little bit ugly actually. So, you know you can transform it and make it look a little nicer. Well once it looks a little nicer, you can start thinking about answers to interesting questions, start thinking about things like, well, is there a relationship between the amount of copies of the game that sell and the ratings? Or, which developer has produced the most top ten selling games. And what's great about R and the tidy verse is it allows you to very easily think about those questions, instead of thinking about the programming. So, I get to type directly in and say, well, filter out these results and summarize them this way and send it to…and actually those are basically the commands that you type. You have filter and summarize and it's these semantically meaningful functions that allow you to think about the data and end up visualizing it in an intuitive way, instead of thinking about what you should be typing.

Bailer: You know as part of that story or part of that process that you describe and in your Significance piece, you wrote about the idea of recasting messy data into a canonical shape. Would you like to describe what was meant by canonical shape in this context?

Thieme: Yeah absolutely I probably should have cut that.

(Collective laughter)

Bailer: And now you might, right?

Thieme: I’m sorry, what was that again?

Bailer: I said now you might.

Thieme: There are three rules for what's called tidy data. That's the nicer name for the canonical shape. And the rules are basically, one, that every column contain a variable and only a variable, every row contains an observation and only an observation, and that each measurement be of a single observational unit. And so, these things, for example a lot of time, you'll have count data for say, income and you know, in the columns of your 13:48data sets, say the rows are measurements, right? Like how much money as do the people in a particular country make…there we go and the columns can be something sort of strange. So, we'll say each column is level of income. So, like from ten thousand to twenty thousand dollars, how many people make that much? From twenty thousand to thirty thousand and all the way up the ladder and if those are the columns, analysis can be kind of tricky because if I want to know, say, how many people are in a particular range or what the distribution over countries looks like, I have to reshape that data in complicated ways to get to answer the question that I'd like to answer. So instead of thinking about my problem, I'm thinking about this programming problem. So, what you end up doing is it allows you to write those counts in nicer ways so instead of having those that collection that could be a very long collection of columns you can reshape it in a way that takes one of the columns to be well these are the income levels. So instead of having them in the columns we put them in one column and that way instead of having this oddly shaped dataset you'll just have one column that's countries one column that is income level and one column that's counts. And that canonical shape ends up being a lot easier and more intuitive to work with.

Bailer: You know one thing that you’ve mentioned in your Significance piece is the idea that our culture is evolving. I was wondering if you could comment on a couple of milestone dates within the R history and then particular aspects of how this culture has been evolving?

Thieme: Oh wow! Yeah, I would love to get to talk about that. So, there are I think two very related major developments in the change in R’s culture. The first happened in the late 2000s basically and this was the movement from more of a computer science and developer culture to a data science and user culture and I'll explain why these two are related in a second. But more or less what happened, I think at least, is that R was a…it grew out of the open source community from the early ninety's and at that time, you know Google and Facebook and these big data science companies, they weren't around. So, there was a different set of people who were using R and computer science largely as a whole but it was the case that many of them were men, there was a huge gender imbalance and so as R, and these other companies like Google, Facebook and other data science companies and data science as a field started to grow it started to attract different users. It attracted a different user base and one of the people who explained this well to me was Julia Silge who is a data scientist at Stack Overflow, the developer of the package tidy text, and so she explained to me that other fields, and the example she used was the life sciences, had a better gender balance. So as R grew and as more users from other fields, like the life sciences started to come in, the gender balance started to change and as more women were using R, I think it got to the point where women's issues came to the forefront of the R culture and one of the people who played a huge role in this was Gabriella de Garash who started R ladies and so her story sort of starts in 2012. She moved from Brazil to San Francisco for her Master's in statistics and while she was attending a lot of data science and statistics meet ups the phrase she used that I think was a really important is that she didn't see herself represented in the audience there. And that lack of representation led her to partly not feel welcome and partly not want to engage in the way that she deserved to be able to engage, right? So, she created space R ladies, that’s R ladies meet up where she could see herself and present an environment for other women and people of under-represented genders to see themselves represented and feel comfortable asking those questions and making suggestions that they might not otherwise feel comfortable making. And so, I think it's important to mention as I talk about how R’s culture has approached gender parity that it is only approaching gender parity and that there's still a lot of work to be done, right? The statistic that jumps out at me is that the R ladies task force on women and underrepresented genders, I believe is the name of it collected seven years of demographic data from the use R conferences which are a big series of conferences that R users attend and in 2004 absolutely none of the percentage of invited talks were given by women which is…

Pennington: Wow!

Thieme: Yeah it says a lot and in 2016 that number jumped to over time and you know over that period rather than the number jumped into the high thirty's that's a huge improvement and deserves to be commended right? But parity is fifty percent not thirty seven percent. So, I think the work of de Garash and a lot of other people who worked hard to improve that gender parity has to be commended right? You need to absolutely recognize the work that they did in R but at the same time recognize that there's still work to be done in the culture.

(Background music plays)

Pennington: You're listening to Stats and Stories and today we're talking R with journalist Nick Thieme.

Campbell: Nick you mentioned the distinction between data scientists and statisticians and maybe John can help here. So what is the difference?

Thieme: Oh wow! I’m more curious about John’s answer than I am…

(Collective laughter)

Bailer: Hey, I'm here, you're the guest and you go first.

Thieme: Fair enough. Yeah, so I think it might be useful to draw from some of the interviews that I have. So, a lot of the people who I talked to over the course…Hadley mentioned this, Robert Gentleman mentioned this, Thomas Lumley who's one of the people in the R core team mentioned this…they all mentioned it one time or another in the past sort of either being looked down on or having their work be sort of ignored because it was computational in focus, instead of say theoretically focused. I think I believe Lumley’s line was, I got tenure despite working on R, not because of it. And so, to me and I'm sure this is a very broad overgeneralization but to me the difference between statistics and data science is exactly that emphasis on computation and computer science in statistics. So, it's sort of the combination of computer science, statistics and mathematics is where I think of data science being and statistics being one of those components.

Bailer: You know I think Nick that I agree with you I think you end up taking someone in, so that you end up enhancing a fair amount of computational foundations and skills and coupled with the mathematical foundations and skills to build this this data science kind of person.

Campbell: Is there still that hierarchy, that looking down on data scientists or is that gone away?

Bailer: I think it's just different it's different I think you know I think people are just a different type of pursuit I think in fact it's probably gained a lot of respect with it what do you think?

Thieme: I think that sounds right to me. I know that some statisticians so for example I don't want to take any of Hadley's Stories you have in your mind before me Oh so I guess he gets the opportunity to say before this there…

Bailer: It’s true.

Thieme: So, something he might end up mentioning is that people don't necessarily a lot of statisticians don't view the tidy verse and his contribution as necessarily being statistics. He's told me that they view it as being something else maybe computer science or you know just sort of this game playing with the shape of data that I think is wrong because you know statistics deals with data so if your data is in the wrong shape or you can't use it, then you're not…you can't do statistics but just the fact that people feel comfortable saying that especially when you know R studio and the tidy verse haved the success and the impact they have, shows that at least in some people's minds that hierarchy is still there.

Bailer: You know I think that the…what comes to mind for me is to think about what's the history of statistics? I mean if you think about experimental design you go to Rothamsted an experimental station if you think about T-test, you go to a brewery in Dublin you know this is this is perhaps just the modern flavor of this. The types of data that people are encountering requires tools that are being built out that allow you to explore them computationally.