Data Science Pedagogy | Stats + Stories Episode 253 / by Stats Stories

Mine Çetinkaya-Rundel (@minebocek) is a senior lecturer at University of Edinburgh, an associate professor of the practice at Duke University, and a professional educator at RStudio. She has author of three open source statistics textbooks and is an instructor for Coursera. Her work focuses on innovation in statistics and data science pedagogy, with an emphasis on computing, reproducible research, student-centered learning, and open-source education. She works on integrating computation into the undergraduate statistics curriculum, using reproducible research methodologies and analysis of real and complex datasets. In addition to her academic position, she also works with RStudio, where her focus is primarily on education for open-source R packages as well as building resources and tools for educators teaching statistics and data science with R and RStudio.

Episode Description

In the past, Introduction to Statistics classes spent a lot of time covering distribution tables, teaching students to run stats by hand and focusing on statistical procedures. However, educators are continually considering new ways to teach stats, and the increasing popularity of data science makes it a more urgent prospect for some. That's the focus of this episode of Stats and Stories with guest Mine Çetinkaya-Rundel.

+Full Transcript

Rosemary Pennington In the past, Introduction to Statistics classes spent a lot of time covering distribution tables, teaching students to run stats by hand and focusing on statistical procedures. However, educators are continually considering new ways to teach stats, and the increasing popularity of data science makes it a more urgent prospect for some. That's the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and media journalism and film, as well as the American Statistical Association. Joining me is regular panelist John Bailer, emeritus professor of statistics at Miami University. Our guest today is Mine Çetinkaya-Rundel. Mine Çetinkaya-Rundel is professor of the practice in the department of statistical science at Duke University and developer educator at Posit formerly R Studio. Her work focuses on innovation and statistics and data science pedagogy, with an emphasis on computing reproducible research, student centered learning and open source education. She works on integrating computation into the undergraduate stats curriculum, using reproducible research methods and analysis of real and complex datasets. Çetinkaya-Rundel has also written articles for the Journal of statistics and data science education, about the evolution of stats education. Mine, thank you so much for joining us today.

Mine Çetinkaya-Rundel Thanks for having me.

Rosemary Pennington How did stats pedagogy become a focus for you?

Mine Çetinkaya-Rundel
You know, when I decided to go to graduate school for statistics, I was not necessarily sure what I was going to do next with it. And thankfully, for me, we had a TA requirement as part of my graduate program. And I realized that I really, really enjoy the teaching aspect I had done, you know, bits and pieces of teaching at tutoring and undergraduate and enjoyed it. But I didn't really I hadn't really thought about it as a profession for myself until I got a chance to do it as a graduate TA, and really enjoyed it and realize, hey, this might be a path forward. And I was lucky enough to be at UCLA where Rob Gould is whose focus is sort of Statistics Education, and I realized this could be a pathway for me, and sort of it was, you know, a happenstance and something I enjoy, and thinking about how can I turn this into my actual job,

John Bailer That's always great to be able to turn something into this interest that this skill, so make to be very successful at this work. You know, one of the things that I've, I've really enjoyed, I mean, I look back on decades of, of being an instructor where I'm teaching statistical ideas, and I kind of cringe at the way, some of the aspects of the course, you know, decades ago, but you know, in some sense, it was constrained by technology, it was constrained by the systems in which we work. I found it really interesting that I've gotten to witness things like even the AASA section on statistical education changed its name. And I think you were at the helm, when that change was made. So can you talk a little bit about about why that was something that was important to do?

Mine Çetinkaya-Rundel Yeah, so a lot of statistics, educators are also teaching data science courses, or integrating data science ideas into their statistics courses, and also integrating statistics ideas into their data science courses, many of us have been sort of tasked with such challenges in the recent past. And we were, you know, as the education section, we're always sort of thinking about how do we make sure that we are offering something that's actually of interest and benefit that's worth the members time. And we were realizing a growing number of, for example, JSM submissions that are more on data science, education, you know, panels, poster sessions, whatever. And at some point, you want to sort of make sure that you capture that identity of those people. And we really thought about does it make sense maybe to support the formation of an interest group in data science education, which might down the line turn into its own section. And I think that was a thing that we sort of worried about, because it would be many of us who have perhaps started off as statistics educators, or nowadays maybe just starting off as statistics. And data science educators don't see these as like separate parts of our identity, we sort of like do that together. And so we thought, I think changing the name could signal like this wider umbrella that we want to sort of collect people under and potentially also bring in folks who are not statistic ones by training, but are also teaching data science into the section and hence the AASA as well.

John Bailer Yeah, that's, I mean, it really resonates and I'm going to follow up here quickly with in my department, for example, we looked at renaming a degree to data science and statistics allowing for different tracks within it. We've also over the course in introducing new courses, we're introducing courses in statistical programming and data visualization, more towards the idea of of drawing, drawing conclusions getting some, you know, extracting information from from data sources. What I wanted to ask as a follow up in particular was, can you give an example or two when you said, you know, introducing data science topics into an introduction, introductory statistics class, what what does that look like? What are some of the topics that you think were important to infuse there?

Mine Çetinkaya-Rundel
For example, larger emphasis and taking more time with it ideas around data wrangling, for example. So we've sort of, at least in my sort of career in statistics, education have always had this focus on, we should be teaching with real data. But since we did not necessarily always spend as much time sort of teaching students how to clean up data and tidy it up and get it to a format, where you can then visualize it and model with it, we would, even if we were using real datasets, we would give students sort of curated real datasets that were sort of ready to go. And maybe you needed to do just a couple of transformations to some columns, to stepping back and saying, Well, maybe your data set does not even come as a rectangle, you know, maybe it's actually a JSON file that you need to first change into sort of a tabular format, or maybe you need to go get it yourself by scraping it off the web. So these are things that are that we're doing sort of to prepare the data for the analysis that we want to do. And I will say that even beyond the data preparation stage, just even at the analysis stage, another idea that I feel like, like I have been infusing more and more in my sort of intro courses, and I know others have as well is going beyond the emphasis on statistical inference. So keeping that there and the idea around uncertainty, but also starting to think about how can we discuss predictive modeling with students early on, because turns out, that's the sort of thing they're always interested in me.

Rosemary Pennington
You know, data is, is literally everywhere. Now, you know, it's so easy for people to grab a public data set, data is used increasingly in journalism, blog post and Twitter post, how has this sort of increasingly public nature of data and statistical data in particular, sort of affected your approach or your thinking about how to teach stats?

Mine Çetinkaya-Rundel
It takes me zillions of hours more to prepare my class. Because it is so hard to just say, I'll just use the same data set I use last semester, when I know, the day before I ran into some cool tweet when someone was like, okay, like, I did this in my class, or even like I came across this paper, and here's the data that goes along with it. So I think it has had many, many positive impacts to my teaching as the instructor, because it's so many newer things that I can pull from I mean, just the other day, this happened to me, where a week ago, I saw a tweet about somebody saying that they had sort of in their data visualization course, scraped data on Yankee Candle reviews and sort of put them together with the, the COVID data, and then sort of just kind of showed them potentially syncing up. And so I thought, well, I'm teaching web scraping this week, I have to do this. So that's what my students will be doing tomorrow. At the same time, I should mention, though, that because data sets themselves are so easy to find, I What I've seen is that the quality of data documentation has not gone up at the same pace as the amount of data that's available. So there's so many repositories of like data sets, but they don't even have a readme that says how the data was collected, why it was collected, who it was collected from. So it is wonderful that we can tell our students, you know, your final project, go get a dataset and do something with it. I mean, I love that approach. But it is so difficult to sort of keep them disciplined in sort of paying attention to the provenance of the data, when that information turns out is much harder to come across.

John Bailer
Well, that's a great point. I mean, this issue of kind of this dataset availability, this is where I thought that that often with convenience, you don't necessarily know the relevance for a population of interest. And here's where I, you know, when you were talking about kind of the statistical ideas that might be relevant forsomebody who's teaching data science from a maybe from a computer science perspective, having this idea of I have a representative sample becomes it's critical. I mean, it could be complete nonsense and the absence of that so I can see that aspect really coming out. So you've talked about this, you know, working with data, that's, that's kind of messy in a stat class to make together that relevance. And then this idea of sort of the statistical principles and ideas, making it relevant for maybe a group that doesn't necessarily have that exposure. There's another aspect of using data that's that you've mentioned in some of your work. And it's the there's issues of kind of ethics that comes into play when you're using such data. So, so good, you know, how do you talk about that? How do you get students to think about what what are sort of the use agreements when they're grabbing these data? And what what's appropriate?

Mine Çetinkaya-Rundel Yeah, so um, so we do this in two ways. One of them is sort of procedural, which is we actually teach them the tools to check, are you allowed to scrape data? So each website has this, you know, robots dot txt file that actually tells a web crawler, can you get data from here or not, and I teach with R, and there's an R package that allows you to do this sort of check. So the very beginning of our web scraping, you know, lesson, for example, is let's check this. But then we actually before teaching web scraping, talk to students about data, data ethics, and give them some case studies that maybe many of us know about, at this point, like the OKCupid, one where it was, it's legal to get the data and release the data, but was it ethical, and there was, you know, quite a bit of different ideas from the researchers who release it versus the public and the other researchers who said, Well, you know, these people are highly identifiable here. So we bring in examples like this and things that happened. And also we tried to get students to think about the ways in which they emit data, and how they would feel if that data was used for various, you know, things. So oftentimes, they're like, I am fine if my social media profile is seen by someone, but not equally fine. If my social media profile is being used to make hiring decisions about me, so I think getting them to think about their own data is also important, both, you know, to make them more responsible about their own data. But beyond that, sometimes it's a little harder to care about other people's data, but it can be a lot, you know, it can be a lot more emotional, and hence trigger more immediate response if you're talking about your own data. So we try to sort of talk to them about the ethics, but also the sort of the, the tools that they can use, and then bring the two together in making a decision around whether they should grab that data. And once they do, how they should use it, how they should allow for the reuse of it.

Rosemary Pennington You're listening to Stats and Stories, and today we're talking to statistics pedagogy expert Mine Çetinkaya-Rundel.

John Bailer
Yeah, that's, that's a hard thing to think about, you know, in terms of this engagement, I mean, it's, it's such a critical aspect. I mean, just before the break, you were talking about this, the issue of, of ethics, and I often think about, okay, we've got people that are currently instructing in this, and whether they're teaching this at the secondary level, or university level and beyond, that may need to start infusing their own upskilling themselves, you know, getting kind of more foundations, perhaps in and kind of helping with these types of ethical decisions and considerations, but or, you know, getting developing additional competencies and some of the data science skills in one case or statistical skills in another. So how do you think about kind of not just, you know, in a teaching perspective, that there's often the discussion of pre service versus in service, I think that the addition of data science principles, and, you know, in this sort of blurring of lines has led to maybe even greater needs for additional in service training. And it sounds like, you know, some of what maybe you're doing with it with our studio might capture that, although it's got to do name now. Right, you know, so can you talk a little bit about kind of kind of current practitioners, increasing their skill sets to be able to start to handle some of these problems?

Mine Çetinkaya-Rundel Yeah. So I think that, um, one of the ways we sort of tried to capture this, like, what students could be doing once they become practitioners is try to bring that experience into the classroom with not just with practical little examples, but also as much as possible, adding like a project component to their coursework, where they get to actually analyze the data set from beginning to end and ask questions they are interested in, in answering, and I think this is good, like from a motivation perspective, but sometimes I find it's even harder to answer a question that someone else asked you to answer than the one that you may have been interested in, which is probably more likely what is happening in, you know, a real work scenario where you can just sort of change your problem as you go to take the path of least resistance through it. But instead, you sort of have to face the difficulty Is that come with it and really perhaps, you know, really work hard at getting your data in shape, or getting a visualization to look a certain way, or getting building a model that actually is useful to not just you, but to other practitioners as well. So I think that our students get this sort of experience when they do things like internships and whatnot. But I also think as much as possible, getting them involved in like collaborative projects early on, where there may be a domain expert or practitioner, maybe in another department, or with an outside business who might have a question that they want answered, and getting the students involved in answering those is really helpful, these experiences are not easy to sort of do especially at scale, they get even harder. And the sort of increasing interest from the student body in data science means our classes are growing larger and larger. But as much as possible, trying to keep up with having these sort of real experiences for students, I think go a really long way. From my perspective, working with our studio soon to be positive, what I really enjoy about it is having this sort of on my radar, the types of problems, many people are trying to solve using packages, like tidy verse, or systems like Horto, and just seeing the questions that come from them, and that the software engineers are mostly the people who are solving on those who are working on solving those problems. I am not a software engineer myself, but I really enjoy being able to think about, well, maybe that's already a solved problem. And if this person is not getting it, that is clearly a documentation or a communication issue and sort of thinking about, we can't just like write the documentation in all caps and expect it's going to work, we clearly have to take a different approach to communicating that or maybe change the tooling, because it's simply difficult for folks to use,

John Bailer You know, one of the things that that I was thinking about, as you're describing this, I'm going to channel Rosemary in a question here. So this is gonna, that she's scared, and you're scared, I'm terrified. So one thing that I think about is that, you know, for some, for a journalist, student, for example, a major, they may have only one stat class, in their undergraduate career.

Rosemary Pennington
I only had one stat class in my undergraduate career.

John Bailer
So in some sense, you're going out to work now, with with this foundation. And you know, I think about kind of this infusion of, as you've talked about it, whether it's the the kind of the principles that are still relevant and important and critical from statistical thinking to some of the computing skills that are required. And you also mentioned kind of this predictive modeling, I think of this is, is much more important than just an evolution or change that's relevant for a major in in a standard data science program, but also for the journalist who's in Rosemarys. Program. So so when you think about this, in terms of kind of, you know, the the one course only exposure to statistics and data science, why is it why is it even more important, perhaps, to have some of these data science principles infused?

Mine Çetinkaya-Rundel Yeah, I mean, I think that, you know, if folks are going to take only one course, throughout their undergraduate career, one, I think it's very important that they get exposed to tooling they could use afterwards, if need be that they can actually, you know, pick things back up and sort of continue to use them to the extent that they need to as part of their profession. So I love tools that are designed for teaching, because I think that they can be so effective in sort of like highlighting a particular concept, but I don't really like them for the practice of doing statistics and data science, because otherwise, you take this one course. And then what and then you can't actually use a tool you learned that you invested time in to solve the problems you want to solve. So I would say that from a tooling perspective, you know, learning a programming language, even if you're sort of learning it at a pretty high level, I think is really important so that you have a good foundation to build on later, if you choose to do so. Another thing and you know, I'm going to speak now, not trying to solve all the problems because we get into challenges around like, Okay, how do you run such a class? How do you grade such a class, whatever. But it is my experience, particularly running the data fest event that I've been running over the decade, where we have students sort of work on a real project over one weekend, set teams that are diverse in many ways, including disciplines of students do so well. So incredibly well, because it's the journalism students maybe who asks the critical question, and then maybe the stats or the data science student who says, Oh, I think I know how to join these two data sets try to answer that. And the computer science student that maybe says okay, this is a large data set, but I think I can make this code run faster. And maybe a student with you know, a bit more I forget Communication who says, now that we have this stuff, I know how to present it. And I think trying, I wish we could have more classes where these things happen. And they do actually tend to happen in intro courses where students before that they take before they've made up their minds about their disciplines. So it's one of the reasons why I enjoy teaching intro courses. But it would be great if even at a higher level, we have more experiences that bring students with different backgrounds together, and have them solve a common problem bringing in their skills, and hopefully, by osmosis learn a little bit from each other as well.

John Bailer 20:36
Well, data fest is is clearly a brilliant effort. And it's, you know, we've hosted it on our campus and your work. And that is just deeply, deeply appreciated. So, yeah, the one thing I wanted to before we kind of get close to a close, I wanted to give you a chance to talk a little bit about some of your work with open source textbooks. Yeah, the idea of you know, that you're just having this this resource that is openly publicly available, what what kind of what led you to get involved in something like that? And why do you think that's important?

Mine Çetinkaya-Rundel Yeah, you know, I think I got involved with open intro, the Open Source Textbook project. Before I knew what open source meant, I think, thanks to my collaborator, David Diaz, who was sort of original idea, this whole thing was, who knew a little bit more about this stuff than I did, I got excited about it, because I thought, hey, if there's a way we can get a free resource out there to students, why not? I am already writing pretty extensive sort of notes and exercises for the, you know, classes I'm teaching for, if there's a way to reuse these in a way where others can benefit from, why not, that's really how I sort of got into it. And it was through working on that project starting up in grad school. But continuing until now, I realized, Hey, this is not just limited to providing one free PDF to people, there's actually like, a whole movement around open source sort of sharing of things, not on just the education side, but software as well. So it's almost like over the years, many worlds that I've been, I've been involved in turns out to be things that are open source, and I found sort of liking for that I like developing things in an open way. Because of the feedback you receive, even during the time of development, it is hard to maintain open source projects, I have to say, sometimes other priorities come up and you feel bad that you can't sort of resolve an issue that someone might expect you to. But if I can, like just set aside the guilt for a little bit of from those instances, it is a really enjoyable process. And then to read from students, you know, the feedback that says it's because this book was free that I was I feel like I was able to sort of start learning from this, not just folks enrolled in university, but even others were just sort of picking it up and reading it. I think that's a no, no, it's been a really enjoyable experience for me. And ultimately, it's just kept me organized. In terms of my own teaching, much of what I sort of generate and share as open source educational materials are things that I'm already generating for myself. And yes, it does take a little bit of additional work to organize it for reuse by others. But I find that that's an intellectually engaging work. And being an academic, you know, some of my time is dedicated to obviously, like, creating things to share with others. So maybe it's one less paper a year, but I feel like it's been impactful, nonetheless.

Rosemary Pennington
Mine, you've talked a lot about sort of how you became a stats educator and sort of how things are going now. And I wonder, what where do you see stats, education going in the future? You know, we've had this infusion of data science is sort of as sort of changed what you're doing. What do you think's next?

Mine Çetinkaya-Rundel You know, that's a really good question. And it's, it's a question that's hard to answer. Obviously, data science ideas are here to stay. I think one of the challenges that a lot of sort of statistics, let's say departments are going to tackle is like how we handle data science? Do we actually, you know, form solid collaborations with other departments that also do data science? Or do we try to sort of say, No, we are taking over data science, but one way or another? I think it's important that we'll use the distinction sort of continue to stay in the mix and stay in the conversation, because I think the perspective that you bring as a statistician is so important. The other thing is, I mean, there's obviously, you know, data science or not, there is going to continue to be interesting methodological developments, like I think the advance of data science and the advance of things you've mentioned around like there's data everywhere, it means we are generating new problems to solve all the time. So I think there's ample work for statistics. And so there's I don't think in any way the statistics as a discipline is becoming obsolete or anything, what I think might be changing, and I perhaps hope is changing is an increase in variety of entry points into the discipline of statistics. So instead of saying to students, you should have a bunch of calculus under your belt and ready to tackle our probability course, if you want to study statistics, I think being able to say, Hey, do your calculus along the way. But also you can start in you know, in a in a data science course, where you don't perhaps need those math prereqs. But you need to be ready to learn a little bit about computing, I think goes a long way. And I think it's the kind of the Unison of these two that hopefully widens the umbrella. And like, maybe increases both the number and the variety of students, we can sort of welcome into the discipline.

Rosemary Pennington Well, that's all the time we have this episode of Stats and Stories. Mine, thank you so much for joining us today.

Mine Çetinkaya-Rundel Thank you.

Rosemary Pennington Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.