Julia Silge is a data scientist at Stack Overflow, with a PhD in astrophysics and an abiding love for Jane Austen. She is both an international speaker and a real-world practitioner focusing on data analysis and machine learning practice. She is the author of Text Mining with R, with her coauthor David Robinson. She loves making beautiful charts and communicating about technical topics with diverse audiences.
+ Full Transcript
Pennington: Get twenty individuals who study language, writing or media in a room and you're likely to get twenty different approaches to the subject. It could be anything from a quantitative content analysis to critical deconstruction of a text. Some scholars though are approaching the Study of Language and media as data science, mining texts for data that help them explain how a creator understood their world. Text mining and data science is the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and Media, Journalism and Film as well as the American Statistical Association. Joining me in the studio, our regular panelists, John Bailer, chair of Miami statistics department and Richard Campbell of Media, Journalism and Film. Our guest today is Julia Silge. Silge is a data scientist at Stack Overflow. Thank you so much for being here, Julia!
Silge: Thank you so much for having me!
Pennington: Now I know your academic training is in astronomy. Actually you have a PhD in astronomy. So I want to ask this first question - how does someone who studied astronomy become a text miner?
Silge: That's a really great question and I think there's two parts, two things that came together that pushed me in that direction. And the first part of it is that I've always been a reader. I love books and I love language. From my childhood, I've always been someone who always had my nose in a book and reading is still what I do to relax. And the other piece of it here is that text is hugely important in the tech landscape…not even tech, like I met data scientists at like a classic Tech company but even if I work, you know, if you work in healthcare or you work in finance or you work at a nonprofit or you are a journalist, like text as data is enormously important today. And so these two things together - both the need for tools to be built to be able to deal with all this text that has become so important and the fact that I've just always loved this and I've always had an affinity for books and reading and text - this came together and made it super appealing and kind of what I like…I'm the right person at the right time to build these kinds of tools.
Bailer: So you're talking to a group…none of us like books here!
(Collective laughter)
Bailer: There's no readers…I mean if we can concentrate on more than 140 characters at a time…
Pennington: Lies! It’s all lies!
Bailer: So yeah, you're talking to 3 passionate readers here as well. So one question that I would have is what was the first text problem that you ever worked on?
Silge: You know, so I have like a winding career path and a lot of people who have the title data scientist say that…they're like oh! I was a biostatistician or I was you know, maybe a software developer and then I moved into this. But I really do think I have a particularly wacky one. I was an astrophysicist doing research and teaching. I actually was a stay at home mom, I was totally out of the workforce for a few years. I came back and I worked at an ed tech start up, doing curriculum development and after like after kind of this weird winding path, I decided for a variety of personal and professional reasons, it was time for me. I wanted to make a transition into data science and I thought gosh! Look at this weird resume!
(Collective laughter)
Silge: Yeah…no one, like I had the confidence like this is going to be a good fit for me but I don't really have evidence that it is true! I need to build a portfolio and when I started building a portfolio of projects, one of the very first things that I did was a text mining project. And you know what they say, you should write what you know and I thought you know like oh gosh, what are some of my favorite things? And my very favorite author is Jane Austen.
Bailer: Oh okay.
Campbell: We knew that. You never want to guess…
(Voices overlap)
Pennington: Yeah, yeah. It becomes obvious. But so Jane Austen's works are in the public domain and you can get the text of them quite easily and so they're like…literally this is one of the very first projects that I did with the goal of - I want to make a portfolio to show to hiring managers, to be able to get a job as a data scientist. So I said OK Oh I can take these works that are in the public domain and I am going to do this analysis and the very first thing I did was, I took some of the existing tools that...in using some of the data science tooling that existed at the time and I did some sentiment analysis of Jane Austin. I think the very first thing I did was just write prejudice. Like, let's do some sentiment analysis of Pride and Prejudice and make some data visualizations of it and when I look back on that now I actually I think like, first of all, it was hard to do. But also, I’m like those visualizations are pretty nice actually. I'm still pretty proud of those first visualizations.
Campbell: That’s cool. So moving from Jane Austin who we’ll probably come back to, I'm really interested in your…you know we run a film program here, so I'm very interested in your screenplay, that database with over 2000 screenplays from I think 1929 to 2015?
Silge: Yes that was such a great project to work on. So this is a project that I worked on in collaboration with the data visualization experts at the Putting. They are amazing at what they do and so I worked in partnership with some of them and I did the data analysis part and then they…in this project they did the front end, built the interactive visualizations. So they have this large database of film scripts and they had done some previous analysis of it, looking at various questions about it. But we worked together and we wanted to ask the question around gender roles with these film scripts. And so what we did was, we first stripped out all of the dialogue and when you take out the dialogue from film scripts, what is left over is the set direction. The set direction in the film script, that's the part of the film that tells the actors what to do. It prescribes what the actors should do, what they are thinking and feeling and looking like and how they're acting. And so what I did after that was say, OK, let's take all of that and let's divide it into bigrams. So a bigram…in text analysis sometimes you look at single words which you would call a unigram, and sometimes you look at N-grams which are a group of N-words and so bigram is a group of 2 words. So we said OK let's find all the bigrams in the set direction and then identify the ones where the first word and the bigram is either he or she and then look at that second word in the bigram and do some statistical analysis of it, basically look at odds ratios, what is more likely to come after he and one is more likely to come after she. This gets at questions to understand, how are men more likely to be portrayed compared to how are women more likely to be portrayed. We were able to get some pretty interesting insights into what are the stories that we are watching, that we're telling culturally about each other when it comes to who we are as men and women. Some of the words that display the biggest differences are - words associated with women are words like snuggles, giggles, squeals, sobs, weeps and some of the words associated with men are words like gallops, shot, howls, and kills. And then you can also look at the words that are in the middle. So that means words that have log odds ratios, where they're about equally likely to come to be associated with men or women. So there's words that are like walks, reads, glances, steady. So by looking at these overall, we're able to understand what are writers, what are films, what are we absorbing, what films are telling us about who we are, and what we think about that. I love this analysis so much because it uses this very detailed text mining approach to get us something that's so core.
Campbell: Right. I wanted to follow up on that because one of the things that you note in this study is that of these 2000 scripts, 85 percent were written by men and only 15 percent were written by women, and I also wondered, did you notice changes? Because you speculate on you know, that this is one of the factors, or variables that are involved in maybe the way this language, and the sort of scene directions are coded. But did you also notice a change more recently, in more recent scripts versus what you were finding back in the 1930 or ‘40s?
Silge: Yes. So that's a great question. So first of all we were able to look at how the gender of the writer impacted, what kind of words were used for different kinds of characters. So women writers use different words to talk about women than male writers use to talk about women. And I was so happy to work with these talented data visualization experts, because we made this really cool interactive visualization, where you can see, you can really experience and see how this changes. Like women who write about women are more likely to…sorry. Women and men who write for…about their same gender are more likely to talk, to use very active verbs. Whereas when you write about the opposite gender, you are more likely to use a few romantic words, which is very interesting. So you asked about the change with time, which is another interesting thing to get at. Well, I did look at change with time and this didn't make it into the final cut because of space issues. The change with time, first of all there were very few women writers of film scripts in the early past, so I was not able to find anything statistically significant there, because there wasn't enough data to see (if) was there a change with gender of the writer with time, but the main change with time is that it is like a style difference, like there were things…I’m trying to remember…I think there were words like for example, some of the words that were very old fashioned, that were much more common in the past. So words, I think like squeals was one of them or...I'm trying to remember what it was…another one. There were some words that when I read them I was like oh yeah I know that, I could see that being used in like the thirty's and not so much today. But it was a fun analysis to do, I was like oh yeah! Yeah, that sounds like something that would happen in the movies in the thirty's and not so much today. It was a fun thing to be able to look at.
Pennington: You're listening to Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington, with Miami University statistics department chair John Bailer and Media, Journalism and Films Richard Campbell. Today we're talking text and data with Stack Overflow data scientist Julia Silge.
Bailer: So Julia, we're going to continue this. This is such a cool example, such a neat application. I'm just curious about the genre film. Did you see differences in the type of language used?
Silge: Yes, so this was something else that we had data on. We had data on the writer's gender, on the genre of the film and this again was something that there were not huge surprises in what we found. And so this is part of why we didn't include it in the analysis, because we saw that women and men were more likely to have those…like women were more likely to have active words used with them in action type movies. So some of these and men were more likely to have some of these…these differences that we see, you can control for some of them by looking at genre. Some of the differences shift depending on genre, perhaps unsurprisingly. This is partly because of what kinds of characters are in what kinds of movies. So I experimented with building a model so that what is actually in here, in the final data visualization is a fairly simple measurement. It's just the log odds ratio but I experimented with building a model where we could control for say, the genre, the gender of the character and be able to see like OK, can we get at what…how much difference is there when we control for these things, then also impact it. You can't account for all of the difference in these kinds of words just with genre. It's not just because men are more likely to be cast in action movies, but it counts for some of the difference, but not all of it so…and it was difficult to communicate in a visualization, the output of that kind of model so we went with this result here, with the more simple statistic to be able to communicate. And this is something I run into all the time, like from my day job, my open source work…it's so applicable to you know, people who work in journalism like what do you decide to communicate, like something that's a more complex model or something that is more simple, easy to explain statistics. You know it's something I think about all the time.
Campbell: So first for the journalist sitting here, tell us about an odds ratio. What is an odds ratio?
Silge: Ah! OK. So an odds ratio is when…it's a statistic, so it means like something you can measure about some kind of sample that you have. So it's a ratio. So on the top you put the odds meaning…gosh! What would be a good way to talk about the odds…so the odds is like if you think about it within gambling, it's like the likelihood that something is going to happen. So on the top you put like the odds of something happening with something else happening and then that goes on the top and then what goes on the bottom is the odds of that happening without something else. So it's a statistic, it's trying to get at what's the strength...like how strongly are those two things associated.
Pennington: Julie, I'm going to swing a speck around to the Jane Austen stuff, because I am also a big Jane Austen fan and I was reading some of your stuff about that. What for you, as you were you know, text mining in these books that you obviously love, what did you like…what emerged from these various analyses that you found most interesting or most surprising?
Silge: One thing that I love is there's a statistic that's called T F I D F. There are two parts of it. The T.F. stands for term frequency. Term frequency is a fairly simple idea. It just means like how often are words used compared to how long is the document. So if you had a lot of…some words that are used a lot in a documents, it would have a high term frequency. So in most language words like…in most documents words like…including Jane Austen's novels, words like we and to have very high term frequency. The other part of it, I D F is a weight, and it's got like a natural log in it, and it's a little more complicated but it basically weights down. It applies to a document in a collection of documents and it weights down things that appear all the time, and it weights up things that appear in only one document or only a few documents in the collection. So T.F. together you multiply them together and you get something called T F I D F and so it's a statistic. The point of it is, the goal of it is to find words that are important for one document in a collection of documents. And so it's important in like internet search, it's a flexible, powerful text mining technique. So if you calculate T F I D F for all the words in Jane Austen's novels what you get are the proper nouns, like the names of the people and places in Jane Austen's novels. So like in Pride and Prejudice, it's like Darcy, Elizabeth, Long Born, you know, it's all the names of like the most important proper nouns. It's no other words. It's just those proper nouns and I love it because when you…if you're someone who has any knowledge of how English works or specifically of Jane Austen’s novels, you see…like if I do like a code through with someone and we walk through how it works and then you see that graph at the end, I love it because most of the time people are like oh! I get it, I get it! Like this is what T F I D F does, it's because Jane Austen uses, you know, pretty similar language from book to book and what is most distinctive, like from a text standpoint, what most sets apart one book from the other books is the name. The names of the places and the people and them. And that's exactly what T F I D F does.
Bailer: Very good. So one thing we haven't talked about, is basically how do you go from a book, a collection of words and sentences and paragraphs, to a data set that you can do an analysis?
Silge: OK yeah so that is a question about like reading in your data. Like reading in your data and getting it into some kind of a data structure that you can do some kind of analysis in. So my work has focused on what people in the business call Tidy data principles. So I work mainly with the programming language R. And mainly in the A.D.M. of tidy data principles. So the first sort of step is that something has to be in some kind of electronic form. So for example with Pride and Prejudice the very first time I did it, I like Googled, you know, like how can I get Pride and Prejudice? It turns out it's on like Project Gutenberg makes many books, like they are in a text form. So then you have to get it somehow into memory but then, what do you? Like what do you do, so that you can do some kind of analysis? I’m a huge proponent of the power and the fluency that embracing tidy data principles gives you. And there's a number of people who work you know in this space and I know you had Hadley Wickham on, sometime in the past year, and he also is someone who has worked so much in this space. There are a couple of like principles here and one is that you have one observation per row, and the work that I've done in the open source world is to build tools, like build software to be a bridge so that you take your data or text data that you have in memory and that you can convert it into this tidy data structure and then you can then use this whole existing infrastructure that exists for data manipulation, data summarization, data visualization to be able to do the kind of analysis that you want to do. So there's more than one answer to that question, but I'm a big believer in the huge possibilities that exist for embracing Tidy Data principles from like the very first steps that someone might want to take in exploring texts all the way through to very sophisticated machine learning techniques.
Campbell: Do you have any suggestions about like some of your work could have implications for how people you know that when they write screenplays being sort of conscious of the kind of verbs that they use. I'm looking at your work on the screenplays. Do you have in the back of your head some notion of how you can, sort of effect change, a larger social change with your work?
Silge: That is something that I have been thinking about most regularly in the context of my day job. So I'm a data scientist at Stack Overflow and Stack Overflow is the world's largest community for developers. We are like the largest online community for developers. And we have 50 million visitors a month and people come and they type in text, right? I love our community, it is a place where so many people have come to help each other. But there have been ways in which our community has not been as healthy as people want it to be. And so this is something I've thought deeply about in that context and have spent a fair amount of work on using text analysis, using from simple to pretty sophisticated machine learning to understand where and in what ways is text being used in an unhealthy manner on our site and how can we detect it and how can we do something about it. So this is something that I care about a lot, both because I'm a woman in tech and this is something that I have felt personally, and something that I, as someone who is part of building, part of like contributing to this community that I care about a lot. It's something that I invest a fair amount of energy in.
Pennington: Julia, thank you so much for being here today. I think it's all the time we have for this conversation unfortunately.
Silge: Thank you so much for having me! It was great to get to chat about my work.
Pennington: Stats and Stories is a partnership between Miami University's departments of statistics and Media, Journalism and Film as well as the American Statistical Association. You can follow us on Twitter, Apple podcast or other places you can find podcasts. If you like to share your thoughts on the program you can send your e-mail to statsandstories@miamioh.edu. Or check us out on our spiffy Website statsandstories.net and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.