How To Become a Data Scientist | Stats + Stories Episode 82 / by Stats Stories

79 silge.jpg

Julia Silge is a data scientist at Stack Overflow, with a PhD in astrophysics and an abiding love for Jane Austen. She is both an international speaker and a real-world practitioner focusing on data analysis and machine learning practice. She is the author of Text Mining with R, with her coauthor David Robinson. She loves making beautiful charts and communicating about technical topics with diverse audiences.

+ Full Transcript

Bailer: We've talked about text mining in data science previously with Julia Silge, but she's also consented to join us to continue a conversation about how she got involved in what she's doing. So on today's episode of stats and short stories a production of Miami University's department of statistics and Media, Journalism and Film as well as the American Statistical Association, we'll be chatting with her a little more about kind of how she got to do what she's doing and what backgrounds and what things she might recommend. I'm joined in the studio by colleagues Rosemary Pennington and Richard Campbell from the Department of Media, Journalism and Film. I'm John Bailer chair of the department of Miami Statistics Department. Julia, welcome! It's a delight to be chatting with you again.

Silge: Thank you so much.

Bailer: I'd like to start with a quick question. What do you like best about the work that you do now?

Silge: I think…so Stack overflow is maybe smaller than some people realize who have engaged with it. Like we're a very high traffic website, like we have as much traffic as Hulu or the New York Times, but I'm actually the only data scientist there. Like the people who build the website like the engineering organization is between 50 or 60. So I think what I like most about it is how much impact I have when I build a model or I do a statistical analysis and deliver it to someone like a product manager or a software engineer or even like a C-level person in my organization. They pay attention and usually do something with it like change how something is built or how something is organized and so, that high level of impact is super motivating to me.

Pennington: Julia I know you've done work on text mining. So my question for you is, for a scholar maybe who is not as familiar with R or the statistical stuff you use, but who is interested in analyzing texts, I would say I am one of those people, what advice would you have for them as they try to jump into this space?

Silge: I think for someone who wants to get started I think that tidy verse ecosystem in R, plus tidy text which is the work that I do, I think is a great place to start. The tidy verse ecosystem is beginner friendly, it is also a great fit for experts like me. I use it because I am super effective in it but it is explicitly built with a “human friendly” A.P.I.

Pennington: That’s what I like to hear.

Silge: It is meant to be something that is for people to get started with and use and my work around text is explicitly built to be part of that and for someone who wants to get started, I would recommend Hadley Wickham’s book, R for data science, like to maybe look at the first couple chapters of that, it is entirely free and online. And then my book that I have written with my collaborator David Robinson, it's called Text mining with R, a tidy approach. It's available at tidytextmining.com. The whole book is there and so to start working through some of that as well. So those would be the things I would recommend.

Campbell: What do you think we should tell our students? You started your career as an academic and probably had to try to get people excited about data at some point. But what do you think students should be studying, undergrads today, to start on the meandering path that you took? What would you recommend that they do?

Silge: So students who are interested in roles as data scientists, I think it's still a little bit of the Wild West. People don't know…no, it really is! Like people don't know who gets to be a data scientist, what does it mean…it means different things at different places. So here are some of the things I think that help people get hired…having something that publicly demonstrates the ability to do the job. So often what that means is you have…there's not one single way that this has to be, so I’m about to list off some things and I don't mean that you have to have all of these things. I mean that you need to have one or two of these things to be able to show someone like Sure! Yeah, I have a degree in stats or you know something related but here is this evidence. So here are some things that could contribute to that. Do you have a get hub account with code in it, that someone can look at and they can see code there and they can see that you know how to use Get hub, that they know that you know how to write code and have you spoken at local meet up groups about some analysis that you did and are able to communicate about the analysis that you did. Do you have some kind of portfolio or blog where you have say beginning to end data analysis projects, where you used both words text and code to walk through an analysis project and visualization and you explain about what it is that you did. So these are the kind of…Stack overflow has this phrase, public artifact and I like that. It's like a public artifact for what you can do and because the data scientist role is not 100 percent defined in industry currently and the training for it isn't really standardized, I think right now people who are hiring they have to see those kinds of things to be able to say yes this person is hirable.

Bailer: I think you've described just a brilliant way to reconsider capstone experiences for our students that are interested in analytics and data science. So you know I think you've given us some great advice, Julia!

Silge: Excellent!

Bailer: That's all the time we have for this episode of stats and short stories with Julia. Thank you once again for being here.

Silge: Thank you for having me!

Bailer: Sure. Stats and short stories is a partnership between Miami University’s departments of statistics and Media, Journalism and Film and the American Statistical Association. You can follow us on Twitter, Apple podcast or other places where you can find podcasts. If you'd like to share your thoughts on our program send your e-mail to statsandstories@miamioh.edu or check us out at statsandstories.net and be sure to listen for future episodes of Stats and Stories where we discuss the statistics behind the stories and the stories behind the statistics.