An Anti-Racist Approach to Data Science | Stats + Short Stories Episode 180 / by Stats Stories

Hadley_Emily_Small.jpg

Emily Hadley is a Research Data Scientist with the RTI International Center for Data Science. Her work spans several practice areas including health, education, social policy, and criminal justice. Emily holds a Bachelor of Science in Statistics with a second major in Public Policy Studies from Duke University and a Master of Science in Analytics from North Carolina State University.


Episode Description

Individuals and institutions around the United States are grappling with the history of racism in the country as well as the ways they themselves have contributed to it. Many are working to adopt anti-racist approaches to their work and in their everyday lives. How to be an anti-racist data scientist is the focus of this episode of Stats and Stories with guest Emily Hadley.

+Full Transcript

Rosemary Pennington: two individuals and institutions around the US are grappling with the history of racism in the country, as well as the ways they themselves have contributed to it. Many are working to adopt anti racist approaches to their work, and in their daily lives. How to be an anti racist data scientist is the focus of this episode of Stats and Stories, where we explore the statistics behind his stories and the stories behind the statistics. And there's very many gem stands and stories in a production of Miami University's Department of Statistics and media, journalism and film, as well as the American Statistical Association. Joining me are our regular panelists, John Bailer, Chair of Miami’s Statistics department, and Richard Campbell, Professor Emeritus of Media, Journalism and Film, our guest today is Emily Hadley. Hadley is a research data scientist with the RTI International Center for data science. Her work spans several practice areas, including health, education, social policy and criminal justice. While working there, she's used a variety of programming languages to do work from estimating the prevalence of teen vaping in Florida, and forecasting demand for hospital beds, and then the COVID-19 endemic in North Carolina. Emily, thank you so much for being here today.

Emily Hadley: Thank you for having me.

Pennington: How did antiracist data science become something that you, I guess, became interested in and now you speak on?

Hadley: Really great question. And I'll leave with that. It's not something that I grew up thinking a lot about. I grew up in New Hampshire, which is a very predominantly white state, I think 94-95% of the state is white, and just did not really think about race a lot when I was growing up. And it wasn't until I moved to North Carolina, for college that I really began thinking about my place as a person who identifies as white. And I did a lot of work in college and post college where I began to recognize certain privileges that I enjoy because of my race. But it really wasn't until last summer and the murder of George Floyd in particular, where I caught sorry, where I really recognized that I could be doing more in the professional space to address racism in the data science space. And so that's when I began to think critically about both the projects that I was a part of, as well as the co workers that I was working with the data we were collecting, and how we were using and communicating that data. And I really realized that there were a lot of opportunities to take small actions as an analyst on your own computer that can have much larger ramifications and impacts and be an important part of the anti racist steps that, as you noted, the whole country is starting to take.

John Bailer: Can you talk a little bit about one of the projects or how this thinking has informed the way that you've now, you know, approaching projects?

Hadley: Absolutely. So I'll say, you know, a large number of projects that I work on utilize race or ethnicity as a variable or characteristic either in the model building but also in descriptive statistics. And, you know, we use it when it comes to us because it seems very familiar. Oh, we've collected this, we should do something with this variable. But race and ethnicity is actually a really complicated subject, everything from you know, where did the terms come from, so Latin x, and then Hispanic are both terms that were invented by the US government, they're not from the community themselves. And so when you're grouping all these individuals together, and then making an analysis based on it, and synthesizing some sort of result for the community that may not always identify together, you're making some leaps and conclusions that might not be appropriate. And so you know, when when I'm working on projects that use race or ethnicity as a component, thinking about everything from you know, the history and the context of how they were developed and defined into, you know, oftentimes we're dealing with small sample sizes, what groups Am I grouping together? Why am I grouping them together? Is it appropriate to have another category? And whose experiences isn't my grouping together? When do I use that? Perhaps the real takeaway, when we're using race and ethnicity is the question of, you know, why are we even using this as a component? Because we're making an assumption that people with the same race and ethnicity are all sharing the same experience. I recently went to a very informative workshop where they talked about, you know, if you're trying to capture something like marginalization, it's not necessarily appropriate to assume that everyone with the same race and ethnicity is either experiencing marginalization or experiencing it to the same degree. So a lot of what I do is you know, asking Like people like me data scientists and data analysts who are making decisions about how to use race and ethnicity and their work, sometimes it's alone on our own computers like we don't get a lot of feedback on the decisions we make and encouraging the analysts and the scientists to be more critical about that work, and also to document it and communicated to project staff, because it is a pretty key decision that has ramifications for the whole project.

Richard Campbell: Emily, your article and towards data science actually reminded me of the show Northern Exposure, which was on in the early 90s. And there was an episode in which they were arguing about the difference between truth and facts. And the doctor in the town sort of represents science in this episode. And he says the job of the scientist is to reveal the data, reveal the evidence and step away. Get out of the way. And that the question I want to ask about that is, as it has to do with objectivity has to do with pushback that you might be getting from colleagues? Could you talk a little bit about that? Because you're not? I'm assuming as a data scientist, you're not. You're in the minority here. So can you talk a little bit about that?

Hadley: Absolutely. And, and you're not far off from the experiences that I've had, I definitely received some pushback from initiating these conversations. And there's kind of two components to it. The first piece is I received feedback that, you know, just considering race in and of itself is racist. And I do push back pretty strongly against that, because, you know, it lends itself to this view that we should be moving towards a colorblind world, which I really disagree with that perspective, because there's a lot wrapped up into people's experiences related to their races and ethnicity that we're not just going to overcome. There's a whole history and a whole system that's been set up. And, you know, I think there's a lot of literature in that space, that I often refer to people that, that critique to observe and reflect on, you know, actually, we do want to be considering race, and we want to be pretty direct about it. But the second piece is, is the one that's a little bit more theoretical and harder to discuss. And that's this idea of objectivity. And that's this idea of us being scientists, and there being the idea that there's one true result and one given fact. And, you know, I can see sometimes like, you know, when you and I run the same mathematical program on the same data, we should get the same result like that is the value of math and the value of statistics that that the same logic should hold true. But then there's the element of us being data, storytellers. And data can tell multiple stories, depending on how you look at it. And that's really what I encourage people to dig deeper into. So you know, referencing how you're using race with race and ethnicity depends, like, you can split up race and ethnicity groups in different ways to tell different stories. And it matters a lot how you choose to group or ungroup those groups. And it's not that one is true, or one is false. It's just that one might be more applicable and more reasonably given a question, or a given area of interest. So I do encourage people to move away from this idea that there is only one truth. Because, you know, especially in the data science field, we're making statements about the data that other people will interpret and use moving forward. And I don't think it's fair to just say, this, is it, this is the truth, when there are multiple ways to to look at the same data set.

Bailer: You know, Richard, thank you for the Northern Exposure reference, you know, I'm probably gonna say, you know, are you putting in this time reference? Are you getting that next next year to talk about the prisoner? Okay, so just that, I think your comment here, sorry, I'm I just, I just, I think the comment here is that what you were making is really important. The idea of that there's a, you can get the same output, given the same input, if your methods are kind of stable, and they're kind of there's this, you can sort of connect all the dots in that path. But the thing that seems like it's critical, and part of what you're challenging is the notion of what should the inputs be? You know, what does, what do those inputs represent? And so you know, how, when you're involved in projects, and you start talking about kind of ideas about what should be the inputs in the models that we're using, because if you're doing data science, I assume you're doing some kind of prediction or some kind of classification is often an outcome. But when you know, how does thinking about those inputs or even the defining the outputs play out, given what you've been considering and reflecting on?

Hadley: Yeah, absolutely. I think this is a really good question of where the inputs are coming from and how involved the data scientists are at the point of collection. And I'll say we're thinking we're moving towards a space where The field is becoming much more aware of how important it is to have the data scientists and data analysts involved much earlier in the process. So everything from, you know, defining the particular question and how that information is collected to also ensuring representativeness in the data. And that's, that's a really big piece for us. Because as you pointed out, you know, I'm on a lot of projects where we're doing prediction, and we're doing forecasting, and we're doing some models that have pretty high degrees of uncertainty. And also have the ability to be pretty biased for one group or another for not paying attention to where, where the data is coming from, and what's being collected. So I'm a really big advocate of having a data scientist involved in that data collection process whenever possible. And then also, on the flip side, making sure that subject matter experts are involved. You know, as a data scientist, my job is not to be a subject matter expert in a particular field, it's to have a skill set that I can apply to a lot of different areas. And, you know, I think at the beginning of the episode, it was mentioned the wide variety of spaces that I work in. And you know, right now, I'm on a criminal justice project in an opioid project and a project about infant mortality in low and middle income countries. These are all three topics that I don't know a lot about. And so when I'm deciding what the outcome is of the model, that's when it becomes really, really important to talk with the subject matter experts and be like, what are we actually trying to predict? What are we actually trying to forecast, and how big is the cost of an error, because that will also greatly impact you know, how how important some of those inputs are, you know, if we have really skewed data, and the cost of an error is really high, we might not want to pursue the project.

Campbell: When you talk about another area that I would like you to weigh in on, you mentioned telling stories about data. And that's something that's a phrase that I've seen you use. This is another area where you might get pushback: what's the relationship between data and storytelling? And I think a lot of scientists, many of them we've had on this program are very good at telling stories about their data. And they do it very naturally. But some, I think, are a lot of scientists uncomfortable with this relationship. Can you talk a little bit about how you think about the relationship between your work and telling stories about your work?

Hadley: Certainly, storytelling is a very key component of what I do. And I think it's a key component of what a lot of the data scientists I work with do. And it's coming from a place where a lot of the groups that we're working with, so that could be government, that could be industry, that could be academia, they want to use what we're doing to make decisions. And so at the end of the day, you know, I'm working on a project right now, or somebody just wants to know, is the model biased? Like, can I use this and, and at the end of the day, that's kind of the statement that I have to get down to. And they're, they're interested in that the technical details are accurate and good and quality. But they don't have training in that area. And they're not necessarily that interested in digging really deep into the plots and into the numbers and into the P values. Like they kind of just want that final piece. And so that communication element is huge, and often much larger pieces of our projects than we anticipate, you know, how can we get a plot that's user friendly, and expresses what we're trying to express in a way that's not you know, biased based on what we want to tell it's, you know, the story that's actually being told by the data, but is also told in a way that's useful for our stakeholders, like, I can't just give people a whole bunch of significant p values and be like, this is what important, alright, you figure out what to do next, like, they want me to generally tell them what to do next. And so I think that's where that storytelling piece comes in. You've seen what significant you've seen what the model output looks like, and helping the client figure out what to do with it is a really key piece of applying these techniques and practice.

Pennington: You're listening to Stats and Stories. And today we're talking with Emily Hadley about how to be an antiracist data scientist. Emily, I'm going to ask you a question about not not the clients you work with, or data scientists, but maybe for the public who is reading news stories about, you know, data that's pursuing populations in particular ways? What sort of advice would you give to a news reader about how to sort of be able to suss out whether what they're the data they're reading about? Is it useful? Or I'm not even quite sure what adjectives I want to use here. But how does sort of navigate sort of that conversation around race and data and when whether there's advice you'd give to a news consumer about, like, what are we thinking about when reading these kinds of news stories?

Hadley: Absolutely. Sure. That is a great question. So our news has really gained a lot of interest in data and the way that data is collected and the way that's being used, and I'll kind of talk about a couple of ideas on the spectrum. So the first is related to how the data is collected, and in particular surveillance. And there's becoming more and more news that reflects that, you know, a lot of the surveillance that's happening is oftentimes communities of color, for example, we see a lot more, you know, cameras that are tracking people and movement in oftentimes communities of color, but also your your cell phone and your data collection. You know, I think that there's a level of education that comes with understanding exactly what tools are collecting your data, and how they're, they're using it. And I would love to see more of that in the school systems. Because, you know, a lot of companies have ad profiles on you, and you might not even know it, or they've been tracking your location for years, and you don't even know it. And I think I really encouraged consumers of news, but also consumers more broadly, to become aware of what data is being collected on them, if they can get rid of that data. And if they can't, and they're uncomfortable with it, steps that they might be able to take to advocate for the use of that data. And then the piece related to that is, every time you see machine learning and AI in the news, I urge you to think critically about what actually is happening and what the outcome of that model is. Because, you know, you do have to ask yourself, like, what I want my information fed into that model and some sort of decision to be made about me, that's going to have a really big impact on my life. And if you don't, you know, say something about it, or do something about it, because we're moving into a space where algorithms are really efficient, and can make decisions quickly. But they're also impacting people's lives.

Bailer: You know, in your, your article, I mean, certainly, we could, we could send through all the five steps, but I'm gonna, I'm gonna leave that as a homework assignment to our to our, our listeners, but I would like to explore at least a couple of the ideas that you bring up. And step two was to learn about how data and algorithms have been used to perpetuate racism. Could you give a couple of examples of how that has happened?

Hadley: Yeah, certainly. So there's a lot that's happening in this space. Right now, I would say, criminal justice is one of the biggest faces, and it's something that I do a lot of work in at the moment. And there's a lot of different pieces to it. So I think one of the most common ones is going to be your facial recognition algorithms, facial recognition has gained a lot of notoriety. And it has also been that because some of the initial algorithms were based on training data. So those are those input data sets that were not representative of the general population. And so you're seeing better, better predictions for people who are white than for people who are black. But even also, in the criminal justice space, something that I'm actually actively working on as part of my work is related to pretrial risk assessment algorithms. So these are algorithms that predict the likelihood that somebody is going to fail to appear or go on to commit a crime after they've been arrested. And there's increasing interest in Judges using these to make decisions about whether or not to keep an individual in jail before their first appearance. And if so, whether or not to set bail and how much for, and they're pretty controversial. I should note that our involvement is from the validation side, not from the development side, not from the creation side, and really kind of an intercept, you know, are these are these biases? Are these tools biased? And are these tools going to be biased in the particular regions that are interested in using one?

Bailer: Or can I interrupt just for one second, just to clarify, so can you talk a little bit what is when you say that these algorithms are biased? Could you talk about what that is? When you see an algorithm biased? What does that mean?

Hadley: Certain? Yeah, a bias algorithm is one that's going to be making a different decision for one individual than another individual, where pretty much the only distinguishing factor between those two individuals is the protected class. So that can be generally we're looking at race, ethnicity and gender, but you could potentially look at others as well. Thank you.

Pennington: So Emily, I have a kind of a larger question about the era of misinformation we're living in right now. And you have any tips for us on countering that? tips for journalists? This is a tough time and I think things are getting better but this the you know, the the disinformation that's passed along in, in social media is it's just so rapid and so fast. In your work and in your writing, how do you take on this challenge?

Hadley: It's been a really tough year and a really, really tough challenge, especially because data is often at the core of it. People put a lot of faith in statistics and numbers and I see all sorts of numbers being associated with things that are blatantly untrue or false, and and people will will latch on to a number and use it to justify You know, why they're not getting a particular vaccine or why they're not showing up to vote or all sorts of different things. And it can be really, really challenging to, as you pointed out, not just tell true news from fake news, but also tell like, what is an accurate statistic and what is valid. a lot of it has to do with education. I'm a big proponent of basic statistics and data science education for all students, you know, before college, I think having the ability to to look at, you know, a piece of news that's come out and understand, you know, well, who is the sample? Where did this number come from? Who funded this study? Like, that type of information, and that type of critical thinking is so important to stop that sort of reactionary movement, and you're like, Oh, my gosh, that statistic is so terrible, like, now it's going to totally change my behavior. And then from a news perspective, as well, I think, you know, journalists, I'm not a journalist by training, I don't know how much training journalists have in statistics, but for the same reasons, like, you know, what are you putting out how strong is the data? What should you trust and help your readers and listeners better make informed decisions about, you know, getting back to like, there's not really always an objective truth, like two people can be right at the same time with very different results. And the interpretation matters a lot and the context in which you're using them. And so I think that we're hopefully going to move towards a space where we can put a little bit more trust and in news and data, but it'll take a kind of a village to come together and agree on that.

Pennington: You probably won't hear any arguments here about promoting statistical data literacy. Yeah, I think you have a very sympathetic audience here.

Bailer: You know, one thing i is i was looking through some of your work. And I really liked this, this idea of when thinking about, about building a model thinking like an adversary, as being part of this, and I, and I see, but that also in terms of what I'm teaching, modeling, or what I'm thinking about this about? You know, I've not thought about doing that in my classes. But that's kind of an interesting idea. Can you talk a little bit about what it means to think like an adversary when you're building a model?

Hadley: Certainly, yeah, this came from actually my undergrad experience, I was a statistics major and a public policy major. And we had this saying, in our public policy major, that good intentions, were not enough. Just because you have these good intentions about what you're building, or what you're using your data for, it does not mean that somebody is going to come along and honestly do something pretty terrible with it. So kind of the classic example that I give in this area is the Twitter bot, that I'm pretty sure it was maybe Microsoft might have been Twitter that that put out generated this, this AI bot that that people were supposed to be able to respond to a couple of years ago, and she had a female persona. And the 4chan portion of the internet found out about this, and because it was AI and machine learning, it was learning from the information that it was being fed on. And so a whole bunch of users just started feeding it a lot of anti semitic and racist and, and other bits of information. And so within about 24 hours, you had a very racist, homophobic, anti semitic chatbot. And that's one of those things where you know, the intentions were probably really good to be like, oh, look how far like automated natural language processing and chatbots have come like, this is very cool, which it is. And you know, that that adversarial is hacked is a section of the internet that decided that they wanted to kind of take this for their own purposes. So definitely, you know, every time we build a model, it's thinking about, you know, both, like, how could users who are inputting data in this model mess with the model itself? And also, you know, how could the output be used to justify something that we're not, we're not interested in justifying?

Pennington: I'm curious if you can think about it, as you've evolved, as you've developed in a sort of on your journey, in thinking about this and becoming an anti racist data scientist. What advice do you give for this? So those of us that are teaching the next generation, whether it's teaching the next generation that are working in, in quantitative departments, or data science, statistics, or in journalism?

Hadley: For sure, well, the first piece was, you know, it was really important for me to recognize that, you know, I am a white woman, but as a white woman, I have a responsibility to step up and do something about this every day, I think sometimes, you know, work is placed on our colleagues of color in a way that burdens them with a lot of work and so stepping up and saying, you know, I am somebody and I can do something about this. And then also following up with and it requires me to do something every day, you know, and I will never get to the point where I've, like, checked off this box and I'm like, Oh my gosh, I'm antiracist like no, like, conscious decision that I make every single day and how to consistently reevaluate and really accept feedback on as well. Like, you know, the willingness to make mistakes and know that I'm going to be wrong and sometimes I'm going to, you know, have behaved incorrectly and accepting that I'm constantly learning is probably my biggest piece of recommendation. It's not easy. But it's so worth it, you know, to be building this field where we're moving towards a space where we're really working to be antiracist. and building the society that we view. And I would say that, you know, kind of encouraging students in particular and your co-workers, the reason I do these conversations is I want other people to get excited about doing this work. I'm certainly not the only person doing this work. There's a large community that's existed long before I got into this. And I think kind of making that active choice to have this be a part of your career is something I've really liked to see more people do.

Pennington: Well Emily that’s all the time we have for this episode of Stats and Stories. Emily, thank you so much for being here today.

Hadley: Thank you so much for having me.

Bailer: Thank you, Emily.

Campbell: Thanks, Emily.

Pennington: Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.