Tarak Shah is a data scientist at HRDAG, where he cleans and processes data and fits models in order to understand evidence of human rights abuses.
Prior to his position at HRDAG, he was the Assistant Director of Prospect Analysis at University of California, Berkeley, in the University Development and Alumni Relations, where he developed tools and analytics to support major gift fundraising.
Episode Description
Protestors have taken to streets across the U-S this summer in order to fight back against what they see as an unjust criminal justice system – one that treats People of Color in prejudicial and violent ways. The concern over racial bias in policing has long been a concern of activists, but there’s an increasing focus on other ways racial bias might influence decisions made in America’s courts and police stations. The statistics related to race and the criminal justice system is a focus of this episode of Stats and Stories.
+Timestamps
What spurred this research? (1:33)
What is a risk assessment model? (2:12)
What ore these tools suppose to do? (4:00)
What is fairness? (5:18)
What did you learn? (10:12)
What is the takeaway for the layperson? (15:20)
What’re some parallels to this work? (19:35)
How do you make this interesting? (22:07)
What’s the flow of your work, for reproducibility? (25:00)
+Full Transcript
Rosemary Pennington: Protesters have taken to streets across the U.S. this summer in order to fight back in what they see as an unjust criminal justice system. One that treats people of color in prejudicial and violent ways. The concern over racial bias in policing has long been something activists were thinking about, but there’s an increasing focus on the other ways racial bias might influence decisions made in America’s courts and police stations. The statistics related to race in the criminal justice system is the focus of this episode of Stats and Stories where we explore the statistics behind the stories and the stories behind the statistics. I’m Rosemary Pennington. Stats and Stories is a production of Miami University’s Department of Statistics and Medial, Journalism and Film, and the American Statistical Association. Joining me are regular panelists John Bailer, Chair of Miami’s Statistics Department and Richard Campbell, former Chair of Media, Journalism and Film. Our guest today is Tarak Shah. Shah is a data scientist at the Human Rights Data Analysis Group, or HRDAG where he cleans, processes, and builds models from data in order to understand the evidence of human rights abuses. He was the co-author of a report released last fall that examined whether a particular risk assessment model reinforces racial inequalities in the criminal justice system. Tarak, thank you so much for being here today.
Tarak Shah: Thank you for having me.
Pennington: Could you explain what spurred this particular bit of research into this risk assessment model and what your report found?
Shah: Sure. So, there’s been interest in these pretrial risk assessment models in particular for a little while now. Partly because of how much public opposition has grown to the money bail system. And because of that these other risk assessment tools have been proposed as more objective or more neutral alternatives to decisions by judges which may be considered biased. And this kind of fits into that atmosphere.
John Bailer: So, can you talk- just to take a step back, just to help fill in the gaps for people that are new to this? I mean- this idea of what happens in a pretrial process. And then as- sort of to jump off of that, what does a risk assessment tool do in the context of this pretrial process?
Shah: Yeah. Excellent question. So, in general when a person is arrested they- a court must decide and depending on what state you live in they’ll have either one to two days to make this decision. Whether you can go home while you await the beginning of your trial, or whether they need to take some kind of action, whether that’s detention or some kind of supervisory condition in order to ensure that you will appear for your court date and or that you will not be a danger to your community during the time that the trial hasn’t happened yet. So, those decisions are being made, as I mentioned before, the judges have historically relied on bail to make sure that people appear for their court dates, there’s increasing recognition that that disproportionately harms people who are poor and so there’s been interest in alternatives, but the basic kind of decision that either a judge or some kind of decision-making system is required to make has to do with usually one or both of those two elements that I mentioned. So, either whether a person is going to be a danger to their community or whether they are going to flee the jurisdiction and escape accountability. So that’s kind of the decision in front of us, historically made by judges. More and more judges are getting information from these risk assessment tools as just additional information to make that decision.
Pennington: So, are these tools like a technology that they’re relying on to help them understand what possible behaviors of particular defendants might be?
Shah: Yeah exactly, and they so these are tools that will take data in about characteristics of the arrested person. So, things like their age and sex and other demographic information as well as things like their arrest history or other kind of encounters with the court system. And I mentioned those kinds of two high-level principles like danger to the community and risk of flight. In practice, like those are kind of fuzzy concepts that need to be made concrete when we’re talking about actual measurements, and so the way those things get measured is in terms of danger to the community we- those who developed these risk assessment tools look at rearrests. So, was a person who was going to face trial- were they rearrested before their trial concluded? Sometimes that’s narrowed down somewhat. So maybe in a given jurisdiction, they’ll only look at felony rearrests or violent rearrests and there’s all sorts of logic that goes into what counts in each of these categories. Similarly, with flight risks- that’s also a little bit fuzzy in some ways, and so what we can measure is failure to appear for a court date.
Richard Campbell: I was interested in you talked about fuzzy right there, and you talk about a definition of fairness, which seems like- that’s not something I hear statisticians talking about very much, having done a hundred and fifty shows- No offense, John. But how do you- what is the definition of fairness in risk assessment modeling?
Shah: Yeah, that is an excellent question and in fact, one that there are multiple definitions fairness in this context and I will give a couple of examples. So one is maybe just going back to what these models look like there are very often logistic regressions or some other kind of predictive model which will classify people into yes, they will re-offend or no they will not or something like that. And so, within that context, there’s kind of basic notions that I think anybody without any kind of statistics background might be able to pick up. Things like demographic So, do black individuals who appear before the court- are there similar decisions made about them versus white individuals? So, in practice there’s- we tend to rely on somewhat more complicated definitions of fairness. The idea being that- well, let me give you an example, so in addition to demographic, there’s things like equal false-positive rates. So, the argument here is that the biggest cost of one of these risk assessment decisions is when somebody has to be incarcerated or otherwise supervised as a result of that score. And so false positive here is somebody who is determined to be high-risk by this tool, but who in fact would not have gone on to re-offend or miss their court date if they were left to go home, so that’s one example. Another example is just kind of equal calibration across race groups. So that means like if you- so your logistic progression puts out the number 0.47, so like 47% likelihood that you’re going to re-offend or something. So white people who get that score people versus black people who get that store. SO, everybody got a 0.47 among the white group did about 47% of them reoffending versus similar numbers for the black group. There is=- so we have equal calibration, we also have false-positive rates. We also have a similar notion of equal false-negative rates- that is people who did go on to re-offend or miss their court date; how often were they actually labeled high risk or low risk and are those rates equal across race groups or other protected characteristics. The kind of challenge, well one challenge is just what I mentioned is that there are multiple different definitions and there’s not an official correct definition and in addition the examples I just happen to give there is like an important result in fairness which is that they are under most realistic circumstances, they are mutually incompatible. That is, you can’t meet all three of them at the same time. So, it’s- which I think makes sense from a non-statistical perspective. People have different notions of what fairness means; I think. But it does make it challenging to talk about fairness in these contexts and I just want to kind of add, so everything I’ve been kind of talking about is kind of fairness within the system defined by the model. Like where the outcomes that are measured in the model, and what were the data inputs that went into the model and they’re kind of taking those data as a given. A separate level of analysis here for fairness is whether or not there is bias in the data itself. Whether these measures are fair measures of the thing that we’re interested in measuring. And so that kind of goes back to what I said before where I said we have these notions of danger to the community or flight risk, but in practice when we’re talking about creating a progression model we need these measures and what we have is rearrests or failure to appear for court and so often there’s problems with both of those measures. A lot of people fail to appear not because they fled the jurisdiction but because they forgot or the court date got changed and they moved so the postcard that they received, they never got it. And similarly, with re-arrest, the assumption there if you’re using that data is that an arrest is an unbiased measure of criminality or dangerousness and there’s a lot of evidence that that’s not the case.
Bailer: So, what did you learn? Let’s get back again to kind of the punchline to the work that you’ve done. You’ve helped us frame what a risk assessment model is and how that’s being used in the context of establishing or evaluating fairness.
Shah: So, in this particular research, we were looking at a tool used in New York City to determine eligibility for a supervised release program. This is kind of an alternative to being detained while you await your trial. So, the idea was that individuals that get a low-risk score would become eligible for this supervised release program whereas those who received a high score would not and the alternative is that you are detained. As I mentioned we looked at some of these different fairness measures that I mentioned such as false-positive rates and accuracy across race groups, and the particular model met some of those but not others so it had much higher false-positive rates for black and Hispanic people than it did for white people and also in terms of demographic parody it was much more likely to give black individuals a higher risk than white individuals. But as I mentioned- well I don’t have anything else to say about that. but one kind of deeper thing that came out of that was a couple of things that we noticed about the data and the process used to build the model itself. So, in terms of the data, we had that question about whether in this case felony re-arrest was the outcome variable that the developers were modeling in their regression, whether felony re-arrest is a fair measure of dangerousness. And that was an important question for us because when we looked into it a little bit the training data for this model had all been collected during the height of New York’s Stop and Frisk program. And sometime after that data was collected New York courts themselves had determined that this was an unconstitutional program because it was disproportionately applied against black and Hispanic people. And something like 87% of people who were stopped under Stop and Frisk were black or Hispanic. So unfortunately for us I guess the arrest data that we got that was the training data for the model did not contain information about whether each arrest was a result of the Stop and Frisk or anything else, however, we’re able to kind of look into- like make some inference about how many of these arrest outcomes could have been affected by the Stop and Frisk program, so we looked back at what the most common arrest resulting from a Stop and Frisk stop were and they were either drug-related or weapons-related, so then we went back to our data and found that just under 40% of the arrests in our outcome data was either drug-related or weapons possession related. So, we can’t say that all of those were Stop and Frisk, but we can guess that a good number of them were because this was during that program. We also found- so one of the things that we had to do when we were writing this report was attempted to recreate the scoring model that’s used. So we kind of read the paper and it goes through logistic regression and so it kind of goes through the variables and so forth and so we have the same training data as the original developers did so we were trying to replicate their model at the beginning. And we ran into some challenges. We had the same train and test split as the developers did and we got the same coefficients for each of those different pieces of the data and that all made sense but then when we looked at the final model that New York was using the point scores that they gave for each characteristic did not match up with the coefficients that we found in regression and in fact appeared to be kind of picked and chosen from the different splits of data. So there were like some coefficients that were found from fitting the model to the training data, some coefficients that came again from the text data and some that came from fitting the model to the entire data set and it wasn’t- so that, I don’t think we would have known if we had not tried to replicate the model, to begin with. Once we saw that we contacted the developers and tried to get more information about what was going on and the best that we can find out is that there was like a decision process by a committee that was like oh this looks good, this doesn’t look good and they kind of worked their way to scores. And I point that out because that’s a little bit separate from the types of fairness that we were talking about but I think these tools often get packaged as like objective or neutral thing and here we see an illustration that that packaging is really hiding a lot of political decisions that are going on.
Pennington: You’re listening to Stats and Stories and today we are talking to Tarak Shah of the Human Rights Data Analysis Group. So, I’m going to go back to that sort of thing you were saying there at the end. So why should someone who is not a statistician, someone who is not engaged in activist morale criminal justice, you know what is the takeaway for just the layperson, as far as it relates to this particular report? Why should my mother or my best friend or my colleagues here care about what you found in this report?
Shah: Great, so I- that’s a good question, I would maybe think about that in a couple of ways. So one thing that I would think about going back to what I was just saying where often data-informed tools are presented as kind of a more objective alternative to some other procedure and I think it’s important for all of us regardless of where we’re working to be somewhat critical of that because often that’s just a way of kind of sneaking in whatever kind of biases we already had in through this kind of packaging. And I think that applies not just to incarceration decisions but often any kind of automated decision-making systems. I also think just- maybe a lot of us are concerned about policing and incarceration right now so as somebody who is concerned about that especially with pretrial incarcerations and these are people who are presumed to be innocent by the legal system, I think worrying about how those people are treated and how decisions about their liberty is made is an important thing on its own. And in particular, I mentioned that the risk assessment tools are often positioned as neutral or objective alternatives to judge’s decisions which are known to be biased and there is a lot of evidence that they are and so just to understand that we don’t get to wash our hands of the bias just by putting it through these systems. And maybe, hopefully knowing that would lead people to think a little bit more- so I don’t know if this is helpful or not but there’s kind of two ways to look at this, like are the decisions fair in terms of equal treatment under the law for different race groups. Another kind of level of wondering about this is whether- there’s kind of a larger issue here that’s at play and sometimes I feel like these risk assessment tools is to kind of push through an idea that like there’s a fairer way to incarcerate people who are presumed to be innocent and so I hope that p[ that the biases that we see everywhere else in our society also sneaked their way into the model, the data that we used to make kind of databased tools will force people to think about that larger picture and about what it actually means to fairly incarcerate innocent people.
Pennington: And it seems like it’s in line with a lot of research around technology that has shown that we have believed these technologies might, as you suggested, give us the get out of jail free card, to use that unfortunate phrase, right? When it comes to this issue of bias, oh we’ll allow the AI or the technology to handle everything because it’s not biased, and what we’re increasingly coming to realize in whether search engines or video games where you have certain avatars is that the bias is built-in because it’s not been challenged outside of technology. It would seem like your report is kind of suggesting that when it comes to this very important issue of incarceration of people before they are ever actually in court, right before they go to trial, that that bias has also been dealt into that technology. So, it feels like it’s within the framework of our larger understanding of the way AI has maybe not been as critically engaged with other technologies because we see them as these arbitrators of truth in a way that perhaps they’re not.
Shah: Yeah I think that’s right, and we see, I mean the data that we have are generated by existing social processes and the existing social processes that we have are not at all free from racial bias. And so that works its way into the data and like you say into all these different AI systems. And of course, in this case, making very high stakes decisions about people’s freedom. Yeah so I think that’s definitely an issue
Bailer: So, what if I- I’ going to hire you now to build a risk assessment tool for me. So, as you think about this and you know there are cases where we see this- in the banking industry, there are certain variables that are just not acceptable for use in prediction as inputs the models. Is there some sort of parallels to this? I mean if you were thinking about this as saying I would like to try to build as fair a risk assessment tool as possible, what would be some of the steps that you would think about and that you would need to consider in doing so?
Shah: So there are like- there have been people who’ve kind of pointed out or asked that or kind of define fairness as the absence of certain predictors in kind of the way that I think I’m less familiar with the banking industry but I think that they just dot have race as a variable and that’s how they kind of deal with that. I assume this is true in banking, it’s definitely true in criminal justice data that there are lots of things that correlate along with race and so you can’t really remove race from your predictors. You can remove that one column, but you can’t remove it from the zip code or from previous arrest history and stuff like that. So, I would start with that. like it’s not enough to sort of close your eyes and hope that you don’t see race because it’s there and so and then you know I think it would be important to kind of have like we talked about these technical definitions about fairness within the model and the predictions it makes and are they treating people equally across races. As I mentioned there are multiple different definitions and some of them might be mutually incompatible so it’s important if you are going to go down this route to pick one that makes sense for your application. And so I kind of- and I don’t have the absolute answer to this but I kind of mentioned that one way to think about this is how the cost of incorrect predictions is distributed across races and whether that cost and that burden is shared equally or not. And so in the case of- again I don’t want to come down on the side of one version or the other but in the case of these pretrial risk assessment tools that seems to suggest something like false-positive rates as like a more important thing to look at maybe because the- when you’re considered high-risk, that’s when you’re really paying the high cost of the predictions.
Richard Campbell: Some of your findings are really important. Even the phrase risk assessment modeling- I’m imagining- how do you get a journalist interested in that, and what the implications are of that, because I think these kinds of findings are really important for [people to understand, especially now. And have you seen any good coverage about your findings and how do you interest a journalist in this? I mean the way that I would go about it, of course, is to find somebody that was really treated unfairly in the system and tell that story, but then how do you make sure you explain what you found in the data understandable to a journalist who could then communicate that to the public?
Shah: So one thing before I go into my answer, this is like an interesting question because the fields that I’m talking about in evaluating fairness in terms of these technical definitions and so forth I don’t know if this is where I started but definitely my introduction and a lot of peoples introduction to it was actually through journalism it was a story in ProPublica about the compass risk assessment tool. That kind of set off a lot of this study. So, in some sense- like in that case the journalists were ahead of me. I was learning from them. But I think you pointed out the language itself risk assessment either sounds boring or technical but it’s also kind of a useful entry point because I think it’s a term that frames the decisions being made or the scope of what decisions can be made and so when we talk about pretrial decision making if we frame this decision in terms of risk assessment then the person in front of you all they are is a possible risk. And that’s- there’s been some push back against that in various places but I don’t know- so a different way to help think about what I mean when I say risk assessment frames that the decision in a very particular way. So, another alternative might be something like a needs assessment. Like I mentioned before one of the reasons a lot of people miss their court dates is not because they got on a private jet and went to the Bahamas, it’s because they don’t have a home address[ so they’re not able to receive updates about when their court date is. They couldn’t get childcare, or time off work, or they forgot. And so all these things are preventable in all of these other ways but if you only see the decision as a risk you’re only thinking of it in terms of well, regardless of what the reason the person didn’t show up is and that’s the only thing I can worry about. So, kind of opening up that decision to divert- not allowing it to necessarily be framed in terms of risk from the get-go is a useful entry point I think.
Bailer: You know that’s almost impossible, well the question might be short, but the answer might be long. I don’t know the risk of that response. You know I guess one of the things that I have hears, and I’m going to sneak in here anyway and just make Rosemary mad at me because she can’t hit me because we’re doing this all remotely.
Pennington: Right, I would never anyway.
Bailer: No that’s true. So, there’s a theme that you talked about here and that’s the transparency of research that came out. You know the importance of being able to reproduce what was done. And it sounds like you had to do some forensic analysis to figure out what occurred in this previous work. So, what that suggests is that you value this in what you do in your current work. So I was wondering in deference to my friend Rosemary here if you could give us a quick summary of what is the flow that helps ensure in your work that you make it reproducible and that you have accountability baked into it that others could follow.
Shah: Thank you for asking that question and it’s important to lots of people right now, like lots of scientists are worried about reproducibility and I think it’s particularly important to who work in human rights because it’s possible that, well you want to get it right first of all, and also you want it to stand up to scrutiny because you may have results that are not- that doesn’t make people that happy and so you need to be prepared for all kinds of scrutiny and attacks and one way to do that is to be very confident in the work that you’ve done. And in terms of like I found that like having specific ways of working and specific structures for how I manage a project are one of the best ways that I can kind of guarantee those sorts of results and in particular the way we do things at HRDAG is we use this system called principled data processing where we kind of very explicitly set up our pipeline where we work on individual tasks in the pipeline, for instance, importing data from an Excel file or something is one task that may be standardizing code values within the columns is its own task and reproducing a regressions model that we found in a paper is its own task and kind of having- being able to do those things distinctly so you’re not worried about is the entire thing reproducible or correct but like is this thing is this link in the chain strong enough and then moving on to the next piece. That’s helped us a lot and then we kind of manage everything technically with these files so I had to learn a little bit of computer engineering as part of this job but I think starting with that idea of breaking things up into tasks that can be tested individually so that you have a little bit more confidence once the project starts to get bigger and more unwieldy and you’re not worried about the core values when you’re working on the model because solve already kind of tested those in an earlier step.
Bailer: Thank you.
Pennington: Well, that is all the time that we have for this episode. Tarak thank you so much for being here today.
Shah: Thank you.
Pennington: Stats and Stories is a partnership between Miami University’s Departments of Statistics and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple Podcasts, or other places where you can find podcasts. If you’d like to share your thoughts on the program send your emails to statsandstories@miamioh.edu or check us out at statsandstories.net and be sure to listen for future editions of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics.