Multiple Systems Estimations Explained | Stats + Stories Episode 84 / by Stats Stories

price headshot.jpg

Megan Price is the executive director of the Human Rights Data Analysis Group (HRDAG), and designs strategies and methods for statistical analysis of human rights data for projects in a variety of locations including Guatemala, Colombia, and Syria. She has contributed analyses submitted as evidence in two court cases in Guatemala and has served as the lead statistician and author on three UN reports documenting deaths in Syria

+ Full Transcript

John Bailer: I'd like to welcome you to today's Stats and Short Stories episode. Stats and Short Stories is a partnership between Miami University and the American Statistical Association. I'm John Bailer. I'm Chair of the Department of Statistics at Miami, and I'm joined by my colleagues, Rosemary Pennington, Professor in the Department of Media, Journalism and Film, and Richard Campbell, Chair of the Media, Journalism and Film department. We're fortunate to be joined today by guest Megan Price, co-founder of the Human Rights Data Analysis Group, HRDAG. Megan, welcome.

Megan Price: Thank you.

Bailer: Hey, Megan, you've been using some really interesting methods to try to estimate the size of hard to count populations. One of these methods you've described is multiple systems estimation. Can you give a short description of the mechanics of how such a method works?

Price: Sure. So, there are a couple of different metaphors that we usually use to think about the intuition behind this method. And one of them is this idea that you walk down a hallway and there are two rooms that are all dark, and the doors are closed, and you're trying to guess the size of the room. And you open one door, and you throw a hand full of bouncy balls into the room, and you listen to hear the balls bounce off of the walls and off of each other. And in that room, you hear a lot of bounces. And then you walk down the hall to the next door, and you repeat the experiment. But in that room, you hear way fewer bounces. And intuitively, you would infer that that second room must be bigger; it must have bigger dimensions because the balls bounced off of each other less. And in essence, that's exactly what's happening when we use multiple systems estimation to estimate the size of a hard to reach population, where instead of balls, we have lists of named individuals or events, and instead of rooms, we have the size of the population that we're after. And the bounces of the balls are the overlaps between those lists. They are the number of times that the same person or the same event is recorded by those multiple sources. And through some very clever algebra and some very clever modelling, we can estimate the size of that population we're interested in.

Rosemary Pennington: So, could you explain a recent situation where this kind of methodology has been used that maybe the public would be familiar with?

Price: Sure. One of the ways that hopefully the public is a little bit familiar with is actually just a few years ago, the US Bureau of Justice Statistics was looking at the report of members of society who had been killed by the police. And this is something that is required to be reported by various jurisdictions, and there are two different sources that try and record this information. One of them is an FBI database and one of them is the ARD. I'm blanking -- actually I don't know what the acronym stands for, but the Bureau of Justice Statistics basically took these two sources of information and said, "We know that they are imperfect and incomplete, but we have these two lists that are supposed to be recording the same information." And so, let's use multiple systems estimation to look at the overlap between these two lists and then to estimate the total number of people who have been killed during interactions with police. That was a report published by the Bureau of Justice Statistics maybe three or four years ago.

Bailer: So, what are some of the key assumptions if you're going to use a model like this?

Price: One of the key assumptions to the kind of method that we use is that the population is closed. So, that means that it's sort of a snapshot of the population that there are no births or deaths in terms of thinking of population dynamics, or in migration or out migration. So, in our particular use case, that assumption is reasonable. There's a whole other class of methods called open populations for when you want to account for those sorts of processes. One of the most challenging assumptions is that you are able to correctly and accurately identify the individuals or the events that are the same across these multiple sources. And that can be very difficult to achieve, especially in many of our applications, where we have relatively thin data, meaning that we have a person's name, and a date and location, and that's usually about it. In some of these other more commercial applications, they also rely on things like addresses or phone numbers. So, that assumption can be challenging. And again, there's a whole class of methodological development, looking at ways to incorporate the uncertainty in matching those records into the uncertainty that is eventually reported in the estimate. And then two other assumptions that get a little bit squishier, I suppose, would be the technical term, is the assumption that the lists are collected independently. You have to make that strong assumption in the case of work, like I just described from the Bureau of Justice Statistics, where you only have two lists. But when you have three or more lists or sources, then you can start to try to model or estimate the relationship between those lists. You don't have to make quite so strong of an assumption about independence. And then you also need to assume that the probability that any individual or event is recorded on a list is equal within each source. And again, you have to make that assumption very strongly when you just have two sources. When you have three or more, you can start to model some of those differences.

Bailer: So, are these assumptions that can be checked, or are these assumptions that you have to design in?

Price: A little of both. Again, when you have three or more sources, you can run some tests and use some different modelling approaches to try and check for them, and to do some different comparisons to try and see if you think those assumptions are being violated.

Bailer: Oh, outstanding. Well, it's been a pleasure to have Megan join us on Stats and Short Stories. Stats + Stories is a partnership between Miami University's Department of Statistics and Media, Journalism and Film, and the American Statistical Association. Stay tuned and keep following us on Twitter or Apple Podcast, and on Stitcher now. If you'd like to share your thoughts on our program, send your email to statsandstories@miamioh.edu and be sure to listen for future episodes where we discuss the statistics behind the stories and the stories behind the statistics.