Claire McKay Bowen is the Lead Data Scientist of Privacy and Data Security at the Urban Institute. Her research focuses on comparing and evaluating the quality of differentially private data synthesis methods and science communication. After completing her Ph.D. in Statistics from the University of Notre Dame, she worked at Los Alamos National Laboratory, where she investigated cosmic ray effects on supercomputers. She is also the recipient of the NSF Graduate Research Fellowship, Microsoft Graduate Women’s Fellowship, and Gertrude M. Cox Scholarship.

Joshua Snoke is an Associate Statistician at the RAND Corporation in Pittsburgh. His research focuses on applied statistical data privacy methods for increasing researchers’ access to data restricted due to privacy concerns. He has published on various statistical data privacy topics, such as differential privacy, synthetic data, and privacy preserving distributed estimation. He serves on the Privacy and Confidentiality Committee for the American Statistical Association and the RAND Human Subjects and Protections Committee. He received his Ph.D. in Statistics from the Pennsylvania State University.

Episode Description

Privacy is becoming an ever more potent concern as we grapple with the reality that our phones, computers, and our browser histories are filled with data that could reveal a lot about who we are – sometimes things we’d rather keep private. The issue of the privacy of data is not a new concern for researchers – in fact, whenever someone wants to work with people, oversight boards ask them about how they’ll keep data about participants private. But the data landscape for researchers and statisticians is changing and that’s the focus of this episode of Stats and Stories with guests Claire McKay Bowen and Joshua Snoke.

+Full Transcript

Rosemary Pennington: Privacy is becoming an ever more potent concern as we grapple with the reality that our phones, computers and browser histories are filled with data that could reveal a lot about who we are, the issue of the privacy of data is not a new concern for researchers and statisticians, in fact, whenever someone wants to work with people oversight boards often ask them how they'll keep the participants data private, but the data landscape for researchers and statisticians is changing, and that's the focus of this episode of Stats+Stories where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington Stats+Stories is a production of Miami University's departments of statistics and media journalism and film, as well as the American Statistical Association. Joining me is panelist John Bailer, Chair of Miami statistics department, Richard Campbell is away today. Our guests are Claire McKay Bowen lead data scientist of privacy and data security at the Urban Institute, and Joshua Snoke associate statistician at the RAND Corporation in Pittsburgh. Last year they published a journal article about the issue of privacy data and statistics, as well as an article in chance magazine, Claire and Joshua, thank you so much for joining us today.

Joshua Snoke: Thanks for having us.

Claire McKay Bowen: Thank you.

Pennington: Your chance article focused on how statisticians would approach privacy and quote a changing data landscape and quote what drove you to write that particular article.

Snoke: Well I think a large part of that was a desire to give people a general overview of where things are at in the privacy world. So kind of the field of data protection statistical disclosure control has been around for a while and it mostly resided in statistical agencies such as the Census Bureau, who were tasked with collecting, you know, surveys, and then disseminating them, and then really in the last 20 or so years, we've seen a huge increase in the amount of data that's being collected. And so, there's just a lot more need for people to know what to do with their data to know how to share it. And so it's becoming a much more talked about problem. And so we kind of wanted to give people a better understanding of where you know where everything stands in this new new data universe that we're living in.

John Bailer: So can we take a step back and just just define some terms. And I think that the idea of what is privacy, what is confidentiality, what is, what do you mean by data when you put that in front of those words. So good. Could you, could you help with that glossary for those that are coming on to this new.

Bowen: Oh, so that's a good question, especially because there are so many different fields that contribute into data privacy and confidentiality, you have social science statistics computer science of our gruffy economics. This goes on and on. So a lot of times when people hear privacy, they think of their personal privacy, right, whereas in confidentiality they think of the sensitivity of what's coming and then be patient with that data. So this is why we all hear the expression data privacy and confidentiality used together. But sometimes it's a little bit wordy so often Alice's say I just work on data privacy because it's short and catchy here.

Snoke: Yeah, I think one thing I would add, you know, one thing that's really helpful to clarify, when we talk about data privacy especially statisticians who work in this field is that we're talking specifically about, you know, wanting to release a data set, or not talking about, you know encryption, we're not talking about someone hacking into your phone or your account or that kind of stuff. We're really talking about, you know, releasing data and being concerned about what is in that data that's being put out there, and whether or not that information could lead to some kind of negative consequence for someone who's in the data.

Pennington: You mentioned Joshua, a minute ago this concept and it comes up in your chance article of statistical disclosure control Could you describe what that is.

Snoke: Yeah, so I mean that's really the general seal of what I just described, protecting the information that you're releasing. And so that's, that's the idea of disclosure control you're controlling against disclosures from the data that you're releasing. So this could be, you know, if you were to publish financial records. and maybe it was anonymous in the sense that it didn't have people's names, but it had other things in there, such as geography and age and gender and those kinds of things, and then someone could figure out who that person was, And then they would learn their financial information. That would be what we call disclosure. And so this field is basically using statistical methods to minimize those disclosures.

Bailer: I really liked your article when you talked about the social utility of information. I love that turn of phrase, and that you're talking about the utility of information balanced against privacy. So what is, you know, why is this, this data, what is what is the utility in this data. And then, What is it then to talk about, in contrast to the counter to the privacy concern.

Snoke: I'm going to try to take a stab at this using some motivating examples especially what's, what's going on. So there is a natural tension with trying to help the better or like do a societal service with data, while still protecting the people who participate in it. And so some of the examples we thought about or used and some other articles are healthcare data and also taxpayer data. So both of those data sets are very sensitive, pieces of information right I wouldn't want to know if somebody knew what doctor was going for, for some kind of diseases, and the fact that it's, if it's taxpayer data that's, that's a lot information you have to file in thinking about, for, for tax day right just imagine all the information you add there. But the data is so useful for figuring out ways to combat COVID-19 How can we distribute the vaccine, more or quickly. The taxpayer data could be used to be more targeted with our stimulus packages, I mean this last one was a $1.9, trillion stimulus package, and if we had that data, direct access to it, be much more impactful with that distribution of that money. And so, how do you make it still useful for researchers to use that data for those social good purposes, but that still balances against the privacy of how you protect so much information that could be identifying the individual.

Bailer: You know we had someone who had been involved in some of the national statistical offices in other countries. So, that had talked about, kind of, almost these kind of National Statistics these products that are being produced centrally as being this this almost human rights for the communities to know about themselves, to make decisions about themselves and to do planning and Claire you made the example of kind of thinking about targeting, you know, where should, where should the pandemic virus vaccines be distributed or other kinds of needs, perhaps as as being built into that. So, you know, you talk about the release here. And the people talking about synthetic data, the idea that you don't necessarily, one of the ways you try to protect the release of information is it's not exactly the information. So can you talk about, you know what, when you're releasing to researchers or releasing to some people that want to do analyses to support decisions that they're making. How, what are some of the features when you're constructing these synthetic data sets that might be, be built.

Bowen: What's interesting about it is that it, it really boils down to how you think people are going to use the data. And actually, this ties into another thing which your professional made me think about, which is a lot of this stuff tails with the call for increased transparency and research. And so, you know there's I think there's a shift or there's been a shift where people now want more access to the data that the paper the results for the study, but based on. And so a lot of times what people are looking for and sharing data is to be able to replicate results, and so they're looking to replicate some kind of specific analysis that was done on the data. And there are some kinds of specific vehicles, and so a lot of times, what you're trying to replicate with synthetic data is just whatever relationship they were looking at in the original study. And so that actually makes it easier if you know exactly what you're trying to replicate. And if you know what people are going to use the data with, the tricky part is in the more general case where you've just collected some data and put it out there, and you have no idea what you're awake, then it gets a lot, a lot trickier.

Pennington: Do you have any examples that might help our listeners because we have a mixed audience of people who listen to Stats+Stories and understand what we're taught like what kinds of things could someone access if someone is not careful. I mean, you mentioned the financial records and medical records, but how would they go about identifying, are there are there. Notable cases where data has been released that wasn't protected, that sort of can illustrate why this is something we have to be concerned about. Even if data is collected in a way that seems anonymous.

Bowen: Yeah, that is a good question I get asked that, too, especially when somebody says like well if I remove just the names and the social security numbers that should be sufficient. But that might have been sufficient decades ago and when I say diagnostics prior to like 2000 or the fact that now we have little computers in our pockets. There's just so much data that is out there that it's making it so much easier to do what is called on record linkage attack, where you try to gather information, externally from the target data and make it more identifier for that person. And so an example that came out recently that's very timely is in 2019, the Census Bureau did their attack on their own data, they decided to take the 2010 census and took social media data such as Facebook and see if they could identify people in their data set and digging into some of the nitty gritty of like what they did, it was roughly like one in six people they were able to identify with that social media data, believe there was another study that followed up and I can't remember what university was but they did a follow up where they couldn't identify it was over 99% of US citizens by looking at just 15 variables, which was, zip code, gender and some other ones that you would think are pretty innocent, that you generally wouldn't care somebody knew about. But if it's tied to your health care information, you would care a lot more.

Snoke: Yeah, just, just to add on that, I mean, it's interesting. There's, there's been some high profile cases, and I think that really push the field forward on this. And I think the ones that are really eye catching are the ones that are so relatable. So one of the probably the most famous ones is the Netflix situation where Netflix released all this data and it was, it was anonymized in the sense that it didn't have names, and all this stuff, but it had what people watched and had reviews that they'd give it, and what people did was they didn't. They probably linked on some kind of geography, but they mainly linked people's reviews from Netflix to reviews that they had given on other sites like IMDB, and those other sites had published more demographic information, but people that use the same reviews on both sites. And then, you know, what this revealed is what people were watching on Netflix, which I think for many of us would be a very sensitive thing. Maybe, maybe not first on our minds with financial information and other stuff, but as someone published all of your all of your Netflix watching online. Anybody feels uncomfortable.

Pennington: A lot of murder shows on mine. You're listening to Stats+Stories and today we're talking about data privacy and statistics with the Urban Institute's Claire McKay Bowen and Rand corporations Joshua Snoke. The interesting thing is we've been talking and you raise this, the issue of social media. Claire, so I do a lot of qualitative research I do I do a little bit of both, but I do a lot of qualitative research and I do a lot of qualitative research in social media, and one of the things that we bump up, I often bump up into and I think other people is this desire on the part of journals for us to even though we are you I collect my data in a way that it's supposed to be anonymous and I you know, change some of the posts so they aren't exactly what people said, there's a push, they want that. They want what the people actually shared and they wouldn't show up in the journal article and it becomes this like tension between what the journal expects and wants as far as a grounding issue of transparency, but then also what you know I as a researcher I've promised people who have participated in my study and I wonder, I mean it's qualitative and it's a very small scale, but if there are those kinds of tensions that are sort of out there to around academic research, as far as like transparency and and privacy when it comes to this kind of data.

Bowen: Yeah you're bringing up another great point, but is highly debated actually for all the fields that are involved, I mentioned there was like social science and demographers there. So basically, the users of the data types called them the data user or data practitioners. Those who want access to the data for whatever purposes they want to use it for. They have their agenda, and they think, Hey, I just, I'm just doing the analysis, this is what I want the output, they need it for either to make a policy decision or maybe like you said in a journal application. But again, there's this other tension of expectations to the users and so there has been discussions on, well we need to explore more on the research question of, at what point is it for the better good. Is that okay to be like, I know that this is your privacy. It's very important, but for the societal good we need to have access to this data. And so at least we're being transparent on that. So that's, that's some of these questions that we need to have a discussion on and actually research about. And I think we talked about this in our article that there isn't enough of us, researching in this field, so there were a lot of these great little questions, thinking like, why haven't we solved these things, well there's just frankly not enough of us to tackle each of these problems and in these questions, while simple like at what point is it that we should make this into the societal benefit, or should we make it private, because it's our duty to protect people. Well that's a simple question it seems like it's a very hard problem to tackle.

Bailer: So, you know, one of the things that you've mentioned in your, in your chance article is this, this idea of, of differential privacy as being this emerging general concept and principle for thinking about this kind of protection of information. Okay, so part of this differential privacy you talked about measuring privacy laws, and thinking about ways you know how you need to think about defining privacy laws.

Snoke: You know, there's, there's some really some really worthwhile. Kind of reasons to be interested in differential privacy, and then there's some really hard parts about it. So, on one hand, it's, it's really making a big step forward in terms of trying to precisely find what privacy means. So one of the difficulties in kind of what has been done in the past is that the concept of privacy wasn't really rigorously defined. It was kind of dependent on a lot of assumptions or a lot of context. And that could run into problems if, if those assumptions were wrong, or if your context changed suddenly the way that you constructed, your privacy protections, didn't work anymore. And so really the, the power of differential privacy is that it's, it tries to step away from that context dependent definition and give a definition that is rigorous, to, to change a context, and to fewer assumptions, and also formalizes it to the point where, you know you have an actual definition you can write down mathematically. So that's the really you know, the really cool part, the really hard part is explaining it, and kind of actually understanding what it means. Because, you know, We talked about this in the article. The older context dependent ones were really nice because you could explain the context, so for example you have an older definition known as can indemnity, which basically means if you release a data set, and it has something like age and gender, there's at least k people say 10 People who have the same demographics, as, as you, so no one is going to be unique, based on the data that you're releasing. And that's really, that's really great because everyone can understand that that means differential privacy on the other hand, basically means, limiting the amount of information that you might learn about someone, given that they are in the data set, versus not being in the data set. The problem is that what that information is, is not well defined, and you're doing this for any possible individual, not actually the people who are actually in the data. You know

Bailer: I really liked him in the amset news column. When, when you gave a talk about randomized response as equivalent to a form of local differential privacy. I just, yeah, I remember when I saw randomize response methods for the first time and I just thought that that was just magical. And it was such a cool idea. Could you talk, could the two of you kind of, you know help describe kind of what randomized response, how that kind of effectively works and why that provides an example of some local form of differential privacy.

Bowen: I'm gonna try to take another stab at this. Especially because I think when people hear local differential privacy, they, they believe that it's really in the framework of differential privacy, but it's actually a bit different, it's talking about whether or not there's a trusted curator or not a trusted curators, so So stepping back a little bit what do I mean by trusted curator, it's so it's the person who's going to be collecting the data or is responsible, ultimately for its security. And so, for local differential privacy, it doesn't trust anybody, even if it's the curator, because in differential privacy, if you were the owner of the data you get to see it you have to figure out like, Okay, we need to figure out exactly like as Joshua said, what are the old possible versions of a person or a wrecker that could be in the data and how could it change for any, any possible information that anybody could try to draw from in the universe, this is why it's so hard. Okay, so like to emphasize why differential privacy is so hard to achieve for local DP, it says it doesn't trust anybody, it says before you collect the information it will randomize whether or not you get the true answer or how noisy it might become before you receive it. And so that's where it deviates quite a bit, and so an example that was applied and hopefully I'm remembering correctly here. It was the apple I owe a text emoji, like so, somebody would be texting, and they would make a follow up emoji, so like sometimes when you're texting in your, in your phone. It will have a suggestion like, Oh, do you mean a smiley face, or do you mean a dog or something like that. And so they decided to use this day so I think because it's pretty innocent, Though maybe somebody is again embarrassed about associating certain text to certain emojis, I don't know, but we'll just go with that. And, and so before Apple would receive the responses, it would this local differential privacy method that they created. Would randomize whether or not they got the true answer or something kind of messy back, but they would no

Pennington: Claire, you mentioned there's not many people who are doing this kind of work around privacy and data. So for people who are, who are not as well versed, what sorts of things can they do to sort of make sure that they are being careful in the release of their data, right, maybe they're not as well versed in these issues maybe they're, you know, it seems like it's, you know, understandings of what, what, what privacy concerns are need to be have changed. So, you know, for someone who is graduating from a statistics program right now or who is engaged in this research now. What can they do to sort of help make sure that they are being careful in how they release their data.

Bowen: That's a really good question. I keep saying those but these are good questions, and I probably got to fumble a little bit on this because there's kind of different ways to interpret what your question is which is one is like do I have like recommended materials to to learn about this or is there like a certain class or certain, perhaps like a person to talk to you or like these sets of rules that you should follow. And so I'll try to tackle at least a little bit of each one and tell you how it's not perfect for each of them. So first off is, is there any book, or videos, or things like that. I wish there was more. Because there. Again, there's so many people who participate in the field, and so a lot of the materials that have been coming out actually have been more heavily in the computer science area and there have been more coming from statistics. So that actually was another motivation for Joshua and I to write this article and chance was to open up another piece of communication materials that will help convey what exactly do we mean by data privacy, what are the challenges and so on and so forth. With that, at least more targeted for statisticians because I said there was a lot from computer science. And so when you read those articles they tend to be very heavily towards that viewpoint. And even when there was the, I think we pointed out in the article, there was this primer that some computer scientists wrote for lawyers, and I actually got to speak with one of these computer scientists and asked him about that primer and he's like yeah so when we shared it with lawyers they still told us it was too mathy. So, I actually hear that pretty often with a lot of things that like oh you should read this book or you read this article, and when you read them, you realize it's still too technical for a non technical audience. And so, while things have gotten a little bit better, especially with Google and Apple, Microsoft, Uber Mazola set senses I'm like listening all these bigger entities are trying to adapt or adopt Scuse me differential privacy, they've been trying to release their own kind of communication materials so there's actually really good YouTube video from One Minute Physics it's actually one of my favorite YouTube series of like a really great way to explain differential privacy and he talks about it with a data set that with people who like to like to eat ice cream.

So I thought that was a very friendly way to go through it. And so there's something like that to kind of get an idea of what is data privacy. So that doesn't quite get to the point of like where maybe you are a data practitioner or a statistician who's like, hey, where should I start here. I wish I could be like hey this is the best coding package you should do, do get the basic running start. There's still again, not enough, there is an R package. So for those who don't know statistical programming language it's open source, and there's a package within it called synthpop, and actually Joshua is one of the co authors on that so I'm going to pick on him a little bit there. So it does do some of the methods that you can do like the statistical disclosure control methods he talks about is specifically the synthetic data generation, but it's still limited to, to a certain kinds of data or certain kinds of utility metrics or when I say utility the usefulness of quality of the data. There is a community from Harvard, called Open DP that's trying to collect a suite of open source code for those for others to use and I think we talked about that in the article, but it's still in the works, they started their lot, they launched last year and 2020 but they're still in process because we actually recently contacted them about getting some more code and they're again still collecting it's going to take quite a bit of time. So that's also an unfortunate thing is we don't have enough time for that.

Snoke: Yeah, I might add one thing. You know, I think, as Claire was saying there's, there's a messy landscape in terms of materials out there but one thing that I found in my own experience, because I work in a more applied setting with, with people who are, you know, not sad sessions are not privacy experts who have their data that they're trying to figure out what to do with no my own experience, a lot of people, you know, can actually pick up the ideas fairly easily. It's not. It's not that you necessarily have to be an expert to understand how to protect data. I think a lot of this stuff is fairly intuitive. And what I found is that, you know, people come to me with questions. Very often in a short conversation, it's, it's easy for me to point them in the right direction and then they can do a lot of it themselves. And I think that's something that we also have to get out of our article is kind of just pointing people in this direction to say like this is out there and you can do it, and, and you should be interested in trying to do this.

Pennington: Well that's all the time we have for this episode of Stats+Stories Claire and Joshua thank you so much for being here today. Thank you.

Snoke: Thanks for taking the time.

Bowen: Thank you so much for having us.

Pennington: Stats+Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats+Stories, where we discuss the statistics behind the stories and the stories behind the statistics.