Keeping Individual Data Private | Stats + Stories Episode 254 / by Stats Stories

Claire McKay Bowen (@clairemkbowen)) is a principal research associate in the Center on Labor, Human Services, and Population and leads the Statistical Methods Group at the Urban Institute. Her research focuses on developing and assessing the quality of differentially private data synthesis methods and science communication. In 2021, the Committee of Presidents of Statistical Societies identified her as an emerging leader in statistics for her technical contributions and leadership to statistics and the field of data privacy and confidentiality. She is also a member of the Census Scientific Advisory Committee and the Differential Privacy Working Group, an advisory board member of the Future of Privacy Forums, and an adjunct professor at Stonehill College.

Episode Description

The gathering of statistics of various kinds, is vital to our understanding the world around us. But some stats can communicate sensitive data about individuals even when statistical methods a have been thoughtfully designed. The ability to keep data private is the focus of this episode of Stats+Stories with guest Claire McKay Bowen. Check out her full article in Significance Magazine now.

+Full Transcript

Rosemary Pennington
Gathering of statistics of various kinds is vital to our understanding of the world around us. But some stats can communicate sensitive data about individuals, even with statistical methods that have been thoughtfully designed. Data privacy is the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's Department of Statistics and media journalism and film, as well as the American Statistical Association. Joining me as always is regular panelist John Baylor emeritus professor of statistics at Miami University. Our guest today is a repeat visitor Claire McKay Bowen. Bowen is a Principal Research Associate in the center on labor, Human Services and population and leads the statistical methods group at the Urban Institute. Her research focuses on developing and assessing the quality of differentially private data synthesis methods, and science communication. In 2021, the Committee of presidents of statistical societies identified her as an emerging leader in statistics were her technical contributions and leadership to stats and the field of data privacy and confidentiality. Well, and he's also the author of a Significance article on the art of data privacy. Claire, thank you so much for joining us again.

Claire Bowen Well, thank you for having me. Again, I must have done something right from the last.

Rosemary Pennington
Well, I mean, this article that's in Significance that I know is an excerpt from your book is really so interesting. And I wondered if you could just talk us through how you decided to talk about data privacy through the lens of this particular piece of artwork.

Claire Bowen
I can't take full credit for this idea. So when I was an intern at the Census Bureau, I attended a talk and I can't remember who it was. Oh, I wish I could give better credit to this person. They used a piece of artwork where they took the pixels and started kind of fuzzing them out a little bit to kind of represent the idea of data privacy. And so I thought, Well, why not use that example, but take it a step further to help explain other data privacy techniques, because at the time, this person was just focused on thinking about noise and fusion through the term of our privacy definition, differential privacy, because that was the big focus, and at the time at census bureau trying to push that out. This is all back in 2016. For those who are curious, like when this all happened, so then when trying to figure out how to best explain an audience who had no mathematical background for my book. That's why I took a piece of art now. How come a Seraph painting one is I actually love pointillism paintings is like one of my favorite piece like techniques in the art. And I also was trying to think of a piece of art that a lot of people knew about. And it was also open to the public for access to digital rights. So it was nice that the Chicago Arts Institute had all their artwork publicly available. So that was really nice to be able to pick that piece of art,

Claire Bowen
it is really interesting to sort of see in the article, like sort of as you work through the various privacy approaches, like what the distortions to the artwork to sort of think through and visualize sort of those concepts in ways that I as someone who gets concerned about David, private data privacy, but who is not immersed in it, it really kind of helped me conceptualize what you were getting at.

John Bailer
Yeah, the other thing I found really nice about this painting, I mean, other than I got it, I had the opportunity to see it at the Art Institute too so that was really cool to be able to visit Chicago and see it. By the way, I was inspired to see it even before reading your paper. But I think reading your paper I thought this was a brilliant selection. And I really loved how this paper evolved. But it was really cool because I was thinking you can also have not only this, the way you're going to be adjusting this image, but that you have a population of people on this, you know, that are out enjoying this day. And I thought that was sort of an opportunity to really play with these ideas of kind of capturing an image after starting to do some kind of adjustments. So how about talking us through maybe one of these ideas, there's certainly a lot of ideas about, you know, masking some information, but yet not by keeping the whole picture that you use. So what's one of your favorites in terms of its impact on this image?

Claire Bowen
Oh, I think the coloring is a really big one. Because with pointillism, for those who don't know, it's basically taking the point of the paintbrush with one color and collecting like different dots to create the full image and so you get kind of different highlights and at least when I did a little bit pointillism painting, sometimes when you mix together some of the colors you get a totally different tone. So then you can get a different kind of highlighting of the imagery for at least for the painting. And so when I say changing the color, one of the techniques is called generalization or cataract realization or sometimes called thresholding. So there's a lot of different terms. But hopefully I confess most of the ones or I guess another one is aggregation. So this is to take really smaller groups of people or other kinds of records and put them to a higher grouping. So a really classic example I believe I used for the paper was for education. So you can imagine, there's different ways that somebody might get educational attainment, you can think of like high school, GED, some college community college, bachelors, bachelor, so the bachelors of arts, science, and so on, so forth. But that could get to fine grain, you can identify people. So you kind of aggregate those individuals up to saying like, did they complete high school or not? Did they go into a bachelor's or a four year program or a two year program and things like that. And so the analogy I use with painting is that instead of having all those different kinds of cool shades that you might have with the pointillism, then you just have only one color green, one color blue, and one color red. And so if you look at the image created in the article, you still get the figures, you still have kind of figured out that they're in a park, because you can still see the trees and the grass. It looks like there's still a lake in the background too. But it's kind of distorted, because it's like these extreme colors now. It was only one green and one blue.

John Bailer
So you know, one question that sort of backs us up a little bit is, you know, what's, what's wrong with identifying people? You know, why is this? Why is this important to do?

Claire Bowen
That's a great question. Because sometimes I hear some individuals say, Well, don't they? Doesn't the general populace know that this is for the great benefit of society, knowing exactly how many people are in a certain area, like going to keep playing on the education piece here, because that's been really important to figure out allocating resources for different schools, figuring out how to substitute teachers, you need to figure out how well maybe we need to divide up the school district because there's like work Tilden coming in to the area and things like that. But at the same time, you don't need to know or perhaps some individuals won't feel comfortable knowing that there are so many kids in the neighborhood, because what about stalkers, or figuring out the fact that in some datasets that we're collecting or figuring out who are from single parent households, that can make it very more targeting for certain high financial situations. So it gets a little bit tricky there, because you do want to help these individuals, but at the same time, you can't try to expose them to further harm. So it gets into a little interesting name further, because I said about like single parents, well, some individuals in certain households, it's indicative of like their single parent household, and we're likely from certain underrepresented communities. And they're, they're more susceptible to attacks. There's a lot of research showing that, if you're from certain kinds of communities, you're more susceptible to certain financial hits. Or again, like I said, different kinds of privacy attacks like you think about scams, or even like financial leakage from the keep era like credit card companies or advertising or like more recent data breach.

Rosemary Pennington
But there's so much data that is publicly accessible. How has that changed? Or has it changed the conversation around data privacy?

Claire Bowen
Oh, it's changed the conversation quite a bit. Because there's, there's a lot of people very concerned with, oh, now we have social media, for instance, that's a lot of extra information. So a lot of like for consumer data. Some of the private companies might make it proprietary, so they keep it close to their chest. But some of them do publish their statistics or other datasets, maybe for the greater public good. Or for some, sometimes they do competitions. So other researchers can contribute to some sort of improving the algorithms. So I'm kind of alluding to something like the Netflix prize. That's a very classic example. It's a very old one, but it's a very salient one, because Netflix did release a dataset to improve the recommendation system. This was back in 22,007 to 2009. And the thing is that they didn't realize you could attack their dataset using IMDB data, because there's similar ratings. So that's an example of public data being available to us to be against a dataset that was anonymized and supposedly protected. Now, the follow up then is that you as a, probably the average person will think, Oh, who cares if anybody knows I get five stars for the latest Top Gun movie that was so cool, right? But that dataset later on, there was a lawsuit against Netflix back in, I believe, 2010. Because it was right after the announcement of the winner, where the data could be used to figure out if somebody was LGBTQ. Yeah, so that's, that was a big issue. And so this goes into like, you're not sure how somebody might attack the data with other datasets or what malicious things they could use with a dataset. So something so innocent, like a movie rating data set, has alarmed other entities, especially the Federal Statistical agencies about what information they can release, even though there's an obvious public good.

John Bailer
It seems like you know, if we just back up for a second talk about one of those. There's The key principles that you mentioned in your, your paper and I've talked about in other contexts, this balancing of kind of privacy versus data utility, you know, just what do you what kind of information? What kind of accuracy Do you require to make decisions versus, you know, protecting the respondents, the anonymity of the respondent, or kind of the identification of the respondent? So Can you talk a little bit about, you know, why you don't necessarily need this kind of individual granularity to have something that would still be of utility for decision making?

Claire Bowen
Yeah, that's a great question. Because I am often asked, like, how do you know you made something accurate enough? Maybe you don't need that accurate information? But maybe you do. And so just like when I said, it's hard to predict how someone might attack the data, it's also hard to predict all the use cases for the data. But you do try. And I say to you as in those who have the dataset who are responsible for the data itself, we'll try to figure out exactly what are the salient use cases? What are the use cases that are needed for that data set to make that important decision making? And so I think I will use this example, my book, find out if it's back in the article, because again, the extra roof broke. So well. I talked about myself being an Asian American, and knowing how many Asian Americans are in a certain area, it might be important to figure out well, during COVID-19, a lot of Asian American owned businesses were hit harder than others. So you need to know if there is generally an Asian American population in an area. But you don't necessarily need to know who are East Asian versus Middle Eastern versus like other areas or Pacific Islander, right. But then you can say like, well, maybe for some use cases, you do need to know that finer grain detailed area, to figure out that there is a big difference of somebody is Japanese, Chinese, Korean versus Filipino versus being from certain other regions of Asia.

Rosemary Pennington
You're listening to Stats and Stories. And today we're talking to Claire McKay Bolen about data privacy. Claire, one of the examples in this excerpt from your book in significance is exploring data swapping, which is one of the ones as I was scrolling through this to prepare. I really caught my eye because it's these squares of images pulled from around the painting that are kind of moved around. And I guess, could you talk a little bit about sort of what data swapping is, and how that might actually play out in a study in order to sort of still prove create usable data while protecting privacy?

Claire Bowen
Yeah, data swapping is actually a technique that was primarily used to protect the 1990 census data all the way till the 2010 data. So when I say 1990 to 2010, this is the big decennial data set, the one that everyone was talking about earlier and for 2020, because that data is used for redistricting, among many other census data products. But that was really more top of mind, because the first day of product that comes out from any of the central data products, again, is the one you might hear with the term PL that's short for basically the redistricting data. There's like some numbers that go after it. And I'm going to totally forget, it's the actual numbers, but that doesn't matter. So for the for data swapping, why was it the primary disclosure avoidance method is that So at a high level data swapping, tried to find individuals who had spirits similar characteristics throughout the United States, but swapped them in regions that they made more sensitive, so I guess I can using myself as an example as being an Asian American. So I, myself, I kind of stuck out growing up in Idaho. So for those who don't know, it's a very rural state. I think they might have had 2 million people in the whole state, but I think they're still under 2 million. So I grew up in an area of 3000 people in a whole town, so I stuck out pretty well as an Asian American. And so maybe my family might get swapped with somebody else who is in Washington, DC, where it's more representative to have Asian Americans. But then some of our information because we're being swapped based on certain kinds of characteristics. So let's just like it for the example here is that if we're in a primarily Asian American household, and we got swapped, and so maybe when we got swapped, it was like, Oh, the whole unit is Asian American. But for those who don't know, I actually have a stepfather who is white. So being swapped over to the United States, excuse me to Washington, DC, maybe our household then shows that my mother, my stepfather, and myself, being swapped with maybe a family of four, who are all Asian American, and then maybe some people who lived in the town and be like, That's weird, because we all know about Claire's family. But maybe at the high level, the data set when somebody's analyzing it, they're like, Oh, well, we know there's an Asian American family. That's all we needed to know in order to figure out some of the statistics or allocating sort of resources.

John Bailer
Yeah, it's, you know, using this, this particular painting is kind of a backdrop for illustration. It also illustrates kind of the idea of, of how much of this addition of noise or kind of addition of masking can you add before you lose if you lose the image. So I imagine that's a real challenge for some of the work that you're doing to think about, you know, what, how much of these various aspects of whether it's thresholding, or data swapping, or rounding, or these top and bottom coding all these other these techniques that you mentioned there? You know, how do you know that you've got just that sweet spot of enough? Guys?

Claire Bowen
It's a great question. So we'll focus on the swapping a little bit. So there could be different rates of swapping. And so if anybody who's listening, and this was like, I'm kind of curious what Census Bureau did, that's actually a guarded secret, they don't tell you the rate that they swapped. So one of the things that people will do is they'll try and test out different rates of swapping. And then they have to think about what is their threshold for what they define as being a disclosure of risk? And what is the threshold for accuracy or data utility or quality? So these terms get used interchangeably, but you can get the idea that there is this innate trade off, like you said earlier, and I talked about this throughout the whole, my whole book is that that tension, they, they oppose one another, but there is that sweet spot, as you said, But determining that gets really tricky. So sometimes people think of it like a curve, right? You have a graph, you can imagine where the x axis would be like disclosure risk, and then there's the y axis of the data quality, and there's some sort of curve, maybe it's not like linear, it's typically not linear. It's not 21. Usually, there's just like an inflection point. And some people are like, Okay, I want to hit that inflection point. Or sometimes they think, Well, no, the accuracy of the data is far more important than sometimes with the privacy. And so then maybe you might push yourself a little bit further along that graph to figure out where it is higher accuracy, but it still has a little bit of that privacy, or guarantee that you want. Or sometimes maybe the data set is just too sensitive. So you think, okay, if I just released some information, not to like finer grained detail, but some of it, then in that way, people actually have an idea, maybe with the structure they set before, for example, trying to request access for the full dataset. Now, this is, I think, going a little bit beyond what you're asking here. But I think it's really important to note that some datasets are released for a certain kind of utility. And it might just be the structure of the dataset where somebody else another researcher says, Hey, I actually need to know more fine- grained detail to make certain kinds of policy decisions. So I'm requesting formal access to the data. And they're only able to do that, because there is some sort of public file where they can say, hey, it's not as accurate as I need for what I need to do.

John Bailer
So I can well imagine that you know, what, for people that are making these decisions for you and your colleagues, that if you have access to the full dataset, you can compare, you know, kind of the summaries of interest, and then start to look at various constructed datasets that reflect some of these data privacy interventions, if you will, and then see if you're, as you get closer and closer to matching the summaries based upon that, that degree of modification, you can find that sweet spot just by analysis. Is that part of the exercise that you're involved with?

Claire Bowen
Yeah, so part of the exercise is to apply some of what we consider the benchmarks for quality. You apply it to this confidential data set, you apply it to the alter data set or statistics you want to release and kind of see like, well, would adding noise change people's decisions? So there can be you can imagine that in some cases, if you make the data so noisy, that somebody is going to make an improper public policy decision making that's really, really bad. And so there's actually an interesting trade off here. Where is it better to not release the data at all? It's very private, but there's no information nor is it to release data. So there's basically two extremes, you can think with the data, you can either not release any information, which is super private, but it's not at all useful. The other version is to release some form of the statistic, usually pretty noisy or aggregated up or any of the techniques that I have described. But you have to kind of think about, Well, is it actually better that I release that information? Because if it's so noisy, that somebody's going to make an improper decision making from it, then maybe it was better to go back and say release no information. But there are obvious trade offs for both scenarios. And so it's actually one of the things we're working on at Urban was figuring out what is that point for people where they'd rather not actually have access to data because they would make bad decisions.

Rosemary Pennington
Your book is your privacy in a data driven world. So I wanted to make sure we got the title of protecting your privacy in a data driven world. And I wonder, Claire, as you've been talking, is there something that you see kind of on the horizon when it comes to data privacy that people whether it's statisticians or journalists, or just everyday people should be more aware of or We're sort of keeping an eye out for?

Claire Bowen
That's a great question, especially because I keep getting asked like, what are like, what is the future? What should we focus on? What are the COA topics? So I'll try to cover broadly, the ones I think are most interesting. And just to be clear, there are a lot of interesting topics, and we just don't have the time to go through everything, especially in detail. So one of the big things is, and main motivation for this book is communication. So I think a lot of times when you ask somebody like, hey, what do you think of data privacy, what is data privacy to you, most people think of other things. But the topic that I'm specifically researching is releasing sensitive information that could be made for better decision making, for example statistical work, or you can definitely think of public policy work, right, because I work with a lot of administrative or Federal Statistical datasets here. And so just trying to communicate that this is a field that impacts lots of people, and is important to figure out like the future of data access. And one of the tricky things is that going into the second challenge, like with the communication piece, is the education. So for those who don't know, most courses that are taught in the United States, for data privacy are mostly done in computer science at a graduate level. And so despite the fact that statistical disclosure control, which is the field of data privacy and confidentiality, has existed for decades, we're not a big player in that discussion, because it's so dominated by computer science. And so actually, it's been like a big argument point where a lot of statisticians who work in it because we're like, we're a little frustrated, knowing that our work is not as impacting or helping impact the conversation as much as it should be. Because there are different perspectives and different priorities that people have. And so why not make sure that you have more perspective beyond just computer science. And so kind of see that what the literature is that there's some of the focus on the research is really driven by that particular field, not to say that we shouldn't consider that a course. Right? They are contributing quite a bit, and they have advanced the field, but it's become a problem. So with that communication education piece, we really should be thinking about how we are going to educate other fields than just those who are just in this niche area, considering the impacts of data access that thinks about social science, economics, demography, so on and so forth? Because I think this was in our other podcast, the episode I was on with my colleague, we asked the question, for our other article about if you were asked, What can vacation or other kind of research material which recommend for machine learning versus data privacy, people would have a very different response. So that's that lack of that communication piece? And so for education, like how do you teach the next generation, if you lack these resources, I guess the other part that I really want to talk about is the data equity side. So equity and data privacy go hand in hand, because in order l depends on again, there's a lot of different definitions of data equity, but one interpretation is having equal representation in the data set. And so certain individuals are more easily identifiable in the data. And so when you are more easily identifiable, therefore, you need to protect those individuals more. So we're going back to what I just said about whether it is better to not have certain people in the dataset because they're too identifiable or to make them very noisy, so you kind of misrepresent those individuals. And that goes into this data equity piece. Because if it was a community, it is really important that you need to target certain kinds of policy decisions, then if they're not present, then you're not gonna make the right one. So I know I talked about this in my book with the Navajo Nation, at some point, I believe it was like May of 2020, their per capita was impacted by COVID-19, far more than New York City. And part of it was because of the communication that people were spreading around for the health policy of like, washing your hands for 20 seconds does not work when you're in an area where over 30% of the people don't have running water. So that's the part that you have to think about is that if these people are not properly represented in the data, you're not going to help them as much as you think you're going to.

John Bailer
You know, as you were talking about communication and education, I was also thinking about not not just sort of training the next generation of people that will work in this arena, but also helping people understand that if you respond to this survey, your Your responses will not be identifiable mean as a way to kind of increase trust and also increased likelihood of participation, especially in in a time when you think maybe my data will be used in some nefarious way. So have you seen any kind of impact of these descriptions or assurances of data privacy in terms of responding to surveys or terms of people engaging with these queries?

Claire Bowen
Yeah, I think the best example would be the Census Bureau because when they do those surveys for let's say the American Community Survey A or the current population survey, they say that your responses will be confidential. When we report them out that will be altered. I can't remember the exact wording, but they make those assurances because those data are protected by law. And so making sure, like you said, to guarantee those protections to make sure people feel more comfortable responding is very important, especially given certain kinds of histories. So sometimes I'm surprised people don't know this. But I'm just going to say the fact that one of the reasons why the Census Bureau has been very thoughtful about this, is that there has been misuse of their data before during World War Two. This is actually how the United States was able to identify who are Japanese American and put them into internment camps, which I've actually been told by a few people that that's the reason why they don't respond to census surveys, is because of this mistrust. With the Census Bureau. They refuse to respond to any of the surveys.

Rosemary Pennington
Well, that's all the time we have for this episode of Stats and Stories. Claire, thank you so much for being here today.

John Bailer Thanks, Claire.

Claire Bowen
Thanks for having me.

Rosemary Pennington

Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.