The Right to Be Left Alone | Stats + Stories Episode 205 / by Stats Stories

Christoph Kurz is a postdoc at the department of health economics at the Ludwig Maximilian University of Munich. His research includes statistical methods for health economics and health policy, especially Bayesian methods and causal inference. Recently, he's focused on privacy research because of the increased requirements demanded by EU legislators regarding the handling and processing of health data. Kurtz has authored a piece for Significance magazine about the concept of differential privacy.


Episode Description

With the ubiquity of technology in our lives have come concerns over privacy, security, and surveillance. These are particularly potent in relation to what's come to be called Big Data. Navigating the complicated terrain is a constant conversation in some sectors of the tech industry, as well as academia. And it's the focus of this episode of Stats and Stories with Christoph Kurz.

+Full Transcript

Pennington
With the ubiquity of technology in our lives have come concerns over privacy, security and surveillance. These are particularly potent in relation to what's come to be called Big Data. Navigating the complicated terrain is a constant conversation in some sectors of the tech industry, as well as academia. And it's the focus of this episode of Stats and Stories where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's Department of Statistics and media journalism and film, as well as the American Statistical Association. Joining me is regular panelist John Bailer, chair of Miami statistics department. Our guest today is Christoph Kurtz. Kurtz is a postdoc at the department of health economics at the Ludwig Maximilian University of Munich. His research includes statistical methods for health economics and health policy, especially Basie and methods and causal inference. Recently, he's focused on privacy research because of the increased requirements demanded by EU legislators regarding the handling and processing of health data. Kurtz has authored a piece for Significance magazine about the concept of differential privacy. Christoph, thank you so much for joining us today.

Christoph Kurtz
Thanks for having me.

Pennington
I wonder, what is it about those new EU requirements or the pressures you're feeling from the EU that made privacy become a focus for you?

Christoph Kurtz
Yeah, so I'm not a I'm not a privacy researcher by training. My research, as was said, in the introduction is statistics for health economics and for health policy. But I deal with a lot of health data and a lot of new EU regulations on how privacy has to be ensured in health data and how to handle data and how to handle privacy and health data. And that was kind of what got me into more, but it increased my interest in privacy research. But my background is that I actually consider privacy as an opposite. If that kind of restricts my scientific example. I had to when I did some data analysis on a health claims data set, we obtained it from an insurance company. And I was only allowed to process this data on a very specific laptop like a laptop that had kind of a secure WiFi and connection. And this laptop was really painfully slow. So I really did not enjoy working in all of this just because of privacy. I mean, this was kind of super annoying. But then, I mean, like, like a year ago, so I heard of a concept called differential privacy, and I read an article about it. And then I was like, Whoa, there's kind of a mathematical definition of privacy. So maybe privacy is a really interesting area of research. And that's kind of how I got into, into, into privacy research. And so the EU EU legislation is one reason I mean, we all see the effects of this kind of European regulation. For example, when we visit a website, there's this new kind of, like, pop up where you have to accept tracking or rejection. This is kind of a result of the EU legislation. And basically, I think this is kind of a good thing. So I think there should be more laws and more regulations on privacy. But on the other hand, it kind of restricts my research. So how can I have both or? Yeah, I want to have the best kind of research results and of course, for privacy laws, and this is kind of a problem. Yeah.

Bailer
Well, I like how you started your Significance piece by suggesting the kind of important balance that exists between you know, the data protection as being an important asset, but also data analysis. I mean, that We're collecting information not not just for grins, but we're collecting it for purpose. And to be able to use it for purpose, it has to have some degree of fidelity, you know, with even protection. So this is this balance point, but before I, you know, we kind of push into kind of a little more of the nuance of privacy, can you kind of give an informal, colloquial kind of casual explanation of what what is meant by privacy in this context?

Christoph Kurtz
I think privacy can be defined as the right to be left alone, the right to being unobserved. I think that that would be a good definition of privacy. And this is, in contrast to, I think, secrecy, for example, secrecy means you actively hide information. But privacy is more about not being observed or the right to be unobserved.

Bailer
Oh, I got a follow up on that now. So, you know, we just had in the US the decennial census, and it's and that's a mandated reporting requirement, you know, you have to report. So in some sense there are times that decisions that are being made, you know, for the, you know, for public good for society, for allocation of resources within many communities, requires information. So, well, while you have this, you know, we value privacy, we also value that, that you're you will be represented, and your perspective heard, when, when certain decisions are made? So how does that, you know, that's, that's, that's kind of an explicit decision to kind of just give up some privacy for for, for kind of being represented in decisions.

Christoph Kurtz
Yeah, exactly, I think, in the scientific community at currently, two revolutions, kind of colliding. And the one revolution is the so-called Open Data revolution, or the replication crisis that costs kind of open and open data movement. And, of course, this makes sense, like, researchers are encouraged to share data, like put it in open repositories. So everybody can kind of replicate or do some re analysis. This makes scientific results transparent. It also saves money because other research groups don't have to collect the same data again. And it's also nice for teaching, for instructions. And at the same time, we have new kinds of concerns about privacy. And when we face data, like, the most interesting data is usually the most sensitive, like when my PhD supervisor told me, so if you want to get into those high ranking journals, you have to either have very good data, or a very good idea and think ahead. But yeah, so So really, data about individuals is is very valuable, and it cannot be shared freely, because data, like very sensitive of data of yourself can have negative consequences for it, for example, you can harm someone's reputation, or someone's employability, for example, if your employer finds out, you have a very serious disease, he might not employ you anymore, or you may it might limit your creditworthiness, or your true ability. And these are serious harms to your privacy that really conflict with the open data.

Pennington
It's, yeah, it's an interesting conversation around this issue of privacy and data. I do a lot of work in the realm of qualitative research and when we do ethnography, or, you know, in depth interviews, a lot of the conversations are about how do you protect the individuals and in what you write up, how do you mask them? How do you, you know, and there's a whole discussion about like, do you falsify, not falsify, but do you alter quotes to be able to hide them and so it's sort of interesting, because that's in qualitative work, what we've been struggling with for a long time, particularly in this age of digital media where if you are doing work online, and you You know, pull something from online, people can track you down if they Google really savelii. And so this issue around privacy and data, I think, is really compelling. In that, I think that for a long time, people imagined data being this thing that kind of stripped you of your identities, right? And what sort of open data being open has sort of, you know, made people realize in a way that I think we always knew, but that actually you are identifiable if people are looking right. And so I think figuring out how to mask individuals in these large data sets is really intriguing. And also super complicated, obviously.

Christoph Kurtz
Exactly. And many people would say, Okay, I removed the name, or like a personally identifiable ID or something from the data set, and then privacy is ensured. But this is, this is not the case, there's tons of research that shows that people can really be identified. I mean, there's this, this famous result by Latanya Sweeney. And she said that 87% of Americans can uniquely be identified by gender, birthday and zip code alone. Wow, this is really, really remarkable. And the reason for this is that we leave so many kinds of digital footprints, like combinations of unique variables that make us unique and identifiable.

Bailer
You know, that, that kind of leads into kind of exploring a little bit more of the kind of the technical ideas that maybe you're here, I mean, so. So when you're talking about this combination of information that goes to what was going to be my second question. And so I'm gonna, I'm gonna postpone this idea of robustness under composition, as a concept for us to return to, but when you when you're talking about this balance between the protection of data and insights from data analysis, you have this formalization of privacy loss. And that, that, so could you talk just a little bit about what's meant by privacy laws? And how, you know, what are some of the mechanisms that you use to fuzz, you know, I would say fuzz up kind of the data for the purposes of protection, but yet having fidelity to the analysis. So if you talk a little bit of privacy loss.

Christoph Kurtz
Yeah, so I mean, privacy loss is mostly used in the definition of the differential privacy definition, actually. So I already mentioned this. Differential privacy is kind of the mathematical definition that indicates when publishing a result or data that can be considered private, in a specific sense, and unique ability of differential privacy is that it can be tuned like this kind of a tuning knob between privacy and not adding, like getting good results, because differential privacy is about adding noise. This is kind of the simplest way to describe privacy, just by adding noise. But there's kind of this privacy loss parameter. This is kind of the tuning knob that defines how much privacy is gained. And on the other hand, how much accuracy is lost?

Bailer
That seems like a really, really hard question. Yeah. You know, to think about what's, you know, where do you put that kind of, you know, I like this idea of that the tuning this idea of a tuning parameter or a knob that you're twisting, but you know, just how far do you twist before you, you have great noise and no signal? Or a hammer?

Christoph Kurtz
Yeah. I think this is probably also the main difficulty of differential privacy. But because it is a mathematical definition, it can be defined. So sometimes, this last parameter is also defined as a privacy budget. So how many queries can you make on such a data that is differentially private, before anything can be revealed? Like how many curious Can you make before you can kind of reconstruct who's who and if that person is in a data set or not. And the privacy loss is also kind of budget you have on on how to how often you can safely curious as

Pennington
you're listening to Stats and Stories, and today we're talking with Christoph hertz about differential privacy. Christoph, I wonder how now that user is vainly exploring this idea of differential privacy. And I know your work has been in the field of health economics, how you imagine this impacting the work that you could continue doing in that particular field.

Christoph Kurtz
I think it will be more and more require in the future to have some kind of privacy statement or maybe some kind of privacy analysis that you have to take in, in health Economic Research, because not only because their clients by law, but also economic data, personal personal health information is a very tight, very sensitive type of data. And I think the requirements to ensure privacy, regarding health records, chronic health records, health claims data will be increasing. And I think this does not only affect economics also have policy and all types of biomedical research, like,

Bailer
So, can I follow up just a sec, what kind of problems are you working on now? I mean, so, can you give us an example of a health economics project that you're exploring where and where differential privacy comes into play?

Christoph Kurtz
Yeah, so in health economics, we are, for example, interested in high cost cases, though, kind of who is someone who costs a lot to the insurance company, and these are very often outliers. And I mean, like, health insurance companies interested in kind of identifying those people and maybe offer different plans or Kind of, yeah, changing their insurance type or something like that. And, of course, as someone who's researching this stuff that we want, we don't want to harm such a person, we don't want to make this person identifiable for the health insurance company. So because this can present a serious harm to those people. And this is not our, our job, our goal. And I mean, we don't want to kind of make those people identifiable. And by ensuring kind of privacy method measures, privacy, focused analysis, we can guarantee that in a way.

Bailer
Although, just as a quick follow up, if, if you go through and you're trying to predict who these outlying cases might be, it could, one could easily imagine that that kind of some unique combination of input variables, might really narrow this down and just by him by stratifying. On those, you know, by collecting those together, if someone is highlighted as being a member of this, this intersection of all these categories, that you'd pick them out that it seems like you are then effectively identifying the people. Does that make sense?

Christoph Kurtz
Yeah, I know what you mean. I mean, yeah, I mean, it's probably not too difficult to identify, kind of, there are certain diseases or certain kinds of comorbidities that really do cost a lot. And I mean, this. This is known by the health insurance company as well. So this is not a secret. That's a good point. Yeah. But still, I mean, there could be someone who doesn't have those typical high costs comorbidities of something like that, and it, he or she is not on, like the focus list of this health insurance company or something. So we don't want to give additional attention to someone who may not be the focus of a high cost case study or something.

Pennington
I wonder Christoph, if you could explain the way you maybe you would to your mother or to a friend who's not a statistician, like why they should care about this issue. I mean, that's the thing. You know, as journalists, we're always kind of trying to think about, you know, why someone should care. And I wonder, you know, this is a, you know, a story about how do you protect data when you're doing research? You know, why should the lay person care about any of this?

Christoph Kurtz
Yeah, so, I think, if I take the perspective of someone participating in a study, so as a study participant, I have to ask myself, Can I be harmed by my participation in a study or by my inclusion in a data set? Also, it cannot be harmed even by my census report, or can I be harmed by my social media account, for example. And in many, many cases, I am harmed, and my only option is to not participate. So there's kind of two extremes . I want to, I don't want to harm myself by participating, but maybe I would have an incentive to participate like getting a special treatment or so which I wouldn't get otherwise, for example, and my only two options are participating, not participating. And what if there would be a third option, like participating, but not being harmed by my participation, this would be the ideal case. And this can be achieved by ensuring your privacy. And this would be the main goal of ensuring privacy and and this should be the goal of anybody who analyzes data or, or creates a kind of a survey or a study or under study or as a court or something.

Bailer
But, you know, I'd like to thank you for taking this into a direction that I hadn't thought about until you just made that remark. And then there's so now all of a sudden, I'm thinking about informed consent. Oh, yeah. Yeah. So you know, one aspect of participating in the health study, which is a great example, is that you are informed of a lot of possible outcomes. And often with treatments, like you just mentioned, it could be side effects associated with some, you know, pharmaceutical intervention or, or, or surgical procedure for that matter. But there's also this potential adverse outcome of disclosure of medical records. And I'm just, I'm just sort of imagining how the story is going to be told to research subjects into the future. You know, it's it's pretty easy to tell. I mean, actually, it's, it's easy to talk about the idea of risk, or whether someone understands it or not, it's a different question, but of these adverse outcomes. But But, you know, how do you imagine describing this to someone who would be a participant in some future study? Oh, by the way, there is some potential data, you know, your material might be potentially released, this electronic information might be released? How would you tell the story of the consequences of this kind of loss of data privacy?

Christoph Kurtz
Yeah, I mean, I think it's general practice, if you participate in a study, then, you are told, of course, your data is insured to be kept private, and your name is removed, and, and everything like that. But I think the reality is different. So we all have heard from the news kind of data leakage, data breaches, like data got released, without any drive, because of hacking or cyber threats or public data. And on the other hand, there's also kind of a billion million dollar market for health data, I think I read a recent piece in the New England Journal. And this was also a commentary on the privacy laws in the US, which are kind of almost non-existent, I think. And this has enabled the rise of a very multi million or billion dollar industry that sells and processes and links all kinds of health data. And as a participant, I don't I don't want that. And I mean, even if the study says, Okay, my privacy insured, there's really no guarantee it really is. Because something like an, like a release or leakage can happen. And I think the only guarantee would be the kind of statistical privacy measures that I mentioned. And if your data is mathematically private, then it also would cover the future, everything that could happen in the future, like a future release of your data, would be covered, because your data is mathematically private. And it has injected noise into it so that you're not uniquely identifiable.

Bailer
So just as It seems like one aspect of this is part of of ensuring the data privacy this, this introduction of noise in a in a clever way so that there's still fidelity or is this integrity of an analysis that occurs, I thought that you the comment about an important feature of data privacy is, you know, robustness under composition. And as you as you stated in your your article, when you're accumulating risk across multiple analyses, and I thought that was a pretty, pretty powerful statement, or a pretty powerful property, can you talk about, you know what that means to say that, that, you know, your that why this is important to be protected when you have this composition of analyses?

Christoph Kurtz
Yeah, so this is closely related to the privacy laws we discussed earlier. So, differential privacy has this parameter that defines privacy laws, but it's also kind of a privacy budget, as I said, and composition means in this case, so I can, I can do differentially private, private analysis on a data set. And if I do a second analysis, I gain no new information on an individual whether he or she is included in the data, or maybe he or she can be uniquely identified. So there's a, there's a kind of guarantee that if I do analysis, I do, I don't reveal more information than I would be doing one analysis. However, this is not kind of infinite, it has to be defined by the privacy loss parameter. And this is maybe also a problem because you don't know forever how many analyses are going to happen. But if you have defined such a budget, you cannot, you can deplete it, but then it's kind of gone, then every analysis after that would not be differentially private, like in the mathematical sense to guarantee privacy.

Pennington
Well, that's all the time we have for this episode of Stats and Stories. Crystal, thank you so much for joining us today. Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.