Public Health Equity | Stats + Stories Episode 388 / by Stats Stories

Dr. Bhramar Mukherjee is the inaugural Senior Associate Dean of Public Health Data Science and Data Equity and the Anna M. R. Lauder Professor of Biostatistics, as well as Professor of Epidemiology and of Statistics and Data Science at Yale University. Among her many honors, she was elected to the US National Academy of Medicine in 2022.

Episode Description

The World Health Organization defines health equity as a public health concept describing equity of access to health resources for genetic, socio-environmental, and economic determinants of health, varying according to individuals, families, and social or societal groups. Concerns about data equity have surfaced, which may result in many populations, including those in rural areas with disabilities, experiencing homelessness or living in low and middle-income regions of the world, being underrepresented in health data sets. This can lead to biased findings and suboptimal health outcomes for certain subgroups, which is the focus of this episode of Stats+Stories with guest Bhramar Mukherjee.

Full Transcript

John Bailer

The World Health Organization defines health equity as a public health concept describing equity of access to health resources for genetic, socio-environmental, and economic determinants of health, varying according to individuals, families, and social or societal groups.

Concerns about data equity have surfaced, which may result in many populations—including those in rural areas, with disabilities, experiencing homelessness, or living in low- and middle-income regions of the world—being underrepresented in health data sets, leading to biased findings and suboptimal health outcomes for certain subgroups.

In today's episode of the podcast, we consider: Where do we encounter problems in data equity, and how can we address them?

I am John Bailer. Stats + Stories is a production of the American Statistical Association, as well as Miami University's Departments of Statistics and Media, Journalism, and Film. I'm joined by Rosemary Pennington, chair of the Department of Media, Journalism, and Film.

Our guest today is Dr. Bhramar Mukherjee, the inaugural senior associate dean of public health data science and data equity and the Anna M. R. Lauder Professor of Biostatistics, Professor of Epidemiology (Chronic Diseases), and Professor of Statistics and Data Science at Yale University.

Among her many honors, she was elected to the U.S. National Academy of Medicine in 2022. Bhramar, it is a delight to welcome you to the podcast.

Bhramar Mukherjee

Thank you, John and Rosemary. It's a pleasure to be on the podcast and thank you for all the work that you have been doing over the years to excite and inspire the next generation of statisticians.

John Bailer

Oh, you're too kind. I'm curious: When did you first start to encounter data equity issues and their impacts?

Bhramar Mukherjee

Yes, so I think a transformative experience happened for me during COVID, when, on 24 hours' notice, my team and I decided that we were going to start modeling the SARS-CoV-2 trajectory in India.

The reason for that was that there was an extreme form of data paucity and data obesity, and also a vacuum in the modeling space that we were noticing. Statisticians were largely absent from the scene. People who were talking about the models were clinicians, physicists, and economists, but this is our lane.

So, we started to model the pandemic in India, and gradually I realized that we had the sophistication of various compartments of a dynamical model, but we did not have the data to afford those models.

That really dawned on me—that this is the problem at the very start of the statistical process. If that is corrupted, or if that process is not strong, whatever we do downstream is going to suffer.

So, I looked at the data distribution of various things that I have worked on all my life across the world and over time and realized—I will just give you one number. I work a lot in genome-wide association studies, and since 2006, in the last 20 years, hundreds of thousands of human genomes have been genotyped, but 87% of them are from Caucasian ancestry.

And South Asia, where one out of five—or nearly 20%—of the people in this world live, has less than 0.4% of the share of data. So, I was just thinking about this: 20% of the population, 0.4% of the genetic data. How is this fair?

And if we are discovering treatments and preventions based on gene-guided strategies, is it optimal for my parents?

That very personal experience drove me to formalize this notion and study it in much more depth.

Rosemary Pennington

This notion of data equity is interesting because, when I first heard this term, data equity, my assumption—given other conversations about equity—was about access, right? Do all researchers or scholars have access to the same kind of data?

But really, you're arguing that it's an issue of who is included in the data. So, could you maybe share another example where this issue of data equity might impact people in a way that would make them care? Because I imagine, for a non-statistically literate audience, this might be something they're like, "Well, why does that matter?"

Could you talk us through in a concrete way that this would matter and impact people?


Bhramar Mukherjee

Yes, definitely.

As John mentioned, health equity—we all agree that is a good notion. Data equity sort of mirrors the notion of health equity. This is not just in terms of access to data; it's fairness, transparency, and accessibility in terms of the collection, use, and distribution of data.

We believe that, in an ideal world, the altruistic goal would be that everybody, regardless of their ideology, circumstances, beliefs, or lived experiences, should benefit from this massive explosion that is happening in the information space, both in terms of the quality of the data and the availability of the data.

I'll just give you one example. This is the classic example of pulse oximetry data.

It was actually noted way back when pulse oximeters were being developed that these oximeters do not really work well for people with darker skin tones. But we never did a proper medical device study on people with different shades of skin tone.

During COVID, when the patient population was largely Black and Brown communities, we got lots of measurements and detected that there is a problem. When the oximeter is saying that your blood oxygen reading is 92 for a person with a darker skin tone, it's really 88, and so you're not getting to the hospital on time.

This was really discovered when we had enough data and the patient mix in the ICUs and hospitals shifted from what it typically is.

Had we collected that data, had there been representation of each of the subgroups, then we would have discovered this.

There is also the question of what happens after we discover it. If we discover it, that does not automatically empower a decision. So I also want to distinguish that data equity does not really empower somebody to take the decision that they deserve. That's a different question.

John Bailer

So, with researchers that are encountering this, are they surprised by the idea that, "Oh no, I've been taking these data sets that are not representative"? They may recognize some of the limitations in the work that they do, but they may not recognize a fundamental limitation in how the data are even collected.

What is the reaction of researchers who are encountering this for the first time?


Bhramar Mukherjee

I'm just thinking that, by researchers, I'm going to broaden the horizon.

If we think about statisticians, we are often used to working on the data that's given to us. We often think about Y and X as given to us. But I think we should be involved from the very onset of the study and think about what is the target population and what is the inference.

I think statisticians and epidemiologists are probably the two disciplines of science that are very, very well trained to think about the origin of data, the collection of data, the sampling of data, and the representation of data.

In computer science, many times my friends actually take data at face value—that this is Y, this is X. If this is measured with error, or if it's coming from a selective population that's not representative of the target population, those notions naturally do not arise in their minds, and they just think, "I'm going to predict."

But I, as a statistician sitting in a school of public health, am going to think about my job not just to predict, but to prevent as well. And for that, I need to have data from the population in which I'm trying to prevent the disease.

So, if you want to connect prediction and prevention, then of course you need to think about this.

When you actually see the representation of data across the world, or you think about the training corpora of these massive AI tools that we are using—GPT, Gemini, Claude—who are in these data sets?

Just think about it. The latest Pew survey shows that 20% of Americans have never heard about a chatbot and 66% have never used a chatbot.

When we are using these tools, the user data is retraining these models again. So this cycle is reinforcing itself, and then people are left out when AI and all these tools are used for decision-making, education, and medical discoveries.

This issue is becoming quite acute, particularly in the last three or four years. People are becoming much more cognizant about the question of who is visible to AI and who AI is visible to.

Rosemary Pennington

Bhramar, you are one of the co-authors of a piece in JAMA Health Forum that came out earlier this year, "The 10 Core Concepts for Ensuring Data Equity in Public Health."

I wondered if you could talk us through how you and your co-authors came up with these core concepts and why you felt like you needed to publish this.

Bhramar Mukherjee

Yes, so thank you for the question, because I have been thinking about this, and this has been hibernating inside me for the last five years.

First of all, I have been thinking about data equity, and I did not know—I have been a practicing statistician for nearly 30 years, a student of statistics—but I had never encountered this term.

Then, when I researched it, I saw that in September 2024, the World Economic Forum actually released its first report on an action-oriented framework for data equity, but it did not have a statistical perspective on how, throughout the data lifecycle, I could implement strategies to prevent inequities.

Being a statistician, I always try to combine these qualitative, altruistic ideas with metrics, numbers, and processes that we can implement from the design phase to the data collection phase, to the prediction phase, to the dissemination phase, and to the policy adoption phase.

So I was thinking, how can I think about metrics?

We had a lot of discussion across clinicians and public health researchers to think about how we really need to blend ideas from computer science regarding fairness, access, transparency, ethics, privacy, and confidentiality of data with public health concepts such as selection bias, representativeness, generalizability, and causal inference.

These two have to be blended together, and then they have to be wrapped in the notion of reflexivity.

I realized that there is a whole field of philosophy called critical data studies. Critical data studies and critical computing have appeared in the literature, but as statisticians, we are not trained to think about data as a topic of debate and discourse from a social and epistemological point of view.

But when we wrap these notions of concrete metrics in the framework of critical data studies and reflexivity, that's where I feel we are closer to thinking about data equity.

We also wanted to distinguish that having enough data does not imply that you have prediction equity.

For example, with the South Asian example that I gave you, I'm not sure that we really need 20% of the data. We may be able to do it with much less because the human genome is largely similar across ancestries. But I can tell you that 0.4% is not enough.

So, what is the right, optimal percentage of data that we should collect on a subgroup, given the current distribution of data?

Then, even if we give somebody a prediction that they are at the highest risk, does that empower them to modify behavior or seek care? That is also decision equity.

So, we wanted to distinguish these three phases of equity and think about concrete metrics, frameworks, and principles that can be implemented so that you can say, "Okay, I was self-auditing my own work as a statistician to implement these principles."

John Bailer

So, given your awareness of data equity issues—and you've probably designed more studies since you've really embraced and investigated this—how has this changed the way that you've approached a new study?

Bhramar Mukherjee

Yes, so, you know, I think I have always believed that you have to walk the talk.

It is unusual for a statistician to go into the field and collect data, and I actually just decided to do that. If I am not happy with some of the data sets I'm using currently, I have to be part of efforts that are collecting more equitable data sets, because you cannot do everything from nothing.

If you have a lack of representation, yes, we can weave our statistical magic through weighting, calibration, and bringing in external data to fix the problem as much as we can. But unless we have enough data on different populations, there are limits to what we can do.

So, I actually, with my colleagues at the University of Michigan, where I was for 18 years, launched a study called My Care, Michigan Cancer Research on the Environment Study.

This is led by my colleague and cancer epidemiologist Lee Pierce and Dina Dolly Noy from Environmental Health. It is recruiting participants from different parts of Michigan, and we consciously chose to have 25% White, 25% Black, 25% Latinx, and 25% Middle Eastern and North African participants, which is a very special population living in the Dearborn area of Michigan.

So, I felt like efforts can be local to solve a global problem—that we all have to do our part.

John Bailer

You're listening to Stats + Stories, and we're talking with Bhramar Mukherjee about data equity and more.

Bhramar, you do an amazing amount of work with public communication. You did this during COVID, and you continue this now. What led to this passion for communicating to a general audience?

Bhramar Mukherjee

I grew up in a family of liberal arts, and my father is actually a very well-known actor on the Calcutta stage. So, in our family, we never got credit for numbers. Words were important, and much more important than numbers.

At the dinner table, my father actually never looked at our grade reports. I was always good in terms of studies, so I wanted to show off, but he never paid any attention.

His only criterion was that, when people came over for dinner, the conversation at the table should make it clear that you were being educated. That's a very nebulous concept, but as we progressed through time, that criterion remained.

He's 86 years old now, and he just produced a new adaptation of Waiting for Godot. He's very active.

I grew up in that world of stage expression, entertainment, and that kind of philosophy. So I always wanted to make math and statistics more interesting.

I was very conscious of this in the classroom—that I should make my best effort to make complicated concepts easy. I feel like if you know things really well, you should be able to teach them well to people from different disciplines.

Biostatistics is a field that really relies on collaboration. I do think that the four Cs we need for a biostatistician of the next age are computation, collaboration, communication, and a little bit of courage to think about stretching outside our comfort zone.

It was really during COVID when I felt my communication skills came into play. I was speaking quite openly about the pandemic in India on television shows and writing columns in Indian newspapers.

All across the world, our model became the go-to model for journalists, stakeholders, and policymakers. One reason for that was effective communication and the use of different forms of communication.

It was not just going on a talk show or NPR but also having visual descriptions of the data that were updated daily.

We created a website called Covind19.org, which ran for more than 800 days with web scraping and automated updating. We have to think about sustainable ways of updating data and running models. If it required human intervention every day, it would not be possible.

Because we were updating and doing nowcasting for more than 800 days, that experience of building a website, visual dissemination, communication, and talking to policymakers was a priceless, invaluable experience for me.

So, I had the interest, but COVID was the catalyst that made it all happen. I also took a lot of media training through the University of Michigan and other opportunities.

Rosemary Pennington

Given what you told us about your family background, I am very curious how you got interested in biostatistics and public health.

Bhramar Mukherjee

Well, that's also sort of a reverse causality.

At our dinner table, there was never closure on any discussion. So I, as a person, somehow sought objectivity, and I felt that the language of mathematics was universal.

Because I was always focused on my grades in English essays, I felt that they depended on the teacher—whether they liked my style, liked my thoughts, and so on. With the same essay, you could get a score anywhere from two to 10 out of 10, depending on who the grader was.

So, I wanted a greater sense of neutrality in the answer—that regardless of who I am or who the instructor is, if I get the answer right, I should get points and I should get credit.

You can probably see this passion for equity through mathematical truth, because the notion of truth is getting so lost.

Probably there was always a passion in me for objective and concrete answers while swimming in this ocean of artists and critics who could never agree on anything.




John Bailer

This is a great story.

As you think about some of the communication challenges you faced when you really embraced the challenge of addressing COVID with journalists and broadcasters, what were some of the greatest communication challenges you encountered? And what were some of the ways that you addressed those challenges?

Bhramar Mukherjee

The first challenge that I faced is that I'm not a native English speaker.

There is a minute amount of micro-processing that goes on in my head, and I always feel that if I could speak in Bengali, I would be much more fluent. So I had to overcome the concern that people might judge me based on my accent, and instead focus on what I say, how I say it, and whether I can communicate through stories.

I will tell you that, during all of these interactions with the media, there was one episode with BBC Bangla in Bangladesh where I was allowed to speak in Bengali. That was my best communication piece because I could actually refer back to cultural anecdotes, politics, and poetry in order to explain a pandemic.

I realized after that episode that I'll probably never have that same level of comfort in English. But does that mean I should let perfect be the enemy of the good?

I think I'm okay as a storyteller, and I can still try. I would also say that knowing the Indian context really, really helped me engage with the media in India. So, I consciously decided to retain my focus on India.

This is something I would recommend to people who are non-native English speakers. You have to show that fourth C—a little bit of courage—and you will see that people are receptive to your stories.

The second thing I learned, really the hard way, is that live television is unforgiving. You say something, and you cannot take it back. So, you have to be very careful and prepare extremely well. If possible, ask journalists for the questions in advance so that you can be prepared. But even with all your best efforts, everything is politicized.

I was trolled so much because of my comments on COVID—when the next wave was going to come, because of the models, and everything else. So, you have to be very resilient. If you are in the public eye, there will be a lot of criticism. At some level, you have to adopt the principles of statistical learning and be both robust and efficient.

Rosemary Pennington

What advice would you have for journalists who are covering complicated public health stories filled with statistics?

You've had lots of experience being interviewed by journalists, but journalists are trying to do this work well. Given your work as an effective and successful communicator of scientific facts, what advice do you have for journalists?

Bhramar Mukherjee

That's a great question.

I think it's good to build relationships, just as I have developed very close friendships with journalists around the world and stayed in touch with them. If I have a communication question, I reach out to them.

I believe in expertise. Expertise can never be substituted, especially deep expertise. So similarly, I would recommend that journalists have data scientist friends they can lean on.

I also think there are many courses now on data translation and applied data science. Blending journalism training with some data science training—which most serious data journalists already do—could be a very good idea.

The third part is domain science. Even with the data and even with the story, clinical and contextual experience is priceless and cannot be replaced.

So, I think it's really a triad of domain knowledge, data science knowledge, and communication expertise that comes together to define a story.

John Bailer

So, in your work in data equity and in biostatistics, what's your next challenge? What are you working on now, and what do you anticipate working on in the near future?

Bhramar Mukherjee

Thank you for that question.

This is a very difficult question because I moved to Yale about two years ago, and the reason I moved to Yale was that I had data equity in my job title.

I think I posted on LinkedIn that I'm the only person in the world with data equity in my job title, and I did learn an algorithm to search very carefully.

A place like Yale—an organization that has existed for 325 years, disseminating high-quality education and research—putting a name to a program was of great value to me, and that immediately drew attention.

So, we are trying to establish a program in data equity in terms of generating research and developing a concentration down the road in responsible AI and data equity.

What I'm working on right now is applying some of these classical statistical principles to generative AI.

I do think that the same problems of selection bias, of leaving people out of the training corpora and focusing only on the user samples of data sets, are happening over and over again.

I just launched a new course this semester based on that paper you mentioned, Rosemary, explaining each of the concepts so that it can serve as a guide toward practicing data equity.

The course is called Ethics and Equity in Data Science and Artificial Intelligence, and I'm writing an e-book on it with my teaching fellow so that people have guidelines and a roadmap.

I want people to see that this is not just abstract idealism—it can be practice. That has been my mission.

I also want to spend quite a bit of time in India. We just launched a summer program there.

India is very strong in medicine. India is very strong in statistics. But the two worlds coinciding, intersecting, and defining a strong model for biostatistics and health data science is still lacking.

One of my dreams in life is to start a high-quality biostatistics institute in India. So next year I'll be on sabbatical, planting the seeds of that idea and seeing where it takes me.

John Bailer

Well, I'm afraid that's all the time we have for this episode of Stats + Stories. Bhramar, thank you so much for joining us.

Rosemary Pennington

Yes, thank you so much.

Bhramar Mukherjee

Thank you so much, John and Rosemary, for your wonderful work and for giving me the opportunity to share with your audience.

John Bailer

Oh, you're welcome.

Stats + Stories is a partnership between the American Statistical Association and Miami University's Departments of Statistics and Media, Journalism, and Film. You can listen to us on Spotify, Apple Podcasts, or other places where you find podcasts. If you'd like to share your thoughts on our program, send your email to statsstories@amstat.org or check us out at StatsAndStories.net.

Be sure to listen for future editions of Stats + Stories , where we discuss the statistics behind the stories and the stories behind the statistics.