Caitlin Augustin (@augustincaitlin) is a Senior Director at Datakind and is responsible for delivering DataKind's core offerings, ensuring that high-quality, impactful data science interventions are created in partnership with social sector leaders. Prior to DataKind, Caitlin worked as a research scientist at a digital education company and as an engineering professor at NYU. A lifelong volunteer, she's engaged with Central Florida's nonprofit community and the organizer of the Orlando Lady Developers Meetup.

Matt Brems is Managing Partner + Principal Data Scientist at BetaVector, a data science consultancy. His full-time professional data work spans computer vision, finance, education, consumer-packaged goods, and politics and he earned General Assembly's 2019 "Distinguished Faculty Member of the Year" award. Matt is passionate about mentoring folks in data and tech careers and volunteers for Statistics Without Borders as well as currently serves on their Executive Committee as the Marketing & Communications Director.

Episode Description

There's a lot of conversation happening about the ethical uses of data and statistics how much weight should we put on numbers at all? How thoroughly should we investigate the methodologies used to create them? And who has access to the data? A special issue of Chance focuses on statistics and data science for good and that is the topic of this episode of Stats and Stories with guests Caitlin Augustin and Matt Brems.

Read the Full Issue of Statistics and Data Science for Good

Learn more about Caitlin’s work at DataKind

Charting the Next Five Years of Colandr
DataKind Tools Help Tell the Story of Housing Loss in the Sun Belt
Engineering Scalable Data Quality Assessments for Frontline Health with Medic Mobile

+Full Transcript

Pennington
There's a lot of conversation happening about the ethical uses of data and statistics, how much weight should we put on numbers at all? How thoroughly should we investigate the methodologies used to create them and who has access to the data? A special issue of chance focuses on statistics and data science for good. And that's the focus of this episode of Stats and Stories where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's Department of Statistics and media, journalism and film, as well as the American Statistical Association. Joining me is regular panelist John Bailer, Chair of Miami Statistics department. We have two guests joining us today. Caitlin Augustin is responsible for delivering data kinds core offerings, ensuring that high quality data science interventions are created in partnership with social sector leaders. Before data, Augustine worked as a research scientist at a digital education company and as an engineering professor at New York University. Matt Brems is Senior Manager of Data Science product and strategy with data robot and managing partner and principal data scientist at beta vector, a data science consultancy his full time professional data work spans computer vision, finance, education, consumer packaged goods and politics. Firms volunteer with Statistics Without Borders and currently serves on the executive committee as vice chair. They, along with Davina Durgana, guest edited the special issue of chance focused on stats for good, Caitlin and Matt, thank you so much for joining us today.

Matt Brems
Thank you for having us.

Pennington
I just I guess I'm just curious what prompted this particular special issue, though.

Brems
As you mentioned, I work or volunteer with Statistics Without Borders. And my colleague, Dr. Davina Durgana, who currently serves as chair of Statistics Without Borders, she came up with the idea to get a group of individuals together to talk with data kind to talk with other organizations and see, is this something that we can put together specifically, can we talk about how data science and how statistics and how this philanthropic work that we in these various organizations do can be used for good? So through conversations between her and Dr. Augustin, we had talked about, really, how can we publicize this a little bit more, and I believe it was Davina and Caitlin who reached out to Amanda at chance to really set the wheels in motion here, and then started to broaden the network of people who we could bring in the types of articles that we wanted to include, and went from there.

Bailer
So can you talk a little bit about the types of articles? And you know, perhaps, Caitlin, you could also add on to kind of the good reception and an early framing of this project?

Caitlin Augustin
Yeah, absolutely. So as that said, you know, this was really the brainchild of the team at SGBV. And data con was thrilled to be invited early on to join the conversation. And what I think is so unique is the group of organizations represented in the addition, we span the pro bono data science movement, so data kind being a volunteer driven organization, SBB being volunteer driven representation from academia from the the triple A s from other think tanks, from entities who are who are doing data for good work at at many levels have both impact and, and organizational engagement, but also data maturity. So when you when you get into the articles in chance, you'll see articles that are really about traditional statistics, you'll see data visualization, so how can we honestly make basic data and reporting accessible to organizations to help them drive data driven decision making to be able to increase their social impact to doing you know, perhaps more advanced machine learning and artificial intelligence work? And I think that's the real power of bringing together so many organizations under this big tent or big umbrella of data for good.

Pennington
Maybe it would be a good time for you each to explain what Statistics Without Borders and data kind are for the people who are not glued into that.

Brems
So Statistics Without Borders is an organization of to date about 1800 volunteers around the globe. We were founded in 2008 to be an organization under the auspices of the American Statistical Association. And really the four people who co-founded Statistics Without Borders just said hey, there's a lot of people with some extra time and some extra energy. There's certainly this enormous need in the world for people to make better data driven decisions or statistically driven decisions? And how can we, as these people, as these volunteers help address that need. So this group of volunteers came together. And Gary Shapiro is one of the people who co-founded SWB. And to this day remarks about how Statistics Without Borders has scaled, he said that he thought that maybe this is something that would take on a project or two a year, maybe Statistics Without Borders might grow to 30, or 40 volunteers. And now in 2021, we've got, I believe, it's north of 1800 volunteers at this point. So fully volunteer organization. And what we do is people come to us organizations come to us specifically not for profit organizations and say, we have a problem. We believe that we can use statistics or data science to solve it. But we need your help, we don't have the money, or we don't have the in house talent, to be able to have a data scientist or statistician or anybody help us with this, we need your assistance. And so that project will then go through a process that we've defined, and statistics, without borders, in a nutshell, just helps these client organizations work on their projects and their work, focusing on social good and the betterment of society as a whole.

Bailer
So before we go to data, could you give an example of one or two of the projects that so maybe even the recent projects that the state without Borders has worked on?

Brems
Yet the projects themselves are really diverse. So one example when COVID-19 hit and racked the United States and the globe as a whole, that one of the big challenges in the United States was small businesses were shuttering small businesses needed to find ways to remain viable, larger organizations had a bit more security, but small businesses if they needed to lay off employees, or if if they needed to make really hard decisions, how would they go about doing that? So the US government made funding available, as well as other organizations made funding available to small businesses. But that's tough for a small business to navigate. A small business is making the decision about who to lay off if needed, how do you, you know scrimp and save to make sure that you can keep your doors open so that you can provide for yourself and provide for your employees, small businesses probably don't have a ton of time to go, or small business owners in particular, probably don't have a ton of time to go on Google and start looking for what funding sources are available, what grants are out there, and all of that. So one of the projects that Statistics Without Borders did, and it's actually been a series of projects now. But a group of people developed a machine learning approach to identify grant awarding organizations pull all of that information together into one central repository, where if there was a small business owner who said maybe I do want to try and get some sort of external funding to help me out, how can I how can I do that in a an easier way than me googling 40 different organizations and filling out all these different applications for various organizations? So that's just one of the many different types of projects that Statistics Without Borders has done.

Bailer
So Caitlin, do you want to help flesh out data kind for us and some of the work that's done there?

Augustin
Sure, absolutely. So data kind, we're a global nonprofit, and with the focus of using data science, machine learning and AI in the service of humanity, so different from SWB, we are an organization that does have a central staff used to be headquartered in New York City, now headquartered all over the world. We were founded back in 2012, we're coming up on our 10 year anniversary now. And we have a global network of five chapters worldwide and 20,000 volunteers over our time. Over the past nine years, we've done over 300 projects in advancing the social impact space in various ways. And the way that we work, we work through sort of a number of programmatic offerings, our big statement is to start with the problem, not the data. And so our first way of engagement with social impact organizations is through something that we call discovery. And we run events called discovery days, which offer the chance for just a one on one conversation between an expert in data science and an organization that has a problem that maybe data science consult. No, we're all data scientists and technologists, we think that that's the right lever to pull. But it's important to acknowledge data science, machine learning, and AI can't solve everything. And that's been a lesson that I've certainly learned, both from my time as a data can volunteer, and then the past four ish years as a staff member will work on projects where your answer is, you just have to pay health workers, you know, and there's no data science It's machine learning or AI that's going to be able to fix that. So it's important to come to these problems in space to acknowledge what can we do and what can't we do. And I, that's something I'm really proud of us doing it data kind. And I know, one of the topics that you mentioned was sort of the space of data ethics. And really, the first ethical thing is, do we even need to do this thing in the first place? Now, data kind wouldn't be in existence if we didn't believe that there was a reason to do these things. And so but we start with discovery, we work to identify problems that we think are data sizable. We have events that we call data dives, which are weekend long social impact hackathon style events, which allow us to dive into problems. We just had one in September that had a focus on community level data, where we brought nearly 300 technologists from around the world together to work on problems at a community level. In March, we had a separate event that was really focused on open data. And I'm kind of almost 600 attendees from across the globe, just really trying to identify the pathways to using data science for impact. Data kinds flagship program is the Data Core Model, which is where we build a partnership with a social impact organization for a period of months to years. These can run as fast as three months, they can take as long as almost two years. And we work to co create a solution with a social impact organization, that is an end to end solution. So we start from that problem space. And we end up delivering a data product. And that could be something as complex as work that we're doing that we shared and chance of building out tools to do data integrity testing across a health platform, to as simple as building up Tableau dashboards so that a food pantry can check their inventories. So we really look to meet a partner organization at the level of data maturity that they're at, and deliver them something that is helpful, useful, sustainable to them in the end.

Pennington
You're listening to Stats and Stories. And today we're talking about a special issue of chance focused on statistics for good with two of the guest editors, Caitlin Augustin and Matt Brems. Now, when you were discussing your two organizations, you both kind of were talking about this diverse array of projects that you've taken on the special issue of chance sort of features a diverse array of articles, talking about data and stats for good. So I wondered if you could talk us through how you decided what kinds of articles you wanted to include in this special issue.

Brems
One of the biggest challenges with this article was figuring out how do we fit it into the number of pages that were permitted? There, there were a number of times when I wanted to reach out to Amanda Peterson Plunkett, the Executive Editor, and say, what if we just change the font to like size seven?

Bailer
Students have been doing this for decades, this is usually the other way.

Brems
Right. You know, usually it's making the periods larger and things like that. There were so many really, really good articles that were submitted. And I think that a lot of that came from internal to data kind internal to statistics. Without Borders, we encourage our members to submit to this and work hand in hand with different groups within our organizations to get those papers ready for submission. In addition to that, we had an early conversation with other organizations, including the royal statistical society, including Lisa, including triple A S, and invited them to participate as well. And I think they did something similar where the leaders of the representatives from those organizations spoke with their respective participants, or volunteers, or whomever and encouraged them as well to submit articles. So we had a good number of submissions, I forget the exact number. But we started with that. And then once we were able to settle on what articles would be accepted, then Caitlin did what I would describe as a masterful job of mapping all of those to a framework for social good centering around the United Nations Sustainable Development Goals, or SDGs.

Bailer
Well, that obviously leads so nicely to the next question. Thanks for the lob to the net there, man. So Caitlin, do you want to talk just a little bit about what the heck is an SDG? I mean, we've had some folks from the UN come in and talk a little bit about this a number of episodes ago, but I really liked that, too. When I was reading the intro to this chance issue. I mean, you know, seeing that, that that was kind of the organizing principle really resonated with me, so So Caitlin, would you like to talk a little bit about some of those SDGs that were particularly relevant for your organizing of these contributions?

Augustin
Sure, and so on. You know, for your listeners who aren't as familiar with the SDGs they're called a blueprint to achieve a better and more sustainable future for all their 17 different goals from eliminating poverty to increasing access to education for all to building sustainable infrastructure. And they're really, truly the things that we are focusing on or should be focusing on. As a world in order to really achieve positive social development and positive social impact at data kind we actually ask ourselves if we can't map a project back to a sustainable development goal, is it within our mandate? Should we be doing that that works as an organizing idea in data for good ? It's a really important philosophy for us that data fine. What I thought was pretty amazing. Going through the chance submissions is that we hit on so many STDs, often data pandas a lot of our work in the frontline health space. So I was fairly certain we were going to see, let's see reduce maternal mortality, let's see indicators around access to, to health care. But it was really wonderful to see submissions that came in that were focused on the environment and sustainability and really building out these access to infrastructure and infrastructure being digital infrastructure and digital tools, as well. And using the SDGs, as our organizing thought to say, these are problems that matter, they are the challenges of our time, and that we think volunteerism and pro bono data science efforts have a real role to play in helping to solve those problems.

Pennington
You know, journalists like to think of themselves as being very important to democracy and democratic ideals. And so as I was reading your opening letter of the issue, there's this point where you raised the importance of the democratization of data. And I wonder, why is it important that we democratize access to data? Why should that matter to people outside of data and statistics,

Augustin
You want to empower people to make the most informed decisions possible and and you want them to, to give decision makers the best chance at having and making something successful and sustainable. A project that we've been working on at data kind has been with New America's future of Land and Housing Program. And one thing that we found when we started doing this work was that something like only 70% of the United States, can you actually access eviction data. So you can't even if across the United States, you don't uniformly even know how many people are being evicted, how many people are losing their homes. And if you're an organization like New America, who's trying to push policy around housing security, not having that information, and not being able to say, here's the number you need to move, here's the magnitude of the problem. It's crashing, so we need to be able to open that information up. And then you take it another layer down, new America is doing great work at a national level pushing the policy. But I live in Orlando, and I sit on the evictions and foreclosures Task Force in my community, and my evictions and foreclosures task force team. It was revolutionary to them when they started to get weekly reporting of those numbers, because they were trying to keep people in homes, they were trying to share out rental assistance. But without knowing where to target that aid, you know, you're flying by and you're doing your best guess. So being able to open up data and make that information, not just available, but accessible and visually useful to actors, will help us drive the solution of building so many levels of society.

Brems
I'd like to jump on a specific phrase that Caitlin mentioned, toward the start of what she was saying. And she used the phrase informed decisions. We want people to be able to make informed decisions, more informed decisions. Oftentimes we think about things in terms of data driven decisions. Well, if we think back to any sort of introductory statistics course, and we have to define data for students, or we try to articulate if we're statisticians or data scientists and a family member says, Well, what is that that's something that I still grapple with is you know, every year at Thanksgiving, it's like, okay, but what do you actually do? Often, data starts with information. So when we think about this notion of informed decisions, really it's we can think about that as data driven decisions because data is information, that information is context. And that context is critical when making any sort of decision. And so by providing more data by providing more information by providing more context, we help people to make better decisions. We help people to avoid the challenges of asymmetric information, we help organizations get access to just more context in order for them to make their decisions. And so democratizing access to data in that regard is really important. I also think there's a recent discussion that's been going on. I saw it on Twitter, and I forget exactly where I saw it on Twitter. Who was talking about this. But I do want to add a caveat here that, especially in this realm of big data, which is the Fanciful thing that everybody loves to talk about, and how big data will solve our problems, more data doesn't necessarily address our problem. More data, or in this context, big data, can move us toward the same answer with more precision. So for us, as statisticians, as data scientists, as people focused on, how can we do the most good with our time and our energy and our resources? One key role that I think we play as helping people to distinguish between that in this case, is more data and democratizing access to data. Is that necessarily better? And I think in many cases, the answer is yes. And in cases where the answer is no, the GUI of statisticians and data scientists play a vital role in helping people to better understand that we provide the context behind why more biased data, or more skewed data isn't necessarily going to help you make a more data driven decision or a better decision with with better context or anything like that.

Bailer
So that the least a very natural follow up to me to one of the particular articles, Caitlin, that you were involved with that was titled pathways to increasing trust in public health data. And you know, given that, that we still are walking lots of places indoors with masks on and still experiencing this pandemic, this idea of increasing public trust, such a critical idea, although you're doing it in a different kind of context. So I was, I was very much intrigued at the data quality, vicious cycle. I mean, most people are talking about, you know, in the positive sense, but the idea of breaking a vicious cycle, where if data were viewed as being, you know, inconsistent or problematic, that in some sense, it ends up destroying trust. So trying to address that directly. Could you just give a quick synopsis of, of what you were trying to address in that project?

Augustin
Absolutely. So in the pathways paper, it's detailing our ongoing work between data kind, and a digital health organization named medic, and Medic, provides a health platform and a toolkit for collecting health data by community health workers. So people who are truly providing care at that last mile are in hard to reach communities. And that data is collected by an application, an individual will do a routine care, visit, capture data and go back. And what we heard was that data isn't trusted. So something is causing a lack of trust in community health workers collected data. And once that, once you hear that the data isn't trusted, well, then you start to make decisions based on something that isn't that data. So it's like, oh, well, it's a lost cause. I'm not going to use that data, I'm going to use my own mental model, I have some other heuristic. And then you just don't fix your data collection processes. So you continue to collect that data. So what data kind of medic had been working to do is to build in a systematic way, data quality checks? I did. It's a data integrity toolkit that identifies inconsistent or problematic data collected and surfaces that data for remediation. And that remediation can look like verification, no, that actually is where the right data is, but we're checking it, we're verifying it, it could be okay, maybe that needs to be cleaned in a different way. Maybe it needs to be replaced. Maybe the toolkit needs to be cleaned. So one interesting thing that we found in doing this was you would have a recorded birth to children born, you'd have a recorded number of children alive, three children alive. Oh, well, that doesn't make sense. We just had two kids that were born. So why don't we build checks into our toolkits that eliminate the possibility of collecting inconsistent or problematic data in the first place? And so we've we've built this ongoing This is ongoing relationship with medic, ongoing toolkit that we are building detection at their platform level on the Community Health toolkit, rather than at an individual and post hoc level, which is where most data quality conversations in frontline health habits they've sort of been at a research level at a at a one off. You know, it's been diligent individuals, but with spreadsheets looking at it. And we think that data science and machine learning can provide a better way to attack this across the platform and across the deployment with the end goal of making it as sort of simple to debug your data, as you would say, debug your code. So you'd have the flag and ability to remediate.

Pennington
As we wrap up, I'm wondering if both of your organization's use volunteers, and I wonder if you could share how someone who was interested in becoming involved in data kind or the web might become involved.

Augustin
Sure, so we at data path actually have a call for 60 volunteers out right now. So if you would like to volunteer with you to kind of connect us with us on Twitter or LinkedIn, we are actively recruiting volunteers now, following us on social media is the best way to stay engaged and to learn about future volunteer opportunities.

Brems
Similarly, we recommend that if people want to get involved as volunteers with statistics, without borders, you can head to our site, we're always looking for new volunteers. And so if you head to Statistics Without Borders.org, toward the bottom of the first page, there's a volunteer application. And we're always looking for people to help assist on projects, like many of the projects that are described, in chance magazine articles, but also in more diverse opportunities in roles. For example, I did a stint working in marketing and communications, and we want someone to run our social media accounts. And so it's helpful to have someone who has the skills and the time and the interest in doing that. So it need not be just a project oriented role. We've done a lot of different roles available. I think, in addition to areas where we would love for people to also help out, one, feel free to engage with us on social media. In addition to engaging with us on social media, one of the things that we'd love for people to do is help us find clients. So if you know of an organization that is a not for profit organization in the United States, or beyond the United States, and who has some sort of data science, statistical need, maybe they don't have the budget to hire someone like IT consultant, maybe they don't have in house statisticians to do the work that's necessary. Feel free to have them reach out to us as well. Also, on our website, there's a place where prospective clients would be able to submit information about their project, then we have similar to what Caitlin shared earlier about how data is structured, we've got a structure internal to SW about a new client acquisition team. And then that moves to our project and client managers and our experts, statistical consultants, and all of that. But to kind of put a finer point on that for people who are interested in getting involved, you can go to our website to be involved as a volunteer, you can encourage people to go to our website if they are a prospective client in need of support, and then feel free to engage with us on most social media platforms.

Pennington
Well, that's all the time we have for this episode of Stats and Stories. I really hope people will pick up this issue of chance magazine, there's some really interesting stuff in there. So congratulations on it. Thanks for the great work. Yeah, and thank you both for being here today. Caitlin and Matt.

Augustin
Thanks so much.

Brems Thank you.

Pennington
Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.