Voter Rolls and Big Data | Stats + Stories Episode 69 / by Stats Stories


Matt Dempsey is the data editor at the Houston Chronicle. He worked on projects involving wildfires, state pensions, and the chemical industry. His passion for public records frequently leads to disclosure of data from all levels of government. His series Chemical Breakdown won the 2016 IRE Innovation award and the National Press Foundation's "Feddie" award. His work was a key part of the Chronicle's Pulitzer Prize finalist entry for Breaking News.

+ Full Transcript

(Background music plays)

Rosemary Pennington: Every few years, Americans are inundated with campaign flyers and ads urging them to support one candidate over another, or to vote no on an issue instead of yes. Those ads and mailers are often later used by researchers interested in studying political communication. But the act of registering to vote or voting itself produces its own kind of data. That's the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I’m Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and media journalism and film as well as the American Statistical Association. Joining me in the studio, our regular panelist John Bailer, Chair of Miami statistics department and Richard Campbell of media journalism and film. Our guest today is Matt Dempsey. Dempsey is the Data Editor at the Houston Chronicle as well as a veteran data journalist. A coalition of twenty news rooms including the Chronicle recently worked together to purchase the Texas voter registration database as well as voting history going back to 2010. They've been using that data to write stories about voting in Texas in the run up to the election. Thank you so much for being here today, Matt.

Matt Dempsey: Thanks for having me.

Pennington: My first question to you is simply, how did you get this many newsrooms to work together on this and what kind of information does the data that you're looking at contain?

Dempsey: Right. So how it happened is a little bit of a mystery to me, honestly, if I’m being 100% honest.

Pennington: (Chuckles)

Dempsey: So essentially what happened was, I wanted to get this data because the election was coming up and I knew we could use this data to tell pretty interesting stories about the electorate and get some idea of where the electorate might be going based on the number of registered and where people are registering and things like that. So I had tried to get that data but I was told by the Secretary of State’s office, that it's going to cost, just the registration data alone, somewhere between thirteen and sixteen hundred dollars and I knew that there's no way my editor would sign up for this.

Pennington: Yeah!

Dempsey: And it would be that any…most newsrooms, not any, most newsrooms would go, that’s a lot for one dataset. So I tried to haggle with the secretary of state's office but I didn't get very far and then, which is unusual, usually you can get somewhere but they were pretty steadfast on this is how much it costs.

Pennington: Yeah.

Dempsey: So I tell my editor that, and I can see that we're not going to approve, you're not going to approve, right, and this is a…

(Collective laughter)

Dempsey: No, right, you are correct, that is…you're not going to pay that, so I…and I didn't quite know what I was going to say next, I just want to like, at first, just confirm that it wasn’t going to go anywhere, so I’m like guess this is dead, and in the moment, I go, well, what if we got someone to share the cost with us? He was like, what do you mean? I’m like, I'm not really…I think, what if I just call some newsrooms like, around Houston and Texas and see if anybody wants to share in the cost of getting this data. Again, no offence to my editor, he's kind of like, umm…sure, go for it! He was like, yeah, go for it. I doubt that this will work but I can't see a problem in trying, right?

Pennington: Yeah! (Chuckles)

Dempsey: So what happened from there is I started making calls. I start with the Texas Tribune and asked them because they are big, Texas wide organization, they have a lot of collaborations with news rooms. So I figured it would be the easiest.

Pennington: Yeah.

Dempsey: …something like this and they would interested in the whole Texas dataset, that would be useful and once I got them to say, essentially, yes, this sounds pretty…we’re interested, and I’m like, great. So from there, actually my wife and I were driving to San Antonio for a work trip for her, and I’m like, well, it's a two, three hour drive. Let's start making calls, and my wife is looking at me and…

Pennington: (Chuckles)

Dempsey: My wife…I'm driving, my wife would look up the phone number of the TV station news directors in Houston, that's where we started with soliciting calls and I’d call up the T.V. stations and I’d make a very short pitch. The pitch was, hey! I have a deal for you. Let's get you $1,300 or $1,600 worth of data for less than two hundred dollars and if we just get enough people involved, that’s a pretty good deal, right? A thousand dollars’ plus of data for less than few hundred bucks! I mean you don't get that kind of a deal on Saturday, right?

(Collective laughter)

Dempsey: And then people, basically, I'm surprised! I was assuming most people would just say no and most of the people, the immediate response was like well yeah, that sounds interesting, shoot me an email. I was like oh, I'm driving so can you shoot me an email and I'll respond. By the time he made it to San Antonio we had responses at least interest from almost every Houston media outlet. That's local N.P.R. station, all the major networks, our local T.V., our local affiliates and obviously the Chronicle and then we had we were in San Antonio so we were parking at our hotel. I rented one of the little scooters and drove out to our partner paper in San Antonio and met with the Editor in Chief Mark there and he said, who I know because I used to work at the Chronicle for a little bit, and said hey this is what we're going to do. You cool with this? And he said yeah, this sounds fantastic. So I got him signed up and then the next week I started calling all the San Antonio papers, San Antonio T.V. stations, and then Austin and Dallas and then I started working my way down from there to El Paso and Corpus Christie and then started hitting the Rio Grande Valley area like Laredo and things like that so…

John Bailer: I was going to say, so why do people care, what's the big deal about this database?

Dempsey: Right. So voter registration data has a lot of information. It has the date of birth, name, address, the date of the registration it has a field called the Hispanics surname, which I don't use actually, it's supposed to help people determine Hispanic ethnicity that Texas does not ask race or ethnicity on voter registration data so I don't use it because I think Hispanics surname is really statistically kind of…I don’t think it really works.

Pennington: Yeah.

Dempsey: But it's something we get. But either way, there's a lot of information there and it is data on millions of people in the state of Texas. It's a really, really large data set when you get the whole thing unzipped. All the data we got, it’s like more than 40 GB of data.

Pennington: Oh wow!

Dempsey: It's pretty enormous! Usually I tell public agencies that tell me that dataset is really big, I kind of scoff at it, like yeah, I’m fine, I can handle a big data set.

Pennington: (Chuckles)

Dempsey: This one actually, it was really, really big and does take a little bit of managing the process it correctly and things like that. Richard Campbell: Did you know going into this, what you were looking for, I mean what kind of data were you looking for and what kind of stories were you thinking about, as you put this coalition together?

Dempsey: Right. So the pitch that I made was in terms of stories that I could tell to the coalition partners was, obviously election stories, registrants, who's registering, what is the demographics of those who are registering in terms of age, so are we getting more young people registering? That was a big thing that I wanted to look at and I think a lot of us want to look into that. How does this compare…how many…what percentage of currently registered voters registered in 2018, or like new registrants and things like that? One of the other things that I wanted to do that I pitched particularly to T.V. stations which do a lot of breaking news is that voter registration database is a really good people finder and a backgrounder, because it has all this personal information legitimately information about lots and lots of people. If you're working on a breaking news story you want to verify information or track someone down, you can get the mailing address and how old they are and it's really good to confirm who people are, and track things down, it is a good primary data set in that regard.

Pennington: Yeah.

Dempsey: That's one of the things. Even if you're not interested in the elections, at least get into this, right?

Pennington: Yeah.

Dempsey: I guess…I wouldn’t lie or anything like that, but I would definitely be doing my best...what's the name of the guy…Henry Hill the music man?

Dempsey: That was our goal essentially was, you could get a lot of really interesting election stories, but it has use outside of that and we haven't had this data in years and years and years, like the whole breadth of the data.

Pennington: Yeah.

Dempsey: So getting this is a good baseline for where we use it, I mean for people finding purposes too.

Campbell: So are more young people registering? Did you find that?

(Collective laughter)

Dempsey: We are working on that! So I’ve only actually had all the data loaded up into my sequence and since like late last week because of other assignments that I've had to finish up before I could spend all my time on elections for the next you know however you know the…March till election afterwards, but the stories we've done so far so we are a partner paper with Republican Election Land project.

Pennington: Yeah.

Bailer: Right.

Campbell: OK great.

Dempsey: So we get a little bit and we have a very enthusiastic partner. I’m coordinating all of the cooperation for the Chronicle. One of the steps we got when election land went live for 2018 was about a situation that happened in Waller county Prairie View A&M which is a historically black college, had an issue with voter registration, they were being told that you know, the instructions they had given to people who were registering. people to vote or for college students on campus was incorrect and that was why they were registering people all year and actually since 2016.

Pennington: Oh wow!

Dempsey: So now they wanted all those college students to fill out a separate like confirmation of residence or change or residence form before they can vote on Election Day or in early voting. That upset very many people. There’s a lot of misinformation around us that we sussed out and one of the things that we did was that we looked into the data to find out how many people had been registered under this incorrect information. I think that training that they provided registrars. So we could show that it was like a thousand people which is like there's about eight thousand kids Prairie View A & M, so that's like a pretty decent chunk of the student population just registered to the wrong address and that would have been a lot of people on Election Day. I mean, if they got an extra form, a whole bunch of things go wrong outside of that in the end and it's going to get to where the State or the county are just going to allow them without the change of address.

Pennington: Right. Matt, since you raised Election land, could you explain what that is for people who are unfamiliar?

Dempsey: Absolutely. So Election land is a collaboration that is headquartered by Pro Publica, newsrooms are participating, they get tips from…some Pro Publica, they get tips from the system to receive information like Reuters, or they monitor Facebook and some other places for complaints about voter integrity issues, inability to vote, you know, voter suppression, anything making it difficult or causes problems with voting.

(Background music plays)

Dempsey: So we were big participants in 2016, and we got a bunch of really interesting stories out of that, they are making a big effort to be more involved before voting starts and during early voting this year, and we’ve already gotten 3 or 4 tips in the first week in Texas, which has been pretty great.

Pennington: You're listening to Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington with Miami University statistics department chair John Bailer and media journalism and films Richard Campbell. Today we're talking about voting data with Matt Dempsey. So Matt, you talked about this issue at Prairie A&M and the sort of difficulties for the students possibly that you've pulled out of this data. How do you go into a dataset this large and find these kinds of stories?

Dempsey: So when I approach my job is, honestly not about a normal reporter approaches a source, right?

Pennington: Yeah.

Dempsey: So I want to ask the questions I want to ask of the data and I know the data well enough to be able to ask the question effectively. So for this one, I knew exactly what the address I need to look up was. I have it and I just want to seek a few queries and it will show me the first lists…it is a really large data set, so first just let us pull out the Waller county data.

Pennington: Yeah.

Dempsey: I don't need everybody else and then I just created Waller county data from there and said OK show me a 100, 700 University Drive, you know those two addresses where there were some questions. So I could pull all of that out and I could limit it to just those who had an effective registration date of 2018. That happened within 2018, so I had to create a separate field to show the year because it had the whole date of registrations, just the whole date and I just wanted to limit it to just 2018 registrations. That's essentially what I did. For other things like that, I mean, we had a different tip from Election land that happened. The number of registrants, the registration site spiked around the deadline. I don't have deadline data yet, I have up to the registration deadline.

Pennington: Yeah.

Dempsey: So I'm working with the secretary of State to try to fill in that gap right now, but I could show like, by month, in specific, like some of the bigger counties in Texas, how many registrants we're getting by month and then compare that to what the county was telling us without being able to look at the data to show that, like in September, Travis County had like 15,000 voter registrations and they're telling us on the last day of registration, they had 36,000 registrants so they had more than double the number of registrations they had in the entire month of September in one day. So that is something that…that's an easy thing to explain and look at in the data, just by counting things and that way.

Bailer: So it sounds like you had to have some very strong database skills to be playing with this.

Dempsey: Well honestly, querying the data is not so bad, that's the easy part and that's something I tell my students when I adjunct or teach data journalism. The analysis part is usually the fun part, the easy part. The hard part is getting everything ready to go, to be analyzed.

Bailer: Yeah.

Pennington: Oh, that's true.

Bailer: Preparing the data and…

Dempsey: Right! The way the State provides this data it is helpful and kind of annoying. I get a zip file that has 254 zipped text files, one for each county in the state of Texas.

Pennington: Oh God!

Dempsey: And add that together so some newsrooms I think are going to probably just upload their counties and their Metros first, and just work with that. You know we have an awesome bureau, we have a sister paper in San Antonio you know I just get I loaded up the whole darn thing into my sequence and my home computer and my work computer because my home computer is a little bit more beastly, the processes run a little bit more faster but it still takes a while because honestly, I'm not usually dealing with 254 text files and I'm certainly not dealing with that many zipped files, so I had to do a little research to make sure I was importing it the most efficient way possible.

Campbell: You mentioned teaching, so Rosemary and I teach journalism and my question is how do you get students interested in data? A lot of journalism students…

Pennington: (Chuckles) are scared!

Campbell: This is an obstacle for them. So what do you when you're in a classroom and you've got a majority of the students I suspect, that find this daunting?

Dempsey: So I've been teaching data journalism, I haven't taught in a little bit since I've been in Houston. I’ve done like an online class for Arizona State once. But I really haven't taught consistently since I was back in Phoenix and working as an adjunct at ACU.

Campbell: OK.

Dempsey: I’ve been teaching mostly grad students and Honor students, but I would say this, that I think a lot of people assume, still in 2018, that data means math and math means…no!! Math is where monsters are! This is a question I can ask any group of journalists, no matter what their age if they're professionals and they're in their sixty's or if they are brand new, if they are journalism students and haven’t worked a day at a newspaper or TV stations – how many of you hate math and I guarantee you get least ninety percent of the room - any room - will raise their hands. Most rooms, I should say. Data journalists a little bit less though, but most people will raise their hand that this is scary, math is bad, that is why I got into journalism I like words. So what I like to do is say that one, this isn't a math class. We're not doing a ton of math. The math is usually pretty simple. Most data journalism math is not complicated and I myself am not great at math. I tell people all the time that people think that I'm a wizard at math, and I’m like, I'm not good at math. I just know how to tell my computer to do my math for me.

Pennington: (Chuckles)

Dempsey: So that's what I try to explain to my students is that data is no different. Data is just like people. It's just like us. You can ask your data questions, your data can mislead you, just like sort of people can mislead you. It can be wrong just like people can be wrong. You have to do all the same things you do prepare for an interview. You have to do the same things when you're preparing to work with data sets. You have to know your research ahead of time. If you go into an interview being unprepared, it doesn't do you well to go and find a dataset and just expect it to produce answers. That’s not how any of this works, right? So if you come into data journalism with the same mindset you come in to your “normal journalism”, you know you'd do well. And you will be…it's a way to get questions and answers that you might not have been able to get out of a human source. So the short version like I got into this is I'm distrustful of people when I was a journalism student I had a real hard time like I'd interview people and then or non-consequential stories and I'd be like well how do I know they're not lying to me. Very distrustful and paranoid and kind of weird and when I found out about data journalism when I was working as a researcher at IRE at Mizzou Investigative Reporters and Editors that's when I discovered what data journalism was and I go, Oh! I can…I see this, this means I don't have to rely on a person to tell me, what the information is. I can use document and data as if I am a resource and then I can ask them to respond to information I already have instead of asking them to tell me the things I don’t know.

Campbell: Very good.

Bailer: So I want to flip the question. So how do you get statisticians interested in data journalism?

(Collective laughter)

Bailer: Because there are many statisticians that are interested in trying to gain insight from data that's actually the world in which we live and many of our students are interested in that and some of us you know still can use words.

(Collective laughter)

Bailer: So how might we attract some from the stat side into the data journalism world?

Dempsey: Right. So my mentor Steve Doig who until fairly recently was a Knight Chair of Journalism at Arizona State, he used to tell me that one of the great things about being a journalist versus being an academic or being a statistician is that our barrier to publication is so much lower.

(Collective laughter)

Dempsey: So for example, there’s lots of times where I’ll be working on something and I'll be talking to an academic they're like well I wouldn’t…you know I don't know if that is exactly how I would you know I don't know if that is statistically significant and I’m like, OK, it doesn't have to be. I just need to show that there's a relationship there and we don't know the answer and we can report around it, use anecdotes to kind of fill in the gap where the data ends and like real life begins. I find that, in that regard, data journalism has a little bit more utility and flexibility than a statistician's or a statistic’s work does. So in statistics, you are beholden to your academic standards in a good way, right? That's a good thing but it also means that when you…where a paper, a stats paper might stop at, we don't know the answer. The data seems to imply this but there's no…there’s accounting variables you have you know it's unclear because we would have to replicate this, we'd have to do some other research, blah blah blah. As a journalist I don't have to wait for any of that. I can just talk to other people and tell them, well this implies this, this could be this and I can use my reporting skills to kind of fill in the gaps and just say this is what it looks like it's happening. It’d be very surprising if it doesn't. I'm not beholden to those other standards. You can publish a lot faster because of this pace. It is so much faster than academia and that's really nice. Also, impact. That's probably my favorite part about journalism. You have a genuine impact on the world on a day to day basis. People read your work! Even though maybe less than I would like, but they still do. I know that for example that Waller county story, within a week of us publishing that first story, the State, the county everybody had all decided that here's a better way to handle it.

Pennington: Oh wow!

Dempsey: That impact, if a stats person does that story, it is not going to run before the election, one, I can’t talk to him before the election and two, because of that it won't have an impact and it stays in the ivory tower of academia. Not as an insult to academia at all, but this is more practical, more real.

Campbell: So working at that speed where mistakes…you could make mistakes, right? So what are the pitfalls? What do you see in other journalists that are problems and what are pitfalls for you? What are you careful about?

Dempsey: Well, I’ll be honest. I'll give you a really good example just for the voter registration data. Not this particular data set, but an earlier version, a smaller subset. I was working too fast, this has happened fairly recently. I was working too fast, I was doing too many different things at once because I'm usually doing lots of different things at once, and I was working with a reporter in our Austin bureau about new registrants versus you know and younger registrants. She had some 2014 slides of data versus 2018 slides of data and I was looking at the ages so we could get, you know, and they just give it as D.O.B., not the age, so we have to calculate the age. You want to guess at the mistake I made when I wasn't paying attention?

Bailer: When you read it…

Dempsey: The date I got was 2014, I used the 2018. So, I was calculating the age of 2014 registrants as if it was 2018, not 2014.

(Collective “Oh”)

Dempsey: So Alejandra she is data fluent, and she knows this stuff, she knows how to use data and things like that. And she caught the mistake. She’s like wait a second how are there no 18 year old registrants in 2014? And I’m like what? What are you talking about? And she’s like yeah, there’s no 18 year olds. And I'm going how could I not catch that? I wasn’t paying attention to that. I was moving too fast, I wasn’t paying attention. I didn't slow down and make sure I was doing all my integrity checks, right? So usually, so caught this way before publication and that's usually where most statistics get caught, is, you're doing your checks, you have editors who are supposed to ask your numbers. That being said, data journalists frequently operate I would say without a net most of our editors don't have the people, the people who are editing our work don’t have data experience or aren’t data journalists themselves almost all the time. So imagine if you are a statistician or you're an academic, and the person you're reporting to doesn't know what you're doing, doesn't have an understanding necessarily of the details of the work, so they have no idea what questions to ask necessarily, that would help catch those things. So you have to be your own best back check. So one of the things you can do is like build into your workflow a series of integrity checks to make sure, have I done…you know, one of the common mistakes you make around age, what are the common mistakes around certain datasets. That's usually the way to go about it and then double check your work. If it sounds too good to be true it probably isn’t, check it three or four times, any time we do a publication, we start the analysis, I start all the analysis all the way over again from scratch if I can do it and on a different piece of software, I'll do it, like if I did the work in Access or MySQL, I'll do it in Excel. If I can't, if the file’s small enough. If I did it in Excel, I might do it in R. You know like do it in a different format the same calculations and if I get the same result, now we are good. Essentially, think about it as rapid replication.

Pennington: Um hmm.

Bailer: Yeah I really like the idea that you're building in your analysis integrity checks and I think that's something we would certainly encourage in all of our analysis classes as well. I mean, you're building in some assurance that you have more, greater likelihood of reproducibility of what you're doing because you wouldn’t have made the silly mistakes earlier or they will come back and haunt you. Do you ever make the data from analyses? You've done open to the public so that others could replicate your work or redo an analysis?

Dempsey: Right. So we have been moving to that more often. I think journalists, journalism in general has been moving towards that, more than we would have done 13 years ago when I got started. We here at the Chronicle have been using the Data.World a lot more and GitHub. So we've been posting a lot more of our data there. We just did…it would have been three or four months ago. We did a project related to the National Flood Insurance Program. Now my former colleague Mark Colette who has recently left journalism to be a stay at home dad, he's my best friend and he's a fantastic reporter who has some data skills, he doesn't do data primarily, but he got all of this really fantastic data from FEMA, about the National Flood Insurance Program about severe repetitive loss and it was dated at like individual addresses for severe repetitive loss, and you could show which properties had applied to FEMA or made a claim under the NFID. We could show which areas in the nation wide data and we could show specifically which cities, counties, states had the most repetitive loss claims, how much those claims were worth, all sorts of really great fantastic stuff. He did a whole lot of data cleaning to get that data into shape to do the analysis and we shared the cleaned version of that dataset on GitHub and Data.World so that other papers, other media outlets could do the same kind of stories in their area. We've already known that there is papers in West Virginia that use that data.

Pennington: Oh wow!

Dempsey: And I think a Louisiana paper did something with it, somebody in Illinois I think did something with it.

Bailer: Oh that’s cool.

Dempsey: I think data is going to be cool for a while. If I was in, for example if I was in North Carolina I would be using this data right now. Which places have severe repetitive loss, which areas you know, specific addresses, which towns have severe repetitive loss, in the areas that got hit by…what was the name of the hurricane already…Florence, and Michael was the one that just hit Florida. Florida should be looking at the data to see if it is usable and you don't have to go through the process. It took us months to get this data.

Pennington: Yeah.

Dempsey: Post Hurricane Harvey, so it's incredibly useful. Though we've been doing that, more of that, we're trying really hard. We have a lot of data in-house that we haven't shared that where it takes a little bit of time to get it clean, make sure it's all cleaned up and updated and that we're not like spreading mistakes, like make sure that the data is ready to be shared.

Pennington: Yeah.

Dempsey: And also figuring out what it is and how to work with it, you know, all of that is essentially institutional knowledge that we should already have, that we don't necessarily have.

Pennington: Yeah.

Dempsey: So we've been working on that as well, so it's not just good for the public, it's for us internally, it makes sure that we know what we have and it shows the public to a degree of what we can actually do.

Pennington: Well Matt thank you so much for being here with us today. This is a really interesting conversation.

Dempsey: Well thank you! I really appreciate it. I'm a little bit of a talk alcoholic.

Pennington: No no no! It’s good, it’s all good. That's all the time we have for this episode of Stats and Stories. Stats and Stories is a partnership between Miami University's department of statistics and media journalism and film as well as the American Statistical Association. You can follow us on Twitter, Apple podcast or other places you find podcasts. If you’d like to share your thoughts on the program, send your e-mail to or check us out at and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.

(Background music plays)