Understanding Data in the Digital Age | Stats and Stories Episode 70 / by Stats Stories

Mark Hansen copy.jpg

Mark Hansen is a professor of journalism where he also serves as the Director of the David and Helen Gurley Brown Institute for Media Innovation. Founded in 2012, the Brown Institute is a bi-coastal collaboration between Columbia Journalism School and the School of Engineering at Stanford University -- its mission is to explore the interplay between technology and story. 

Prior to joining Columbia, Hansen was a Professor in the Department of Statistics at UCLA. In addition to his technical work, Hansen also has an active art practice involving the presentation of data for the public. His work with the Office for Creative Research has been exhibited at the Museum of Modern Art in New York, the Whitney Museum, the Centro de Arte Reina Sofia, the London Science Museum, the Cartier Foundation in Paris, and the lobbies of the New York Times building and the Public Theater (permanent displays) in Manhattan. Hansen holds a BS in Applied Math from the University of California, Davis, and a PhD and MA in Statistics from the University of California, Berkeley.

+ Full Transcript

(Background music plays)

Rosemary Pennington: Almost since its birth, people have been trying to figure out how information flows in Digital Media. Some have described the flow as viral in nature while others have used the concept of spread ability to make sense of the way people share material online. Among those trying to make sense out of a particular Web 2.0, are journalists. Reporting something grappling with such issues as verification and sourcing online as well as trying to figure out how to capitalize on new media to reach a broader audience. The intersection of new technology, statistics and journalism is the focus of this episode of Stats and Stories where we explore the statistics behind the stories and the stories behind the statistics. I am Rosemary Pennington. Stats and Stories is a production of Miami University's departments of statistics and media, journalism and film as well as the American Statistical association. Joining me in the studio, are regular panelist, John Bailer, chair of Miami statistics department and Richard Campbell former chair of media journalism and film. Our guest today is Mark Hansen a professor of journalism at Columbia University and the director of the David and Helen Gurley Brown Institute for media innovation. Mark, thanks so much for being here.

Mark Hansen: Thank you.

Pennington: Before we start talking about the work you've been doing at Columbia, could you just explain briefly how someone who studied math and stats winds up in a journalism program teaching?

(Collective laughter)

Hansen: Oh! It is a long story. The short version is, so I went to Berkeley and I was fortunate enough to get a summer internship at Bell Laboratories that secured me a place at Labs after I graduated. Somewhere along the line Bells Labs shifted from AT&T to Lucent as a parent company. Lucent’s stock soared, someone got the bright idea that with all this newfound freedom that comes from a soaring stock price we could relive some of the glory days of the Bell labs and so there was a revival of something called Eat, the experiments in art and technology, which paired - it started, it was originally launched in the sixties and it paired artists in New York with researchers at Bell Labs. They revived it in the nineties, late nineties. And I was one of the researchers who participated. I met sound artist Ben Ruben, we went on to collaborate and produce a number of works like lobby art for the New York Times building a lobby art for the public theater, made a piece for the 9/11 memorial museum, we ended up like building lots of art installations that draw on live feeds of data in some way. And sort of one step led to another. I took my sabbatical from U.C.L.A., actually I left Bell Labs, got a job at U.C.L.A., took my sabbatical at New York Times, R&D Lab, people I met because we were installing a piece in the lobby and One thing led to another and it was a job here at Columbia, leading this institute that pairs technology and story in some way and the media art that I'd been doing was close enough to what they were looking for and they gave me a shot. 

Bailer: What a cool history! That is awesome! So, what is done at the New York Times R&D Lab? Hansen: Well so, at the time, so there have been two incarnations of it now, at least two that I know about. At the time it was its mandate was to look five to ten years out for how we would be - how we would be accessing content, the New York Times content and it was kind of a half research lab, half like show and tell space, so you know we would show off some things we would get, you know like one of Microsoft's touch tables or there was like a range of mobile devices, various times trying to see what Times content could look like, and basically the group was built of designers and technologists who were thinking really hard about where media publications were going. That effort ended a couple years ago and they now have a new R&D Lab that is situated a little more closely to the newsroom. They've been we've been working with them in various things, for everything sort of voice interfaces to one of our projects, they've been experimenting with one of our labs. So they are a really collaborative group but the difference between the two labs is that one is really close to the newsroom, the new one, or it seems to be, and the other one was a little bit farther away and both very intentional in their design, both doing sort of different things and both very interesting to watch.

Campbell: In one of your talks I watched, Mark, and this is a program called Stats and Stories, you talked about the narrative power of data. Can you talk a little bit about what that is and how you think about that?

Hansen: Yeah. I actually, usually in my talks, since there’s chunks of them, they get repeated season to season. Usually one of the power of data is actually a quote from Pulitzer. Pulitzer prizes, the founder of our school, in 1904 wrote about what he wanted his College of Journalism to teach and he included statistics as something that should be taught amongst like law and history and other subjects that we still teach. And when Pulitzer wrote about teaching statistics, he said, and I am not quoting because I don’t have it in front of me but basically the idea was that in statistics you can find the truth and that's what he was obviously going to do but then also humor, romance, fascinating revelations of all kinds, he said. So he said, data has been an exceptionally expressive medium, and mind you, 1904. 1904, I believe Gossip had just gotten to the brewery, Fischer was like twelve or something…because statistics wasn't what we think of statistics as being and so he and to be fair he's probably describing what's you know the kind of sort of verbal slippage has happened now. People talk about data and they don't really mean statistics. Or they do, but they use the word data because they're kind of afraid of the word statistics or something like that. And so I think I think the way he spoke about statistics, we often see people talking about data today. But the idea is that as more and more of our lives are mediated by computers, mediated by computation data code algorithms, there is a need on the part of journalists as the explainers’ last resort to be able to interrogate those systems and to ask basic questions. Because the systems pool together and form systems of power and journalists are here to hold power to account. So in many ways my job here in terms of thinking about the narrative potentials you know, what's going on with these systems, how do we report on them, how do we hold power to account when it's entirely virtual.

Pennington: Mark, as someone who went through a journalism program in undergrad and in graduate program and had to take statistics classes and being the stereotypical journalism student who was terrified of statistics and math how do you get journalism students to sort of let go of fear they may have of dealing with data and statistics to sort of dig into the stories that sort of sit in it?

Hansen: That's a good question. So first of all, the fear no offense to John, because I know he is a wonderful teacher,

(Collective laughter)

Hansen: Fear is probably real because oftentimes we don't teach statistics very well and that's not really a function of well-meaning instructors. It’s the function of the standard narrative that's taught in the introductory stat class. Maybe you look at it in introductory stat class, it's kind of a race to the T. test with a little bit of regression sprinkled back and it takes the form basically of the proof that you would have followed to establish the T test, right? You wondered about the probability for a while, maybe you have some dated graphics to be like dinner everywhere is chapter one and then there’s an extended grind into probability and then eventually the data come back but they are so abstract that it's hypothesis testing and so on and it just doesn't it you sort of lose it very quickly and it's understandable that people are kind of turned off by it or could be, given the standard narratives that they had sort of been exposed to. I know it is heresy and I apologize.

Campbell: Should we let John defend statistics?

(Voices overlap)

Bailer: Wait a minute, wait a minute. Hold on. I don't think it's so radical. I mean I think that in fact we've…you would see a reboot in the way an introductory class would appear now versus you know, going…taking a time machine back twenty years and looking at it.

Hansen: Yeah absolutely I think that there's real attempt to change things so but I think there's a difference when you teach. There's a real difference between data analysis and data journalism. Data analysis often and this maybe goes back to your narrative potential question and you know I apologize if I'm ever…if I'm not answering these questions right. I spent the morning in the dentist chair. Half of my face this numb, clear up to the tip of my nose which is a disturbing feeling.

Bailer: Well we hope this is less painful than a dentist visit.

Hansen: Well, I will tell you at the end!

(Collective laughter)

Hansen: When I get out of the chair. So, the differences as I see between data analysis and data journalism is that oftentimes data analysis generates stories from a dataset about a dataset as a dataset. Whereas data journalism always links that data back to its origin, always links that data back to some fundamental question. There's also a real difference in terms of the agency that the person performing the analysis feels they have. You know, as soon as I arrived at J school, I was told the journalism is not really an academic field, it's more of a habit of mind. It's a way of asking questions about the world, looking around and saying why is it like this, why is this like that and a lot of the teaching that we've been doing with statistics around computation, code and algorithm and so on, that is providing students with the capacity to ask those questions of virtual things, of algorithms, of predictive policing scoring, or parole scoring or whatever it might be, right? Let's start to ask those questions not of the physical world that I can see around me but of the virtual world that's controlling things and that curiosity, that agency to ask questions really does fundamentally change the way you teach. You teach data now saying, teach looking at data, rather you teach thinking about data and it establishes a kind of different bar I think. So and one last thing I would add, last year, about half of the class, from what I understand had made a request to our dean of academic affairs, and I should give you a scale. We have two hundred or so students that are in this program, made a request for some kind of statistics training as part of the program. So I offered something called statistics breakfast, they were seven to nine in the morning on a Monday. There were bagels but still it was seven to nine in the morning on a Monday and we had like forty five students showing up every week for I think five or six weeks to learn on their own. No credit no nothing just showing up because they wanted to learn something about statistics and how that function to our knowledge about what was created and data. I think that says something about sea change in the understanding that our students have about how…

Bailer: Yeah. I agree.

(Background music plays)

Pennington: You're listening to Stats and Stories where we discuss the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington with Miami University statistics department chair John Bailer and media journalism and films Richard Campbell. Today we're talking journalism and tech with Columbia University's Mark Hansen. Mark I want to ask you a question about something you worked on at New York Times called Cascade. You were involved in the creation of that, yes?

Hansen: Yeah.

Pennington: So I do social media research and I went to grad school at IU where they've been doing some interesting stuff tracking Twitter data and so my question for you is I was looking at Cascade and the way you were tracking the way people share things. My question always goes back to, you can track how people share information online but how can you understand whether it's meaningful sharing, is kind of where I always go to say like, if you're a news organization and you're trying to track how your story spread, how can you figure out if there's meaning behind it or if it's just sort of like a reaction and not real engagement?

Hansen: Well that seems to be the big question. Because it's not just…it's not just the like or the share but it's also you know the page view or the click through or any of these other metrics and so there's a real attempt to see, take these somewhat standard metrics and figure out, can you learn a little bit more, can you engineer the system so that you can see how far down people read, you know, do they go to the next…going to those pages and you think this in the access logs but like can you…how far down the page, how far are they reading, I am a member of a project at Cornell Tech which was looking at those kinds of statistics. So there's a kind of open research question about how best to characterize what engagement actually means and what you're getting from the statistics. And I think there's one of the things I'm hoping by injecting a little more questioning spirit into the statistical side of things in our journalists is to ask well you know our newsroom gives us these statistics. What else could we know? What else can we know about how we're interacting with our audience and I should say when I first started at the New York Times R&D Lab, there was a real resistance on the part of the newsroom to show any kind of usage statistics, to the reporters I think that they felt that it might lead to pandering or something like that. But Twitter and you know reporters becoming more sort of prominent in their own right it's impossible for them not to know that a particular story has hit. So I think, I mean, this is sort of dodging your question but I think,

Pennington: That’s okay, you can dodge it.

Hansen: I think there's real, a lot of work around what sort of meaningful statistics can be extracted and you know, what does that say about engagement. Now I will tell you that with Cascade we saw a lot of really interesting sort of communities that built up. So one of my collaborators Jeff who was responsible for the really beautiful graphics, I mean really beautiful.

Pennington: Yeah I know, those are amazing!

Hansen: Had noticed a network of rabbis that consistently share religion stories amongst themselves and so there are these interesting communities that build up, that I think become important as well and so it might not be just direct statistics like how long or was it like but can you think about the communities where things sort of work their way through and you know thinking about that is you know as a typical it can be difficult because it's you know lots of small scale things that accumulate the kind of sea of activity that you perceive but. Anyway the tool like cascade lets you drill down and sort of have those quiet moments and see those things and. I don't I'm not quite sure what that would amount to for a marketing person that's certainly not my side in the church and state divide I was always on the church side.

Pennington: Yeah.

Bailer: You know sort of a natural follow up to consider the social media whether or not something is the real post. I know some of the cool work that you speak about is the definition of bots and other you know artificial entities living in the social media space and maybe looking at kind of what likes are starting from, who's actually posting these kind of comments and these likes. So can you talk a little about the story in which you investigated this…you and your class investigated these artificial entities on social media?

Campbell: And let's congratulate you too for your front page byline in the New York Times.

Hansen: That's A1 on a Sunday!

(Collective laughter)

Bailer: That’s hitting it out of the park!

Hansen: In fact there was a funny story about that because my collaborators who were so amazing at the time. They were a little bit dodgy about when it was coming out, it was coming out this weekend. The thing, you're online first and I was actually giving a talk at Cornell tech to a group of high school teachers and someone said are you same Mark Hansen that has this like thing on the Times?!

(Voices overlap)

Hansen: Sorry but I was like holy! And then you know I ran and got the Saturday paper and of course it wasn't there and then I went to the bodega Sunday and picked up from the stack and I sort of went through every single section and I didn't see it and I was like damn! And I throw it back in the pile, and it catches my eye in the corner and I go no! We are on the front!

(Collective laughter)

Bailer: How cool!

Campbell: That’s cool!

Hansen: But I think just to back up half a second the question that you ask about authenticity right I think is fundamentally what it comes down to and how can you tell you know in social media whether this is an authentic news site, whether this is an authentic, like a real person, who's posting, who's followed you. The news stories that you're getting we talk about fake news, there's questions of authenticity and factualness like I think these are fundamental questions that we're grappling with as a society not just journalism but as a society and I think, you know lots and lots of sort of tech work is being proposed to automatically identify fake news and this and that. I don't know what the solution is but what we're trying to do is like chip off little pieces of it so can we understand little pieces of what's there and maybe help the public make better judgments and not necessarily pass everything off to some you know AI system in the sky. So what we did in my class it's a computational journalism class. And we assume the students don't know anything about computation and so we teach them some python. And two years ago because of all the sort of news around the 2016 election we themed it on what we were calling…competition propaganda and tried to get the journalist to ask questions about what does trending mean, how does something trend, what does Twitter say about when something trends or not. Here are two million tweets from the inauguration, can you tell what was trending at any given moment, like you try it. And one of things we're interested in was influence and how voice spreads so we got an account on Twitter, didn't have any followers, all right, what are we going to do? So we Googled how do we have influence on Twitter. The first thing that came up was this company called Devumi which said it could get you followers really fast, and they thought, well that's kind of cool. So we bought some followers twenty five hundred or so and it looked kind of amazing right it looked like real people all the pictures matched up so if there was like you know, a teenage kid, you know standing with skis in his bio pic and maybe the pic behind him you know the long rectangular one is a longer one of him, going down the slopes and there's a link to a Facebook account with him and all of his friends skiing and it just all looked perfect, right? So I go, well these are real people and because they pass my authenticity test and we go hell, what’s the economics here because we paid pennies for this person to follow us. Why would they do that? And we haven't even tweeted from this account so you know why would someone…I didn’t get it and then we noticed, it took us an embarrassingly long time but we noticed aspects of the person's account, the name, the login name had changed so. A lower case I's become lower case L's zeroes are replaced with O’s, or vice versa a single underscore name became a double underscore and so basically what someone had done was created bunch of bots from real accounts so they cloned all the data, grabbed all the pictures, grabbed all the text, created a new account with a slightly altered name and now use that to programmatically follow or do whatever it is people bought the bots to do. So in this case twenty five hundred then followed us. So the next question is well who else do these bots follow? So this was an opportunity to teach network analysis. So we scaled up and down the network and found tens of thousands of these accounts and found lots of people who shared many of our bots, like Hilary Rosen the C.N.N. commentator had twenty four hundred of our twenty five hundred bots among her followers which seemed suspicious and so we got this long list, we found the bots at the tells and then at that point we said, alright, we got to make something out of this. So I had my students formulated to a pitch, brought in Dave Dance who is the deputy investigations editor at The Times, gave him the whole story, our students had found an example of some stolen identities where those identities were minors and now those minors or there doppelganger is retweeting really ridiculously hard core porn. And you know for what is often an abstract idea you know Russian hordes of Russian bots the borders like what is that look like what does that mean, minors tweeting porn that was the way in, to talk about the influences on media and what it might be and the impact of that story has been incredible. I won't bore you with steps along the way but the latest was a couple months ago, Twitter killed up tens of millions of accounts and cited our story as the reason and so again that curiosity and taking data and weaving the stats into a story and the graphics and the whole thing coming together that's the best that I can think of in terms of data journalism, that's the thing that I wish we could teach more of, in statistics. Teach more of like, some of that basic curiosity and agency to follow a story all the way through.
 Campbell: OK. So Twitter estimates, I saw this in a business law blog somewhere that the fifteen percent of its accounts are automated and I think Facebook now says that about sixty million of its users are fake. Does that sound about right? Do we know actually do we have any sense of how accurate those estimates from most companies are?

Hansen: So let me put that back on you. How would you figure that out? Would we somehow form a sample and have a peek? Like I think some of these questions, let me step back a second. When our article came out the day after almost all of those accounts the tens of thousands that we saw that were doppelgangers that had stolen someone's identity, all of them disappeared in a heartbeat. Which means that Twitter, I mean, I don't know but it just seems unlikely to me that Twitter didn't have a sense of what was going on there. So they you know were suspended or disappeared. And so I think you know even on Hacker News there was like the kind of response we got from Hacker News was like you know Twitter, Facebook these places have the best engineers in the world and it took a group of journalists to do this which I kind of found a little insulting because the whole point of this is to teach journalists to do this. I mean they know full well I feel like they have the information like they see the whole thing. They see the patterns of people joining, the timing of it, the ways in which things are used. So you know they're in a position to, I think you know get rid of at least the most egregious bots and I think they spend a lot of time doing that. According to the company they spend a lot of time doing that. If we take that at face value and I don't know that we should but if we do you know Twitter would be a whole lot messier if they weren't doing what they are doing it but I'll tell you this you know this is where I was going with the first question or my first line of questioning. It was like just take a sample and let's see which accounts are bots and which aren't right sometimes it's really hard to tell because in these sort of authenticity questions where you're deciding if something's fake or not you're inevitably led and this is my own opinion but you're never be led to a kind of arms race so someone builds a detector that says, this is something what fake looks like the people who are making that fake thing go, oh! Well part of the problem with our bots or at least in the detector software they seem to be like killing off our accounts that tweet constantly. So let's make the bots go to sleep and let's make them tweet a little less frequently on the weekends. And then a whole class of bots get architected until the owner of the tech detectors says wait a minute wait a minute that rule isn't working anymore let's retrain and then…and so you are kind of going back and forth. There's a and even like these attempts that are like establishing credibility indices have to be a little bit key for new sources and news new stories to try to combat fake news…have to be a little careful that we don't get into an arms race and end up with really good fake news like really well produced so it's a hard sometimes it can be a hard question because you know and I'm going to I'm going to shut up about this but the question sort of in terms of influence on for me it is probably less about rallying purely automated systems it's probably less about bots and more broadly about coordinated activity and through coordination one can gain an outsized voice on these platforms. So sometimes it's invoking a horde of bots of some kind to repeat or retweet your content sometimes it's giving your account over to a third party who will tweet like crazy their message and sometimes it's just a group of people organizing on some other platform and saying All right go and then tweeting a whole lot on a particular topic so it's something about the design of these systems that we have to question and it's not necessarily just the kind of authenticity or the sort of bot thing that comes up and I don't know that directly answers your question because I don't know how many bots there are on Twitter so I thought I would just kind of waffle on about…

(Voices overlap)

Bailer: That’s an important skill to have in answering a question. So I have just one last question for you. We were talking about some really outstanding and interesting work in data journalism. You know thinking about our students and who might want to be preparing for this sort of this is a simple two part question what does a starter mass science student need to know to be able to move into the space and work here and what does a journalism student, someone who is majoring in journalism need to know to move into the kind of work like that you're describing?

Hansen: So I'll start with the second first because I'm in a school of journalism and I teach journalists. I think that what they need to know, there are some basic skills that I think a student needs to know some basic coding skills. You know being able to access data from an A.P.I., being able to, you know to do some basic computations inevitably journalist still end up, for whatever reason doing a lot of Web scrapings some of those skills come in handy, some basic understanding of statistics, I think there is going to be gradations of statistical analysis in newsrooms. If you look at, if you assume that maybe statistical modeling goes the way the data visualization went in that for a long time I remember top the numbers like mocking the New York Times for their graphics but now I feel like they're leading the pack, journalists are inventing lots of tools for visualization that are helping the public's capacity for understanding visual stories, or data analysis told through visual means. And so in a newsroom you have people who will make a bar chart and then you have people who will make this phenomenal sort of interactive map whatever. And so you have kind of a range I think in the same way of people who will quote a percentage or you know maybe you know consult the fact finder for the census and quote some number. And then those who will be doing full on, you know regression analysis of various kinds or machine learning pieces of various kinds so I think we'll start to see much more on the high end. So as a journalism student kind of what needs to figure out where they're going to fit in that ecosystem and how they move right and how they sort of gain their skills and so on so there's some there's some basic tools that we can start them with and a little Python helps, understanding how the web works helps. And you know I think the most important thing is having a story and someone to guide you or help you that will help bring you through. Does that answer on the journalism side?

Bailer: Yep.

Hansen: OK. I wasn't interrupted so that means I'm not waffling too badly! And then, the second for the statistics side, I go to say what it really comes down to, is a sense of agency I mentioned this before but like recognizing that the tools you are learning can provide you ridiculously powerful insights into the ways in which your neighborhood your community your city your state your country functions and having the capacity to say you know wait! Like that curiosity to ask a question figure out what data might be interesting to address that question and take it on I think you know the way I was taught it was kind of dirty for a statistician to collect their own data. I was sort of taught to be, taught to like you know we're consultants right or we are you know in some way producing statistics that other people interpret there's always the subject area expert that comes along with you right and maybe again John teaches this differently and then I apologize but my training didn't provide me with a lot of agency.

Bailer: Neither did mine.

Hansen: And I think that is a shame, right, in fact even the Royal Statistical Society has taken down the…you know from their logo has taken down ribbon around the stand of wheat and if you remember but their logo is a stand of wheat and the ribbon around it was a Latin phrase that said to be threshed out by others. It was supposed to be just like producing stuff and being very objective and what have you and I'm not suggesting ditching the objectivity but I am suggesting a sense of agency and a recognition that the tools that we are learning and the frameworks that we know and the things that we do well and are taught are powerful. And can be used to question power in society if we just open our eyes and look around, right, the stories are there, the things there and that and that a statistician getting involved in joining a journalistic institution is as valid a career path as anything there is now and that they can have significant impact on helping again hold power to account, to question why things are the way they are and I cannot encourage you know students enough to get to really grab on to that opportunity because what they're being taught is powerful really powerful.

Pennington: Well Mark thank you so much for being here that's all the time we have for today but before you go I do want to ask if this was less painful than the dentist.
 (Collective laughter)

Hansen: And I apologize if I waffled.

Bailer: You're great.

Pennington: You're great, thank you again.

(Voices overlap)

Pennington: Thank you. Stats and Stories is a partnership between Miami University's departments of statistics and media journalism and film and the American Statistical Association. You can follow us on Twitter or iTunes If you'd like to share your thoughts on the program send your e-mail to Stats and Stories at Miami OH dot edu, and be sure to listen for future editions of Stats and Stories when we discuss the statistics behind stories and the stories behind the statistics.

(Background music plays)