Social Media, Data, and Democracy | Stats + Stories Episode 96 / by Stats Stories

Steven Wilson_Headshot.jpg

Dr. Steven Lloyd Wilson is an assistant professor of political science at the University of Nevada, Reno. He earned his Ph.D. in political science from the University of Wisconsin-Madison in 2016, and serves as the Project Manager of Computational Infrastructure for the Varieties of Democracy Institute at the University of Gothenburg. His research focuses on comparative democratization, cyber-security, and the effect of the Internet on authoritarian regimes. He also works on a variety of projects involving network and content analysis of social media around the world. 

+ Full Transcript

Rosemary Pennington: New media technologies can be imagined as friend or foe when it comes to politics depending on who’s using the media and who’s framing that use. The last few years in particular, has seen a rise in interest in the use of new media in the post-Soviet world. Politics, the internet, and data are the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I’m Rosemary Pennington. Stats and Stories is a production of Miami University’s Departments of Statistics and Media, Journalism, and Film, as well as the American Statistical Association. Joining me in the studio are regular panelists, John Bailer, Chair of Miami Statistics Department, and Richard Campbell, former and Founding Chair of Media, Journalism, and Film. Our guest today is Steven Lloyd Wilson. Wilson is an Assistant Professor of Political Science at the University of Nevada, Reno and studies the intersection of the internet and politics, particularly as it relates to the former Soviet world. He’s joining us in the studio, after traveling to Miami on a visit sponsored by the Havinghurst Center for Russian and Post-Soviet Studies, as part of the Colloquium Series on Russian Media Strategies at Home and Abroad. Steven, thank you so much for being here today.

Steven Lloyd Wilson: Thanks for having me.

Pennington: I know when you were here at Miami, you were talking about social media use in the informally Soviet countries or the former Soviet world, how did you get interested in this particular area?

Wilson: Well, I have sort of a weird route where I ended up being a Doctor of Political Science, where that was never the plan at any point. So, when I was 13, I read The Gulag Archipelago for the first time, and I thought it was a novel when I happened to take it out to the library, and that kind of started this obsession and interest with Russia, but then I went and got an engineering degree, because I didn’t know what I wanted to do, so I might as well make a salary while I didn’t know what I wanted to do. But once I got sick of designing webpages and doing database work on the web, I went back to study political science and I thought that I was going to be a Russianist, and then the Arab Spring happened, and suddenly, knowing how the internet worked was very, very valuable for a political scientist, and so to this date, for the last few years, that’s sort of what I’ve stitched together, is that technical background along with, knowing about and really loving, kind of, the former Soviet world.

John Bailer: So, when you think of social media, what are some of the key features that you look at in social media as features that you would want to analyze?

Wilson: The simple fact that it’s normal people communicating, which, I think in the modern world, we really get spoiled by how social media’s just, sort of, become ubiquitous and we don’t even think about what a miracle it is in some way. A lot of historians, they would literally murder each other if they could get a diary in a normal persons perspective from some countries for entire centuries of time, and over most of human history, normal people had absolutely no voice, and so in the modern world, it’s a small – or rather large miracle that we are absolutely recording in real time just normal people’s thoughts permanently, so we can read them, analyze them, use them statistically for political science or just use them qualitatively, reading through them as eyewitness testimony and stories.

Richard Campbell: So, you talked today a little bit about some of the limitations of doing this kind of work, particularly when you have populations, large populations, where not everybody’s using social media. How do you address that in your research?

Wilson: Well, there’s a temptation to try to use social media as a replacement for survey data, which I think is very methodologically misguided, because what we get into is that, social media data can never truly be representative, and it’s not representative on many different levels. You have, first of all, the people that use social media are not representative of the population as a whole. The people who are vocal on social media aren’t even representative of people on social media, let alone the population as a whole. And you can keep going down this rabbit hole of trying to use it as some type of survey mechanism, is really doomed to failure, and I think that the biggest problem with it is not understanding that you don’t understand what the dominator is. Since this is a stats podcast, I can be stats nerdy to a certain degree. There’s a tendency to simply look at social media numbers in the absolute, like 70 million Americans saw a Russian-targeted propaganda on Facebook, and never have a denominator for, yes, that’s a large number, but what is it over?

Campbell: You’ve talked in your talk about the, sort of, two approaches to this. The first approach when social media came along was to study it as a causal variable, but you think it’s moved to a different level. Your approach is, using social media to measure something.

Wilson: Right.

Campbell: So, talk a little bit about what you would like to measure and what you are measuring.

Wilson: Sure, so the first generation of social media research really was about, does social media cause democratization? Does social media cause there to be more protest? And as social media’s become basically universal on some level or another, that’s become a less interesting question. It’ll be interesting for historians at some point, I’m sure, but in terms of the more interesting use is, what can we actually get out of social media data that we can’t get any other way? And so, this is where the survey temptation really comes in, because you think, ah, I can just look at the social media postings from random oblast and tell you what percentage of Russians in random part of Siberia think X, Y, and Z. And so, I don’t think it quite works that way, but it really works from that perspective of, can we hear voices that we otherwise wouldn’t be able to hear and see? And a different way of thinking about it is that, it’s fundamentally a bottom up approach when it’s working correctly. Rather than a researcher saying, I want to know about X, and therefore going and asking people about question X. Instead, we can really let the data speak for itself in terms of, if I want to know about Russian nationalism, instead of asking people, are you nationalist, I can go look at hundreds of millions of tweets in Russian and say, let’s look at how people talk about nationalism in various ways.

Bailer: I’m curious as you look at these ideas of not being representative, the participants in it, one aspect of that seems like you might be looking at people that are influencers within this scope. Is that one dimension of what you consider?

Wilson: That’s definitely one dimension of it is that, there are some people that are more vocal on purpose and have more followers, they have more people paying attention to what they say. And so, from a certain perspective, you don’t want to look at those people, if you’re looking at it from kind of a traditional survey reach perspective, because while those people are just overly vocal in performative, why would I weight them differently? But that’s kind of the beauty of social media data is we can see what people actually are following in the wake of, and one very interesting example we have of that is that, in old school, classic, political science research, we have this finding that, the number one indicator of whether somebody will go to a protest is not how strongly they feel, how angry they are, any of the other obvious, kind of, political causal variables. It’s really simple, do they know someone else going to the protest. That is the predictor of whether they go. And so, what we found on social media is that, because of that whole power of weak ties cliché, is that it makes it way more likely for somebody to know that somebody is doing something, and so we can see stuff like, if you look at network weighted Twitter data, for instance, of just simply multiply number of tweets from a location by the number of followers the people have. You can see on a map, mass protest happen, because you see all of the network-central actors geographically converge into a tiny area. And so, on the one hand we pick up mass protest that way, and we also pick up World Cup matches, for instance.

Pennington: How do you sift through this? It’s so much material. You sort of raised the analogy of, right, historians looking back and gathering diaries from a single individual, this is an astronomical number of, if Twitter, just tweets alone, how do you sift through this to find the interesting stuff?

Wilson: With lots of computer programs [LAUGHTER]. So, this is kind of the weird paradox of social media data is that it’s sort of the most disperse, vague, qualitative data we’ve ever had, and yet, it begs for some type of quantitative approach because there’s so much of it. And so, we’re developing a lot of new, really fun statistical techniques that basically deal with doing automated text analysis, for instance. I do a lot of work with convolutional neural nets, in terms of basically doing stuff like, okay, let’s train a neural net on all of Russian language Wikipedia, so the neural net knows what Russian looks like, mathematically, sort of thing. And then, let’s go ahead and take a nicely paid undergraduate RA’s who are fluent in the language to code something like, here’s a set of 10,000 random tweets that all have a set of various words that we associate with something like, nationalism, or some political issue. Now, go hand code it through a rubric saying, okay, this tweet is hyper-nationalist, this tweet is anti-nationalist, and code it that way, and then we just use the neural net to say, function on that, do a training set, do a test set, look, it’s working 95 percent well, if it’s not working very well, you do kind of iterative steps of training, and then you can throw that model at the whole stack of a hundred million tweets that you have.

Bailer: So, what’s been the biggest success that you’ve had in this type of model-building and application?

Wilson: I think one of the most interesting ones is a current project that I’m working on with Doctor Gillman from University of Nevada also, where we’re trying to measure partisanship in congressional tweets. So, we’ve got a database of all tweets by members of congress from the last, I think it’s six years going back, it comes out to about two and a half million, I think is how many tweets there are, congresspersons are very, very verbose on social media. And so, we’re doing a lot of work on doing this exact process of hand code whether there’s partisan language being used, like angry partisan language, and can we train a neural net to do that. And it was amazing, because it was just the most default settings we possibly had through in the first model, and it was 85 percent accurate that it could just pick up – like, that’s how mathematically obvious it is when people are speaking in very partisan language.

Bailer: So, I’m curious when you get data like that, how do you know it’s not staffers that are doing this, or are you just assuming that the staffers are representing the message that the representative would want to have put forward?

Wilson: Well, in that case, we’re making that assumption, because these are the official Twitter accounts of the people in office. So, there’s some assumption that this is official and if it wasn’t official, that they would either delete those tweets or fire the person who was posting that type of thing. So, there’s definitely that going on.

Campbell: In your talk today, you talked about when new media came along, this happened with television too and other new media, there’s this sort of promise, this utopian vision that the internet’s going to spread democracy everywhere and it’s just going to be great, and the people that are voiceless will have a voice, and you’ve talked and studied how savvy authoritarian regimes are actually using this to their own benefit, and can you talk a little bit about what you’ve done on that front?

Wilson: Sure, this is a lot of what my dissertation work was actually on, and it’s this notion that it’s not as simple as, oh, the dictator has a bunch of people reading internet posts and sends the jack-booted thugs to their door in the middle of night. It’s far more sophisticated than that in a very interesting way. So, China’s sort of the prototypical example of doing this to perfection, which is, they’re using it to solve their information problem, which in dictatorships, there’s this issue that, they never know what people are actually unhappy about, because they’re not allowed to complain, they’re not allowed to write letters to the editor, they’re not allowed to vote out the corrupt local party official, and so over time, eventually, it reaches a boiling point, and people revolt, not because they want democracy, but because they’re sick of the local dog catcher murdering dogs, or the pothole that’s been in front of their house for 25 years and no one’s ever fixed it. And it’s the way I put it is that, the internet with savvy regimes, lets them know where the potholes are. Where what China does brilliantly, if maliciously, is that, they allow limited freedom of speech within, sort of, socially constrained understood norms that, no, you’re not allowed to agitate to overthrow the government, but you’re allowed to complain about the potholes, and then they can figure out what problems to fix, and therefore become, in the long run, far more stable.

Pennington: You’re listening to Stats and Stories, and today we’re with University of Nevada, Reno’s Steven Lloyd Wilson. He’s at Miami University on a visit sponsored by The Havinghurst Center for Russian and Post-Soviet Studies as part of the Colloquium Series on Russian Media Strategies at Home and Abroad. Steve, in a couple of times in your responses, you’ve mentioned issues of democracy, and I know you’ve written about something called, varieties of democracy. Could you explain a little bit what that is exactly?

Wilson: Oh, it’s a fantastic research institute that operates in Sweden right now. It started over in the US. We have PI’s all over the world at this point, Notre Dame in particular has some, University of Florida. And basically, the gist of it was being dissatisfied with, kind of, these big cross-country measures of democracy, and the idea was to, instead of sort of internally have a few people create a measure, was instead to ask very specific questions from experts in every country around the world. And so, they’ve got a network of some 32 hundred PhD’s around the world; two-thirds of them actually live in the country that they are nominally an expert of, so it’s really on the ground expertise. And the questions are basically based around, okay, you’re a media expert, so we’re going to ask you a series of questions on like, media freedom in your country. And it’s designed as building block style. So, instead of a top down, how democratic is the country, it’s questions more like, if someone posts something online that is critical of the government, how likely are they to be arrested? And there’s a lot of fun statistical models that are put together in order to use anchoring vignettes to make sure the thresholds are comparable between individuals to do bridge coding, where people code more than one country so we can make sure what – that their thresholds transport beyond countries and all sorts of stuff like that. So, there’s a very elaborate Bayesian IRT latent variable model, I think all the words in there for it, that was mostly put together by, I believe, Dan Pemstein it was at NDSU did a lot of it, and Kyle Marquardt, who’s actually Moscow – or heading to Moscow for a job now, is kind of following up the work on that.

Campbell: Is that the group that’s been doing this work on – groups that are targeted most by hate speech? Did that grow out of this?

Wilson: It grew out of it in part, that’s sort of a new project that I’m working on, called the digital society project.

Campbell: Talk about that, because that’s really got some interesting stuff in your talk today.

Wilson: Sure, so that’s basically based on that varieties of democracy infrastructure. It’s the exact same approach, but it’s a new survey that we put together this year, in order to measure various aspects of digital society, so like, regime capacity for monitoring the internet, regime use of the internet to broadcast false information domestically, whether you’re likely to get arrested for speech online, how average people use the internet for organizational purposes or not. And we just finished up the survey in January, and so we – we’re actually just released the data set today, which I have message from one of my Co-PI’s that the CSV link is broken; so I need to go fix that [LAUGHTER] immediately after this.

Bailer: Always happens doesn’t it.

Wilson: Exactly. So, that’s putting together all of these measures. It’s 35 indicators across 180 countries, 2000 to 2018 time series, so.

Bailer: And tell more about the questions that you have to answer with this data.

Wilson: Sure, we’re trying to answer, for example, a lot of people come to me and it really is a once a month, I get a question that’s something along the lines of, you do internet stuff [LAUGHTER] it’s usually like that. I need a cross country measurement of level of government censorship of the internet. Which should be a perfectly reasonable question to ask, and my answer is, there isn’t one. And my answer is that to almost all of these different variables of things like, the government’s capacity for using the internet to do all of this stuff like China does, for instance. Right now, or until this afternoon with the CSV file, we had no cross-country answer. We just had, sort of, yeah, the China experts can tell you what China does. The Russian experts can tell you what Russia does. And then past that, you wouldn’t even know to ask that central Asian – sorry, it’s the Central African Republic right now is being hammered by Russian bots on social media. That would never occur to you, and we found it simply because the data showed that from the Central African Republic, and then when we dug into it to make sure that the statistical model wasn’t just being screwy, it was, huh, who would have thought that? We do now, because we asked all of these people, so.

Campbell: Now is this the study that has the hate speech literature?

Wilson: Yeah.

Campbell: So, talk a little bit about some of those early results from this.

Wilson: Sure, so one of the questions we ask was simply a multiple selection where you could answer as many apply or not, of what groups are targeted by online hate speech and harassment in a given country. And I know the initial response is, what if there is no online speech? What is the denominator? You just told us [LAUGHTER] – well, we have another question that is about whether or not there is online organization in the first place. So just, control for those with each other and you’ve got an answer. But there’s some surprising responses where, basically, almost entirely around the world, in almost every single country, LGBTQ individuals are, which is probably not a fantastically surprising result, but it’s interesting just in how universal it is. And then one step down from that is ethnic groups. So, obviously, the ethnicity varies from country to country, and then it just kind of hammers down there. And one of the findings that I think that we’re getting here is that, I think there’s a perception, or at least an assumption that, social media and internet usage is a rich and western thing, and then there’s certain other cases that we know about like Russia and China, and then what we’re actually finding is that, no, this is really universal. Like our question about, is there a domestic online media presence, and how big is it? And we expected that to basically be our nice control variable for telling us where there just isn’t, everybody just uses Facebook or something like that, and the answer was, no, every place has domestic online media.

Bailer: I’m curious about something I was looking through one of your papers and describing the idea of different types of technical literacy and how that might play out and really the idea of population technical literacy and how that might be related to the opposition and organization of the opposition, or regime technical literacy and how that might deal with this, sort of as a countervailing force that might be in place. Could you talk a little bit about the idea of technical literacy and that distinction that you make?

Wilson: Sure, the reason why I cast it like that was going back into the old school studies of actual literacy in terms of can you read. And today, this is almost a boring subject because it’s basically, it’s 99 percent everywhere, unless you have a lousy education system, in which case, there’s far more problems in whether or not you have literacy or not. But so, I wanted to capture something more nuanced than just percent of population using the internet. I wanted some measure of the kind of anecdotal examples that we got out of the Arab Spring for instance, where you had, in Egypt, for instance, people setting up wi-fi hotspots from their cellphones and then intentionally daisy-chaining them until they could get a connection across the border once the government shut down the internet. And to me, that was fascinating, because that was a – somebody has to have technical know-how to do that, but you don’t have to have that many people. It’s sort of like a force multiplier, where you don’t need that many hardcore computer programmers to do that, but you need some. And so, that’s why a lot of these – the measures I was using for population technical literacy, I was doing a Bayesian latent variable model that had a variety of inputs that were basically passive versus active, where passive being, percent of population with access to the internet, because you have to have the baseline, but then active measures like, how many – one really good proxy variable I had was collecting the number of posts to the Linux Kernel mailing list from every country in the world doing like an IP address back tracing to just what country was that from, because the idea was, it didn’t take that many, but if you had a few Linux hackers in your country, that probably meant you had the capacity to do the Egypt Wi-Fi hotspot thing, so.

Pennington: You raised this issue of the Russian bots in Central African Republic, and I wonder, there’s been so much coverage of Russia’s influence in various media, not just in the United States, but obviously in Ukraine and in eastern Europe, just sort of pull away from the research and think about what you see as an expert in this area in the news coverage, what about the news coverage of this issue have you found frustrating or problematic if that’s not a too strong of word.

Wilson: I think it goes back to the denominator factor. So, the temptation with all social media research and data, especially in the journalistic side of things, is the numbers are really big, so you just report the numbers. And so, for instance, the 70 million Americans are seeing the various Facebook ads that were from Russian bots, were one that was really interesting to me, because if you dig into the details of how they got those ads – you think 70 million people and you think, wow that is like presidential campaign level of whatever. They spent 200 thousand dollars to do those ads. If 200 thousand dollars of Facebook ads could have the influence that is being feared about, they wouldn’t cost 200 thousand dollars, or rather Facebook would be the world’s smallest company. So, it’s sort of this gist of when you look at these very large numbers, you have to have an appreciation for what do you need to divide them by to figure out how relevant they actually are.

Bailer: And even just the idea of translating that to – from exposure to action, to the change. I mean, we’ve had a previous guest that’s talked a little about that and that seems like that’s a real hard problem when you’re looking at this type of data.

Wilson: Yeah, definitely. Trying to figure out whether there’s an actual causal factor of, it’s one thing to say there’s a bunch of ads, can you actually say that this actually tipped something or changed something. And I think it is important to note that it’s bad, even so. Even if it didn’t on a poli-sci level, cause a tipping point, having foreign intervention and democratic elections is bad, but on the other hand, one of the really funny experiences I had was, I was at a conference that was a mixture of policy-makers in the government working on these issues in academics, and the policy-makers were having an aside at one point, arguing about how frustrated they were that our allies didn’t see what a big deal it was and that they just weren’t giving us the proper, at least, emotional support on some level of your right, heavens to Betsy, this is terrible. And one of the things that I pointed out to them was just that, yeah, all those Latin American countries you’re talking about, I can’t imagine why their response to, oh my gosh, a foreign power is intervening in the advertising and free media of our elections; why they wouldn’t immediately just be shocked and give you sympathy, and they weren’t particularly impressed by that, but.

Campbell: You talked today about we’re just, sort of, at the infancy stage of studying the way social media – the ways you can measure it. So, what can’t you know from this research when you talk about limitations, and what would you like to know that you hope will be revealed at some point?

Wilson: I think the main limitation what we can’t know is, we can’t treat social media data as survey data. You can never use this, no matter how fancy of a mathematical model you build, you can’t say, I collected these tweets, and now I can tell you that there’s 85 percent odds that Hillary wins the election or something like that. Maybe we can’t do that with survey data either [LAUGHTER] but the temptation to do that is there, because it looks like that data’s there, but it really isn’t. And in terms of what I would really want to be able to get out of this data, I had talked briefly in the conversation earlier about, really being able to turn this back to use it on a more qualitative level, because a lot of the social media data, in particular, that has, say like, GPS coordinates attached from cell phones and stuff, we can actually say with essentially certainty that someone was actually at a physical place typing this thing at this time, which gives us in the long run of history, this fascinating source of basically eyewitness testimony that we can confirm is real and in real-time. And so, I would really, really love to have the type of system set up in some way that we can gradually collate this together in to, this is what people on the ground said about this event, that there were no reporters at, and no one has done any surveys about after the fact, but this is what people said.

Pennington: Well, Steven, that’s all the time we have for this episode of Stats and Stories. Thank you so much for being here.

Wilson: Oh, thank you for having me. It was fantastic.

Pennington: Stats and Stories is a partnership between Miami University’s departments of Statistics and Media, Journalism, and Film, as well as the American Statistical Association. You can follow us on Twitter, Apple Podcast, or other places where you find podcasts. If you’d like to share your thoughts on the program, send your email to StatsAndStories@MiamiOIH. – I’m starting from that line, Charles. If you’d like to share your thoughts on the program, send your email to, or check us out at, and be sure to listen for future editions of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics.