Sports Analytics in the Classroom | Stats + Stories Episode 342 / by Stats Stories

Mark Glickman is a Fellow of the American Statistical Association, a Senior Lecturer on Statistics in the Harvard University Department of Statistics, and Senior Statistician at the Center for Healthcare Organization and Implementation Research. His research interests are primarily in the areas of statistical models for rating competitors in games and sports, and in statistical methods applied to problems in health services research. He served as an elected member of the American Statistical Association's Board of Directors as representative of the Council of Sections Governing Board from 2019 to 2021.

Episode Description

Sports generate a lot of data among them individual player metrics, team performance data, and specific game statistics. And there are a lot of tools to crunch all those numbers. Learning to use them can be a challenge and is the focus of many sport analytics classes offered in the United States. We hear about one professor’s approach to teaching sports stats in this episode of Stats and Stories, where we explore the statistics behind the stories with guest Mark Glickman.

  • Full Transcript

Rosemary Pennington
Sports generate a lot of data. Among them are individual player metrics, team performance data and specific game stats, and there are lots of tools to crunch all those numbers. Learning to use them can be a challenge, and is the focus of many sports analytic classes offered across the United States. We hear about one professor's approach to teaching sports stats in this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's Departments of Statistics and Media, Journalism and Film, as well as the American Statistical Association. Joining me is regular panelist, John Bailer, emeritus professor of statistics at Miami University, and today, we have a repeat guest joining us on the show, Mark Glickman is here. He's a fellow of the American Statistical Association, a senior lecturer on statistics in the Harvard University Department of Statistics, and senior statistician at the Center for Healthcare, organization and implementation research. His research interests are primarily in the areas of statistical models for rating competitors in games and sports, and in statistical methods applied to problems in health services research. He served as an elected member of the American Statistical Association's board of directors, as representative of the Council of sections, governing board from 2019 to 2021, he's here today to talk with us about teaching sport analytics. Mark, thank you so much for coming here again.

Mark Glickman
Thank you so much for inviting me. Rosemary and John, I really appreciate it.

John Bailer
Oh, it's great to have you back. I don't know if you noticed I was doing the wave when you were being introduced. Rosemary didn't pick it up. She left me. She left me just hanging here, Mark.

Mark Glickman
I was wondering when you're going to stop reading about this. This, this crazy guy who just does all these silly things that just are not doing great things for the world, and gets on with something that might be of interest to your audience.

John Bailer
Oh, I think this is great you know, in fact, our very first episode that we ever did in 2013 was a sports statistics related episode, which was incredible.

Rosemary Pennington
And football season has just started, so I think people are going to want to hear what you have to say.

John Bailer
Yeah, absolutely. And you know, so before we dive into your class, I had this really burning question, which is, what was the first sports related data analysis that you ever remember doing?

Mark Glickman
So I have to say that I'm a pretty strange person when it comes to sports, because unlike most people who've gotten into sports analytics, I actually got into sports because of the analytics, rather than got into the analytics because of the sports. So I was actually, for most of my life, you know, fairly unschooled in sports. So actually, what got me into all of this was chess, which I suppose is, you know, some kind of, you know, some version of, like a brain sport. I guess if you want to be more, you know, more conventional about it. I mean, the first sport that I became really interested in was basketball. And the first application, I guess, of analytics for me was NFL, football, and so, like, you know, I still think about those kinds of problems even now, but my main love in sports analytics really, is everything having to do with chess, and analytics of chess, and being able to not only figure out, like, you know, how to rate chess players, I had to assess how good they are, but even getting into areas like, how do you design tournaments that are done in a way to optimize certain criteria, making the fair learning as much information as you can based on the pairings. So, you know, those kinds of questions have always fascinated me.

Rosemary Pennington
So how do you go from chess, which I would agree is a brain sport, I love chess, although I am decreasingly good at it, I think, as I get older. So how do you grow from a focus on chess to sort of opening it up to sort of other kinds of sports?

Mark Glickman
Well, one way that you could do it, which happens to be my way, is that, you know, I was pretty committed to writing a PhD dissertation on something having to do with rating chess players. And in fact, the whole point of what I was working on during my PhD studies was to develop from first principles a method for deriving a rating system or deriving a method to analyze strengths in chess tournaments. And in my case, I have. To have an advisor who basically said, you know, you can't just restrict this to chess. You might want to go out and apply this to sports. And I was like, oh, what sport do you suggest? And my advisor, Hal Stern, said, Let's try football. And so, I basically got into football mainly because I was analyzing the data. I mean, I genuinely have an interest in football now, but that's actually what got me into it. So, you know, they're very related problems. I mean, for chess, you have people that are playing head-to-head in games. And not only are, you know, in the games are basically one lost and tied. And the other feature of chess is that players' abilities change over time. So having basically a head-to-head setting where there's a win, loss or a tie outcome translates to football also, except with football you have this extra detail that you have point scores. And so that makes the problem, in one way more complicated, because they're more results, but in another way, it also makes it easier, because the models for analyzing that kind of data tend to be easier to work with, but the basic structure is the same. And so that's kind of what got me into it. So I hope I'm not disappointing you by saying that, like, basically, my whole set of interests is guided by what I've been doing in statistics.

John Bailer
No, no, that's really cool. I was interested in you talking about the relatedness of problems in chess and football. And I was thinking, as you were describing these first three sports, that there's very different types of information that's obtained in these games. You know, with basketball, you have a paired competition, but yet, there's a trajectory of what's happening in the course of a game.

Mark Glickman
That's, yeah, although you could actually say that the same thing is happening in chess, because, like, over the course of a game of chess, you know, there are ebbs and there's flow, there's, you know, ebbs and valleys. And you know, sometimes you make an extraordinarily good move or extraordinarily bad move, and that corresponds roughly to point differences, you know, gap in scores starting to increase within a game.

John Bailer
So that's great. I was suspecting that was true, but I was thinking that it's probably not easily available to have that kind of data to do analyzes.

Mark Glickman
Although, these days, it actually is becoming much more available. And in fact, I think that's probably one of the areas that I think is going to be, you know, the real open problem in rating competitors, especially in board games like chess. In the old days, everything would be based on the game result. Yeah, so, like, if you and I, if the three of us were playing in, you know, say, chess tournaments, we would end up getting an evaluation of how good we are, just purely based on what the results of the games were. But now these days, with computer programs being so good, they're able to evaluate individual moves. So, like I can basically, you know, tell you, for a particular game that you're playing, like I can feed a game into a program, and it'll tell me, on a move by move basis, how good it thought the moves were. And so there's an opportunity there to be able to basically evaluate how strong chess players are based on individual moves. Now there's an interesting aspect to all this, I think, which makes it a tricky problem, which is that just like in sports, like, say, football or basketball, there's a concept of Garbage Time, which is to say that like in a chess game, you might be completely dominating your opponent by some point in the game and they're not going to give up, like maybe they'll just keep on playing on, hoping that you're going to make a mistake and you won't win. Usually, that's not going to happen. But the question is, if you're going to be trying to evaluate how strong a player is on a move by move basis, do you really want to be including those moves at the end of the game that are basically Garbage Time. And that's a question that you know comes up both in chess, but also in conventional sports, when you want to be able to evaluate how strong teams are.

Rosemary Pennington
When John was talking about us, talking to you, about sports analytics, I was immediately thinking of like my brothers and how growing up, they had baseball cards, which I know it was a cliche, but like, we're looking at all the Player Stats. And I used to play fantasy football pretty religiously during the football season, before I became a professor and had no free time. And also would like to pour over these stats to try to sort of figure out, like, what was the right player, what was the right line up, which is sort of obviously a very cursory, rough and ready kind of approach to how to crunch stats. So when it's when you have students coming into this sports analytic class that you teach at Harvard, what kinds of things are you teaching them to do, and I guess, what kind of careers are you potentially setting them up to sort of walk into?

Mark Glickman
Yeah, that's a really good question. So my overall sense from the students that took this course, just to back up, you know, I taught this sports analytics course for the first time this past spring. You know, this was in some ways a little bit of an experiment, because, first of all, I never taught a course in sports analytics before. There was a version of this course at Harvard a couple years prior, taught by somebody who, unfortunately for the department, ended up leaving right after he taught the course and ended up landing a position as, like, the head of analytics for Manchester City. So he's in charge of all Manchester, yeah, no, he's a really impressive guy. And so he put together this course. Again, you know what the goals are for a class like this, I think our general approach, and I would say probably most academics that end up teaching something like sports analytics, are often doing it, not so much because they have like particular career goals for the students. I think it's more a combination of giving students exposure to, like, applications of statistics. You know, in principle, there's, like, lots of different courses that, you know, we statisticians could be teaching that, like, basically, take a lot of the methods that they're learning in more foundational courses, and then apply them to, you know, substantive areas. And, you know, it could be about the environment, it could be various things like health, you know, medical and health kinds of settings. And so, you know, we basically were interested in applying it to sports, in part because we know that so many of our students are actually really interested in sports, and I think that's a pretty common phenomenon, you know, across the US. So we're basically serving two goals here. One, really was to, you know, give the opportunity for people to get some exposure to applications of the kinds of methods that we're teaching. And then secondly, you know, recognizing that we have this captive audience already that probably would be interested in taking a class like this. Anyway, we didn't have any particular, you know, aim for positioning people into.

Rosemary Pennington
Yeah, it's interesting because Sports Analytics does seem like it is this really blossoming space where, like, teams are hiring analysts, and in addition to sort of newsrooms high hiring sports analytics folks too. And it just does feel like it has grown in a way that I personally did not expect to see happen.

Mark Glickman
So, yeah, I mean, I know, just from the history of all this, that it had a pretty interesting trajectory, because I think pretty early on, right around the time, I think that most professional teams started getting a little bit of a hint that there might be something to working with data and making database decisions. That what they started doing was they started opening up to hiring like maybe one or maybe two analysts to bring in to their teams. But what ended up happening is that they didn't really entirely know how to use these people. And in fact, I know that oftentimes they have these analysts and sort of dual roles where they would, you know, do the sorts of things that the analysts wanted to do, like work with the sports data. But then they would also have them do, like, a lot of grunt work that had really nothing to do with the sports side of the business. And they also weren't being paid very much. So the retention, I think, for analysts and sports teams was pretty low. I think after, you know, the glamor of being associated with a professional team, I think when that wore off, then, you know, it became pretty clear that these teams really had to pay, like, serious salaries to qualified people. And I think that's where we are presently. So it really, it took a while to get there. But like, I think originally, a lot of these teams were trying to, you know, sort of take advantage of this being a new area, and then, one that would be very high demand by, you know, the kinds of people they'd be employing. But eventually, I think the market caught up with this phenomenon. Now, real people, in fact, you know, there are plenty of faculty, the statistician, faculty, people that I'm aware of that basically left their faculty position to become full time in sports analytics.

Rosemary Pennington
You're listening to Stats and Stories, and our guest today is Harvard University's Mark Glickman.

John Bailer
So I'd like to loop back to one of the comments you were making about sports in this context. We're talking about some of the concepts that people are learning in stack classes, and I noticed from your syllabus that one of the first couple of weeks you spend time on the different models that might be needed by high scoring versus low scoring games, you know. So why don't you flesh that out a little bit for people that are listening?

Mark Glickman
Sure, so, you know, there's for a bunch of different games, like, say, basketball, and you could probably throw in football as well, NFL football or American football, I guess I should say more generally, you know, those are games where point scores are pretty reasonably high, like, certainly for basketball, that's probably even more compelling, because typical games, you'll end up having scores of one team might get in the 90s. Another might get in the hundreds. And so you know, the kind of models that are traditional models for football scores, would be to basically assume that the score difference between the two is approximately normally distributed, and it's centered at the difference in these unknown strengths of the teams. So like, say, I have a team. You have a team. Each of us has, like, some unknown value for what our strength is. And like, say, just to understand the principle, suppose that we were to play repeatedly, I beat you by maybe five points in the first game, six points in the second game, and seven points in the third game. So I'm defeating you those three times, then, on average, I'm defeating you by six points. So what it would mean is that whatever my parameter value is and your parameter value is there, the best guess at their separation is six points. And so you could basically play this game by looking at lots of games played among different pairs of players, and you could basically estimate the difference among all the parameters for all these different teams. And you could do that using a normal base model. In fact, this whole framework really fits in quite nicely to like a normal least squares regression framework. And so that's basically the kind of tool that I introduce for these, like high scoring games. Now there's, I'll mention, just you know, for completion, that there is an interesting twist to this situation, which is that, in this example that I just gave you where the difference in our point scores averaged six points, well that could mean that, for example, my parameter value could be positive three and yours could be minus three. But it'll also be like, where minus 10 and yours is four. So the point is, you can't actually, like, figure out what those parameter estimates are just purely based on these point score data, because there's an indeterminacy there, you know, there's basically a redundancy of what those different parameter value estimates could be. So what you typically do is you impose a constraint which says something like, whatever, all those differences are between those estimated parameters. Make sure that, on average, all the teams parameters average to zero. And that's just an arbitrary number, like, I could have made it like, average them so that they're all 10, or average them so they're all 100. It's conventional to have them so they're averaging zero, but that can actually be included as a constraint in the least squares regression model, and that's one of the things that actually go into quite a bit of detail teaching, because that idea comes up over and over again in some of these models that we work with.

Rosemary Pennington
So, I'm an American football fan and soccer fan, but grew up watching American football, and now I'm a big Bengals fan. And one thing that's been really interesting to watch over the last few years is coaches increasingly going for it on fourth down. That used to be something that was so rare, and then probably in the last two seasons, it has been like a real flurry of coaches who are like, making that gamble and frequently winning, which is, you know, again, a surprise. And I notice on this syllabus that you have two lectures devoted to fourth down decision making in NFL football. And I would love to know what you are talking to your students about in that class. And that's those lectures.

Mark Glickman
Well, sadly, I mean, I could certainly explain it, but sadly, I don't actually get to that material, but that's a great example of, well, first of all, it's great example of something that I really need to make the time to get to because, you know, basically, that's an example of decision making under uncertainty. And, you know, that's a clear example where analytics has, you know, just made such an enormous impact on strategy and coaching in American football. Because, as you point out, Rosemary, the conventional wisdom for, you know, for most of the life of football was that, except in extremely rare situations, would you ever go for it and fourth down? And so the analytics has shown that actually it's more often worth the gamble in certain situations where, you know, basically it would be unheard of. And part of the calculation is that the way that you evaluate this kind of situation is that you look at all of the different decisions you can make, and then all of the potential outcomes and the probabilities associated with achieving each of those outcomes. And so it's fairly straightforward to do an expected and expected value calculation in each of those situations, and that will end up guiding you to the result you should select. And so it's a very basic application of decision making under uncertainty.

John Bailer
So, I'm now picturing on the sideline this giant book of all these configurations with kind of probabilities there, because they're probably not. They're not doing real time calculation of this, I assume.

Mark Glickman
Yeah, I don't think they're not even sure they're allowed. I don't think they're allowed to, yeah. Well, and what makes it actually even more complicated is, you know, even that technology has changed a bit, because, again, in the old days, and when I say the old days, I mean maybe, like, within the last 10 years, you know, the strategy would be to, you know, figure out, how does making one of these decisions say on on fourth down impact what the likely score is going to be in some short amount of time? But the technology has really improved so much that the focus isn't so much on what the likely score is going to be, but what the winning probability is going to be of winning the game. So in other words, the focus now is on winning probabilities. And the reason that's very relevant is that if you're in exactly the same situation for fourth down, like, say you're on, like our opponent’s 30. You're on the 30 yard line with, like, you know, two yards to go, and it's fourth down. And we're talking about the first quarter of the game where you're maybe up by seven. And you're now contrasting that to the same situation, but in the fourth quarter, with like, two minutes to go, the winning probability implications are completely different, because in the first quarter, there's a lot more game left to affect what's going to happen to the rest of the game. So it's not a very high leverage situation, but in the fourth quarter, it's going to make much more of a difference. So it's not the score that ends up being like the crucial detail, it's the winning probability, and that's why so much attention in the last, you know, five years especially, has focused on developing good winning probability models as a function of all kinds of situations that you can imagine within a football game.

John Bailer
You know, other topics that you have in your class that I really thought were a lot of fun, because I think about these different sports. Some sports are very continuous and flowing. Others are, you know, very much state based. They're sort of dynamic, this dynamism versus this, just stepping through a game and sequence, and I'm just, you know, when you talk about that, what are some of the issues that you bring up when you're describing these kinds of different types of games to students?

Mark Glickman
In the context of my course, we don't get too much into that, but certainly from a historical perspective, you know, one of the reasons why baseball was the first sport that basically got tackled, if you pardon my expression, you know, for analytical work, was that it's a very discretized game, and so it's very easy to analyze. There's, you know, very little data that accumulates within a game, at least in terms of just examining what the situation is like at the start of an at bat, and what happens at the end of an at bat. But like you said, getting into these sports like basketball and football and hockey and soccer, those are much more dynamic and much harder to analyze, and so those have really been on most people's back burner for a long time now that doesn't too much relevance in, like a course that I teach, but certainly other people who teach Analytics courses, especially once they start getting into like, say, player tracking, then it becomes very relevant, because, you know, really, the name of the game for these more dynamic sports, is being able to, like, trace where players are at any given point in time, and being able to model the trajectories of their movement. There's been some really nice work by people like Luke Born and a lot of his colleagues, where they were very clever enough to basically realize that you can model these different trajectories which look completely different from each other, but you could represent them as like low dimensional curves. And so if you can represent them as low dimensional curves, I think these Bezier curves, then all you need to keep track of are like coefficients to the different terms in the model. And then all of a sudden, like what started out being this mind bogglingly complex trajectory, all of a sudden becomes like a vector of just like maybe four or five numbers to represent, you know, an individual movement. And then once you could do that then, then you can start talking about modeling, you know, movements of players, abilities, and how people work with or against each other, and so that was a major breakthrough.

Rosemary Pennington
Well, that's all the time we have for this episode of Stats and Stories. I think both of us have a lot more questions. I could have talked to you all day, but thank you so much for being here.

Mark Glickman
My pleasure. Thanks for having me.

Rosemary Pennington
Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics