Alexandre Andorra is a Senior Applied Scientist for the Miami Marlins as well a Bayesian modeler at the PyMC Labs consultancy firm that he cofounded as well as the host the podcast dedicated to Bayesian inference “Learning Bayesian Statistics” His areas of expertise include Hierarchical Models, Gaussian Processes and Causal Inference.
Episode Description
Sports analytics is a booming industry with new technologies allowing for the parsing of ever more sophisticated statistics. Analysts can now examine the height and the force of a gymnast tumbling pass, the probability of going for it on a 4th down in football, actually working out, and the arc of the best swing for a baseball player. Analytics are also used in the conditioning of athletes, particularly for all the baseball players preparing for the start of the MLB's spring training. Analytics is the focus of this episode of stats and stories with guest Alexandre Andorra.
+Full Transcript
————————
Rosemary Pennington
Sports Analytics is a booming industry with new technologies allowing for the parsing of ever more sophisticated statistics, analysts can now examine the height and the force of a gymnast tumbling pass, the probability of going for it on a fourth down in football, actually working out, and the arc of the best swing for a baseball player, analytics are also used in the conditioning of athletes, particularly important for all the baseball players preparing for the start of the MLB spring training. Sports analytics is the focus of this episode of stats and stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington, stats and stories is a production of the American Statistical Association in partnership with Miami University's departments of statistics and media, journalism and film. Joining me, as always, is regular panelist John Baylor, emeritus professor of statistics at Miami University. Our guest today is Alex Andorra. Andorra is a senior applied scientist for the Miami Marlins Formerly he worked as a Bayesian modeler at the PI MC labs consultancy firm. He co founded. Andor is also the host of the podcast learning Bayesian statistics. His areas of expertise include hierarchical models, Gaussian processes and causal inference. Alex, thank you so much for joining us today. Yeah,
Alexandre Andorra
thank you so much for having me. I'm super excited, and I love the show, so it's really an honor to be here and that you folks invited me.
Rosemary Pennington
Don't shine John's ego more than you need to.
John Bailer
He's talking about you, rosemary, how about the recent episodes where you have raised the game.
Rosemary Pennington
Alex, I know that you have worked as a political scientist or in political science before you moved into sports analytics. Why did you decide to make that leap? Because it seems like those are very different things.
Alexandre Andorra
Yeah, it's a it's a good question. Rosemary, I'm not surprised. Could give you, you know, the like the Manifest Destiny answer that you have with backward bias, you know, inside bias. But the truth is, is, it was in, I guess it still is extremely random. My career has been the the result of random interests, persistence and and like. And so I started, indeed, with political science. That was my, my first love in the professional area, I discovered that actually I was studying political science. I was writing a book about the geopolitics of the US. At that time, it was 2016 a very consequential election. I remember it was July, and I discovered while writing the book Nate Silver's models for 538, at the time, and and the nerd in me was awoken. I was like, and I wasn't at all it like I wasn't doing any statistics. At the time I was doing I was actually in a short term contract in the French secretary of state, so, you know, working on Foreign Affairs, basically. But I was, I saw the model, and I was like, Oh, this is awesome. I need to do that for next year's elections in France. And we had elections in 2017 also a big election, a very, even new one. And, and, yeah. So I spent the whole year, the rest of the year 2016 until until May on 12th, 2017 learning Python and learning again, Beijing, stats. So that's how I started political science. Now, hi, when did you spot? This is, like, still part of the story, but basically with time, I started to realize I was more interested in the methods than in the field of political science. More and more, I became kind of disillusioned with the field, because I felt like it was more political than scientific, and that also nobody in France was going to pay me for electoral forecasting,
and I was broke, so I didn't body.
I realized actually people would pay me for statistical methods and for statistical models, especially Bayesian models, because not a lot of people know about those and can do those. And I got very lucky, because also I started learning Python at the time, which was still kind of a niche language, only the nerds knew it. But I remember when I was telling my parents or my friends that I was learning Python, everybody. Was like, What now, if you mentioned Python to someone I don't know, a lawyer, for instance, he'll know, or she'll know what what Python is. So I was very lucky in that sense, because Python also just the popularity of Python just burst through the roof at the time. And so that's when I started integrating the pine C card F team, then we started pine C lambs in in this year, actually, I felt like it was a good time to get a new challenge, and working in a professional sports team has always been a dream for me. Since I'm a child, I've always been very into sports. I've done plenty of sports. I've been good in none of them same. So, yeah, I was like, that was very sad, because I couldn't be a pro athlete in any sports. I've tried, and now I live, actually, at the time, where I'm lucky to be able to contribute to a sports team without being an athlete. And actually, what I know how to do, and love and love to do so patient modeling and stats, I can apply that to to a sports team. And that's, that's really awesome. And so, yeah, that's how, basically, I started in the in the sports industry. That's
John Bailer
a, that's a, it's a great story. Thanks. Thanks for sharing that. I'm curious, are there things from from your study of political science that you that apply directly to thinking about analytics and sports? Yeah. I
Alexandre Andorra
mean, in a sense, a lot, because data in both realms are imperfect, you know? So Polytechnical science is way harder because this is this social data, survey data, first reason a lot. And second, it's really not reliable. Like even the best ones we have are polls. I mean, RCTs, let's say, but then we have polls, and polls are the best thing we can do, but it's really bad. And so in that sense, you know being very careful with your data and actually modeling, and just getting the data as one node in your model, and not the thing you use to actually get the results is something you develop when you model, when you do otter forecasting. And I would say, if I had to pick a second one, that would be hierarchical models, because, and that's because there is not a lot of data, but there is a lot of structure and a lot of domain knowledge, and so you have to put that into the model. And then very often, these are hierarchical layers. You can share information between the different clusters, and that's very useful, because often you don't have enough data for a given cluster, like, let's say, a state of the United States, or a county where you don't have a lot of poll but you can share what you learned from similar counties through the model. And that's extremely powerful. And it's in sports, even though, especially baseball, is extremely advanced on the on using data and using science, and you have a lot of data. This is, like, the amount of data. It's just incredible. This is like a candy shop. Like, each time I see the data, I'm just amazed by that. But it's not because you have a lot of years of data about a lot of players that you necessarily have a lot of data with each player. Because you have a lot of players who don't play a lot in the MLB or then they they play a bit, then they disappear, then they come back and and the best players they are, they are not there for, you know, like a long time either. So these, like, for each given player, don't have that many data. And that's the important thing, even though, again, baseball is an extremely rich data reach Spartan I'm not complaining at all, but something that you you get from having worked in a very data poor environment, I
Rosemary Pennington
wonder, is there some aspect of the sport that you're particularly excited to get your hands on the data of whether it's like batting or pitching, or something like is there something about the data specifically around something in baseball that is exciting, you or is it just everything? Yeah,
Alexandre Andorra
honestly, this pretty much everything, because I don't know a lot about baseball, so I'm actually learning baseball through the job. Basically, I mean, I obviously have notions about baseball, but I'm working in a pro MLP team, so you can imagine the amount of knowledge that most of my co workers have about the game, and I am nowhere near them for now, but I can learn about. Very well and fast, thanks to them. And the data is also helping me, you know, working on the models, this is really helping me understand the dynamics of the game, mainly, what, what is making a good player in some compartment that we're interested in, that's really what, what I get from from working on these models. So, yeah, I think, I think that's mainly that. You could say it's, it's the the excitement of the of the beginner, if you want,
John Bailer
that's, that's, that's a wonderful thing to possess. I think it's good to be a beginner at all points of your life. I also want
Rosemary Pennington
to point out that I also have notions of baseball that's about as well educated as I am about the sport,
John Bailer
and I played for many years, that was something that I did do. And I'm curious you sort of alluded to one of the kinds of classes of problems or questions that you might address. One was that of player evaluation is part of what what you will you might do with analytics. Are there there? What are some of the other kinds of questions that that someone that that's doing sports Analytics might be working to answer for for a professional baseball team?
Alexandre Andorra
So many,
John Bailer
pick your favorites? Yeah.
Alexandre Andorra
I mean, of course, like the I would say for bread and butter would be higher valuation, because that tree, in the end, what really matters for the team is being able to use the money you have as efficiently as possible in buying players is an investment. So you want to be as sure as you can be that you want to buy this player or you want to sell this one, and why? Because it depends on, you know, your objectives, how you want to build the team, how the manager also is talking with the GM and the rest of the team about that. So that, I would say, is really what I've seen being the the most important thing. If you don't have that, you don't have the foundation, and it doesn't really matter to to the rest. And it's not only in baseball, it's, it's like that in a lot of spots. I know John, you, uh, you used to coach soccer. So you know soccer is, is really the same and, and I would say it's even more important in in soccer for for different reasons. We can, we can talk about that a bit later if you want. Yeah, so there is that you need that foundation. And then there is so many data and new data sources that come in every year in baseball that really you can have a lot of fun, like the latest arrival of data. Are joint data, right? So it's like, you know, you you can track the knees, for instance, of the RV, elbows of players, and you can basically recreate their path, their movement, either on the field or even during a peach. So these, like, these really cool, I don't remember the exact name of it's like, bio stats, or something like that. You know, bio mechanics. I think it's biomechanical data. So, yeah, like, this is something some teams are, are pushing on right now, because this is the frontier. So, so if you're a biomechanistian and you like baseball, I'm pretty sure you could, you could get a job in the MLB if you wanted to. Computer Vision is also something that, that I see used more and more. Again, this is more at the frontier, but, yeah, this, I think it gives you a panel of what you can do, but basically you have so much data that, yeah, you could build a team of several like, if I had, you know, if I were given the power of building my R and D team, I would probably do something like that, like a portfolio investment strategy, basically where I would want, you know, most of my modelers to work on the on the bread and butter stuff, really nailing down the player projection and making sure we get better every year at the end. But I would also get some other modelers, maybe 20% of my portfolio that does more, you know, re secure thing more at the frontier. That takes more time, that's gonna not work a lot of the time, but it can. It can give you some really good advantage compared to the to the other teams. You're
Rosemary Pennington
listening to stats and stories, and we're talking with the Miami Marlins Alexandra. So you mentioned a moment ago soccer, and I know that you are doing research on soccer. Do you want to talk through what it is you've been working on?
Alexandre Andorra
Oh, yeah, sure. Always my pleasure to talk about soccer. I'm European, you know, so that that is really the big, the big sport over there in France also, even though French people don't really like to say it because, you know, it's not very classy, they prefer to say it's tennis or rightly. I. But look at the look at the look at the numbers, and it's just like football is by four by a factor of 10 more popular in France than rugby, which is the second sport by number of of people you know, actually playing the sport in Yeah. So basically, I started working on on a on a project that's completely open source, and we're working on a paper right now with Max go, who is a junior analyst who just finished his PhD at Boca new university in Milan in economics. And yeah. So we were curious about, how do you isolate the skills, the real skill of a player, trying to de confound it, from the skill of the team he's playing in? Because football, especially European football or soccer, is extremely unfair. It's an extremely capitalist sport, in a way, where we don't have at all the mechanisms that American sport us sports have to kind of make the leagues more equal to have a bit more suspense. European sports is very, very unequal, and the big teams are basically always the same. They don't always win, you know, but it's like 80 90% probability that PhD is going to win the league every year. For instance, in France and the rest of the continent is a bit of the same as usual. The English are a bit different, you know. But even there, the Premier League is getting more and more concentrated. And so basically, it's very important in soccer to try and and isolate the player effect. And so that's why we, we try to do in this, in this paper, we just submitted it to the to the MIT Sloan sports analytics conference for 2025 so you know, we'll see if we if we get selected, but if we get then, then we'll be in Boston, probably in March, in in basically what the player does is trying to to isolate that and trying to come up with the soccer way of computing a very cool metric that I really love in Baseball that's called War or wins above replacement, and that, that's a such a cool use of statistics to me, because this telling you what the contribution of a player is compared to a replacement level player, right? So not on an average level player, but actually a replacement level player, so a player that is ready on the bench to take the place of that player if he gets injured or if he gets traded to another team. So that's usually a much better player than an average player, because obviously, like the replacement level player in the MLB is very is really good. And so basically trying to replicate that in football, which is different because it's a much more continuous spot. So we've started isolating the problem with forwards. So center forwards, the one who strike the goals. And so, yeah, basically we did that where we have the model which can isolate the contribution of the player and then the contribution of the team the player is playing in. And so that's a that was a really cool project, very, very challenging we're fitting. The main thing is, we're using Gaussian processes in this project because I love them, and I thought it was very useful here. But basically the idea is, like, you can decompose the time series of a player carrier in at least three time components, right? You could think of a long term component, which would be the aging curve of the player. Here you could imagine, like a parabolic shape, more or less, you know, it they increase fast, and then they plateau, and then they decrease. But they decreased less fast than than they increase in. Then you could within each season of of football, you could also imagine that they have a medium, medium term time component, right where they get physically prepared, often to peak at some point in the season. In Europe, it's usually for February, March, when the biggest games happen. So usually they start slow, and then they would get better, and then they can drop, also quite a lot, because they can get injured. It can get, you know, too too intense, too many games, so stuff like that. And then you can have a third Gaussian process, which is basically your garbage can of short term variation that that is not explained by the other components. So basically, that's what we're doing in this project. And that's pretty cool, because then we can get some some cool plots that we have in in the paper where we can basically differentiate two things that we came up with, something we call the skills above replacement and the performance above replacement. So skills above replacement is. Really, if all players were playing in the same team, what would be expect them the scoring rate to be? So it's kind of what you want to know, if the playing field is leveled. So kind of who is, who is the best in in like per se, and then performance above replacement is the SAR so skill of a replacement, but taking into account the team the player is taking is playing in. And so that gives you an indication about how the team is elevating, or, on the contrary, downgrading, the performance of the player you're interested in. And so to give you some of their headline results, when we did that, interestingly, the best players, Messi, Ronaldo, Erling, Allen, all of them have you know the best. SAR, they are the best of the best, but they get an even better par So performance above replacement metric because they're also playing in teams which are just incredible. So that means the teams they were playing in were actually allowing them to score even more goals than they would have scored in an average team. I was thinking
Rosemary Pennington
of like players that I really loved watching, like Diego Floran, who played for Uruguay, and met otziel, who played for for Germany, and and Arsenal, and thinking about sort of how they, at times, can be so beautiful on the pitch and play so well, and then other times, they can look like complete garbage. And I'm just thinking about, like, the makeups of the teams when they were playing well and when they weren't playing well, just in my head, trying to think of like, oh, maybe that was a garbage national team that year, and it sort of brought down the performance of the player. So I think that's really interesting. So
John Bailer
were you restricting your data to particular leagues in certain seasons within these leagues,
Alexandre Andorra
right? So Max actually did that. He came up with a way to scrape all the data from a German website called kicker, and he managed to get all the data from four biggest leagues. So all the leagues, except for France. Obviously, we must have been on strike that day. But yeah, so basically, all the four big leagues. And I think we have almost 24 seasons for each league, wow, like that. So yeah, like, that's the data set. Is really, is really big. That's really cool data set, and we have that game by game. So it's not, you know, just one row for one season. It's game by game, which team met whom? What was the result? Who scored that? That's a really cool data set. And Max, you know, not just came up with with a way to scrap all that and save that and clean that. And so that's all in the GitHub repository for people who want to have fun with that.
John Bailer
So I'm really curious thinking about, you know, with 24 seasons of data, you know, thinking about what, who was, who was, sort of the players with these highest pa ours or equivalent values, you know, 20 years ago or 10 years ago, and how do they compare to the players with the highest values today?
Alexandre Andorra
Right? Yeah, that's a cool thing to do that we didn't do yet.
John Bailer
I'm sure there's a lot that's a really neat problem. I you know what? I think
Alexandre Andorra
that we'll get to it for sure, because that's really, that's really fun. Yeah, well, we would need for that to compute the par for each season, which is definitely doable, because we have everything in the model to do that, and we have the Gaussian processes. So what I can tell you is that, I mean, definitely messy Ronaldo, or, like, really, above and beyond, you know, shoulders above the other ones. And then you have, you have other ones that you can you're like, oh, yeah, that's normal. He's there. Alan shear for instance, for the Newcastle fans, yeah, Alan Shearer is, he's among the best ones, of course. Dio forlen. I don't think he's in there.
Rosemary Pennington
I was also thinking about too, so, oh yeah,
Alexandre Andorra
yeah. So he's not in there, but he's he didn't score a lot. He was more of a midfielder. He did score more for a midfielder, but yeah, he's not in the forwards. But yeah, there are some, like, one of my favorites, I didn't send Kevin otter way, and also Legend of PhD, of PhD, he's in there. But, yeah, you can see it's not, it's not messy level. But I was happy he was at least above replacement.
John Bailer
I see Pele. I want you to take a time machine back and get all the Pele data in this. And you know, the other thing that just immediately comes to mind for me is, if this works out, I mean, it's it immediately would generalize to other games that have that same kind of continuous flow, particularly like low scoring games with the same the context of who you're on the pitch with or the ice with. So hockey is a is a pretty natural you know, this works out in this context, I could see it generalized pretty nicely to that. Type of setting. Are there other ones that you could envision?
Alexandre Andorra
Yeah. I mean, there are way too many things I want to do. This is the problem. But yeah, I mean, Max and I are already thinking about, you know, extending and extending that an obvious thing we can do is, is improving the model we have right now, I have several ideas. I think, I think the GPS we could try right now, we have one GP for all the players. I think I mean three GPS for all the players. I think we could actually have three GPS per player, but also sharing information between these GPS. So having a hierarchical structure in there. And also, I would like to see the model. Could do some clustering at inference time, so being able to tell you which are the elite player and the replacement level players right now, it doesn't do that. You have to do that afterwards, which is also, like, that's cool, because I think it's very good for a recruiter, for instance, a decision maker in a club, a football club to understand the model and use it, which is actually what you want, because if the model is too complicated, they are not going to use it. And this is like, well, it's money down the drain. But yeah, there is that. And of course, the all the other positions that we're interested in, right? We only did forwards for now. But the problem of football is that, is that if you, you cannot have only one war metric, you, you have to have kind of a different war metric for each position. So then, yeah, like if, even when we get to that, well, well, I think we'll go to goalkeepers, because that's going to be the, the easiest next one, and then defenders, and then the hardest is going to be midfielders, for sure, but yeah, so we're thinking about that, oh, like, I mean, if, if that's of interest to to people. And, you know, we can publish stuff, and people are interested in that, I can definitely work on that. And, you know, on the on this side, for years. That's fun. That being said. If people are, you know, interested in joining that effort, feel free to to reach out. I always love talking to motivated people. Well,
Rosemary Pennington
that's all the time we have for this episode of stats and stories. Alex, thank you so much for being here. It's great.
Alexandre Andorra
Yeah. Thank you so much, folks. I mean, I, I, I hope I didn't give you too long answers, but yeah, that was, that was a pleasure, like, really, uh, thank you for for all the work you do. And I'm really impressed by the by the setups. She's super professional. I record an episode I'm gonna feel like an imposter.
Rosemary Pennington
Stories of the partnership between the American Statistical Association and Miami University's departments of statistics and media, journalism and film. You can follow us on Spotify, Apple podcast or other places where you find podcasts. If you'd like to share your thoughts on the program, Send your email to stats and stories at Miami oh.edu and check us out at statsnd stories.net, and be sure to listen for future editions of stats and stories where we discuss the statistics behind the stories and the stories behind the statistics you.
Transcribed by https://otter.ai
Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.
————————