The Wonders of Astrostatistics | Stats + Stories Episode 192 / by Stats Stories

Jessi Cisewski-Kehe is an assistant professor in the Department of Statistics at the University of Wisconsin-Madison. Her research focuses on methodological development for complicated datasets of which closed-form models and likelihood functions are not able to fully capture the desirable and interesting features of the observations. Statistical challenges in astronomy, astrophysics, and cosmology (i.e., astrostatistics) are a primary focus of her work.

Chad Schafer is a professor in the Department of Statistics & Data Science at Carnegie Mellon University. Since his Ph.D. work at the University of California at Berkeley, he has worked on statistical challenges that arise in astronomy, with a particular focus on the handling of complex estimation problems. He is currently involved with the Legacy Survey of Space and Time, to be conducted at the under-construction Vera C. Rubin Observatory, co-chairing its Informatics and Statistics Science Collaboration since 2015.

+Full Transcript

Rosemary Pennington
The universe seems unbelievably vast, a sky filled with countless stars and worlds. Well, maybe not so countless, as there's a whole field devoted to crunching the numbers associated with an ever expanding universe. Astrostatistics is the focus of this episode of stats and stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. stats and stories is a production of Miami University's Department of Statistics and media journalism and film, as well as the American Statistical Association. Joining me is panelist John Bailer ,Chair of Miami statistics department, Richard Campbell. He's away today. We have two guests joining us today to talk about Astrostats Jessi Cisewski-Kehe is an assistant professor in the Department of Statistics at the University of Wisconsin Madison. Her research focuses on methodological development for certain kinds of complicated datasets. Statistical challenges in astronomy, astrophysics, and cosmology are a primary focus of her work. Chad Schafer is a professor in the Department of statistics and data science at Carnegie Mellon University. Shafer works on statistical challenges that arise in astronomy, with a particular focus on the handling of complex estimation problems. He's currently involved with a legacy survey of space and time to be conducted at the under construction Vera Reuben Observatory, co chairing its informatics and statistics science collaboration since 2015, Jesse and Chad, thank you so much for being here today.

Chad Shafer Thank you for having us. Thank you.

Rosemary Pennington
So John's dad listens to the show. And I wonder if you could take a moment, if you had to describe to John's Dad, what astrostatistics is? How would you explain it to him?

Chad Schafer
What I tell people is that modern astronomical experiments, you know, observatories are gathering massive amounts of information on the observable universe, we'd look at something like the Sloan Digital Sky Survey, it has observed almost half a billion galaxies and stars in the universe and measured their properties. So there is a lot of work needed in order to process and analyze this data to actually learn something about the universe scientifically. There was a time when astronomy was about studying individual objects out there in the universe. And there is still some of that work being done. But by and large today, the field of astronomy is focused on studying large samples of objects and universes studying their properties and learning what we can about the universe from them.

John Bailer
So Jesse, would you? Would you expand on that? Or do you have any kind of different take on that?

Jessi Cisewski-Kehe
Yeah, so how I often describe astrostatistics is just simply using statistical methodology for problems in astronomy, cosmology and Astrophysics. But yeah, so Chad, wonderful, we're far more detailed, I would say that they're some of the problems where we still study individual objects, or things like the detection of exoplanets where we might be observing, for example, a single star and trying to decide if a planet is orbiting, just for example. But yeah, there's, there's definitely a focus on massive amounts of data that need to be processed, analyzed, summarized. Another aspect of working in Astra statistics is that in the field of astronomy, there's many different types of data as well. So you'll have an usual complete cloud data, maybe like the locations of galaxies in the universe, but also their images, their functions and things of that nature. So there's a nice variety and plenty of room for statisticians to contribute.

John Bailer
Can I just do a quick follow up for both of you? I'm just Could you describe just a quick one question that you've worked on? Doesn't have to, you know, and then what is a little bit about the kind of data that's collected to address that question? Well, I

Chad Schafer
I think one of the most fundamental types of questions we address are the estimation of so called cosmological parameters. People are often surprised that a lot of the questions about the universe have been boiled down to the estimation of a handful of real value parameters, for instance, the Hubble constant, what's the current rate of expansion of the universe. So you can use observations of the so-called cosmic microwave background radiation, for example, to constrain those parameters. It sort of sets up classic statistics problem from an SEO statistical inference standpoint, but the interesting aspects come in, in the complexity of the relationship between those parameters you want to estimate and what you actually get to observe, you know, the mapping, if you will that relates those parameters to what you get to observe is based on a lot of cosmological theory, very complex, working with those models presents a lot of statistical challenges. So that's actually the sort of problem that got me started analyzing classified background radiation data for the purposes of parameter estimation. So Jessie, how about you?

Jessi Cisewski-Kehe
Yeah, one of my first astrostatistics projects was, was really just trying to get a map of the distant universe. And so why is that hard? Well, you know, how do we observe anything in the universe? We have to measure something like the light output, or we have to be able to see the objects or somehow detect them. But the further the objects are away from us, the harder they are to see. And so we, at some point, we stop seeing stars at some point, we stop seeing galaxies, or we only see very, very bright galaxies, for example. But astronomers, as it turns out, are very clever. And they have these different ways of observing different parts of the universe. And so they have this type of data. That's known as the lyman alpha forest. And so what happens with the lyman alpha forest, so what is that? Well, so the light output from very distant galaxies, actually, I shouldn't say galaxies, I should say quasars. quasars are some of the most luminous objects in the universe. And so we can see them very far away, and the light that the quasar lets off travels then through the universe. And it turns out as it travels through the intergalactic medium, so like the gas and the stuff between us and the quasar, that stuff leaves an imprint on the light that we can measure from the quasar. And so from that light, which is actually what we observe would be a spectrum. So you can just imagine it's a function wavelength on the horizontal axis, and something like the brightness on the vertical axis. And so we have these functions, and we are able to kind of pull certain properties from that function, that can tell us something about the distribution of matter along the line of sight to the quasar. So you get this kind of interesting collection of functions that help us to understand regions of the very distant universe. And, and so from that information, I've worked on trying to just produce a map, then write it just like how if you had a collection of points on the earth, and you wanted to produce a map of some aspect like that, let's say the temperature of the earth, and you have the temperature at different locations. And and so in this case, we have some, something that's like the amount of matter at certain points in kind of a 3d region of the universe. And, and then we just created a map of that. And so we can, we can get then a picture of what the distant universe looks like, it ends up being something analogous to the distribution of neutral hydrogen in that region,

Rosemary Pennington
something that comes up a lot, you know, when I'm I so I'm a nerd, and grew up like semi obsessed with science and astronomy stuff. And something that comes up a lot when I read is this idea of noise. And there is noise and data. And I know it's something that's used in statistics broadly. But I wonder, you know, what, when we're talking about astrostatistics, in the kinds of things that you are measuring? How do you conceptualize noise in this situation? And how are you? How do you control it? When it seems like there's so much it's still kind of unknowable about things that you need to measure in some ways?

Jessi Cisewski-Kehe
yet? No, no noise in astronomy data can be very challenging, because sometimes you don't know is it? Is it just this kind of random fluctuation? Is it due to the instrument? Is it due to these physical phenomena that we just don't know enough about? And it ends up being correlated with perhaps, time? And so? So I'd say, yeah, dealing with noise can be quite challenging. A number of astronomers if you ask them, depending on what field in astronomy, you know, one astronomer's noise is another astronomer's signal. And so, so they're just as a brief example, um, so I mentioned this already, but I work with exoplanets. And exoplanets are planets orbiting stars outside our solar system. And so if we want to detect exoplanets, there are a number of different methods for doing it. But one aspect of exoplanets is that they're orbiting a star, and a star is not just a solid ball, it's plastic. monitors activity, there's lots going on. And so when you're trying to detect an exoplanet the star's activity is actually more of a noise and a hindrance. But for those who study stars, that's what they're interested in. And so they want to understand those properties. So you have to pay depending on what one is working on. It's how you understand or conceptualize noise, it just depends on a lot of different things.

Chad Schafer
In terms of statistical problems, people often ask me, you know, what are the common statistical problems within Astro statistics, and really the methods that are used run the gamut of all all sorts of things. But the handling of what we would refer to as a measurement error, within the field of astronomy, is a big challenge. And Jessie was just referring to that. And this is one dimension in which I feel that astrostatistics really pushes the boundaries of what statistical methods can handle, you know, we think about all the measurements that are taken, they tend to have error bars attached to them, we assume that they're reliably measured, although we should take into consideration they can definitely be off. But they have complex distributions, they're typically heteroskedastic. How is that incorporated into the downstream analyses? That is another thing that, you know, people working in astrostatistics are constantly taking into consideration?

John Bailer
Yeah, as I was reading some of the background papers, and I submitted some really nice general papers, in some places like significance and chance where these are discussed, I found myself thinking more about detection methods, and the idea that there's almost this race between increasing precision and measurement. And then what does that mean for what you have to do in terms of the analysis? And that, that sounds really tricky. It seems like maybe it's sort of job security for working in astrostatistics? Because the measurement instruments continue to evolve and, and maybe sort of even the strategies and what you measure seems to evolve. So right? How does this kind of interplay between how well you can measure and then what you do with those measurements?

Chad Schafer
You know, one thing I often repeat, probably too often, is that we are shifting from variants dominated challenges and astronomy into bias dominated ones. And what I mean by that is, when you have a small sample, you're mostly concerned with variability in the resulting estimates, but as you get larger samples with higher precision, then you start to run up against limitations of your misspecified models, the biases that result from that. And that's the generation of problems that we are facing now. And what surveys such as lsst will give us is, how do we do inference? How do we get the most out of this data in cases where the sample sizes are so large, that any model that you assume, can be shown to be a poor fit to the data?

John Bailer
You know, so you mentioned this large scale structure problem? Is that what you were referring to the LM was that when you when you said like LSS or die, I miss?

Chad Schafer
Yes, I should be more careful. LST stands for the legacy survey of space and time. Oh, sorry. There is another notion of so- called large scale structure, which is a field that is studying, as the name implies, the structure of the universe. But legacy survey space and time is a survey that's going to be conducted on a telescope that's currently under construction in Chile, it's going to completely revolutionize our understanding of the universe. Why? Well, it's, again, just orders of magnitude more data. Okay. It'll observe 20 billion galaxies. That's the projection. Wow. How do you crunch any of the data from that? Well, that's a big question. You know, obviously, to a certain extent, we're looking 10 years into the future, the computing power that will be available then is, you know, difficult to conceive of at this time. And we assume that, you know, we will be able to do large scale analyses of this data. But now that's a lot of what the people working in astrostatistics and other scientists in these areas are focused on is answering those questions how to do that. The analyses in a proper way to take these large datasets into consideration and not discard useful information in the process, right. So those are definitely the big challenges in the field these days.

Rosemary Pennington
You're listening to stats and stories, and today we're talking about astrostatistics with Jessi Cisewski-Kehe and Chad Shaffer. Jessie, you've mentioned exoplanets a couple of times. So I'm wondering if you could just describe what an exoplanet is. And I think in some of your answers, you might have kind of danced around how you find them but maybe kind of spell it like how do you find them?

Jessi Cisewski-Kehe
Yeah, absolutely. So an exoplanet is a planet that orbits a star that's not our Sun. So it's outside our solar system. The first exoplanet that was discovered orbiting a sun-like star was in 1995 51 pegasi B. And that discovery actually led to the Nobel Prize in Physics for the discoveries in 2019. So in terms of astronomy timelines, it's a relatively young field, exoplanet astronomy, and there are different ways of detecting exoplanets. Two of the more popular ways in the sense that they've discovered more exoplanets is what's called the radial velocity method and the transit method. So I'll describe both because I think they're, they're easy enough to describe, hopefully will make sense. But the discovery of 51 pegasi B. So the first planet orbiting a sun like Star, uses the radial velocity method. And so what that method does is it looks for a wobble in the star. So if there is an object that's orbiting a star, it can shift the star kind of off its center of mass, the more mass of the object, the more that star will wobble. And so what astronomers do is they measure on different nights, let's say, in the ideal situation, a star would be observed every night for years, that's not feasible, it might just be a few times a month, for several months a year, but, but with each observation, they, they get a spectrum from the star. And that gives a distribution of light. So again, you have wavelength on the horizontal axis, and something like brightness on the vertical axis. And from that spectrum, there are ways of observing a redshift blueshift, and I can go into those details. But that might be too much information without pictures. So just say from that spectrum, we can observe, we can measure if it's moving, and and then from that, from that motion, how much it's shifting left and right, we get an estimate of the radial velocity of that star for the night. And then if you plot the radial velocity across time, you look for a particular pattern in it. So if a planet, for example, is orbiting the star, you'd see something kind of like a sinusoidal signal in the radial velocity data. And, and so you can fit a model to it and, and try to try to see if there's a planet present, but but ultimately comes down to trying to measure a systematic wobble in the star that we'd expect if there is an object orbiting, and so on. So that's one one approach. And then the transit method. If, if you've ever heard of NASA's Kepler telescope, Kepler uses the, the transit method, and so that's where instead of looking at the distribution of light of a star, it monitors the total light output of the star across time. And so it's not as detailed as information from each star. But because of that you can get it turns out, you can get a lot more measurements. And so Kepler was monitoring a field within the Milky Way for a number of years. And, and what you look for then is the brightness of the star, having these little dips in it, and the dips would indicate that something has transited or crossed across in front of the star in the line of the telescope. And so if a planet is orbiting a star, you'd see it blocked out, potentially just a little bit of the light. And, and so pretending that the light output of the star was just a flatline, you'd see like a little dip. And what you'd expect to see then if there is a planet is you'd expect to see that dip occur periodically like with it with a particular period. And so if you see that same dip, let's say three times kind of within the same time distance apart, then that could suggest that there's actually not an object orbiting it could be a planet and from the shape of that dip, you can measure some properties of the exoplanet as well and so on. So those so the transit method and the radial velocity method have been very successful at Detecting exoplanets. At this point, I believe there have been over 4000 confirmed exoplanets discovered, and so many of them, most of them are from the transit method. And then the second most would be the radial velocity method.

John Bailer
You know, one thing that's really striking to me, when I look at this, this work, you know, you've, you're, there's this vastness of scale, with the data and with the objects that you're describing. And, you know, one of the things that when people talk about quantitative literacy, there's often the idea of statistical literacy being part of it, understanding functional relationships, being part of it, and understanding magnitudes. How do you communicate this idea of just the vastness of information that's being collected, or just the size of the systems that you're studying? You know, you know, your Chad, you, I think half a billion just tripped off your tongue is, as you heard describing this, as you were describing this, but you know, how do you talk about this and ways to try to bring people along and help gain intuition and insight?

Chad Schafer
It's difficult. I, one thing I would stress is that, you know, I like to think of common, this is an overgeneralization. But common problems, and astrostatistics often go through a sequence of stages. And one of the early stages is taking these massive collections of data. Maybe you have a catalogue of galaxies, that has 10s of 1000s, or maybe hundreds, hundreds of 1000s of galaxies in it. And then coming up with some useful summary of, of its properties. Something I might refer to as a canonical parameter, a natural parameterization of this population, and an unnatural summary, its estimate would be a natural summary statistic is what I'm trying to say. So, if I'm discussing these problems with statisticians, that's often how I try to describe the problems, we start with these massive data sets. And they are undoubtedly very large. But ultimately, if we can think about the problem, and you know, from a scientific standpoint, and also a statistical standpoint, you can come up with a useful summary statistic, which serves as an intermediate point, prior to the ultimate estimation of these cosmological parameters, for instance. So thinking about the massive amount of data, it can be overwhelming, and very difficult to think about how you would actually do the analysis. But when you take it in steps, when you think about it, and stages, I think that that helps with intuition.

John Bailer
Just as I have a question about the role of simulation and, and measurement here, I mean, I, it seems like they're, they're competing, they're competing ideas about what the universe is, is built, and how it works. You know, this is me, so I'm, I'm kind of skating on the thinnest ice that I can find right now. So you know, just as a Yeah, I'm gonna just just bring out all of my dirty ignorant laundry here to welcome you, john. So if you have these different ideas, you're interacting with colleagues and say, well, it could be this, it could be this. And, you know, then you may end up simulating some of these systems based on some of the assumptions about the structure they have. But then you may be collecting data. And I'm just wondering if there's, you know, how do you deal with both kinds of the idea of simulation of systems as well as some of the observed characteristics of systems? And how does one inform the other

Jessi Cisewski-Kehe
First, simulations can play a very important role in astronomy and an asterisk statistics? The issue is that to get simulations that are realistic enough to really match, or even come close to what might be going on in our actual universe, is very challenging, because if you want a lot of details, you can imagine the simulation cube has to be small. But if the simulation cubes small, you're not really getting the kind of the big picture of what's going on in the universe. So kind of these big cosmological simulations often will, I guess, typically will leave out baryonic matter which would be like our usual matter, like galaxies, stars, that the stuff we're made of, and just use dark matter. And, but, but when you are interested in things like galaxy formation and that sort of thing. You have to of course, Build that build that in building the baryonic physics, but then the simulations end up being smaller. And so simulations are important because they can give us a way of kind of checking some aspects of what we're exploring. But, they're definitely not a replacement for working with real data. But then the other problem is with our real data, we have one universe that we can observe. And so it's so if we try to do, you know, any sort of, let's say, like large sample theory, we don't have the repetitions that that we typically rely on. Well, I guess you can cut up the universe and look at different parts. There are things you can do, of course to mimic that, but if you're interested in things like the large scale structure, often using cosmological simulations to see what happens if we tweak some cosmological parameters, how does that change what the large scale structure looks like? So if you change some property of dark matter, what does the resulting cosmic web look like at our present time? And so using simulations can help us to say, Well, what if the universe kind of had some some different underlying cosmological model what happens? But then we end up leaving out some important physics that could completely change the answers if we had it in.

Chad Schafer
Yeah, and just building on that I completely agree with what he's saying. But as I was saying earlier, as we get more and more data, the simple models that people may have relied upon, start to show their flaws. Right, if we have enough data, you can show that maybe these normal approximations that you were using, they are no longer a good fit. So in on smaller scale problems, maybe than the ones that Jessie was referring to simulation models can be a very useful way of getting around that issue. So instead of having a classic normal distribution of assuming a certain parametric distribution, for your data, the things we typically do in statistics is that one on one course, we could use a simulation model instead. So you could think about as you vary the input parameters to the simulation model, then you can generate datasets, and compare that with what you have actually observed. So these sorts of techniques, including things that Jesse and I have had worked on together, are becoming more commonly utilized in in astrostatistics. And in other fields. And again, it's this is for the reasons that I was describing that with the amount of data that we have these days, and the sorts of questions that people are asking that the data, making simple or simplified assumptions, is no longer going to be adequate. So simulation models can be a useful tool and that in that way.

Rosemary Pennington
Chad, you mentioned this big observatory that is being built, which and sort of suggests the potential for what could come from that. And I wonder for both you and Jesse, as we're wrapping up our conversation, what are the things that you sort of see on the horizon? In this particular field that you think are that are exciting your or are getting you jazzed about the work that you do?

Chad Schafer
Well, I certainly lsst this survey that's going to be conducted, as I said earlier, it's going to completely revolutionize our understanding of the universe, I mentioned the the growth in the number of galaxies that will be observed. Another crucial aspect of this of this experiment is that it's going to measure things over time. To an extent that has not been done before. So we're actually going to be able to track individual objects, track how the sky is changing, it's even going to be used to track the movement of asteroids near to Earth. So the time dimension that this is going to give to us is going to be tremendous, a tremendous source of additional scientific information. So I, I'm involved with this, this project, so I tend to focus on it. But I do believe that in 10 years time, this is going to have led to a huge number of additional discoveries regarding the universe. And that's certainly something that I'm excited about. I'm very much looking forward to taking part in those analysis.

Jessi Cisewski-Kehe
So one of my areas that I've been focusing on recently is detecting low mass exoplanets. And it's. So when I started working on this, gosh, maybe five years ago, from a number of discussions with with astronomers, I thought this was going to be so it's using the radial velocity method, I should also say that, so trying to basically trying to discover Earth analogs, and the way it was described, I thought this was going to be such an easy problem, because one of the issues Yeah, and famous authors. But, because the way it was described the issue with uncovering low mass exoplanets, so just the smaller exoplanets is that they leave a very small signal. And when you're when you get down to the the signal size that an earth analog would have induced for the kind of the radial velocity data that you'd get from an earth analog orbiting a sun like Star. That scale gets down to the scale of a lot of activity on the surface of a star. So things like sunspots, we have heard of their star spots on other stars. And it turns out, those sorts of that sort of activity can leave imprints that can actually look like an exoplanet. So the signal can mimic a signal from a planet. But it's different. There are reasons why they're different. And so ultimately, from initial discussions, I was like, Oh, this is just a simple signal separation problem. We have all these things in statistics that could be applied. But it's just not the case, like it is, it is a very challenging problem. Because that is just the amount of information within the seller spectrum, it's not completely clear yet how to use that in order to, to kind of discriminate whether it's a planet or we're activity. And so, you know, there's one issue with the activity that can hide the signal of a planet, and then, which is bad, but you know, you just don't detect it that that's the problem. But I don't think it's as problematic as saying there is a planet when it's actually just a star spot. And so there are a number of groups around the world who have been focusing on this statistical challenge. Most of them are astronomers, there are definitely statisticians involved as well. But it's just it's a really, it's a really fun problem, because it's just not, it's not clear how to solve it. And so, people are throwing all these different techniques at this problem and trying it out and seeing, seeing what's effective and what's not. And, but it's important, because trying to find this population of exoplanets that Earth analogues has been, has been just a general focus or goal within exoplanet astronomy, because we found, like I said, about over 4000 exoplanets, many of them are going to be the, you know, the more massive size planets, but but trying to find kind of a something that looks potentially looks like Earth.

There have been some discoveries that kind of are described or marketed as Earth analogs, but they're just they just, we're not quite there yet, I guess. And so I just think it's an interesting problem from the astronomy standpoint, but also statistically, there's a lot that can be done to contribute.

Rosemary Pennington
That’s all the time we have for this episode of stats and stories. Jesse and Chad, thank you so much for being here today.

Jessie and Chad Thank you so much. It was a lot of fun. It was great. Thank you.

John Bailer
Thank you. You're a great guest.

Rosemary Pennington
Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.