diff --git a/about/probabilistic_programming_with_pymc/transcript.md b/about/probabilistic_programming_with_pymc/transcript.md index b6724c5..f7c2254 100644 --- a/about/probabilistic_programming_with_pymc/transcript.md +++ b/about/probabilistic_programming_with_pymc/transcript.md @@ -1,657 +1,243 @@ # Video transcript -:::{note} -This "draft" transcript has been downloaded from YouTube and needs to be polished. -::: - Video: [Introduction to Probabilistic Programming with PyMC](https://youtu.be/Qu6-_AnRCs8) -Speakers: Austin Rochford (AR), Reshama Shaikh (RS) +## Speakers: Austin Rochford (AR), Reshama Shaikh (RS) **RS:** Hello, everyone. Welcome to Data Umbrella's webinar. I'm going to do a brief introduction about Data Umbrella and then Austin will do his talk and during the talk you are free to ask any questions in the chat and then we will answer them, you know just like whenever it's a good time, either every 10 minutes or towards the end. This webinar is being recorded it will be placed on our YouTube typically within 24 hours and I will post the link to our YouTube in the chat once I finish my brief presentation. -A little bit about Data Umbrella. We are a community for underrepresented persons in data science. About me: I am a statistician slash data scientist. I'm based in New York City as you can hear the sirens. That is quite normal from here. You can find me on Twitter, LinkedIn and GitHub at reshamas. We have a code of conduct and thank you for helping make this a welcoming friendly professional community for all of us. There are various ways that you can support Data Umbrella -- the first and foremost is to follow our code of conduct and contribute to making it a really you know welcoming collaborative space where people want to return. +A little bit about Data Umbrella. We are a community for underrepresented persons in data science. About me: I am a statistician slash data scientist. I'm based in New York City as you can hear the sirens. That is quite normal from here. You can find me on Twitter, LinkedIn and GitHub @reshamas. We have a code of conduct and thank you for helping make this a welcoming friendly professional community for all of us. There are various ways that you can support Data Umbrella -- the first and foremost is to follow our code of conduct and contribute to making it a really you know welcoming collaborative space where people want to return. -We also have a discord where you can join and ask and answer general questions and share events or information. We're also on Open Collective so if you would like, feel free to donate. If you work for a corporation -that uses Benevity, this is a company match platform, you can also find us on Benevity. We have various video playlists on YouTube. And one of them is contributing to open source and you can see here that we have videos for numpy and core python and scikit-learn and pandas. +We also have a discord where you can join and ask and answer general questions and share events or information. We're also on Open Collective so if you would like, feel free to donate. If you work for a corporation that uses Benevity, this is a company match platform, you can also find us on Benevity. We have various video playlists on YouTube and one of them is contributing to open source and you can see here that we have videos for numpy and core python and scikit-learn and pandas. We also have a playlist on career advice. We've had four fabulous speakers so check that out if you are looking for that and these are all the playlists that we have. We have career advice, data visualization, data science beginner friendly videos, contributing to open source, scikit-learn sprint. And we have just started a new playlist for the PyMC series and this will be added to that one and this is just a sampling of some of the videos that we have on our YouTube. -We also have a lot of resources on our website related to using inclusive language, allyship, burnout, AI ethics. You can check those out on your own. We are on all social media platforms as data umbrella. The ones that I have highlighted, Twitter, LinkedIn, are the most active. Meetup is the place to join to find out about upcoming events and updates. And we have a blog as well and we have a monthly newsletter so I will share the link to that. Feel free to sign up. We only send a newsletter once a month and we promise not to spam you in any way, so if you'd like to sign up for that that's also helpful. +We also have a lot of resources on our website related to using inclusive language, allyship, burnout, AI ethics. You can check those out on your own. We are on all social media platforms as data umbrella. +The ones that I have highlighted, Twitter, LinkedIn, are the most active. Meetup is the place to join to find out about upcoming events and updates. And we have a blog as well and we have a monthly newsletter so I will share the link to that. Feel free to sign up. We only send a newsletter once a month and we promise not to spam you in any way, so if you'd like to sign up for that that's also helpful. -We use BigMarker for our webinar platforms and there is live captioning available so if you go to the very top there is should be a button set that says cc and you can turn on the live captioning. From the feedback that I've heard the English live captioning is pretty good. You can try the other languages, I think they're not -they're not quite as accurate. +We use BigMarker for our webinar platforms and there is live captioning available so if you go to the very top there is should be a button set that says cc and you can turn on the live captioning. From the feedback that I've heard the English live captioning is pretty good. You can try the other languages, I think they're not quite as accurate. Our upcoming event that we have is "Automating Workflows with GitHub actions." That's going to be on January 18th, oops, lost my slides. Hang on just a minute, sorry about that, all right, there we go. Yeah sorry I don't know what's going on. -Today's talk is Introduction to Probabilistic Programming with PyMC with Austin Rochford. Austin is the chief data scientist at Kibo Commerce. He is a recovering mathematician who is passionate about math education, Bayesian statistics and machine learning, and you can find Austin on Linkedin, Twitter and GiHub at austinrochford. I don't know if I'm pronouncing the name right but Austin will let us know when when he speaks. And feel free to tweet about the event. Our twitter is at data umbrella and with that I am going to turn off my -mic and camera and hand it over to Austin. +Today's talk is Introduction to Probabilistic Programming with PyMC with Austin Rochford. Austin is the chief data scientist at Kibo Commerce. He is a recovering mathematician who is passionate about math education, Bayesian statistics and machine learning, and you can find Austin on Linkedin, Twitter and GitHub @austinrochford. I don't know if I'm pronouncing the name right but Austin will let us know when he speaks. And feel free to tweet about the event. Our twitter is @dataumbrella and with that I am going to turn off my mic and camera and hand it over to Austin. **AR:** Thank you Reshama for that introduction and thank you for organizing this event together with all of the other Data Umbrella folks on your team for organizing and promoting and it takes a lot of work to put an event together. And thanks to the folks on the PyMC side who's who have been coordinating with Reshama and her team with Data Umbrella so thank you for asking me to speak here today. I'm excited to share something I care a lot about with all of you and Oriol as well and everyone else who is contributing to that effort. -So my goal today is to introduce the idea of both probabilistic -programming and how it works in pi mc kind of from a knowledge of python a -little bit of numpy um up to an actually interesting application to give you an idea of what -is probabilistic programming why do we at pi mc find it exciting -um so about this talk as you can tell this talk is a jupiter notebook uh the slide -rendering the code was difficult to see so i'm just going to scroll through the notebook apologies for that it's available on the -pi mc data umbrella sprint github repo at this link i think oriole and i have both also -shared uh links in the chat with access to this notebook so please -do load it up on your own play with some of the examples modify them for yourself -so in terms of an outline where are we going to go for the next 40 minutes or so 45 minutes we're going to -talk about probabilistic programming from two different perspectives philosophical and mathematical this is a -little bit of motivation we're going to see how exactly does probabilistic programming play out in pi -mc we're going to use two somewhat simple examples here to illustrate the basic concepts one is one -of my favorite little probability exercises the monty hall problem which you've probably seen or heard about -before and the second is robust regression this will show how we can easily iterate on a -model using probabilistic programming one of its real strengths then we'll talk about why pymc why should you -actually care about pi mc as opposed to coding these inferences from scratch or some other system so we're going to talk -a little bit about the strengths of hamiltonian monte carlo and asera which is kind of pi mc's uh peer project that -that enables a lot of the exciting things that pi mc does then i'm going to give you a taste of -what a real uh you know a near real analysis that i've done around lego pricing -with this so if you can tell from my background i collect legos uh star wars -space dinosaurs a smattering of other ones uh so i'm very interested in what drives the prices of lego sets if you go -on my website which is just austin rochford.com i've written about this fair amount show how to analyze those -using pymc so a real world application and provide some resources and next steps right and i think rachma will ask -uh maybe mean all to to say what the next steps are with the sprint that we're enabling here but if you want to -learn more about either bayesian statistics probabilistic programming or pi mc i have some -resources linked at the end as well so the interest of time let's dive right in -so i take two different perspectives on probabilistic programming the first is a little philosophical right so people a -lot about a lot of the time talk about data science as doing inference to tell stories right data science is all about -telling stories with data i've included uh one of two favorite images here -um this is the size of napoleon's army right as he marches to and then back -from moscow during his ill-fated invasion of russia what's the story this data visualization tells you i think it -tells you not to invade russia during the winter uh but it's very much data goes to stories right that's the direction the -arrow flows you do some calculations on your data you use that to tell a story -probabilistic programming one of the real strengths about it and how i think about is it kind of turns that arrow around probabilistic programming says +So my goal today is to introduce the idea of both probabilistic programming and how it works in PyMC, kind of from a knowledge of python, a little bit of numpy, +up to an actually interesting application to give you an idea of what is probabilistic programming, why do we, at PyMC find it exciting. +So about this talk as you can tell this talk is a jupyter notebook, the slide rendering, the code was difficult to see so i'm just going to scroll through the notebook. +Apologies for that, it's available on the PyMC data umbrella sprint github repo at this link I think Oriol and I have both also shared links in the chat with access to this notebook, so please +do load it up on your own, play with some of the examples, modify them for yourself. +So, in terms of an outline where are we going to go for the next 40 minutes or so 45 minutes, we're going to +talk about probabilistic programming from two different perspectives-- philosophical and mathematical. +This is a little bit of motivation we're going to see how exactly does probabilistic programming play out in PyMC. +We're going to use two somewhat simple examples here to illustrate the basic concepts one is one +of my favorite little probability exercises-- the monty hall problem which you've probably seen or heard about +before and the second is robust regression. +This will show how we can easily iterate on a model using probabilistic programming, one of its real strengths, then we'll talk about why PyMC, why should you +actually care about PyMC as opposed to coding these inferences from scratch or some other system? +So we're going to talk a little bit about the strengths of hamiltonian monte carlo and asera which is kind of PyMC's peer project that +that enables a lot of the exciting things that PyMC does. Then I'm going to give you a taste of +what a real, you know a near real analysis that i've done around lego pricing with this. +So if you can tell from my background, i collect legos, star wars, space dinosaurs, a smattering of other ones. +So i'm very interested in what drives the prices of lego sets. +If you go on my website which is just austin rochford.com , i've written about this fair amount show how to analyze those +using PyMC, so a real world application and provide some resources and next steps right and i think Reshama will ask +maybe mean all to say what the next steps are with the sprint that we're enabling here but if you want to +learn more about either bayesian statistics probabilistic programming or PyMC, I have some +resources linked at the end as well. +So, in the interest of time let's dive right in! +So, i take two different perspectives on probabilistic programming-- the first is a little philosophical right, so people +a lot of the time talk about data science as doing inference to tell stories right. +Data science is all about telling stories with data. I've included one of two favorite images here-- this is the size of napoleon's army right as he marches to and then back +from moscow during his ill-fated invasion of russia. +What's the story this data visualization tells you, i think it tells you not to invade russia during the winter +but its very much data goes to stories right, that's the direction the +arrow flows you do some calculations on your data you use that to tell a story. +Probabilistic programming-- one of the real strengths about it and how i think about is, it kind of turns that arrow around probabilistic programming, says i'm going to tell the story of how my data is generated and then encode and then i'm going to -use that to perform inference about unknown quantities uh whether it's the size of napoleon's army at a certain -point in time whether it's the probability that a coin is flat fair whether it's the effect that the number -of pieces has on the price of a lego set you know there's some unknown quantity in the world i'm going to tell a story -about how it relates to the data i see and then i'm going to use the data i've seen to infer -some things about that unknown quantity so that is how probabilistic programming kind of turns that around -um and it turns that around if you if you put your you know mathematical hat on a -little bit and you think about turn that around okay so if i have uh -a story and i get data from that story how do i turn that conditional probability around -well that's bayesian statistics right bayesian statistics is all about how do i go from the probability of b -given a and turn that into the probability of a given b right so this is all about how do i reverse -conditional probabilities same idea here going from data to storytelling and turning that -out around is what really motivates me here so this is kind of the philosophical level could say a lot more -there i'm not going to dwell too deeply let's dive into the mathematical level so the mathematical level this all plays -out um in something called monte carlo methods right we're going to be talking about -advanced monte carlo methods here and so what's a simple example of a monte carlo method well i'll give you one right here -let's generate 5 000 points that are uniformly distributed in the square that -ranges from negative one to one right so that's shown here -okay and i apologize these some of these things will be duplicated because these were built to be rendered as slides but -the formatting wasn't making the code legible so we're being a little bit flexible here so we've got our five thousand points in -this square negative one to one right let's count the number of them that are inside the circle with radius -ones so these are inside the circle with radius one if x squared plus y squared is less than one -right this plot highlights those that are inside the circle of radius one uh the green ones are inside the circle -the orange ones aren't inside the circle uh and then you know -what proportion of these should we expect to be inside the circle right -well it probably has to do with the area of the circle right the area of the circle is one -right the area of the square is four so if we take you know number of points -green points inside the circle and we divide that by total number of points which is five thousand we should get the -ratio of the area of the circle to the area of the square which is pi over four -and sure enough we add up the number that are in the circle we divide it by five thousand and -we multiply it by four we get not an awful approximation to pi right so this is what a monte carlo -method is a monte carlo method is all about using random numbers cleverly to approximate the quantity -you're interested in and we'll get into how that connects to probabilistic programming and bayesian inference in a second -how does this connect to bayesian inference well what we've done here is actually approximate this integral -right if you think back to calculus this integral relationship is true right because the square root of one minus x -squared that defines this quarter circle right so this integral gives you the area of -this quarter circle here multiply that by four you should get the area of the whole circle which is pi -right so we instead of doing this integral through maybe a trigonometric substitution or whatever you might have -learned in calculus approximated its value through random number generation -and bayes theorem has often very many intractable -integrals here right one way to restate the bayes theorem as i showed above is this way where theta are your unknown -parameters and d is your data right so if we want to know about our -unknown parameters given our data it tells us that it's data given unknown parameters probabilistic programming is -all about specifying this quantity right prior is not that hard also -specify the prior but then calculating this integral at the bottom is intractable -not possible provably impossible for a lot of interesting models and if we restrict ourselves to models just models -where you can calculate this quantity life becomes a lot less interesting we close off a lot of potential -applications so that's what gets us to consider probabilistic programming using monte -carlo methods in this case with pi mc although there are many different uh packages out there to do it with -um so all of that was great theory let's get down to practice now and let's talk -about the monty hall problem so if you're not familiar with the monty hall problem right there's a famous game show -in the united states going back 50 years even maybe probably even longer than that called let's make a deal right the -host was named mommy hall for a number of years and so that's where this comes from and one of the games the contestants would play in -let's make a deal would be that monty would show them three doors closed right behind two doors were goats -and behind one door was a sports car right so obviously in -the end you want to open the door that has a sports car behind it so how this game would play out is that -monty would ask the contestant to choose a door right and not open it yet so say you -chose door number one the first door right what monty would then do is open one of the other doors -in this example door three but it could be whichever one to show you a goat and then this is where the drama comes -in he would ask you do you want to switch your choice of door yes or no -right and this is one of the kind of first examples where conditional probability becomes a little bit -counterintuitive uh probably a lot of you know is that the answer is yes you should switch your -choice of door right um there after he's opened a door to show -you a goat there's now a one-third chance that your door contains the sports car -and a two-third chance that the other unopened door contains a sports car so you should switch -in order to maximize the chance that you choose the prize you actually want right and we could work this out -mathematically here i'm not going to run through it because it's tedious the point is probabilistic programming will -enable us to write some code to avoid doing this right because it's possible in this situation but not particularly -interesting and a lot of situations it won't be possible to work out like this so it's always good when you're learning -a new numerical method to apply it to a problem where you can get the answer with pen and paper to make sure they -agree that gives you confidence when you're moving to situations that you can't just use pen and paper for right so you -should be able to get a known answer in a situation uh that you can work out -explicitly so that you build confidence when in situations where you can't work out the answer explicitly -so let's see how we solve this monty hall problem in pi mc -that was supposed to be a hidden cell um thank you for bearing with me so what does the pi mc solution looks like well -initially we have no information about which door the uh prize is behind so i'm importing -pi mc i'm creating a pi mc model with this context manager on normal you know python stuff we do and then i'm saying -hey this prize comes from a discrete uniform distribution all that means is -each number between 0 1 and 2 has the same chance of containing the door right there's a one-third chance -it's behind door zero uh the first door one third chance that's behind the second door a one-third chance it's -behind the third door so here i'm telling the story of how my data is generated before i get any -information don't know where the door is the prize is i should consider each door equally likely -then we need to think about things from the host perspective right which door is he going to open -based on the door we have chosen well we've chosen the first door so he's not going to open that door right there -would be no drama there if he opens it and there's a car we just stick with the open door if we open it and there's a goat -we switch doors and it doesn't really matter which one we choose so he's not gonna choose the door we chose which is -door one now if the prize is behind door one this -is the first row he won't open door one we've already said he could open door two or three -they both contain goats that's fine if the prize is behind door two he's not going to open door one again that -destroys any drama he's not going to open door two also because he doesn't want to show us the car because yeah we -would just switch to door to the door with the car okay so he's forced to open door three -right and similarly if the prize is behind door three he's forced to open door two so this is how monty makes his -decisions of which door to open so we can see if the prize is behind -door one there's a fifty percent chance he opens door two or door three if the prize is behind door two there's -a hundred percent chance he opens door three and similarly if the prize is behind door three there's a hundred percent chance he opens door two -so in the spirit of probabilistic programming we want to encode these chances he opens each door given where -the prize is as code so i'm going to import this package name asara uh and i'll i'll talk a little bit -about a sar in a second so just bear with me for the moment and i'm going to write p open which is the probability he opens each door as a -switch statement here so this is basically the asara version of an if statement -so this is saying if prize is behind the first door then we -know zero percent chance he opens the first door fifty percent he opens the second fifty percent he opens the third -that corresponds to this line right if it's not behind the first door a nested if statement if he's behind the -second door zero percent chance he opens door one zero percent chance he opens door two a -hundred percent chance he opens door three that corresponds to this row right and then this final thing if it's -not behind the first door not behind the second door gotta be behind the third door 100 chance he opens the second door -so why have i written this in terms of aet aceratenser.switch and it's -sarahtenser.equal instead of regular python if then else statements we'll answer that question in -a little bit and it's all about increasing the efficiency of our sampling algorithms through calculus -acera lets us differentiate expressions automatically without having to do any calculus with pen and paper in a way -normal python if then else statements do not and so that's the benefit we get by writing this -which totally could be an if-then-else statement and somewhat awkward uh syntax and we'll see that that's really -worthwhile once we move towards more complex examples so we've said our prior belief about -which door the prize is behind we've said what are the chances monty opens each door given where the prize is now -we know he opened the last door the third door and showed us a goat right -so which door he opened is now a categorical random variable it's either first second or third -the probability each door was open is that p open we just defined through those switch statements and we know he -opened the third door since this is zero index the third door corresponds to this two -right so we've said how where the prize is influences the -likelihood he opens each door and then what data we've observed now we're going to perform inference and -inference in pi mc is through sampling right this is where the mark this is where the monte carlo -uh inference comes into play so we sample from the posterior distribution to use the bayesian terminology -here and we get uh an x-array array of samples we're going to get in this case -um 20 000 samples from the posterior distribution of places the door of the -door the prize is behind in 20 000 different simulations of this right just like when we were using monte carlo -methods to approximate pi we had 5 000 different samples there a certain proportion of them were inside -the circle we use that proportion to determine pi we're going to do draw 20 000 random numbers here we're going to -see how many of them are zero saying the prize is behind the first door we're going to see how many of them are one -saying the prize is behind the second door and those proportions are what's going to tell us the probability that -determines whether or not we should switch doors so we get these traces here okay -uh these samples and i'm going to check the posterior prize so this says how -many times is the prize behind the first door right and this says one third of the -time the prize is behind the first door given the information we've input which means that two-thirds of the time it's -behind the second door so we should switch doors right this is the well-known result that we could derive uh with pen -and paper and i did on a previous slide so um -so it's good we've recreated this result using probabilistic programming by telling the story of how our data was -generated we have inferred what are the chances the sports car is behind the first or -second door and we know the correct answer is to switch door so let's step back a bit and talk about -the components i've just flashed in front of you two components we've talked about so far we've talked about pi mc -distributions right we saw discrete uniform and categorical distributions there so any -probability distribution almost any probability distribution you can think of from statistics -has been implemented in pi mc and if it hasn't they're actually quite straightforward to add those are some -good first contributions or to add uh new probability distributions it it's -kind of uh there's a lot of boilerplate code with um some small tweaks you can use and that's -actually really good and important work i like to think of these given that i'm a lego collector it's kind of the lego -bricks of probabilistic programming right distributions are the bricks pi mc lets -you put those bricks together in different ways to perform inference and so -in the case above we use the discrete uniform and the categorical distribution i've just chosen to highlight two to -show flexibility obviously there are normal distributions there are zero inflated poisson distributions if you -were to click on this link it takes you to the documentation shows you you know uh many dozens if not hundreds of -distributions that are available to build up your statistical models from and then what were those -aet.switch aet.eq things doing well they were invoking acera -which is pi mc's uh kind of tensor computational back-end right you can see -a little a little blurb from the acera documentation here i'm not going to read it out loud for you you can think of a -sarah as filling the same niche as uh you know pie torch -or tensorflow fill in uh various other you know maybe -deep learning ecosystems which is it will optimize your numerical computational graph -and it will also allow you to calculate derivatives of that computational graph without having to do the calculations -yourself and that is really key and what pi mc really leans on asera for to enable -efficient inference in high dimensions which we'll talk about in a little bit it's very much worth the uh -the cost of wrapping p open in these asera switch and sarah equality -statements to get uh you know automatic differentiation and optimization for -free uh once you start thinking in that way it's not hard to port a lot of the common code you would write to that and -the benefits are great so i'm going to walk through another kind of simple example to illustrate a -couple more key concepts and ingredients of pi mc -uh or dependencies perhaps before we ramp up to our more realistic lego examples so we're going to take a data -set from anscom's quartet an interesting you know instructional data set of four different groups of -data that have some similar summary statistics i want to talk a little bit about how we can use pi mc to do robust regression -today so let's focus on the third entry in this quartet that is almost exactly linear except for -one outlier i'm going to show you how you can really powerfully use these kind of swappable distributions in -pi mc to capture the linear trend here and ignore this outlier in kind of a toy example but -that's instructive for real life applications so let's see ordinary least squares -let's first do something we know will be uh prone to responding to this outlier -what are the mathematical assumptions behind your good old ordinary least squares regressions well -they are that all values of the slope and the intercept b and the um -[Music] the error the noise variants are equally likely so i'm going to build a pi mc -model that encode these things and flat here is the distribution the lego brick that -says all real numbers are equally likely to be the value of m all real numbers are -equally likely to be all the va uh the values of b all positive real numbers that's what -half here means are equally likely to be the value of theta right so you write out some math here -y is mx plus b plus some noise this is one way to view it this is equivalent -statistically to saying y has a normal distribution with mean mx plus b variance sigma squared so that's -how this is encoded in pi m c really really simple right and here you'll see -again we're saying hey i observed this data y 3 y 3 being the -y-coordinates of these data points in the third entry in the quartet -so we can come on down and we can sample from this model again you see pi mc's doing some sampling for us and it's -warning us some things there were five divergences after tuning so this is great pi mc is not only trying to infer -the values of m and b here but it's going to warn us when it's not doing a -great job when our monte carlo methods may be failing so this is a really powerful and user-friendly thing about -pi mc how can we visualize these divergences where inference is failing well we can -use one of the other important components of the the probabilistic programming ecosystem here which is rvs -a kind of sibling project to pi mc to plot the values of mb and theta that -are causing these divergences right so the the idea here is that rvs is a library -that contains a lot of pre-built visualizations and statistical routines -that will help you understand the results of your inference that you did with pi mc -and visualize debug those results right so what two more key components -that this uncovers one is rvs right and rvs is really cool because -it's agnostic about the probabilistic programming library you use to conduct the inference right the plots of -posterior distributions or divergences or you know many other things you want -to make are agnostic about whether or not pi mc is your program probabilistic -programming package of choice or stan is your probabilistic programming package of choice or -you know numpyro is your probabilistic programming package of choice there are many probabilistic programming languages -out there they all want similar visualizations rvs exists to say hey if -you provide the results of inference in a standard format we'll build those visualizations for you no matter what -uh upstream inference package you use so our visa is really awesome um i strongly encourage everyone to take -a look at it when we look at the results of inference right we took -we said pm.sample store this in ols ordinarily squares trace -we look at the type of that that's an rv's inference data object okay you just need to encode there was the samples -from your monte carlo uh approximations in this object get -uh rvs to work on it if we look at the posterior component of that trace that's an -x-array data set an x-ray is a really awesome library that's you know i think about it almost -as like pandas in more than two dimensions right it's a great way to define flexible -labels labeled arrays in arbitrarily many dimensions and this is exactly what -you want to do when you're doing invasion inference and visualization of bajan inference the results of bayesian -imprints excuse me and we will see uh where those labels are very powerful -um so there's a question why do we assume that it is normally distributed in the -ordinary least squares model that's a great question you can motivate the ordinary least squares model a bunch of different ways -uh right usually it's viewed as minimizing the mean squared error with a bunch of -restrictions applied that turns out to be equivalent to placing a a normal -likelihood on your data so it's just one of the assumptions that goes in here there are a bunch of different equivalent ways to frame it so -great question so what do we want to do well we want to -compare our inferred values of m and b to the robust values we would get if we -ignored the outlier and the outlier here happens when x is equal to 13. so we're going to use numpy's uh -polynomial fit of degree one same equivalent to ordinary least squares regression to get -the slope and intercept and we're going to compare those to our posterior distributions for m and b -that we got using pi mc right and you can see here -the reference value so the the robust slope is a little bit smaller than it's -a little bit towards the low end of our posterior credible interval we've slightly overestimated the slope -underestimated the intercept here this has really meant the details of this visualization aren't that important it's -just that all of this came out of you know one line essentially of rvs because rv's is this powerful showing what's the -true value that i'm trying to get out of inference versus what's the posterior distribution uh that happened when i fit -my model so we can also plot the lines here that come out of this inference right and we -can see okay the blue line is the average the posterior expected value -that you get using pi m c and the red line is what you get when you calculate the same thing using numpy -just to show that they agree when you when you match up your assumptions so that's good but we see you know this has -definitely been pulled upwards by the outlier underestimating the true intercept over estimating the -true slope due to this outlier how can we swap out some of those lego bricks -those distributions uh pi mc distributions in order to make -this less sensitive to that outlier well we can start out by saying let's -regularize a little bit let's move from ordinarily squares to ridge regression for those of you that are familiar with -the machine learning concept of ridge regression what is ridge regression it's really just normal priors on your -coefficients right so instead of saying m is a flat distribution here we're going to swap -that out and say it's normally distributed mean 0 standard deviation 1. how to choose these parameters as a -subject in and of itself um i've just chosen these here for convenience you could change them -um [Music] yes so reishima has asked if y were distributed -as a binomial yeah you could absolutely you know build say a logistic or a probit -regression model this way uh you know there are distributions for any kind of typical glm that you've seen out there -i've written up examples on my blog austinrockford.com around you know negative binomial regression using pi mc -or even you know ordinal regression to see what factors drive lego set star -ratings uh so the the number of distributions is really flexible really is legos the goal -is to enable you to build whatever models you want um using these building bricks here we're -sticking with normal observed likelihoods just for simplicity uh since this is an introductory talk -but this can get as complex as you like you can make a zero inflated binomial model or something similar uh with the -building blocks that pi mc provides a great question reishma um -so we're going to throw these on these pry these normal priors on the -uh slope and intercept this is equivalent to ridge regression i'd have to do some math to determine -what regularization penalty it is but it is equivalent uh we perform inference again we get a -little a few a couple fewer um a couple fewer warnings -uh but there are still some divergences we can check our plots again our true values our robust values of the slope -and intercept are actually even further from what we've inferred so huh well this makes sense -regularization of this sort tends to be more useful let me find the correct -thing when your x values are extreme and not your y values so we've added in -robustness to extreme x values but not to extreme y values actually reshma this will speak -to your next question which is -um which is how can we change that -observational likelihood to be more robust against this extreme y value well to get robust regression out of -this we can use the fact that students t distribution has fatter tails than a normal distribution -right and so here the blue line is the standard normal distribution the orange -line is a student's t distribution with two degrees of freedom we see the tails are fatter -this will respond due to those fatter tails this will respond be less sensitive to outlying y values -so we'll specify the model for m b and theta same as we did before but we'll change y obs here from a -normal random variable to a student t distribution we have to specify a prior on our -degrees of freedom here i say it's a uniform between one and ten -uh right because as the degrees of freedom goes to infinity the t distribution becomes a normal distribution it's less interesting if -the degrees of freedom is huge we expect it to be small here because there is an outlier but similarly you know mean is -mx plus b sigma is sigma we perform inference we see -in our posterior plots we've captured the true robust values of these things quite well -we can visualize the outcome lines here the green line is going right through -our uh you know non-outlier data points and the outlier has been ignored this is -a little bit of a silly example right because there's no noise in the these things outside of the one outlier that's -why these posterior distributions look real wacky you could add a little noise here to make it more realistic the point -is these distributions are easily swappable uh chandra asks can we model using -tweety regression i believe we can uh if not it's possible to add -just depends has the distribution necessary been pre-built -how do we choose hyper parameters and can we go wrong by choosing a distribution um -just like right this is just one kind of tool in your statistical toolbox out of many -uh you can always go wrong choosing your statistical tools so there is tons of -uh ways to choose these prior distributions and evaluate the goodness -of your choice that are really outside the scope of this talk at the end i will give -some references that will include um discussions of these sorts of things but -that's a great question so yes absolutely you can make bad decisions if you don't choose the appropriate tools -as with any statistical method um and how to how to decide whether you are -in a good spot or not is something that's outside the scope of this talk but there are many references out there -that i will share so why are we using a sarah again and -i'm going to move quickly here to make sure we have time for q a at the end um -so apologies but hopefully it will still be somewhat clear why are we using uh -a sarah why are we using that aet.switch aet.equal and under the covers here when -we write uh you know pm.normal there's a bunch of acera operations going under -the covers when we call this class constructor uh pi mcs just providing a nice layer -over those for you so you don't have to think about them unless you want to um -why we use the sarah is to do something called hamiltonian monte carlo and what makes it hamiltonian what makes it -hamiltonian is that you want to take into account the geometry of the model you're specifying -right and geometry in the sense of differential geometry is all about curvature -and curvature is about derivatives in a past life i was a differential geometer -and the way i think about it is curvature is always about the failure of partial derivatives to commute -so uh you know if you differentiate with x and then differentiate with respect to y -and then do the derivative of y with respect to y and then the derivative with respect to x -those two things are the same in flat space on the plane they are in general different if you're -on a curved surface um and that is how geo geometers or at -least a certain set of geometers think about curvature and that's actually really important for doing this kind of -inference in high dimensional models models that we're really interested in here as -accurate representations of reality due to the curse of dimensionality which you're probably familiar with from -machine learning what's the way the curse of dimensionality uh is reflected -geometrically is that the volume of the unit sphere as the number of dimensions -increases goes to zero exponentially fast right so -you can think of the unit circle and then the uni in two dimensions the unit sphere in four dimensions the unit hyper -the unit sphere in three dimensions there's a hyper sphere in four five six seven dimensions what's that volume -it drops to zero really fast as the number of dimensions increases even mildly here we can see that between 100 -and 1000 dimensions it's already you know 10 to the negative 196 -right what this means is that typical points in your data set near the origin there's a nice definition of typical but i i -won't get into what that exactly means here are hard to find in high dimensions -if we just treat the spaces flat when we take into account the curvature -of the space they become a lot easier to find and asera is what lets us do that by automating those derivatives we need -to cut um to quantify curvature -so where does acera really help us as a toy example let's show using a sarah that the derivative of x cubed is 3x -squared familiar to all of us who have taken calculus right so we're going to define x as a scalar y -is x cubed then we're i'm going to ask acera to pretty print the gradient of y -with respect to x that should be this derivative let's walk through this and i'll tell you how to read it -so this says fill x cubed with x is one okay so one cubed is one this term is -just one one times three okay so we've got this coefficient of 3 here -and then we've got times x to the 3 minus 1. so we've got 3 from this term we've got x cubed from this term -beautiful right so sarah is doing calculus under the covers -to quantify the curvature of space to make sure that we are not dealing with these absurdly small numbers -uh that we would if we didn't acknowledge the that geometry all a bit hand wavy right now -there are great uh references here uh there's no way to go super deep on that -theory in an introductory talk i just want to wave at why it's important so -let me take about five minutes to run through this final example where the shape of the posterior does -become important and in that tweety regression or that binomial or ordered logit or probit regression this becomes -a lot more important than our you know simple to you know univariate linear model -with like two or three parameters um it becomes a lot more important as your -model gets more complex and realistic so let's talk briefly about a bayesian analysis of lego prices so on the left -here you have a small sampling of my lego collection to my left here there is some of it as well -i came at this with a motivating example of here's darth vader's meditation chamber a set lego released uh last year -in 2021 is this model worth the 70 they were charging it or not how can we answer -that question with data so i went out there and -scraped a bunch of data from brickset which is just a lego fan site that has references all sorts of data you can see -it here i've made it available online you can find the data if you're interested in that sort of thing you'll see there's about -6400 sets from 1980 to 90 uh to 2021 -you can see there's a pretty good i'm gonna move quickly here because i know we have some other things to cover off in the end and i want to leave time for -q a um there's a log linear relationship between the piece count of the set and -the price of the set that's not surprising right lego has both fixed and variable costs in manufacturing it sets -larger um sets obviously will incur those larger variable costs so they're going -to charge us more to cover their costs plus whatever profit that that makes perfect sense we see where darth vader's meditation -chamber kind of fits in this we can look at to make this model a -little more interesting a little more realistic we can look at how these prices have varied over time -so how has the recommended retail price per piece varied over time and we could -see where darth vader's meditation chamber fits in there we can break this down -by different themes so we can see star wars creator harry potter and disney -we can build a model i don't know why my math is not working here apologies for the last minute -change that says you know basically each component the intercept and then -the price per piece uh has a time varying component so a gaussian random walk and you can come -into the notebook and see how this is built up here from our pm.normal lego bricks -and a theme component which is a hierarchical normal random variable you can see how that's built up here this is -not a ton of code to specify a fairly flexible model with both time and theme effects -if we try to sample this without hamiltonian monte carlo i can do this using a metropolis step -which is the simple naive thing to do i get a ton of errors if i calculate the r -hat statistics here i see they're enormous they're up above three for some of these things r hat statistics of one indicate -convergence so i shouldn't trust this these results this is one of the ways -that pi mc tries to help you decide if your results are trustworthy or not you can see this took about a minute -let me do this with right there 351 parameters here so there's a lot of them -uh let me do this with the more advanced samplers the hamiltonian monte carlo nut -stands for no u-turn sampler that that pi mc provides this takes nine -minutes ten minutes uh but produces much better r hat values that are close to -one indicating no problems with convergence there and therefore much more trustworthy -uh posterior inferences so the top is the using the metropolis step that falls -prey to the curse of dimensionality you can see that these samples are wild all over the place i have no reason to -expect a true posterior distribution would have this wild form so this shows inference problems whereas the -hamiltonian monte carlo ones have this beautiful shape much more plausible -um so even though it took about 10 times as long to sample the sampling efficiency -is uh just so much bigger right so the sampling time is an order of magnitude -longer for the adaptive hamiltonian monte carlo nuts is hmc yes for daniel's uh question nuts is -a form of hmc but the sampling efficiency in terms of effective sample size per second is an -order of magnitude larger so to get the same effective sample size with your -metropolis step you'd actually have to sample that model for 25 hours so 25 hours versus 10 minutes i'll take -10 minutes any day of the week so we can start to answer some of our questions pretty easily using pi mc here -we can look at the posterior distribution of what we would expect darth vader's meditation chamber to cost based on the -fact that it's a star wars set released in 2021 uh that has however many pieces i think -it's about 700 pieces so our models uh posterior expected value was 79 we -see that the 69.99 lego price to that is just below the median so it's pretty reasonably priced we can assess the -pretty easily the effect of different set uh types on the price of legos right so creator -sets are a little less expensive star wars sets are a little more expensive etc etc -here i've included some uh two books that are available online -for free this one on the right is also for purchase so the left is a classic reference -middle is a great reference written in r with stan but we have posted notebooks -implementing it in python with pi m c for all of the examples and exercises -here really great textbook to help you think about bayesian inference and then on the right is a textbook about -bayesian inference actually written by osvaldo raven and jung peng who are three piemce core -contributors that just came out uh has an online version available also available for purchase in print -i thank you for your attention again i thank reishma and the data umbrella team -for their hard work putting this together and oriole and nina for me -excuse me for their coordination uh there and at that point i will hand it back to reishma to perhaps say a -little bit more about the sprints this has been intended to introduce to you -uh and then i can take any q a after after reshma uh fills that in so thank -you all for your time i really appreciate it so yes actually me now is going to be -speaking a little bit about the upcoming sprint so i'll wait um if you know if you're able -to share your camera that would be great -right no -yep we can hear you so i'll share my screen -okay -oh i think my screen is visible yes it's loading yep now we can see the slides yep yeah so -uh yep uh so we're going to be organizing a sprint with data umbrella like the -entire biomc team on 18th and 19th february like that over that weekend -uh we haven't finalized the exact time slots yet but we're hoping to do like two hours uh -to our slots over the weekend so we can accommodate more time so -[Music] -two of these have already happened and two more of these will be happening and i really think you should be -like uh if you can make the time either attend them or watch a recorded version if you want to contribute to yfc like -austin gave a wonderful introduction on how to use pyeomc and why you should be interested in probabilistic programming -next we have a talk that ricardo will give which is more about you know how can you contribute to ymc -like how can you help out with the source code or other things and then martina will be giving a talk on -how do you contribute to the documentation like say if you're not really interested in contributing to code but you know a bit of statistics or -maybe you just want to contribute to the technical part which is again important -so these talks would help you out with that and finally then out month from now we have -the sprint uh a couple of us in the pi you should always be to help you through polar -request surveys and i mean i i think it would be really fun and -if you're new to open source i would really ask you to join in and -yeah that's that and thanks to vishwa for putting together this whole thing it's been like so much work -yes um let me turn on my camera um yes i want to thank um the entire um -pi mc team because um they initiated this sprint and there's a series of -webinars you can see that we've had two of them already that are on our youtube and we'll have two more upcoming which -will be available on our youtube and the meetup group is the best place to sign up -because that's where you're going to be able to find out about the upcoming events um -so meanwhile i have a couple of questions for you like for instance for the sprint how much of a background do -people need to have to be able to participate in the sprint -like uh i think uh people of like the background can be really varied like you can contribute to various areas -actually you can if you know some statistics you can contribute to documentation if you've already -contributed to open source but you don't know stuff about probabilistic programming there are still housekeeping issues that you can always contribute to -you know there's a bug there's a type or anything so as long as you're like slightly comfortable with using github -you have the local setup you'll be able to make at least some meaningful contribution and all of us will be there -so we'll try to help you out like you know according to whatever you know what can you best help out with so i think uh -it won't be like that much of a problem like background wouldn't be that much -with any background as long as you can like you have your github setup and everything that's something that we want -you to do and yeah after that it's all good from there -great um if anybody has any questions on the upcoming sprint feel free to post in -the chat and it will also be helpful for us because it means that we'll be able to sort of -include that information in the in the event so other people -probably if you have questions other people have questions as well -oh by the way uh austin we were speaking before but the question that i asked you um which would be great to share with -the group is um you know if you want to talk just a little bit about how this um the new version of pymc which is a major -major release if you just want to talk about that for people who weren't here before sure sure so if if you go and find on -the internet even even on my website for example you'll see a lot of -materials that reference pi mc3 and it's been quite an evolution to get -from pi mc3 to uh what we're currently calling pi mc which which under the covers is the -fourth version of pi mc the uh kind of saga has been that um -i talked about a sarah here for a while right um -a previous version of of acera was supported by a research group i believe out of the university of montreal -it was called fianno one of the first kind of tenser libraries for deep learning you could think like an -intellectual precursor to your tensorflows and and your pie torches and your terrace and all that -um a couple years back that lab announced that fianna would no longer be -supported by mc you know was kind of casting around for a new tensor back end -dabbled in in pi torch and mxnet and uh tensorflow for a while eventually you -know some great pi mc folks decided they would rather just take over sarah continue to maintain or take over thiano -rename it as sarah to show that it's a new project continue to evolve that to not just meet the needs -of pi mc although that's a big part but make it an interesting uh tensor package -and modern tensor package in and of itself and we've taken that uh that opportunity to call this pi mc v4 -introduce some modernizing changes around a focus on rvs -uh focus on x-ray that were not present you know 10 years ago or whatever when pi mc3 v3 were started so we kind of -took that back and changes an opportunity to make some lightly breaking changes -uh to make pi mc a little more modern and i do say lightly breaking because by and large most of the code that i have -on my website um that runs on pioneer c3 will run on pi -mc now if you just change uh thiano to acera in a few of the right -places so it was more of an opportunity to modernize when changing back ends than anything -else is what i would say and it's some really exciting stuff i think to to work with and the acera -folks are doing uh really great and interesting work not only to support pi mc but in kind of tensor computation in -general you can you can do things like use c as a back end you can use jacks as -a back end i think there are maybe it's a number back end for your acera code there's all sorts of interesting um -kind of lower level programming language focus things that they're doing there so it is -the foundation of pi mc v4 for sure but it's so much more so worth thinking -about in its own right so i just want to read a comment that oriole who is a um maintainer for pi mc -foot regarding the sprint which is as an example of how how wide the range is our docs have a glossary of statistical -terms to which you can contribute with no python or pi mc knowledge only stats -i think really what's going to be unique um what is unique about this pi mc sprint is that you know -people from all different backgrounds in terms of what knowledge they have whether it's -statistics or python or just very can can contribute to it you don't have to be a pi mc expert -to learn how to to contribute to pi mc i'm just going to turn my camera off and i'll let austin answer a few of the -questions that are in the chat -all right folks let me scroll up a little bit pi mc3 is not being continued i forget -to what extent it's going to get bug fixes i think that is addressed uh -either on the pi mc website or in uh the pi mc discourse -which is quite an active forum we have for answering questions but no most of the active development is happening on -pi mc v4 which like i said it is almost most models written in pi mc3 -require some very light like five minutes of porting to get them to work on v4 so it's all you know designed to -be as seamless as possible but you know we can't promise it's 100 seamless -um [Music] arielle mean all shared to answer ariel's -question i believe then claudia does the definition of custom probabilities change substantially so -yes one of the things that does change under the covers i've talked a lot about um -the interface the the user-facing interface has not changed a ton a little bit with the advent of rvs and and the -emphasis on x-ray uh the implementation of pi mc distributions under the covers has -changed a fair bit now if you've done it before it's not that hard to learn the new way -but it is not exactly what it was like before in v3 so it's where you've maybe implemented some custom distribution -there's a new pattern to follow i would say it's a better pattern now that sarah has kind of taken what thiano did and -made it more uh more suitable for pymc so i would say there's a newer better pattern to follow there but there is a -little bit of work if you've done something like implement a custom distribution -and uh as you say hopefully more documented um that's something that can definitely be -helped in this sprint i think there's a lot of uh my experience contributing new -distributions is a lot of find a distribution that's similar look at the patterns there -uh adapt it to your situation and test but i think that's a great place that our our documentation could for sure use -some improvement is in adding new distributions so i think it'd be a great uh sprint activity to contribute to -um also a great discourse topic so benjamin has asked are there differences -between probabilistic programming noteworthy significant reason why would you why you would choose pi mc over stan -uh so i'll answer this on a couple of different levels mathematically they're all kind of striving towards the similar -ideal of adaptive hmc when you really get down to the nuts and -bolts of it there's going to be some algorithmic differences between how they implement it but by and large those -should not matter to 95 99.9 percent of people -um i certainly couldn't tell you what they are off the top of my head and i use pi mc quite a bit -um you know stan stan versus pymc depends do you you know -do you want to have an all python interface to your programming language is that important for your application if so pymc might be a better fit for you -stan has its own you know model definition language um -strengths and minuses of each as a like a software library are you know -the language you use to specify models whether that's pure python for pi mc or you know stan's domain specific language -that's kind of a a subset of c plus plus neither is better than the other stan is -great software we are actually great friends with the folks on the stand team get dinner with them all the time when -i'm in new york um it's just it depends on your application -where you need to be running this code what kind of tasks you're doing with it so you know i -tend to be living in jupiter notebooks like everything to be concentrated in python like to be able to do my -contributions in python and not in c or c plus plus so i've chosen pimc -uh v3 and pymc as it's now renamed but you know i don't fault anyone who chooses to go down the stand rabbit hole -either i think they both have their strengths and weaknesses both uh numb focused supported projects so -and there's a lot of cross-pollination of ideas uh particularly manifested in rvs right rvs is a visualization -diagnostic library intended to be shared between pi and c and stan and several other -inference engines so really you know gets down to the nitty gritties of how do you choose any software package -uh what what your environment is like what your needs are um -chandra asks uh simulating conditional -probability can you share resources about that there's no resource that comes to my mind uh it's more about -taking those lego building blocks and putting them together if you had a specific question in mind you're saying -like a high school conditional probability problems probably you know involving colored balls in urns or -something like that uh i'd say that the best way to get help there is take a stab at it yourself -uh maybe you solve it great maybe you don't no worries uh there's a learning curve -here bring your you know partially worked example to discourse and there's an entire group of really -friendly people on discourse uh that sometimes includes me when i have the entire in the time but there's a great -community there that are very helpful in getting your questions answered so um -i don't know that there's a definitive reference for those sorts of problems my my advice in general would be take a -stab at it and then turn to discourse with your your partial progress and folks they are extremely friendly and -will help you um [Music] help you drive that to to resolution -and daniel points out yes adam downey's book i think think babies is probably the one -you're referencing is an excellent uh reference for those sorts of problems -um austin there's a couple of questions that i just put in the q a chat one is from luke and he writes hi austin great -talk have you played around with the infinite monty hall python problem with pi mc -so infinite doors rather than just three of them um infinite doors rather than just three -well pi mc does not deal well with infinities right just like any they're right you can't -specify an infinitely long numpy vector and if you think of acera as like numpy -fancy numpy not that's not technically correct but it's spiritually correct -it's hard to um answer that question -computationally i think and mathematically when i put my mathematician hat on because i am trained as a mathematician you know it -depends what kind of limiting process you're using to get to infinitely many rooms what the right answer is there so -the answer is no i have not done much thinking of it and when you start to deal with things that are infinite in pi -mc or any probabilistic programming language this isn't a an inherent limitation of pi m c you run the same -problems with stan when you try to implement things like gaussian processes or dirichlet processes -where there are you know potentially infinitely many parameters uh it generally takes a lot of hard work -um to find good approximations or truncations there so i think that's a subtle question and the answer is not -really um and then uh -there's sorry there's another question in the q and a tab maybe you answered it but i'm not sure does the best yes i i i -think i did get get to that one so if i come back i probably have time for about one more chandra's final question maybe -from the chat so pomegranate mc pyro tf probability how is it different -from pimc or stan yeah so these are all different things that let you specify -basically likelihood functions right and some of them are just specifying likelihood functions some are -performing inference on all on likelihood functions um so -they are all adjacent right if you think about pi mc it's a way to specify a model -perform inference on that model and then rvs is a way to visualize and quantify the uh the -inference you've done on that model uh each of these hits some or all of those -things that pi mc does and has flexibility more or less flexibility -in some of those places right there's no one probabilistic programming package to rule them all right so there's a couple -different things you need to specify your priors you need to specify your observational likelihood you need to -perform inference you need to draw conclusions from that inference if you think of those things as kind of layers -in getting value out of any probability model uh each of these packages speaks -to one or more of those layers it does then it does each of those layers differently -that has strengths and weaknesses in terms of completeness and flexibility that vary by those layers and so -uh not really possible to give an exhaustive answer here but cover some of the same ground is the -answer does some things more or less flexibly does some things more or less user -friendly right one thing that i love to buy mc is when you look at i said m is equal to pm.normal you know m as a -string zero one that's quite close to the mathematical notation you would use to write that model down very user -friendly is what pi mc goes for things like uh mc make you write a log -likelihood function on your own tf.probability a little less user friendly -um so just you know depends which trade-off is right for you but they play in similar areas and you can assemble -some of those other tools together to cover the same use cases as you would imc here it all depends on where you -need flexibility and where you need need or want simplicity and ease of use -i think that that's probably a uh a good place to call it a day and thank you again reishma and data -umbrella team and meanwhile and oriole on the pi mc side for organizing things again it's been a a real pleasure to -speak to all of you here today -oh ray schwein i think you're muted i wanted to thank you austin for your -presentation +use that to perform inference about unknown quantities, whether it's the size of napoleon's army at a certain +point in time, whether it's the probability that a coin is flat fair, whether it's the effect that the number +of pieces has on the price of a lego set, you know there's some unknown quantity in the world. + + +I'm going to tell a story about how it relates to the data I see and then I'm going to use the data I've seen to infer some things about that unknown quantity - so that is how probabilistic programming kind of turns that around. + +And it turns that around if you put your mathematical hat on a little bit and you think about turning that around. Okay, so if I have a story and I get data from that story, how do I turn that conditional probability around? + +Well, that's Bayesian statistics, right? Bayesian statistics is all about how do I go from the probability of B given A and turn that into the probability of A given B, right. So this is all about how do I reverse conditional probabilities? Same idea here, going from data to storytelling and turning that around is what really motivates me here. So this is kind of the philosophical level. I could say a lot more there, I'm not going to dwell too deeply. Let's dive into the mathematical level. + +So the mathematical level, this all plays out in something called Monte Carlo methods, right? We're going to be talking about advanced Monte Carlo methods here. And so what's a simple example of a Monte Carlo method? Well, I'll give you one right here. Let's generate 5,000 points that are uniformly distributed in the square that ranges from negative one to one, right? So that's shown here. + +Okay, and I apologize, some of these things will be duplicated because these were built to be rendered as slides but the formatting wasn't making the code legible, so we're being a little bit flexible here. So we've got our five thousand points in this square, negative one to one, right? Let's count the number of them that are inside the circle with radius one, so these are inside the circle with radius one if x squared plus y squared is less than one, right? This plot highlights those that are inside the circle of radius one. The green ones are inside the circle, the orange ones aren't inside the circle. And then you know, what proportion of these should we expect to be inside the circle, right? + +Well, it probably has to do with the area of the circle, right? The area of the circle is one, right? The area of the square is four. So if we take, you know, number of points, green points inside the circle, and we divide that by total number of points, which is five thousand, we should get the ratio of the area of the circle to the area of the square, which is pi over four. + +And sure enough, we add up the number that are in the circle, we divide it by five thousand, and we multiply it by four, we get not an awful approximation to pi, right? So this is what a Monte Carlo method is. A Monte Carlo method is all about using random numbers cleverly to approximate the quantity you're interested in, and we'll get into how that connects to probabilistic programming and Bayesian inference in a second. + + + +How does this connect to Bayesian inference? Well, what we've done here is actually approximate this integral, right? If you think back to calculus, this integral relationship is true, right? Because the square root of one minus x squared, that defines this quarter circle, right? So this integral gives you the area of this quarter circle here. Multiply that by four, you should get the area of the whole circle, which is pi. + +Right, so we, instead of doing this integral through maybe a trigonometric substitution or whatever you might have learned in calculus, approximated its value through random number generation. + +Bayes' theorem has often very many intractable integrals here, right? One way to restate the Bayes' theorem, as I showed above, is this way, where theta are your unknown parameters, and d is your data, right? So, if we want to know about our unknown parameters given our data, it tells us that it's data given unknown parameters. Probabilistic programming is all about specifying this quantity, right? Prior is not that hard; also specify the prior. But then, calculating this integral at the bottom is intractable, not possible, provably impossible for a lot of interesting models. And if we restrict ourselves to just models where you can calculate this quantity, life becomes a lot less interesting. We close off a lot of potential applications. So, that's what gets us to consider probabilistic programming using Monte Carlo methods, in this case with PyMC, although there are many different packages out there to do it with. + +So, all of that was great theory. Let's get down to practice now and let's talk about the Monty Hall problem. So, if you're not familiar with the Monty Hall problem, right, there's a famous game show in the United States going back 50 years, even maybe probably even longer than that, called Let's Make a Deal, right? The host was named Monty Hall for a number of years, and so that's where this comes from. And one of the games the contestants would play in Let's Make a Deal would be that Monty would show them three doors closed, right? Behind two doors were goats, and behind one door was a sports car, right? So, obviously, in the end, you want to open the door that has a sports car behind it. So, how this game would play out is that Monty would ask the contestant to choose a door, right, and not open it yet. So, say you chose door number one, the first door, right? What Monty would then do is open one of the other doors, in this example door three, but it could be whichever one, to show you a goat. And then this is where the drama comes in, he would ask you, do you want to switch your choice of door, yes or no? + + +Right, and this is one of the kind of first examples where conditional probability becomes a little bit counterintuitive, probably a lot of you know is that the answer is yes, you should switch your choice of door, right? Um, after he's opened a door to show you a goat, there's now a one-third chance that your door contains the sports car, and a two-third chance that the other unopened door contains a sports car. So, you should switch in order to maximize the chance that you choose the prize you actually want, right? And we could work this out mathematically here, I'm not going to run through it because it's tedious. The point is probabilistic programming will enable us to write some code to avoid doing this, right? Because it's possible in this situation but not particularly interesting, and a lot of situations it won't be possible to work out like this. So, it's always good when you're learning a new numerical method to apply it to a problem where you can get the answer with pen and paper to make sure they agree. That gives you confidence when you're moving to situations that you can't just use pen and paper for, right? So, you should be able to get a known answer in a situation that you can work out explicitly, so that you build confidence when in situations where you can't work out the answer explicitly. + +So, let's see how we solve this Monty Hall problem in PyMC. + +That was supposed to be a hidden cell. Thank you for bearing with me. So, what does the PyMC solution look like? Well, initially we have no information about which door the prize is behind, so I'm importing PyMC, I'm creating a PyMC model with this context manager on normal you know python stuff we do, and then I'm saying, hey, this prize comes from a discrete uniform distribution. All that means is each number between 0, 1, and 2 has the same chance of containing the door, right? There's a one-third chance it's behind door zero, the first door, one-third chance that's behind the second door, a one-third chance it's behind the third door. So here, I'm telling the story of how my data is generated before I get any information, don't know where the door is, the prize is, I should consider each door equally likely. + +Then we need to think about things from the host's perspective, right, which door is he going to open based on the door we have chosen? Well, we've chosen the first door, so he's not going to open that door, right? There would be no drama there if he opens it and there's a car, we just stick with the open door. If we open it and there's a goat, we switch doors and it doesn't really matter which one we choose. So, he's not gonna choose the door we chose, which is door one. Now, if the prize is behind door one, this is the first row, he won't open door one, we've already said, he could open door two or three, they both contain goats. That's fine. + +If the prize is behind door two, he's not going to open door one again that destroys any drama. He's not going to open door two also because he doesn't want to show us the car, because yeah, we would just switch to the door with the car. Okay, so he's forced to open door three, right? And similarly, if the prize is behind door three, he's forced to open door two. So this is how Monty makes his decisions of which door to open. So we can see if the prize is behind door one, there's a fifty percent chance he opens door two or door three. If the prize is behind door two, there's a hundred percent chance he opens door three. And similarly, if the prize is behind door three, there's a hundred percent chance he opens door two. + +So, in the spirit of probabilistic programming, we want to encode these chances he opens each door given where the prize is as code. So I'm going to import this package named asera, and I'll talk a little bit about Aesara in a second, so just bear with me for the moment. And I'm going to write p_open, which is the probability he opens each door, as a switch statement here. So this is basically the asera version of an if statement. + +So this is saying if prize is behind the first door, then we know zero percent chance he opens the first door, fifty percent he opens the second, fifty percent he opens the third. That corresponds to this line, right? If it's not behind the first door, a nested if statement: if he's behind the second door, zero percent chance he opens door one, zero percent chance he opens door two, a hundred percent chance he opens door three. That corresponds to this row, right? And then this final thing, if it's not behind the first door, not behind the second door, gotta be behind the third door, 100 percent chance he opens the second door. + +So why have I written this in terms of aet, asera tensor dot switch, asera tensor dot equal? Instead of regular Python if-else statements? We'll answer that question in a little bit, and it's all about increasing the efficiency of our sampling algorithms through calculus. Asera lets us differentiate expressions automatically without having to do any calculus with pen and paper in a way normal Python if-then-else statements do not. And so that's the benefit we get by writing this, which totally could be an if-then-else statement and somewhat awkward syntax, and we'll see that that's really worthwhile once we move towards more complex examples. So we've said our prior belief about which door the prize is behind. We've said what are the chances Monty opens each door given where the prize is. Now we know he opened the last door, the third door, and showed us a goat, right? + +So which door he opened is now a categorical random variable. It's either first, second, or third. The probability each door was open is that p_open we just defined through those switch statements, and we know he opened the third door since this is zero-indexed, the third door corresponds to this two, right, so we've said how where the prize is influences the likelihood he opens each door, and then what data we've observed. Now we're going to perform inference, and inference in PyMC is through sampling, right? This is where the Monte Carlo inference comes into play. So we sample from the posterior distribution. + +To use the Bayesian terminology here, we get an xarray array of samples. We're going to get, in this case, 20,000 samples from the posterior distribution of places the door the prize is behind in 20,000 different simulations of this, right? Just like when we were using Monte Carlo methods to approximate pi, we had 5,000 different samples there, a certain proportion of them were inside the circle. We use that proportion to determine pi. We're going to draw 20,000 random numbers here. We're going to see how many of them are zero, saying the prize is behind the first door. We're going to see how many of them are one, saying the prize is behind the second door, and those proportions are what's going to tell us the probability that determines whether or not we should switch doors. So we get these traces here, okay? + +These samples, and I'm going to check the posterior_prize. So this says how many times is the prize behind the first door, right? And this says one-third of the time the prize is behind the first door, given the information we've input, which means that two-thirds of the time it's behind the second door. So we should switch doors, right? This is the well-known result that we could derive with pen and paper, and I did on a previous slide. So, it's good we've recreated this result using probabilistic programming by telling the story of how our data was generated. We have inferred what are the chances the sports car is behind the first or second door, and we know the correct answer is to switch doors. So let's step back a bit and talk about the components I've just flashed in front of you. Two components we've talked about so far: we've talked about PyMC distributions, right? We saw discrete uniform and categorical distributions there. So any probability distribution almost any probability distribution you can think of from statistics has been implemented in PyMC, and if it hasn't, they're actually quite straightforward to add. Those are some good first contributions or to add new probability distributions. It's kind of, there's a lot of boilerplate code with some small tweaks you can use, and that's actually really good and important work. I like to think of these, given that I'm a Lego collector, it's kind of the Lego bricks of probabilistic programming, right? Distributions are the bricks. PyMC lets you put those bricks together in different ways to perform inference, and so in the case above, we used the discrete uniform and the categorical distribution. I've just chosen to highlight two to show flexibility. + +Obviously, there are normal distributions, there are zero-inflated Poisson distributions. If you were to click on this link, it takes you to the documentation shows you, you know, many dozens if not hundreds of distributions that are available to build up your statistical models from. + +And then what were those aet.switch, aet.eq things doing? Well, they were invoking asera, which is PyMC’s kind of tensor computational back-end, right? You can see a little blurb from the asera documentation here. I'm not going to read it out loud for you. You can think of asera as filling the same niche as, you know, PyTorch or TensorFlow fill in various other, you know, maybe deep learning ecosystems, which is it will optimize your numerical computational graph, and it will also allow you to calculate derivatives of that computational graph without having to do the calculations yourself. And that is really key, and what PyMC really leans on asera for, to enable efficient inference in high dimensions, which we'll talk about in a little bit. It's very much worth the cost of wrapping p_open in these asera.switch and asera.eq statements to get, you know, automatic differentiation and optimization for free. Once you start thinking in that way, it's not hard to port a lot of the common code you would write to that, and the benefits are great. + +So I'm going to walk through another kind of simple example to illustrate a couple more key concepts and ingredients of PyMC or dependencies, perhaps, before we ramp up to our more realistic Lego examples. + +We're going to take a data set from Anscombe's quartet, an interesting, you know, instructional data set of four different groups of data that have some similar summary statistics. I want to talk a little bit about how we can use PyMC to do robust regression today. So let's focus on the third entry in this quartet that is almost exactly linear except for one outlier. I'm going to show you how you can really powerfully use these kind of swappable distributions in PyMC to capture the linear trend here and ignore this outlier in kind of a toy example, but that's instructive for real-life applications. So let's see ordinary least squares. Let's first do something we know will be prone to responding to this outlier. + +What are the mathematical assumptions behind your good old ordinary least squares regressions? Well, they are that all values of the slope and the intercept b and the, the error, the noise variants are equally likely. So I'm going to build a PyMC model that encodes these things, and flat here is the distribution, the Lego brick that says all real numbers are equally likely to be the value of m, all real numbers are equally likely to be all the va, the values of b, all positive real numbers. That's what half here means are equally likely to be the value of theta, right? So you write out some math here: y is mx plus b plus some noise. This is one way to view it. This is equivalent statistically to saying y has a normal distribution with mean mx plus b, variance sigma squared. So that's how this is encoded in PyMC, really, really simple, right? And here you'll see again we're saying, "Hey, I observed this data y3, y3 being the y-coordinates of these data points in the third entry in the quartet." + +So we can come on down, and we can sample from this model again. You see PyMC's doing some sampling for us, and it's warning us some things. There were five divergences after tuning, so this is great. PyMC is not only trying to infer the values of m and b here, but it's going to warn us when it's not doing a great job, when our Monte Carlo methods may be failing. So this is a really powerful and user-friendly thing about PyMC. How can we visualize these divergences where inference is failing? Well, we can use one of the other important components of the probabilistic programming ecosystem here, which is ArviZ, a kind of sibling project to PyMC, to plot the values of m, b, and theta that are causing these divergences, right? + +So the idea here is that ArviZ is a library that contains a lot of pre-built visualizations and statistical routines that will help you understand the results of your inference that you did with PyMC and visualize, debug those results, right? So what two more key components that this uncovers: one is ArviZ, right? And ArviZ is really cool because it's agnostic about the probabilistic programming library you use to conduct the inference, right? The plots of posterior distributions or divergences or, you know, many other things you want to make are agnostic about whether or not PyMC is your probabilistic programming package of choice, or Stan is your probabilistic programming package of choice, or, you know, Numpyro is your probabilistic programming package of choice. There are many probabilistic programming languages out there. They all want similar visualizations. ArviZ exists to say, "Hey, if you provide the results of inference in a standard format, we'll build those visualizations for you no matter what upstream inference package you use." So ArviZ is really awesome. I strongly encourage everyone to take a look at it when we look at the results of inference, right? + +We took, we said pm.sample, stored this in ols, ordinary least squares trace. We look at the type of that, that's an ArviZ inference data object, okay? You just need to encode there was the samples from your Monte Carlo approximations in this object, get ArviZ to work on it. If we look at the posterior component of that trace, that's an xarray dataset. An xarray is a really awesome library that's, you know, I think about it almost as like pandas in more than two dimensions, right? It's a great way to define flexible, labeled arrays in arbitrarily many dimensions, and this is exactly what you want to do when you're doing invasion inference and visualization of Bayesian inference, the results of Bayesian inference, excuse me, and we will see where those labels are very powerful. + +So there's a question: why do we assume that it is normally distributed in the ordinary least squares model? That's a great question. You can motivate the ordinary least squares model a bunch of different ways, right? Usually, it's viewed as minimizing the mean squared error with a bunch of restrictions applied that turns out to be equivalent to placing a normal likelihood on your data. So it's just one of the assumptions that goes in here. There are a bunch of different equivalent ways to frame it, so great question. + +So what do we want to do? Well, we want to compare our inferred values of m and b to the robust values we would get if we ignored the outlier, and the outlier here happens when x is equal to 13. So we're going to use NumPy's polynomial fit of degree one, same equivalent to ordinary least squares regression, to get the slope and intercept, and we're going to compare those to our posterior distributions for m and b that we got using PyMC, right? And you can see here, the reference value, so the robust slope is a little bit smaller than, it's a little bit towards the low end of our posterior credible interval. We've slightly overestimated the slope, underestimated the intercept here. This has really meant the details of this visualization aren't that important. It's just that all of this came out of, you know, one line essentially of ArviZ because ArviZ is this powerful, showing what's the true value that I'm trying to get out of inference versus what's the posterior distribution that happened when I fit my model. + +So we can also plot the lines here that come out of this inference, right? And we can see, okay, the blue line is the average, the posterior expected value that you get using PyMC, and the red line is what you get when you calculate the same thing using NumPy, just to show that they agree when you when you match up your assumptions. So that's good, but we see, you know, this has definitely been pulled upwards by the outlier, underestimating the true intercept, overestimating the true slope due to this outlier. How can we swap out some of those Lego bricks, those PyMC distributions, in order to make this less sensitive to that outlier? Well, we can start out by saying let's regularize a little bit, let's move from ordinary least squares to ridge regression. For those of you that are familiar with the machine learning concept of ridge regression, what is ridge regression? It's really just normal priors on your coefficients, right? So instead of saying m is a flat distribution here, we're going to swap that out and say it's normally distributed mean 0, standard deviation 1. How to choose these parameters is a subject in and of itself. I've just chosen these here for convenience. You could change them. + +Yes, so Reshama has asked if y were distributed as a binomial, yeah, you could absolutely, you know, build say a logistic or a probit regression model this way. There are distributions for any kind of typical GLM that you've seen out there. I've written up examples on my blog austinrockford.com around, you know, negative binomial regression using PyMC or even, you know, ordinal regression to see what factors drive Lego set star ratings. So the number of distributions is really flexible, really is Legos. The goal is to enable you to build whatever models you want using these building bricks here. We're sticking with normal observed likelihoods just for simplicity since this is an introductory talk, but this can get as complex as you like. You can make a zero-inflated binomial model or something similar with the building blocks that PyMC provides. A great question, Reshama. + +So we're going to throw these on, these priors, these normal priors on the slope and intercept. This is equivalent to ridge regression. I'd have to do some math to determine what regularization penalty it is, but it is equivalent. We perform inference again. We get a little, a few, a couple fewer warnings, but there are still some divergences. We can check our plots again. Our true values, our robust values of the slope and intercept are actually even further from what we've inferred, so h. Well, this makes sense. Regularization of this sort tends to be more useful, let me find the correct thing, when your x values are extreme and not your y values. So we've added in robustness to extreme x values but not to extreme y values. Actually, Reshama, this will speak to your next question, which is how can we change that observational likelihood to be more robust against this extreme y value? Well, to get robust regression out of this, we can use the fact that StudentT distribution has fatter tails than a normal distribution, right? And so here the blue line is the standard normal distribution, the orange line is a Student's t-distribution with two degrees of freedom. We see the tails are fatter. This will respond, due to those fatter tails, this will respond, be less sensitive to outlying y values. + +So we'll specify the model for m, b, and theta same as we did before, but we'll change y_obs here from a normal random variable to a StudentT distribution. We have to specify a prior on our degrees of freedom here. I say it's a uniform between one and ten, right, because as the degrees of freedom goes to infinity, the t-distribution becomes a normal distribution. It's less interesting if the degrees of freedom is huge. We expect it to be small here because there is an outlier, but similarly, you know, mean is mx plus b, sigma is sigma. We perform inference. We see in our posterior plots, we've captured the true robust values of these things quite well. We can visualize the outcome lines here. The green line is going right through our, you know, non-outlier data points, and the outlier has been ignored. This is a little bit of a silly example, right, because there's no noise in these things outside of the one outlier, that's why these posterior distributions look real wacky. You could add a little noise here to make it more realistic. The point is these distributions are easily swappable. + +Chandra asks, "Can we model using Tweety regression?" I believe we can, if not, it's possible to add, just depends, has the distribution necessary been pre-built? + +How do we choose hyper-parameters, and can we go wrong by choosing a distribution? Well, just like, right, this is just one kind of tool in your statistical toolbox out of many. You can always go wrong choosing your statistical tools, so there are tons of ways to choose these prior distributions and evaluate the goodness of your choice that are really outside the scope of this talk. At the end, I will give some references that will include discussions of these sorts of things, but that's a great question. So yes, absolutely, you can make bad decisions if you don't choose the appropriate tools, as with any statistical method. And how to, how to decide whether you are in a good spot or not is something that's outside the scope of this talk, but there are many references out there that I will share. + +So why are we using Aesara again, and I'm going to move quickly here to make sure we have time for Q&A at the end. So apologies, but hopefully it will still be somewhat clear. Why are we using asera, why are we using that aet.switch, aet.eq, and under the covers here when we write, you know, pm.normal, there's a bunch of asera operations going under the covers when we call this class constructor? PyMC's just providing a nice layer over those for you so you don't have to think about them unless you want to. + +Why we use asera is to do something called Hamiltonian Monte Carlo, and what makes it Hamiltonian, what makes it Hamiltonian is that you want to take into account the geometry of the model you're specifying, right? And geometry in the sense of differential geometry is all about curvature, and curvature is about derivatives. In a past life, I was a differential geometer, and the way I think about it is curvature is always about the failure of partial derivatives to commute. So, you know, if you differentiate with x and then differentiate with respect to y, and then do the derivative of y with respect to y and then the derivative with respect to x, those two things are the same in flat space, on the plane they are. In general, different if you're on a curved surface. And that is how geometers, or at least a certain set of geometers, think about curvature, and that's actually really important for doing this kind of inference in high dimensional models, models that we're really interested in here as accurate representations of reality due to the curse of dimensionality, which you're probably familiar with from machine learning. What's the way the curse of dimensionality is reflected geometrically is that the volume of the unit sphere, as the number of dimensions increases, goes to zero exponentially fast, right? So you can think of the unit circle, and then the unit sphere in two dimensions, the unit sphere in three dimensions, the hyper-sphere in four, five, six, seven dimensions. What's that volume? It drops to zero really fast as the number of dimensions increases, even mildly. Here we can see that between 100 and 1000 dimensions, it's already, you know, 10 to the negative 196, right? What this means is that typical points in your data set near the origin, there's a nice definition of typical, but I won't get into what that exactly means here, are hard to find in high dimensions if we just treat the spaces flat. When we take into account the curvature of the space, they become a lot easier to find, and asera is what lets us do that by automating those derivatives we need to quantify curvature. + +So where does asera really help us? As a toy example, let's show using asera that the derivative of x cubed is 3x squared, familiar to all of us who have taken calculus, right? So we're going to define x as a scalar, y is x cubed, then I'm going to ask asera to pretty print the gradient of y with respect to x, that should be this derivative. Let's walk through this, and I'll tell you how to read it. + +So this says fill x cubed with x is one, okay? So one cubed is one, this term is just one, one times three, okay? So we've got this coefficient of 3 here, and then we've got times x to the 3 minus 1. So we've got 3 from this term, we've got x cubed from this term. Beautiful, right? So asera is doing calculus under the covers to quantify the curvature of space to make sure that we are not dealing with these absurdly small numbers that we would if we didn't acknowledge the geometry, all a bit hand-wavy right now. There are great references here. There's no way to go super deep on that theory in an introductory talk. I just want to wave at why it's important. + +So let me take about five minutes to run through this final example where the shape of the posterior does become important, and in that Tweety regression or that binomial or ordered logit or probit regression, this becomes a lot more important than our, you know, simple to, you know, univariate linear model with like two or three parameters. It becomes a lot more important as your model gets more complex and realistic. So let's talk briefly about a Bayesian analysis of Lego prices. So on the left here, you have a small sampling of my Lego collection. To my left here, there is some of it as well. + +I came at this with a motivating example of here's Darth Vader's Meditation Chamber, a set Lego released in 2021. Is this model worth the $70 they were charging it or not? How can we answer that question with data? So I went out there and scraped a bunch of data from Brickset, which is just a Lego fan site that has references all sorts of data. You can see it here. I've made it available online. You can find the data if you're interested in that sort of thing. You'll see there's about 6400 sets from 1980 to 2021. + +You can see there's a pretty good, I'm gonna move quickly here because I know we have some other things to cover off in the end, and I want to leave time for Q&A. There's a log-linear relationship between the piece count of the set and the price of the set. That's not surprising, right? Lego has both fixed and variable costs in manufacturing its sets. Larger sets obviously will incur those larger variable costs, so they're going to charge us more to cover their costs plus whatever profit that makes perfect sense. We see where Darth Vader's Meditation Chamber kind of fits in this. We can look at, to make this model a little more interesting, a little more realistic, we can look at how these prices have varied over time. + +So how has the recommended retail price per piece varied over time? And we could see where Darth Vader's Meditation Chamber fits in there. We can break this down by different themes, so we can see Star Wars, Creator, Harry Potter, and Disney. We can build a model, I don't know why my math is not working here, apologies for the last minute change, that says, you know, basically each component, the intercept, and then the price per piece has a time varying component, so a Gaussian random walk, and you can come into the notebook and see how this is built up here from our pm.normal Lego bricks and a theme component which is a hierarchical normal random variable. You can see how that's built up here. This is not a ton of code to specify a fairly flexible model with both time and theme effects. + +If we try to sample this without Hamiltonian Monte Carlo, I can do this using a Metropolis step, which is the simple naive thing to do. I get a ton of errors. If I calculate the r-hat statistics here, I see they're enormous, they're up above three for some of these things. R-hat statistics of one indicate convergence, so I shouldn't trust these results. This is one of the ways that PyMC tries to help you decide if your results are trustworthy or not. You can see this took about a minute. Let me do this with, right there, 351 parameters here, so there's a lot of them. Let me do this with the more advanced samplers, the Hamiltonian Monte Carlo, NUTS stands for No U-Turn Sampler, that PyMC provides. This takes nine minutes, ten minutes, but produces much better r-hat values that are close to one, indicating no problems with convergence there and therefore much more trustworthy posterior inferences. So the top is the using the Metropolis step that falls prey to the curse of dimensionality. You can see that these samples are wild all over the place. I have no reason to expect a true posterior distribution would have this wild form, so this shows inference problems, whereas the Hamiltonian Monte Carlo ones have this beautiful shape, much more plausible. + +Even though it took about 10 times as long to sample, the sampling efficiency is just so much bigger, right? So the sampling time is an order of magnitude longer for the adaptive Hamiltonian Monte Carlo. NUTS is HMC, yes, for Daniel's question. NUTS is a form of HMC, but the sampling efficiency in terms of effective sample size per second is an order of magnitude larger. So to get the same effective sample size with your Metropolis step, you'd actually have to sample that model for 25 hours, so 25 hours versus 10 minutes. I'll take 10 minutes any day of the week. + +So we can start to answer some of our questions pretty easily using PyMC here. We can look at the posterior distribution of what we would expect Darth Vader's Meditation Chamber to cost based on the fact that it's a Star Wars set released in 2021 that has, however many pieces, I think it's about 700 pieces. So our model's posterior expected value was $79. We see that the $69.99 Lego price to that is just below the median, so it's pretty reasonably priced. We can assess the pretty easily the effect of different set types on the price of Legos, right? So Creator sets are a little less expensive, Star Wars sets are a little more expensive, etc., etc. + +Here I've included some references, two books that are available online for free. This one on the left is a classic reference. The middle is a great reference written in R with Stan, but we have posted notebooks implementing it in Python with PyMC for all of the examples and exercises here, really great textbook to help you think about Bayesian inference. And then on the right is a textbook about Bayesian inference actually written by Osvaldo Martin, and Jung Peng, who are three PyMC core contributors that just came out, has an online version available, also available for purchase in print. + +I thank you for your attention again. I thank Reshama and the Data Umbrella team for their hard work putting this together, and Oriol and Meenal for their coordination there. And at that point, I will hand it back to Reshama to perhaps say a little bit more about the sprints. This has been intended to introduce to you, and then I can take any Q&A after Reshama fills that in, so thank you all for your time. I really appreciate it. + +**RS:** So yes, actually Meenal is going to be speaking a little bit about the upcoming sprint, so I'll wait. If you know if you're able to share your camera, that would be great. + +**Meenal:** Right, no? + +**RS:** Yep, we can hear you, so I'll share my screen. + +**Meenal:** Okay. Oh, I think my screen is visible. Yes, it's loading. Yep, now we can see the slides. Yep, yeah, so... +Yep, so we're going to be organizing a sprint with Data Umbrella like the entire PyMC team on 18th and 19th February, like that over that weekend. We haven't finalized the exact time slots yet, but we're hoping to do like two-hour slots over the weekend so we can accommodate more time. + +Two of these have already happened, and two more of these will be happening. I really think you should be able to, if you can make the time, either attend them or watch a recorded version if you want to contribute to PyMC. Like Austin gave a wonderful introduction on how to use PyMC and why you should be interested in probabilistic programming. + +Next, we have a talk that Ricardo will give, which is more about, you know, how can you contribute to PyMC, like how can you help out with the source code or other things? And then Martina will be giving a talk on how do you contribute to the documentation, like say if you're not really interested in contributing to code but you know a bit of statistics or maybe you just want to contribute to the technical part, which is again important. + +So these talks would help you out with that. And finally, then a month from now, we have the sprint. A couple of us in the PyMC team will be there to help you through pull requests, issues, and I mean, I think it would be really fun. And if you're new to open source, I would really ask you to join in. + +Yeah, that's that, and thanks to Reshma for putting together this whole thing. It's been like so much work. + +**RS:** Yes, , let me turn on my camera. Yes, I want to thank the entire PyMC team because they initiated this sprint, and there's a series of webinars. You can see that we've had two of them already that are on our YouTube, and we'll have two more upcoming, which will be available on our YouTube. And the meetup group is the best place to sign up because that's where you're going to be able to find out about the upcoming events. + +So Meenal, I have a couple of questions for you, like, for instance, for the sprint, how much of a background do people need to have to be able to participate in the sprint? + +**Meenal:** Like, I think people of varied backgrounds can contribute. You can contribute to various areas. Actually, you can, if you know some statistics, you can contribute to documentation. If you've already contributed to open source but you don't know stuff about probabilistic programming, there are still housekeeping issues that you can always contribute to, you know, there's a bug, there's a typo or anything. So as long as you're slightly comfortable with using GitHub, you have the local setup, you'll be able to make at least some meaningful contribution, and all of us will be there. So we'll try to help you out like you know, according to whatever you know. What can you best help out with? So I think, it won't be like that much of a problem. The background wouldn't be that much with any background as long as you can, like, you have your GitHub setup and everything. That's something that we want you to do, and yeah, after that, it's all good from there. + +Great! If anybody has any questions on the upcoming sprint, feel free to post in the chat, and it will also be helpful for us because it means that we'll be able to sort of include that information in the event. So, other people probably, if you have questions, other people have questions as well. + +Oh, by the way, Austin, we were speaking before, but the question that I asked you, which would be great to share with the group is, you know, if you want to talk just a little bit about how this, the new version of PyMC, which is a major major release, if you just want to talk about that for people who weren't here before? + +**AR:** Sure, sure! So if if you go and find on the internet, even even on my website, for example, you'll see a lot of materials that reference PyMC3, and it's been quite an evolution to get from PyMC3 to, what we're currently calling PyMC, which, which under the covers is the fourth version of PyMC. The, kind of saga has been that,I talked about Aserra here for a while, right? A previous version of Asera was supported by a research group, I believe out of the University of Montreal. It was called theano, one of the first kind of tensor libraries for deep learning. You could think like an intellectual precursor to your TensorFlows and and your PyTorches and your Terras and all that. + +A couple years back, that lab announced that theano would no longer be supported. PyMC, you know, was kind of casting around for a new tensor back end, dabbled in in PyTorch and MXNet and TensorFlow for a while. Eventually, you know, some great PyMC folks decided they would rather just take over asera, continue to maintain or take over theano, rename it as asera to show that it's a new project, continue to evolve that to not just meet the needs of PyMC, although that's a big part, but make it an interesting tensor package and modern tensor package in and of itself. And we've taken that, that opportunity to call this PyMC v4, introduce some modernizing changes around a focus on RVs, focus on Xarray that were not present, you know, 10 years ago or whatever when PyMC3 v3 were started. So we kind of took that back and changes an opportunity to make some lightly breaking changes,to make PyMC a little more modern. And I do say lightly breaking because by and large most of the code that I have on my website, that runs on PyMC3 will run on PyMC now if you just change theano to asera in a few of the right places. + +Same with places. So it was more of an opportunity to modernize when changing back ends than anything else is what I would say, and it's some really exciting stuff, I think, to to work with, and the asera folks are doing, really great and interesting work, not only to support PyMC but in kind of tensor computation in general. You can, you can do things like use C as a back end. You can use JAX as a back end. I think there are maybe, it's a NumPy back end for your Aesara code. There's all sorts of interesting, kind of lower level programming language focus things that they're doing there. So it is the foundation of PyMC v4 for sure, but it's so much more, so worth thinking about in its own right. + +**RS:** So I just want to read a comment that Oriol, who is a maintainer for PyMC, put regarding the sprint, which is as an example of how wide the range is: "Our docs have a glossary of statistical terms to which you can contribute with no Python or PyMC knowledge, only stats." + +I think really what's going to be unique, what is unique about this PyMC sprint is that you know, people from all different backgrounds in terms of what knowledge they have, whether it's statistics or Python or just very, can contribute to it. You don't have to be a PyMC expert to learn how to contribute to PyMC. I'm just going to turn my camera off, and I'll let Austin answer a few of the questions that are in the chat. + +**AR:** All right folks, let me scroll up a little bit. PyMC3 is not being continued. I forget to what extent it's going to get bug fixes. I think that is addressed, either on the PyMC website or in, the PyMC discourse, which is quite an active forum we have for answering questions, but no, most of the active development is happening on PyMC v4, which, like I said, it is almost, most models written in PyMC3 require some very light, like five minutes of porting to get them to work on v4, so it's all, you know, designed to be as seamless as possible, but you know, we can't promise it's 100% seamless. + +[Music] Arielle, Meenal, shared to answer Ariel's question. I believe then Claudia. Does, the definition of custom probabilities change substantially? So yes, one of the things that does change under the covers, I've talked a lot about, the interface, the the user-facing interface has not changed a ton, a little bit with the advent of RVs and and the emphasis on Xarray. The implementation of PyMC distributions under the covers has changed a fair bit. Now if you've done it before, it's not that hard to learn the new way, but it is not exactly what it was like before in v3, so it's where you've maybe implemented some custom distribution. There's a new pattern to follow. I would say it's a better pattern now that asera has kind of taken what theano did and made it more, more suitable for PyMC, so I would say there's a newer, better pattern to follow there, but there is a change. Yes. + + + +Same with a little bit of work if you've done something like implement a custom distribution. And, as you say, hopefully more documented, um, that's something that can definitely be helped in this sprint. I think there's a lot of, my experience contributing new distributions is a lot of find a distribution that's similar, look at the patterns there, adapt it to your situation and test, but I think that's a great place that our our documentation could for sure use some improvement is in adding new distributions. So I think it'd be a great, sprint activity to contribute to, also a great discourse topic. + +So Benjamin has asked, "Are there differences between probabilistic programming, noteworthy, significant reason why would you, why you would choose PyMC over Stan?" So I'll answer this on a couple of different levels. Mathematically, they're all kind of striving towards the similar ideal of adaptive HMC when you really get down to the nuts and bolts of it. There's going to be some algorithmic differences between how they implement it, but by and large, those should not matter to 95, 99.9 percent of people. I certainly couldn't tell you what they are off the top of my head, and I use PyMC quite a bit. + +You know, Stan versus PyMC depends, do you, you know, do you want to have an all Python interface to your programming language? Is that important for your application? If so, PyMC might be a better fit for you. Stan has its own, you know, model definition language, strengths and minuses of each as a, like a software library are you know, the language you use to specify models, whether that's pure Python for PyMC or you know, Stan's domain specific language that's kind of a, a subset of C++. Neither is better than the other. Stan is great software. We are actually great friends with the folks on the Stan team, get dinner with them all the time when I'm in New York, it's just it depends on your application, where you need to be running this code, what kind of tasks you're doing with it. So you know, I tend to be living in Jupyter notebooks, like everything to be concentrated in Python, like to be able to do my contributions in Python and not in C or C++, so I've chosen PyMC, v3 and PyMC as it's now renamed, but you know, I don't fault anyone who chooses to go down the Stan rabbit hole either. I think they both have their strengths and weaknesses, both, NumPy focused, supported projects. So and there's a lot of cross-pollination of ideas, particularly manifested in ArViZ, right? ArViZ is a visualization diagnostic library intended to be shared between PyMC and Stan and several other inference engines. So really, you know, gets down to the nitty-gritties of how do you choose any software package, what, what your environment is like, what your needs are. + +Chandra asks, "Simulating conditional probability. Can you share resources about that? There's no resource that comes to my mind, it's more about taking those Lego building blocks and putting them together. If you had a specific question in mind, you're saying like a high school conditional probability problem, probably you know, involving colored balls in urns or something like that, I'd say that the best way to get help there is take a stab at it yourself, maybe you solve it great, maybe you don't, no worries, there's a learning curve here. Bring your, you know, partially worked example to Discourse, and there's an entire group of really friendly people on Discourse, that sometimes includes me when I have the time in the time, but there's a great community there that are very helpful in getting your questions answered. So, I don't know that there's a definitive reference for those sorts of problems. My, advice in general would be take a stab at it and then turn to Discourse with your, your partial progress, and folks they are extremely friendly and will help you, help you drive that to, to resolution. + +And Daniel points out, yes, Adam Downey's book, I think "Think Bayes" is probably the one you're referencing, is an excellent, reference for those sorts of problems. + +**RS:** Austin, there's a couple of questions that I just put in the Q&A chat. One is from Luke, and he writes, "Hi Austin, great talk! Have you played around with the infinite Monty Hall Python problem with PyMC? So infinite doors rather than just three of them?" + +**AR:** Infinite doors rather than just three. + +Well, PyMC does not deal well with infinities, right? Just like any, they're right, you can't specify an infinitely long NumPy vector, and if you think of asera as like NumPy, fancy NumPy, not that's not technically correct, but it's spiritually correct. It's hard to, answer that question computationally. I think and mathematically, when I put my mathematician hat on, because I am trained as a mathematician, you know, it depends what kind of limiting process you're using to get to infinitely many rooms, what the right answer is there. So the answer is no, I have not done much thinking of it, and when you start to deal with things that are infinite in PyMC or any probabilistic programming language, this isn't a, an inherent limitation of PyMC, you run the same problems with Stan when you try to implement things like Gaussian processes or Dirichlet processes, where there are you know, potentially infinitely many parameters, it generally takes a lot of hard work, to find good approximations or truncations there. So I think that's a subtle question, and the answer is not really, and then + + +**RS:** There's, sorry, there's another question in the Q&A tab. Maybe you answered it, but I'm not sure. + +**AR:** Yes, I, I, I think I did get, get to that one. So if I come back, I probably have time for about one more. + +Chandra's final question maybe from the chat. "So Pomegranate, MC PyRO, TF Probability, how is it different from PyMC or Stan?" Yeah, so these are all different things that let you specify basically likelihood functions, right? And some of them are just specifying likelihood functions, some are performing inference on all on likelihood functions, so they are all adjacent, right? If you think about PyMC, it's a way to specify a model, perform inference on that model, and then ArViZ is a way to visualize and quantify the, the inference you've done on that model, each of these hits some or all of those things that PyMC does and has flexibility more or less flexibility in some of those places, right? There's no one probabilistic programming package to rule them all, right? So there's a couple different things you need to specify your priors, you need to specify your observational likelihood, you need to perform inference, you need to draw conclusions from that inference. If you think of those things as kind of layers in getting value out of any probability model, each of these packages speaks to one or more of those layers. It does then, it does each of those layers differently, that has strengths and weaknesses in terms of completeness and flexibility that vary by those layers. And so, not really possible to give an exhaustive answer here, but "cover some of the same ground" is the answer, does some things more or less flexibly, does some things more or less user friendly, right? One thing that I love about, PyMC is when you look at, I said "m = pm.normal , you know m as a string, zero, one", that's quite close to the mathematical notation you would use to write that model down, very user friendly is what PyMC goes for. Things like, MC make you write a log likelihood function on your own. tf.probability a little less user friendly, so just you know, depends which trade-off is right for you, but they play in similar areas, and you can assemble some of those other tools together to cover the same use cases as you would PyMC here. It all depends on where you need flexibility and where you need or want simplicity and ease of use. + +I think that that's probably a, a good place to call it a day, and thank you again Reshama and Data Umbrella team, and Meenal and Oriol, on the PyMC side for organizing things again. It's been, a real pleasure to speak to all of you here today. + +Oh Reshama, I think you're muted. + +**RS:** I wanted to thank you Austin for your presentation. + + +