- Transcript: https://github.com/data-umbrella/event-transcripts/blob/main/2020/08-matt-brems-missing-data.md
- Meetup Event: https://www.meetup.com/data-umbrella/events/271243174/
- Video: https://youtu.be/LJKYXq3WHTw
- Slides: https://github.com/matthewbrems/missing_data_short_version/blob/master/missing_data_slides.pdf
- GitHub repo: https://github.com/matthewbrems/missing_data_short_version
- Transcriber: Barbara Graniello Batlle
Ty: If you have any questions you can use the Q&A function. So, if you're on zoom, and you hover over the video, it should be on the bottom below the video. It would say Q&A, and you can ask questions there, and then we'll spend some time to answer them during a break in the middle of the talk, and then at the end of the talk as well.
Ty: Hey Matt, I saw you just logged in. Are you good to go?
Matt: Yeah, good to go whenever you are.
Ty: Cool, yeah, I just, I'll just do a quick intro talk about the meetup group, talk about GA. I'll take like two minutes. I'll start at 6:05 and then you can hop into your presentation.
Matt: Sounds good.
Ty: How's the new gig?
Matt: It's going pretty well. Week two, yeah no no complaints yet. It's, I've been dropped a little bit into the deep end, but that's okay. It's what I was sort of looking for.
Ty: Yeah, is it, what's it like onboarding during a pandemic?
Matt: So, in my current role, first of, it's interesting. Secondly, there wasn't a ton of onboarding. There's a, there's a bunch of stuff that I have to do kind of all on my own time you know, compliance and all of that, but there hasn't been a ton of onboarding, so I don't know if that's just the nature of me joining in the middle of this project or something else, but yeah, I know, I know about six different people and that's about it.
Ty: Okay yeah, that's interesting. It's, it's weird because in a sense, like remote work, or this environment we're in, you cut a lot of the fluff, cut a lot of things, and I guess some people might think they're fluff or not, but right, like a lot of the stuff you do the first day at work, or the first week, it's like meeting people, you know, all sorts of meetings and things like that, that may or may not have to do with actually the work you have to do, but they're just like work culture events right, and so it looks like some companies are just cutting it out, and you know, it's just like whatever.
Matt: Yeah. It'll be interesting when we have to start going back into the office, or rather, if we have to do that, what things will look like then. If there will be a, like a retro onboarding or anything like that.
Ty: Yeah.
Matt: Well, a lot of uncertainty there.
Ty: Is FINRA like, I don't know like the culture at FINRA, is it like, very much like, when the government opens they will open up? Because I know in New York City, you know, like offices opened up, and obviously companies like GA and other tech companies are like let's let's hold back, but I have some friends that work in investment banks, they have to like show up to work next week, like full time.
Matt: We, we haven't gotten anything about that yet. I think we're following the, to my knowledge, and what I've heard, we're following the, like the schools in DC, so ...
Ty: School schedule.
Matt: We'll see, but haven't heard anything yet, so I would be surprised if we go back before like August 15 just based on where we are now, but who knows.
Ty: Okay, well we have like, we have, yeah, we still have a couple of people tricking in, but I'll just start my intro. Can everyone, can you see my whole screen now?
Matt: Yep.
Ty: Okay. Sometimes I forget if you go fullscreen on Google Chrome. So, thanks everyone for attending during these uncertain times. I hope you'll get as much out of this talk as I'm planning to. So, the talk is called "Good, Fast, Cheap - How to do data science with missing data" and Matt Brems will actually be leading the talk. I'm just here to do an intro, and moderate, and login to my zoom account. So, just a quick agenda: we're gonna do an intro, I'm just gonna introduce Matt, I'm gonna introduce kind of our sponsors and the meetup group, and then we will open up for Q&A. As I mentioned before, zoom has like a built-in Q&A function, so if you hover over this video with your mouse, on the bottom below the video, you'll see a thing called Q&A, so you can ask questions directly there. This will be recorded. It will be sent out in a couple days, call it two or three days, automatically by the GA team. If you do not get the recording for some reason it could be in your spam box, but otherwise, just email your local GA kind of campus and, and they'll be able to sort it out, or they'll just email me and then I'll email them the recording.
Ty: So, Matt Brems is the speaker. He actually just moved to FINRA, maybe like two weeks ago, but before that, he was the global instructor for GA's Data Science immersive program across the United States, and before that he was working in the political consulting realm, and if you want to follow him on Twitter his symbol is, or his tag, his handle is @matthewbrems.
Ty: I wasn't able to figure out if your whole bio was, you know, 'twas too long for the slide.
Ty: Then the event is run by Data Umbrella. So, the mission of Data Umbrella is to provide a welcoming and educational space for URGs or underrepresented groups in the fields of data science and machine learning. You can learn about upcoming events and the mission at dataumbrella.org, and you can follow them @dataumbrella on Twitter. Data Umbrella is running weekly events during this time, like weekly virtual events, so if you check the page, there'll be another really interesting data science event within the next seven to ten days, call it. PyLadies is also a co-sponsor. It's the New York City chapter of a large national group. So, it's a group for python ladies and non-binary people of all levels of programming experience, and you can check the home page of the national organization pyladies.com, and then their Twitter handle is @NYCPyLadies. The ,ast but not least, the company where I work at, and where I met Matt Brems, and it's letting us borrow their webinar zoom account is: General Assembly.So, General Assembly is a pioneer in the education and career transformation space, specializing in today's most in-demand skills. So, the homepage is generalassemb.ly, or generalassemb-dot-l-y, and then the Twitter handle is @GA. We run a lot of cool classes in tech, data, marketing, product management. If you're interested in learning about the classes, you can also ask questions to me and I can help sort that out.
Ty: Without further ado, we will move on to Matt's talk.
Matt: Awesome.
Ty: I'll just stop sharing my screen and then you can yours.
Matt: Yeah. I'll go ahead and start sharing my screen as well. Good evening everybody, like, like Ty shared, my name is Matt Brems. I use he/him pronouns. I'll go ahead and share my screen here, so hopefully you'll be able to see it. I dropped in the, I dropped it in the chat box, a link to github where you can find my content, so if you would like to follow along you're certainly welcomed to. I'll be focusing exclusively on slides here; however, in, in the repository there are, in the repository there's code and there will be clear breaks in the code, show when you can, that show kind of when we'll be shifting from one slide to the next, and where the code sort of fills in the blank.
Matt: I'm seeing that this is not sharing. So, let me, actually Ty, really quickly, are you able to see my slides?
Ty: So, no. It just says: "Matt Brems has started screen sharing."
Matt: Okay. Well let me go ahead and restart sharing that. I appreciate folks' patience here with this. This is my first time setting up this, my first time using this iPad. I switched iPad since I'm no longer using my GA one, so we will, let's see.
Ty: Yeah, if everyone's wondering, the slides from this presentation are in that link that Matt shared in the chat, in that github repo, there's PDF.
Matt: Right, so this is not working. So, what I'm gonna do, I've got the slides on Google Slides, so I'll just go ahead and use the Google Slides version, and edit through that instead of, instead of working on my iPad. So, I will not get to use my, my stylus. That, that is, that is okay. So, let me stop sharing that, and I will go ahead and reshare my screen here. Again, appreciate folks' patience with this.
Ty: Oh! Were you trying to cast your iPad into the zoom?
Matt: Yeah. I've got a, okay, yeah, I've got it.
Ty: Nice desktop.
Matt: Thank you. Uh, you would be able to see my screen now. My Google Slides have not pulled up. Can you see them?
Ty: Yep, we can see it now.
Matt: Okay I saw Ty's face light up, so imagined that yeah, that, that's coming through now. So, I won't be able to annotate with my, with my Apple pencil, but, but that's okay.
So, thank you very much Ty, thank you very much everybody for, for having me join. Like I mentioned, my name is Matt Brems, I use he/him pronouns, and what I'm here to talk with you about is: how to do data science with missing data. This is a really challenging problem to work with, and to try and grapple with, but I think it's an important topic, because if you do data science you have inevitably run into challenges with missing data or missing information, and the ways that we try and handle that, probably are not the ways that we should be handling that. So, I've got some stuff here, you can see this in the slides. Ty has already talked about my background, so I'll go ahead and skip beyond this. You're welcome to, to hit me up afterwards if you'd like to talk about any of these experiences.
The lay of the land for tonight, so we're gonna start by talking about missing data, we'll get into strategies for doing data science with missing data, and, provided that we've got time, we're gonna wrap up with some practical considerations and warnings for you to keep in mind. As this goes on, please feel free to, at any moment, drop notes in the slack, or in the zoom channel, so I will keep an eye on that as well, as the Q&A. So, at any point feel free to do that, to drop stuff in there, and I'll try and respond kind of in real time.
So, how big of a problem is missing data? And this is just gonna be the, the very quick start of it. It's a really challenging question for us to answer, because what's going to happen when we are trying to work with missing data and we try and quantify how big of a problem it is, is that, from a practical point of view, and it sounds trivial to say this, we can only see what we observe. So we're only going to be able to actually see the data that we've gathered. We don't know the value of that missing thing itself. So the only way for us to be able to quantify, or understand, the magnitude of how big of a problem missing data is, is that we can use simulated data to try and help us answer that question. Now, in the interest of time, we're not going to go through actually generating the simulated data and looking at this, but if I move forward to this slide, if you would like to check this out, all of the code is pre-written for you in the notebook.
So, in the repository, I've got three different sets of notebooks. There's one with a prefix 00. One with a prefix 01, and one with a prefix 02. So, you can run that on your own, if you would like, but in short what you would see at that notebook is, we create some data, or we generate a complete dataset, and then we take twenty percent of those observations and turn them off, or we set those to be missing, and we see how much of an impact that would have on our model. If, and what we notice is that, if we were to look at the slope and the y-intercept of our simple linear regression model, there's actually a really, really, really large effect, and so that is a quick way for us to try and quantify how bad, or how big of a problem missing data is. Now that depends on a whole host of factors: How much data do you have? What is the type of missing data that you're dealing with? How, What type of model are you trying to fit? How many variables do you have? All sorts of things factored into that, but in a very, very simple case, we can see that missing data is really going to undermine a lot of our inferences and the conclusions that we may try to make.
So, this brings me to what I want to get into, which is: What is a realistic approach for us? So, if you're familiar with the good, fast, cheap idea in project management, what that means is that you can come up with a project that is good and fast, and if you come up with a project that is good and fast, what's going to end up happening is it's not gonna be cheap. That is, it will be more expensive to be able to do that project, because if somebody wants something done quickly, and wants something done well, people will probably have to pay top dollar for that. On the other hand, you can think about a good and cheap project, so sometimes people will say: "hey I need a project that is high quality", and they don't want to pay a ton of money for it, and that's a very realistic scenario to come up, but the challenge is: if somebody's not willing to invest in it, and you want something that's high quality, generally, that will take a large amount of time in order to deliver that solution. So here we see the overlap between good and cheap, is that it will take time to deliver. And then finally at the bottom here, we see fast and cheap. Most frequently, in my personal experience, people will say: "hey I want to pay this low dollar amount for a project and I need it done tomorrow." Well, what's gonna happen is, is that if something is done fast, and that something is done on the cheap, it's not going to be the best quality in most cases. So, because of that we need to think about: How can we take this approach and apply it to missing data? The reason that I bring this up is that oftentimes what clients, or managers, or anybody else wants, they will say: "I want something that is good and fast and cheap" but that's not going to be feasible. Connecting this directly with missing data, we think about what missing data means in terms of a fast and cheap analysis, that's going to be: you just drop all of your missing values, or you do a single imputation. If you want an analysis that is done well, and is fairly inexpensive, then you have to fill in your missing values, or handle missing data in what I would call the proper way. So we can talk about proper imputation, or the pattern submodel approach, and then finally, and we'll get into what those mean shortly. And then finally, if you want an analysis that's good, and your analysis to be very quick, then you should gather your data in a complete manner. The downside of that, of course, is it's incredibly expensive to do that. You have to figure out how can you collect complete data, without missing any data, and oftentimes, you have to pay top dollar for that, something that might be a hard sell for when you're talking with your boss, or your client, or somebody else. So, I'm not, what you should come away from this evening, is not this is the specific way I need to always handle my missing data, but get a better understanding of: What are the trade-offs in missing data? What are the different challenges in, or the trade-offs, if I go with one approach versus another approach.
So, given that introduction, what I'd like to do is talk about what are strategies for doing data science with missing data. So the first thing to do is, let's talk about how to avoid missing data. So, something that I think is really important to note is that it's usually going to be more expensive up front, but cheaper in the long run to avoid missing data. So, as you think, and this is probably the, this is not the best way of, sorry not best way, I'm trying to come up with the right phrase for it, this is not the, this is not the fanciest or coolest approach to missing data. For me, to spend time to talk about avoiding missing data, you're like: "yeah, like okay, I know that it's better for us to collect data as opposed to collect data that is missing", and I get that, but it's important to talk about this, because oftentimes, in the long run, it's going to be a better thing for you to avoid that missing data upfront. Depending on the jobs you have, or the jobs you want to have, if you're working in an organization where you're gathering survey data, or you're working to collect data in some capacity, if you have any control over that, there may be small to moderate design changes that can be implemented there, that allow you to gather significantly more data. If you gather more data you don't have to invest time in how to handle that missing data later, you can just use the entirety of your data. If you've got more data, your inferences, and your predictions, and everything, tend to be more precise. Your variance is lower, and so because of all of this, it's often better for us to try and avoid missing data upfront, if we can.
So, I want to take a moment to talk very briefly about some of these, again you can read all of these on your screen, but some of the things that I think about are for example: decreasing the burden on your respondent, or minimizing the number of questions somebody has to respond I responded to two surveys earlier today in Survey Monkey. A former colleague had posted some stuff on LinkedIn, and I filled those out, and they were, I was willing to do it even on my phone because they were relatively short surveys. I like the Chipotle app. I eat from Chipotle quite frequently, and up until actually like last week, what they would do is, if you ordered online through the app they would follow up about an hour later with a green smiley face or a red frowny face and say: "Hey how was your dinner this evening?", and you could click the smiley face or the frowny face, and if you clicked the smiley face they said: "Hey thank you so much." If you clicked the frowny face you got to put a couple of checkboxes and say this is what I, this is what I didn't like, or this was, this was not satisfactory, and that was it. So many other organizations, there are so many other data collection mechanisms, end up requiring you to fill out twenty, thirty, forty, fifty questions, and so what ends up happening is that you often will create missing data by the design of what you're looking at, as opposed to anything else. So making some changes on how you can decrease the burden on your respondent, maybe making questions closed-ended instead of open-ended, like, like a fill-in-the-blank question.
One other note that I want to make here is thinking about, thinking about improving accessibility, that's a very, very important point. So, there are lots of different ways you can get into this. We can think about language accessibility. We can think about readability. We can think about individuals who may be hard of hearing and ways to gather data from individuals, but it's important to think about accessibility and inclusivity when we design that. So, if you are part of an organization where you are gathering data in some capacity, is there a way to improve that accessibility to others? For example, when I was doing polling and surveys in the context of politics, we would administer surveys in multiple languages, specifically English and Spanish, when we were calling different populations, or specifically, different states that had a, a particularly large hispanic population or spanish-speaking population. That was something that we wanted to do, because otherwise we were just leaving out broad swaths of the population, which would, of course, down the road, compromise our inferences. So you can make a compelling business case for doing something like this. It's not, I mean, accessibility, in my opinion, is in and of itself a valuable goal. In addition to that, I think that it's important to recognize that when attempting to encourage other people, or share with other people, that they should take some action, or invest some funds, or some energy in that, that there are some positive business effects to it as well.
Moving onto the next slide. So we talked about avoiding missing data. How do we ignore missing data? Well, the very, very short summary is that we're going to assume that any observation that we've observed is similar to those observations for which we are missing data. When we ignore, we're making an implicit assumption, which may or may not be a valid thing to do. When I was in grad school, a professor shared that a very general rough guideline is that: if you are missing less than five percent of all of your data, you may be okay ignoring that data that's missing. Now if you are trying to do something like supervised learning, you fit a model where you've got a bunch of inputs and an output, your Y variable, if you're missing a ton of data from your Y variable, then, even if you're missing less than five percent of your data overall, that may compromise your inferences too much, you may not be willing to do that. Or if there are certain variables that you know were believed to be really meaningful and you're missing a lot of data from those, maybe ignoring missing data isn't the right way to go. Now, when we ignore missing data effectively, it's what most softwares are going to do by default, in R, in Python, in something else, if you just put your data into your model and press GO, or dot fit, or whatever it is you choose to do, if you were to do that, and didn't handle your missing data in some capacity or in some way, then you're probably going to, that's effectively ignoring it. Your model, or your software, is almost certainly going to drop all of your observations that contain one or more missing values in it, and that may be okay to do if that number is relatively small, but I want to emphasize up here, there is an assumption that you are making with that, and that may or may not be a valid assumption to make.
So, the last thing that I want to talk about is: how to account for missing data. I mentioned how to avoid missing data upfront. If you can't avoid it you might say: "can I ignore it?" Well, here before we account for it, and if we can't ignore it, then we have to account for it, but before getting into that, I want to shift our mindset a little bit because there is a, a belief that we can just plug in those gaps in our data, that you know, and perhaps you were, perhaps someone expected that you would be able to come here tonight and I would give you a new Python package that allows you to fill-in missing data, and, and you've got that technique you can put in your workflow, and in your wallet, and kind of move on your way, but the problem with that is that you have to do this in a specific way or we're really just making up data. Making up data has all sorts of issues: a) we might be wrong; b) it's not an ethical thing to do, in my opinion. So, because of this, we need to be very careful about how we would fill-in some of those gaps, or how we, how we tackle missing data, but I'd like to shift our mindset a little bit and say in most cases we're not really fixing missing data. It's not like we just have this new step in our workflow where I fit some, some method in pandas or in scikit-learn and then move on with the rest of my day. We're really just learning how to cope with missing data. So given that shift in our mindset that we're really just learning how to effectively and in a principled way cope with our missing data, let's move beyond.
So, I want to talk about how to account for missing data, and there is code in the repository to go through both unit missingness and item missingness. So I want to, I want to share that with you, and that's again in the repository, if you'd like to take a look. One note that I want to make is, please, again, feel free to drop questions in the chat, if there are questions that you have, because I want to make sure that I can answer them as we go. I really want this to be as helpful as possible for, for each of you. So, let's talk about unit and item missingness. There are a couple of different ways that data can be missing. Unit missingness is where we're missing all of our values from one observation. So, for example here, index 3, if I'm gathering data on individuals, and let's say that person 3 just did not respond to my survey, or for whatever reason, I have no information from this person. If I have NA's for all of those, that would be an example of unit missingness. This person did not share their information with me and so I have no information. Item missingness, or, I like to refer to it as Swiss cheese missingness, is where there are holes in your data. So indices 1, 2, and 10,000 have this. For example, for index 1, we do not have information on age or income, but we do have information on sex here. For individual 2, or index 2, we do not have sex, but we do have access to age and income data, and then, all the way down to row 10,000 we're missing one value here in the income column.
So, the way that we handle unit and item missingness is a little bit different. So, in terms of unit missingness, the very, very quick summary of how to handle unit missingness, and i am, let me actually go ahead and pull up this notebook just to, very quickly, show what this looks like. I'm gonna go ahead and pull open this Jupyter Notebook. If you do not have Jupyter Notebook on your computer, or if you're not familiar with python or, anything like that, that's okay. I'm probably only gonna spend about two to three minutes talking about this, but I do want to pull this up as an example. So when we are, oops that is item missingness, what I meant to do is pull up the unit missingness one, I grabbed the wrong one, so I apologize. I'm gonna move back over here. I'm going to go ahead and shut that down and open up the Jupyter Notebook 01, unit missingness. I'll drop that in the chat here, 01_unit_missingness.ipynb, which can be found in that repository. In my experience, the most common method of handling unit missingness, where we're missing an entire row of data, is if we have supplemental data on that individual, to do something called weight class adjustments, where we take our observations and we break them into classes and then we will weigh them before doing our analysis.
So, for example, let's say that I'm working in HR analytics, so I'm working in human resources and I want to understand how satisfied are individuals within our organization. Let's say that, to make this simple, we have two different departments, we have a finance department and an accounting department, for which I want to study individuals, and let's say that, maybe when I administer these surveys, that in the finance and accounting team, they're split perfectly evenly, fifty percent of people in finance, fifty percent of people in accounting. Let's say that maybe people in finance had too much other stuff to do, or were less responsible, or whatever kind of motivation you want to ascribe to them. and let's say that people in finance were less likely to respond to my survey, and let's say that people in accounting, whether it's because they had less on their plate, they're more organized, they're more conscientious of this, they just wanted to reply, whatever else, let's say that accounting people responded more to my survey. So, because of this, if I scroll down here, what we're going to do is, we're going to see that, if I look at all of my survey responses, so let's say I administer this survey, fifty percent of people are in accounting, fifty percent of people are in finance, but when I get my surveys back I get a disproportionate number of responses in the accounting department. Here, about seventy-seven percent of my responses are in accounting, meaning that only about twenty-two or twenty-three percent of my responses are in finance. Well, if I was going to just take these values and I was just gonna do a simple average to understand on average how happy are my employees, I might be putting some additional bias in my model here, and that bias may come in because I received way more responses from accounting than from finance. So, what I would like to do, the strategy that we can employ is something called a weight class adjustment where I'm going to basically down-weight all of my people from accounting, I'm going to up-weight all of my respondents from finance, and what that's going to do is, going to put them back on an equal playing field, because again, fifty percent of people were in finance and fifty percent of people were in accounting.
So, the way that we do that is, we take our full sample of people. All of the one-hundred percent of people who we administered those surveys to, both the observed and the missing. We're going to lump them all together and we break them into subgroups based on characteristics that we know. In this case I know accounting and finance. I'm going to give every individual a weight as well. So the weight for people in group i is going to be what's the true percentage of people in that group divided by what's the percentage of observed responses in that group. So, for example, in the accounting group, the true percentage of people who were in accounting is one-half, divided by the percentage of responses from accounting, which was as we saw up here, about seventy-seven percent. I let Python do the math and so the weight for each accounting vote is about sixty-four point six percent. I do the exact same thing for finance, and what that means is that each finance vote gets a weight of two point two. So, for every finance person, this is finance, and this is accounting, every single finance person who replied, they get a vote that's two point two times. So one person submits a survey, I'm gonna up-weigh them two point two times. For every accounting person who responded, instead of each person getting one vote, they effectively get point six-four-five votes. If we want to take the weights in each of those groups and multiply that by the number of responses that we get, that ends up equalizing things so that the total weight from all of my accounting responses and the total weight from all of my finance responses end up being equal, or in this case, almost exactly equal.
Once you have created those weights, what you can do is just pass them in to Sklearn. So, you'll create a column of weights, I've done that here, just added a column in pandas, df bracket weights, and then what we can do is use that in order to do more complex analyses. So, if I were to just calculate, for example, the raw average of employee satisfaction, I get an employee satisfaction score of about five point seven, but if I calculate the weighted average based on my employee satisfaction score, it's significantly lower, I get here five point four-five. So, that, that average score went from five point seven down to five point four-five. A decent drop, and that's because people in accounting, this is based on the data I generated up-top, but people in accounting were on average happier with their jobs. People in finance were, on average, less satisfied with their jobs. What ends up happening though is, when accounting over responds and finance under responds, is that that's gonna skew our results. Now, if you were to take this information, and you want to build a more sophisticated model with this, you can do so by passing df weights, if you're, if you're a user of Python, and specifically scikit-learn, if you want to, you can pass df bracket weights, that column, or that vector of weights, when you fit your model, and it will weight your models results based on those, those weights that you've given it. So, you could do this for a linear regression model, or you could do this for a random forest, or something else if you would like to.
Two quick things that I want to call out: one is that our goal with cost weighting, when we do this weight class adjustment, is to decrease our bias, but what should we be concerned about? Well, when we decrease bias, we tend to increase variance, and so there's an article from the New York Times back in 2016, some of you may be familiar with this, there was an individual I believe, it was, the, the title of this article is: "How one nineteen year old Illinois man is distorting national polling averages." And what this ended up, and so I encourage you to take a look at this, if you would like to take a look at and read this, but effectively, they created these buckets, they weren't just looking at the accounting and finance department, but instead they looked at, for example: age, geography, sex, other information, maybe political party, all of these different buckets, and by creating so many different buckets and attaching a weight to those buckets based on response rates, what ends up happening is they decrease bias but there tends to be an increase in variance. So what ended up happening here, I would encourage you to perhaps take a look at that later if you would like, but what ends up happening is that the person who had that, who is distorting national polling averages, his weight as assigned by this approach was a, was thirty times higher than the average individual's weight, and was actually 300 times more than the person with the smallest weight in this poll. So this one individual had an enormously outsized influence on these polling averages. So it's something to, to keep in mind. Related to this, I'm making an assumption here, I'm making an assumption, and thank you Sam for, for sharing that link. I'm making an assumption here that I know that fifty percent of my people are in accounting and fifty percent of my people are in finance, but that's not always a realistic assumption. So if I want to understand what percentage of people will support the Democratic candidate in the upcoming election and I want to look at things across age groups: eighteen to thirty-four, thirty-five to fifty-four, or fifty-five and up, I have to make a guess about that, as I write here, hopefully it's an educated guess, but what we see in past elections may not be indicative of what we're going to see in this election. Thinking about the 2016 Democratic presidential primary, my understanding is, far more young people came out and voted in that Democratic presidential primary largely in support of Bernie Sanders. Thinking about that, that's something that we may not have noticed in 2010, 2008, 2004. So if we use past data to predict the future, it can be really challenging to do that in a way that's principled. We think about when then-senator Obama was running in 2008, what ends up happening was, when, when he was running against secretary, or then, I should say then-senator Clinton in 08, what ends up happening is that, if you just looked at information in 2000 and 2004, you're likely going to dramatically underestimate the proportion of people of color, specifically black voters who came out in support of Obama during the 2008 election. So this, this weight class adjustment method is something that you can sometimes do, but you can't always do that, and so it's important to keep in mind some of the limitations of this. We would be assuming that we know what the distribution of, in this case what I've highlighted, the age groups are, but that's certainly not a guarantee.
I'm gonna go ahead and shut this, and I'm gonna move back over to the slides. So, to try and talk about how to pull some of these pieces together in a workflow, we have not talked about, we have not talked about imputation for, for unit non-response yet, but we'll get into that here. In terms of my workflow, I start by saying: "how much missing data do I have, and is it worth my time to try and address it?" Anytime I'm doing a data science problem, that's one of the first things I look at, then I say: "is it reasonable to attempt deductive imputation?" Which we're going to talk about momentarily. Then, if my goal is to generate predictions, then I'm going to use the pattern submodel approach. If I want to conduct inference, then I will use the best imputation method available, ideally proper imputation. So, we'll talk about what each of these are, because a lot of those bolded terms are things we haven't seen yet. In order to get into that though, one last thing that I need to talk about are three different types of missing data.
Now, you, when you look at your data and you see a bunch of NAs in your data, or a handful of NAs in your data, they all look the same to us, but there are actually three different types of missing data that are important to know. So I'm taking inspiration from my friend Allison here. My friend Allison, she is getting her PhD in biology at Notre Dame, very proud of her, so let's say that she's a grad student in a lab working late, and while she's pipetting in the lab, she reaches for her pen and accidentally knocks one petri dish off the desk. So from that petri dish, my friend Allison loses all of the data that she otherwise would have collected. So, over here, I'm looking at what maybe that might look like in terms of data gathering. So Allison was able to measure how much bacteria, or what was the width of the bacteria, in each petri dish on day one for all of these different petri dishes. Did the same thing on day two, but this here, you can see 16 millimeters, but really that would be an NA in our data, that's something that we do not have access to. We would call that: missing completely at random. This data is missing completely at random because there's no systematic differences between that data that's missing and the data that we've observed.
Moving on to the next example, there's something called: missing at random. And I apologize, please do not shoot the messenger, I was not in the room when people decided on these terms. I think these terms are silly and I wish there was a better way to describe them, but I apologize on behalf of statisticians here. So we talked about missing completely at random, here we've got missing at random. So let's say that we work for the Department of Transportation, and we're looking at the Pennsylvania Turnpike, a toll road, a highway, that has a toll booth, so that you can understand, people have to pay whenever they go through this toll booth in order to use the Pennsylvania Turnpike, and let's say that there's a, a sensor set up to track how many cars go through a given gate in a given time window. Well, that sensor breaks and doesn't gather any information between 7 and 10 a.m. What we would describe that as, is data that's missing at random, and the reason that we call it missing at random, is that conditional on some data that we do have in hand, the data of interest is not systematically different, so whether or not that data point is missing depends on data that we have observed. In this case we have observed time, that is information that we do have, and time contributes to that missingness. That missingness is based on, are contingent upon those specific hours that data is missing. You might imagine the solution for trying to tackle this type of missing data is, maybe, we want to use the time in order to help us fill-in, or generate some value for those number of vehicles that are, that we are missing.
The last type of data that's missing, the last type of missingness I should say, is data that's not missing at random. Let's say that I administer a survey, and that survey includes a question about income. People who have lower incomes are less likely to respond to that question about income. What we call that data that's not missing at random because whether or not something is missing depends on the value of that missing thing itself. Here, for example, we see that people who have lower incomes are on average less likely to share their incomes with me. So if we wanted to do something simple, like calculate the average of this data, I can calculate the average of income but it's going to be skewed pretty significantly upward. We're gonna see a value that's much higher than it should be. Now, this is the most complicated type of missingness to work with because we don't have access to these incomes here.
So those are the three different types of missingness, and I want to talk about a couple of ways that, in the 15 minutes we've got left, that we can handle. So there are five different methods here that I outline. I'm going to move quickly through them because there's a recording here and I want to be respectful of folks' time and not hold folks over, but again please ask any questions that you have in the chat.
So let's start by talking about deductive imputation. Deductive is, it has to do with logic. We're going to deduce values. We're going to use logical rules to understand how we can fill data in. So let's say that there was a survey that asks if somebody was the victim of a crime in the last 12 months, and that person says: "no" and then the same survey has a later question that says "were you the victim of a violent crime in the last 12 months?" and that respondent leaves the answer blank. We can use logic, we don't have to make any guesses, or we don't have to do any inference, we can use logic to say given the answer to my first question I know the answer to that, and I can fill in that missing value through logic. This requires specific coding, so you would have to, as we get new data, we have to recognize how do my variables, or my columns relate to one another? You would have to code that up. That's not something that is going to be consistent across all datasets so you can't just download a library to do that. It can be time-consuming, but it's good because it doesn't require any inference, and it does not matter what type of missing data you're working with, whether it's missing at random, not at random, completely at random, you can do this in any of those cases.
The next thing that I want to bring up is: mean, median, and mode imputation. So, I imagine that many of you have done this at some point or another. For any NA value or any missing value in your data you just replace your missing value with the mean or the median or the mode of that column. It's a quick fix. It's easy to implement and it seems reasonable, but it can really significantly distort your histogram, and it underestimates your variance, and we'll talk in a minute about why that variance or that variability is so important. It should only be considered if your data is missing completely at random. So, if you can say: "based on my understanding of my data I can truly believe", and maybe some quick analyses that I do in my data you can say, "look I believe that my data are missing completely at random." This would only be appropriate in that case, but even then you probably shouldn't do this, and there are better ways of handling missing data. So, this is an example, so what I have up top, and this is all in that 02 notebook if you want to take a look at that 02_item_missingness. These visuals come directly from that notebook. So here up top, this is the real histogram of data with, in blue, this blue vertical bar shows us the true average of that data. Down below, we're looking at the same data but, in blue I have which data I've observed and then any value that was missing I filled the mean in, so here this orange bar gives me the mean. you'll notice that that's a pretty dramatic difference between the top and the bottom. If you're missing ten, twenty, thirty percent of values in a column, which is not the craziest thing in the world to think through, you might get something like this. So first off, you're going to distort that histogram, so that's one challenge of working with this. The other thing that I want to run through is why is under estimating variance a bad thing. So here I've got the formula for your sample standard deviation, and you can go through this if you would like, but in short, if you are working with the sample standard deviation, what you'll notice is that if you've observed your first k observations, and then observation k plus 1 all the way through n, those are missing, and you try and use mean imputation, what you'll do is for k plus 1 through n you fill the mean for those values, that formula shifts, instead of dividing by k you're gonna divide by n, that denominator gets bigger, but you can rewrite this formula by breaking it out for the first k values, those values you've observed, the actual real data, and here k plus 1 through n all of those values that were missing and you filled in the mean for. We'll notice here we've got x-bar, which is our sample mean, minus x-bar, which is our sample mean, that's zero. Zero squared is zero, and if you add a bunch of those up, that still gives you zero so what ends up happening is: that this part of your, this part of your variance, or this part of your standard deviation remains exactly the same, you don't add anything there, but your denominator gets bigger. The reason I walk through that is, by doing this, your standard deviation gets smaller. Your standard deviation is used in a number of ways. One if you try and generate a confidence interval then your standard, you're gonna get maybe a 95% confidence interval, but that's gonna get smaller. I apologize if you can, you can either see the Lightning or hear the thunder outside. So, you may see your, your confidence interval get much smaller depending on how many of those values you imputed. That's not because you're getting more confident, or that's not because you decreased your level of confidence, let's say ninety-five to ninety percent confidence, that's just because we imputed our mean. So we might become falsely confident in our results. Something similar happens to p-value where your p-value may shrink, so your p-value of point zeo seven now becomes a p-value of say point zero three. All of a sudden we get a significant result, but even though we, but that significant result is only because we filled in this missing data.
So Umesh asks, and I apologize if I mispronounce anybody's name, thank you for your question Umesh. Do you see significant gains in model performance from imputing using methods like MissForests or other model based imputating? How do they compare with simpler imputing methods like mean imputation? So mean, median, and mode imputation, and if I move to the next slide this is a write-up of that, confidence interval gets smaller, the p-value gets smaller, but that's not something, that's, that's not real. That's not valid. If we look at mode imputation, we see a similar challenge where one value is artificially inflated a ton, but we're still going to see in effect where that standard deviation is almost certainly going to get smaller and artificially so. So when it comes to trying to impute properly, what you should do, and in the interest of time I'm gonna skip ahead, if you were to try and fit a single regression imputation where you fit a model to your data like MissForests or something else, you're gonna get values that look like this, where those imputed values are all lumped in the middle. It's still not going to work the way that you expect it to and that's because when you generate those predictions, that deterministic computation that you do, is usually going to fall on one line. Instead, you imagine that you probably want to generate values that look more like this, that resembled a true variability that you see in your data. So because of that, in order to properly impute, we need to impute many times. Because anytime you fill in a value with one number, you're treating it like you know that true number, and that's not the case. If you fill in any missing value NA with a zero, or a ten, or a thousand, you're acting like you know that value, and you don't. So the way to properly impute missing data is to make like ten copies of your dataset to do imputation with some stochastic behavior, so you can add in like a random error if you were to do a regression model, or if you were to do MissForest, you could, you would be able to maybe add in some randomness into that. You would do that on all ten of your copies of your datasets. Once you've got ten copies of your dataset that are full, they've got some different values in them, then you would build like your final model, or do your final analysis on each of those datasets. Then combine your results together just like you would aggregate results in a random forest. So if you are doing a classification or a regression model you can average your predictions, for classification you would do like a vote based prediction across all ten of those models that you constructed. There's a visual here that I think helps to drive that point home. Again, you start with your data, there's a bunch of pound signs or hash tag symbols in here that represent missing data. You make a bunch of copies of that, this image has three copies of your data, and fills those in using some sort of random imputation method like a regression model with some random error. Then you build your final model or analysis on those three data sets. You get your results and then you combine them together. zif your goal is to do prediction, you can do that. If your goal is to do inference, there's something called Rubin's rules, we're gonna drop that name in the channel here. Rubin's rules. In order to combine those estimates together. So I've got a little bit of context on that, on Rubin's rules, so there's documentation in the repository if that's something that you want to explore on your own. So there's content in the notebook on that.
The last thing that I want to go through is, I've talked about those first four methods of imputation, the pattern submodel approach is where I want to end up for today. And that pattern submodel approach for handling missing data is: you're gonna take your dataset and you break it into subsets of data based on how your data is missing, then what you're going to do is build one model on each of those subsets creating many different models, you won't combine those models together, you will end up with many different models. So, a visual example. Look at the dataset on the left hand side. I have a Y, I have an x1, and an x2. What I can do is, I can take my first two rows right here because I've observed the same data, I'm gonna call that pattern one. My next two rows I'm gonna call that pattern two. The next two rows are pattern three, and then my final row is pattern four. I'm gonna group my data into four chunks based on how my data are observed in missing together, and I'm gonna fit a different model on each of those. So you would end up with four different models here. One model would be, if you wanted to do a linear regression model Y equals beta naught plus beta one times x1 plus beta two times x2, and that would be fit on those first few observations. Then you would fit a separate linear regression model on these next observations where you just exclude any value for x2 because you don't have any value for x2, so that would be a model like y equals beta naught plus beta one times x1. So you would, based on this, come up with four different models. There are a lot of advantages to this and if your goal is just to make predictions then you should use the pattern submodel approach that I described here. That, this will outperform imputation methods if your data are not missing at random and it's gonna perform about on par with imputation methods or filling in methods if your data are missing at random or missing completely at random. It does not require missingness assumptions, so that's one of, in my opinion, one of the cool things about that. It is a based on, my understanding, a relatively new method, there was a paper released in I think September of 2018 on that. I think it was mentioned a, a long time ago in a more esoteric paper, but to my knowledge there's not a ton of machinery like in Python to implement this so it has to be a little bit more of a manual process.
So I know that we're right at the end of time here, and I do want to be respectful of folks' time. So first off, thank you so much for, for showing up, I totally understand if you need to hop off so please feel free to do that, I'm happy to stick around and Ty you feel free to come on if you need to give me the hook and pull me off.