-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRegression.Rmd
122 lines (86 loc) · 4.25 KB
/
Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "Regression"
output:
html_document:
theme: journal
css: styles.css
code_folding: hide
---
<!-- Here you'll find a general idea of the project https://github.com/folaAj/AdventureWorks/blob/master/Regression.ipynb -->
```{r, echo=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
This is the part of the project where we will predict the average monthly spent amount based on the info we have. We still don't know which variables we will take into account, we just want to get a predictive model.
```{r load_libraries}
library(readr)
library(dbplyr)
library(ggplot2)
library(tidyverse)
library(ggplot2)
library(readr)
library(data.table)
library(DT)
library(pander)
library(scales)
library(cowplot)
library(shiny)
library(eeptools)#for changing the DOB to age
```
We download and clean the data (see the Visualization tab for a detailed explanation of this process).
```{r}
customer <- read_csv("~/CS 499 Senior Project/datasets/AdvWorksCusts.csv")
spend <- read_csv("~/CS 499 Senior Project/datasets/AW_AveMonthSpend.csv")
bikebuyer <- read_csv("~/CS 499 Senior Project/datasets/AW_BikeBuyer.csv")
three_datasets <- data.frame(customer, spend, bikebuyer)
data_clean <- select(three_datasets,-c(CustomerID.1, CustomerID.2))
missing_values <- sapply(data_clean, function(x) sum(is.na(x))) #it checks number of missing values by column
data_clean <- select(three_datasets,-c(CustomerID.1, CustomerID.2, Title, MiddleName, Suffix, AddressLine2))
data_clean <- data_clean[!duplicated(data_clean), ] #it removes duplicates
```
Here's a summary of the data. We can see that there abour 6 continuos variables. Let us remember that our response variable, what we're trying to predict, is the $AveMonthSpend$ variable.
```{r}
pander(summary(data_clean))
```
Here you can see the first few rows of the dataset to get a feel of the data.
```{r}
head(data_clean)
```
## Finding the predictive model
Our null hypothesis is that the variation in AveMOnthSpend is due to randomization, and not due to the variation of other variables. In other words, other variables cannot explain the AveMonthSpend.
Let's start at visualizing the variables.
```{r}
###Let's add up age column using the existing DOB column
#Change BirthDate from Character to Date format
data_clean$BirthDate <- as.Date(data_clean$BirthDate, format = "%m/%d/%Y")
#Append the new column called Age
data_clean$Age <- as.numeric(difftime("1998-01-01",data_clean$BirthDate, units = "weeks"))/52.25
data_qualitative <- data_clean %>% select(11:16)
data_quantitative <- data_clean %>% select(c(22,16:20))
plot(data_quantitative)
```
I see some sort of relationship between Age, YearlyIncome, TotalChildren, and NumberChildrenAtHome.
NOTE: We might consider NumberChildrenAtHome as a discrete variable or factor, but I think it's better if we keep it as continuous.
***
As per the qualitative variables, you can refer back to my EDA window.
(Summary: Categorical features such as occupation, gender, marital status and home owner flag have distinct relationships with average month spend. The quartiles are of different levels. It seems that Males spend more on average than females same for married and homeowners)
```{r}
#creating a smaller data frame with the features that seem to play a role
features <- cbind(data_qualitative,data_quantitative)
head(features)
```
Let's start with building our model
```{r}
#We'll use a forward approach. We'll start from most simple to most complex
model1 <- lm(AveMonthSpend ~ YearlyIncome, data = features)
summary(model1)
model2 <- lm(AveMonthSpend ~ YearlyIncome + NumberChildrenAtHome, data = features)
summary(model2)
model3 <- lm(AveMonthSpend ~ YearlyIncome + NumberChildrenAtHome + TotalChildren, data = features)
summary(model3)
model4 <- lm(AveMonthSpend ~ YearlyIncome + NumberChildrenAtHome + TotalChildren + Age, data = features)
summary(model4) #Notice how Total Children becomes insignificant!
model5 <- lm(AveMonthSpend ~ YearlyIncome + NumberChildrenAtHome + Age, data = features)
summary(model5)
model6 <- lm(AveMonthSpend ~ YearlyIncome + NumberChildrenAtHome + Age + NumberCarsOwned, data = features)
summary(model6)
```