Ben Turner
Predicting Housing Prices in Ames, Iowa
This project will be evaluating 79 different variables that are provided in a Kaggle competition dataset. The variables describe almost every different of aspect of a house. The data includes obvious variables like square footage, number of bedrooms, and number of bathrooms, to much more less known variables, such as basement square footage, to what shape the lot of the house is on, to whether or not the house has alley access.
The dependent variable in this project will be the Sale Price. In this project we are going to try to determine how to predict housing prices based on home variables.
The EDA will begin by trying to understand how the dependent variables and independent variables relate to each other and the cause for that relationship. Our EDA will also involve some data cleaning, how to handle the missing data, and how to deal with the categorical variables.
The data for this project is split up between a test and a train data set. When running the first initial EDA, we find that we have A LOT of missing data. So we will not be able to run a reliable predictive model until we will those values in. Overall there are 34 different variables missing data or NA’s, and a grand total of 13,965 NA or missing values in total.
For the variables, I either input the median value or whatever the majority value was. For example, the highest variables with missing values was the pool area. Since the majority of houses came back with a “None” value, I will use that and replace all missing values for Pool QC with None.
For numerical variables, such as Year Built, referring to what year the house was built in, we look to see if it has a normal or skewed distribution. For normal distributions, it is best to use mean for our model. If the variable has a skewed distribution, then it is best is we use the median value. For this variable, based on the histogram this looks like a skewed distribution as most of the houses are built 1960 or so and later, and a large proportion of them built after 1990’s, so we will use the median for this value of 1979.
Looking at the data
Price Vs. Overall Condition (1-10 scale, 10 being best condition)
Price Vs. Overall Quality (1-10 scale, 10 being highest quality)
Looking at Sale Price V. Year Built V. Overall Quality
This plot show a the relationship between the SalePrice, YearBuilt and OverallQuality. It seems that houses from recent years have better grade of quality.
Looking at the data we have a ton of values that are missing data. I've decided to input the missing values into missing data points based on what would most make sense for my model. For instance, the variable for Lot Frontage has missing values. When I run a summary for this data, I get the following:
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 60.00 68.00 69.09 78.00 313.00
In this case, it is probably best to use the median, I've decided to use the median of 69.09 to use for all missing values for this partiuclar variable since it is normally distributed
In another case, I've used just the most frequent feature. For example, the varibale of MSZoing, which refers to how a house is zoned (residential, commerical, etc). As we can see from the table, the vast majority of data is zoned as 'RL', so we will use this to input for the missing values.
C (all) FV RH RL RM
25 139 26 2269 460
Now that we have our data cleaned up, let's start putting the data into some models. THe first model I will use is a very simple linear model using only month and year sold, square footage, number of bedrooms and lot size. I wanted to input some data values that I assumed wouldn't be a strong indicator, but would still yield some basic results that we can compare against later.
#Linear Graphs
Residuals vs. Fitted Plots
*This plot shows error Residuals vs fitted values. *The dotted line at y=0 indicates our fit line. *Any point on fit line obviously has zero residual. Points above have positive residuals and points below have negative residuals. *The red line is the the smoothed high order polynomial curve to give us an idea of pattern of residual movement. In our case we can see that our residuals have curved pattern. This could mean that we may get a better model is we try a model with a quadratic term included. We will explore this point further by actually trying this to see if it helps
Normal Q-Q Plot
*The Normal Q-Q plot is used to check if our residuals follow Normal distribution or not. *The residuals are normally distributed if the points follow the dotted line closely *My graph indicates that most points are within the normal distribution, so my model appears to pass the test of normality
Scale – Location Plot
*Scale location plot indicates spread of points across predicted values range. *One of the assumptions for Regression is Homoscedasticity meaning variance should be reasonably equal across the predictor range. *As residuals spread wider from each other the red spread line goes up. In my graph, it appears that around the 100,000 mark, it looks like the residuals are getting closer to each other therefore causing the red spread line to go down A horizontal red line is ideal and would indicate that residuals have uniform variance across the range.
Residuals vs Leverage Plot
*This plot took was a little harder to grasp when I first looked into what it’s telling us. *Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. *Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence. *What we are concerned with here is that if any points land outside the two dotted lines, meaning those points has very high leverage or potential for influencing our model. So typically if I did have points in that range, I would most likely want to exclude those points. My model have all points within the desired range
#This is a very basic linear model using only lot's square footage, bedrooms, and month
linearModel <- lm(SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr, data=train)
linearPreds <- data.frame(Id = test$Id, SalePrice= predict(linearModel, test))
data.frame': 1459 obs. of 2 variables:
$ Id : int 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
$ SalePrice: num 169277 187758 183584 179317 150730 ...
> linearModel <- lm(SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr, data=df)
> linearPreds <- data.frame(Id = df$Id, SalePrice= predict(linearModel, df))
> str(linearPreds)
'data.frame': 2919 obs. of 2 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ SalePrice: num 173686 180162 186931 177646 209445 ...
> head(linearPreds)
Id SalePrice
1 1 173686.1
2 2 180161.7
3 3 186930.9
4 4 177646.3
5 5 209445.2
6 6 166224.7
> summary(linearModel)
lm(formula = SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr,
data = df)
Min 1Q Median 3Q Max
-258793 -48846 -17832 30415 542493
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1917554.0845 3037799.2464 0.631 0.528
YrSold -897.6770 1512.5335 -0.593 0.553
MoSold 1104.8926 743.2912 1.486 0.137
LotArea 1.9680 0.2005 9.817 < 0.0000000000000002 ***
BedroomAbvGr 13275.8801 2456.2101 5.405 0.0000000756 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 75870 on 1455 degrees of freedom
(1459 observations deleted due to missingness)
Multiple R-squared: 0.09035, Adjusted R-squared: 0.08785
F-statistic: 36.13 on 4 and 1455 DF, p-value: < 0.00000000000000022
The result I focused on here was R-Squared. R – squared is a statistical measure of how close the data are to the fitted regression line. Typically, the higher the R-squared, the better the model fits your data.
The adjusted R-squared value for this simple linear regression is 0.08785 which of course shows that this is close to random, which means our model is not very good. Let’s put all the variables with our cleaned up data set, to where we input values for all the missing values and see what happens:
The 2nd time I ran the linear model was with all the data and I was able to get a better R-squared result:
lm(formula = SalePrice ~ ., data = clean)
Min 1Q Median 3Q Max
-118577 -6221 0 5912 118577
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 43390 on 708 degrees of freedom
Multiple R-squared: 0.9597, Adjusted R-squared: 0.834
F-statistic: 7.635 on 2210 and 708 DF, p-value: < 0.00000000000000022
model <- rpart(SalePrice ~.,data = clean, method = "anova")
predict1 <- predict(model, clean)
postResample(predict1, clean$SalePrice)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 134939 90492 134939 577382
1 2 3 4 5 6
195586.9 195586.9 195586.9 134938.7 195586.9 134938.7
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 134939 90492 134939 577382
RMSE Rsquared MAE
27861.88349 0.93153 14620.09323
##RANDOM FOREST One thing that I did have trouble with was converting the data from a categorical value to a numerical. When trying to run my random forest model, I kept getting errors that stated that my data could not be read.
I then had to use the following functions to get it work correctly. This took me awhile to figure out, but then through the help of the professor and used this code to process it.
#First we have to make it so the the columns in R don't start with numbers, so I'll change
names(clean) <- make.names(names(clean))
#Then we have to change the # of categorial bc RF cannot hanlde over 53
clean <- clean %>% mutate_if(is.character,as.factor)
So I was able to have what I think was a little bit more success in my 2nd try with the random forest. I had a couple factors in my dataset that had over 53 levels, so I changed those values from factors with over 53 levels to integers and ran the model.
> rf1 <- randomForest(SalePrice~.,data=clean, ntree=1000,proximity=TRUE)
> varImpPlot(rf1)
> predict_rf<-predict(rf1, clean) #Prediction
> head(predict_rf)
1 2 3 4 5 6
208014.0 175139.4 221352.1 158057.1 12000000000000 147914.5
> summary(predict_rf)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 50532 90522 163504 642316
> postResample(predict_rf, clean$SalePrice)
RMSE Rsquared MAE
8389.5080258 0.9941019 3464.3994625
I used the caret package in R to test the models using the postResample function. As we can see, this R-squared for this model of 0.9941019 is much better than the Linear model or the Rpart model. Here is the graph for the RandomForest
As we made the models a bit more complicated we were able to get better results for our predictions. in order from least effective to most effective, it was linear regression, then using the recursive partitioning in caret, then finally the random forest gave us the best model to be able to predict a final house price. Given this was my first time around kaggle and a practical project, it was very interesting because what i thought i was going to be doing most of my work was the models, but the bulk of the time to get this data was just figuring out how to handle the missing data. there are much more sophisticated approaches out there to be used, but overall i believe the models used here explain
I believe next time given this same problem, i would do some more feature engineering to zoom in on what variables have the largest effect on the sale price.