Skip to content

bjt4080/Data-Science-Project

Repository files navigation

MSDS 692 DATA SCIENCE PRACTICUM

Ben Turner

Predicting Housing Prices in Ames, Iowa

Project Overview

This project will be evaluating 79 different variables that are provided in a Kaggle competition dataset. The variables describe almost every different of aspect of a house. The data includes obvious variables like square footage, number of bedrooms, and number of bathrooms, to much more less known variables, such as basement square footage, to what shape the lot of the house is on, to whether or not the house has alley access.

The dependent variable in this project will be the Sale Price. In this project we are going to try to determine how to predict housing prices based on home variables.

EDA (EXPLATORY DATA ANALYSIS)

The EDA will begin by trying to understand how the dependent variables and independent variables relate to each other and the cause for that relationship. Our EDA will also involve some data cleaning, how to handle the missing data, and how to deal with the categorical variables.

The data for this project is split up between a test and a train data set. When running the first initial EDA, we find that we have A LOT of missing data. So we will not be able to run a reliable predictive model until we will those values in. Overall there are 34 different variables missing data or NA’s, and a grand total of 13,965 NA or missing values in total.

For the variables, I either input the median value or whatever the majority value was. For example, the highest variables with missing values was the pool area. Since the majority of houses came back with a “None” value, I will use that and replace all missing values for Pool QC with None.

For numerical variables, such as Year Built, referring to what year the house was built in, we look to see if it has a normal or skewed distribution. For normal distributions, it is best to use mean for our model. If the variable has a skewed distribution, then it is best is we use the median value. For this variable, based on the histogram this looks like a skewed distribution as most of the houses are built 1960 or so and later, and a large proportion of them built after 1990’s, so we will use the median for this value of 1979.

EDA (EXPLATORY DATA ANALYSIS)

Looking at the data

str(df)

      Id           MSSubClass     MSZoning          LotFrontage    
 Min.   :   1.0   20     :1079   Length:2919        Min.   : 21.00  
 1st Qu.: 730.5   60     : 575   Class :character   1st Qu.: 60.00  
 Median :1460.0   50     : 287   Mode  :character   Median : 68.00  
 Mean   :1460.0   120    : 182                      Mean   : 69.09  
 3rd Qu.:2189.5   30     : 139                      3rd Qu.: 78.00  
 Max.   :2919.0   70     : 128                      Max.   :313.00  
                  (Other): 529                                      
    LotArea          Street             Alley             LotShape        
 Min.   :  1300   Length:2919        Length:2919        Length:2919       
 1st Qu.:  7478   Class :character   Class :character   Class :character  
 Median :  9453   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 10168                                                           
 3rd Qu.: 11570                                                           
 Max.   :215245                                                           
                                                                          
 LandContour         Utilities          LotConfig          LandSlope        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Neighborhood        Condition1         Condition2          BldgType        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                                                                  
  HouseStyle         OverallQual      OverallCond      YearBuilt   
 Length:2919        Min.   : 1.000   Min.   :1.000   Min.   :1872  
 Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
 Mode  :character   Median : 6.000   Median :5.000   Median :1973  
                    Mean   : 6.089   Mean   :5.565   Mean   :1971  
                    3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001  
                    Max.   :10.000   Max.   :9.000   Max.   :2010  
                                                                   
  YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
 Min.   :1950   Length:2919        Length:2919        Length:2919       
 1st Qu.:1965   Class :character   Class :character   Class :character  
 Median :1993   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1984                                                           
 3rd Qu.:2004                                                           
 Max.   :2010                                                           
                                                                        
 Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
 Length:2919        Length:2919        Min.   :   0.0   Length:2919       
 Class :character   Class :character   1st Qu.:   0.0   Class :character  
 Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
                                       Mean   : 101.4                     
                                       3rd Qu.: 163.5                     
                                       Max.   :1600.0                     
                                                                          
  ExterCond          Foundation          BsmtQual           BsmtCond        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
 Length:2919        Length:2919        Min.   :   0.0   Length:2919       
 Class :character   Class :character   1st Qu.:   0.0   Class :character  
 Mode  :character   Mode  :character   Median : 368.0   Mode  :character  
                                       Mean   : 441.3                     
                                       3rd Qu.: 733.0                     
                                       Max.   :5644.0                     
                                                                          
   BsmtFinSF2       BsmtUnfSF         TotalBsmtSF          Heating         
 Min.   :   0.00   Length:2919        Length:2919        Length:2919       
 1st Qu.:   0.00   Class :character   Class :character   Class :character  
 Median :   0.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :  49.57                                                           
 3rd Qu.:   0.00                                                           
 Max.   :1526.00                                                           
                                                                           
  HeatingQC          CentralAir         Electrical           1stFlrSF   
 Length:2919        Length:2919        Length:2919        Min.   : 334  
 Class :character   Class :character   Class :character   1st Qu.: 876  
 Mode  :character   Mode  :character   Mode  :character   Median :1082  
                                                          Mean   :1160  
                                                          3rd Qu.:1388  
                                                          Max.   :5095  
                                                                        
    2ndFlrSF       LowQualFinSF        GrLivArea     BsmtFullBath   
 Min.   :   0.0   Min.   :   0.000   Min.   : 334   Min.   :0.0000  
 1st Qu.:   0.0   1st Qu.:   0.000   1st Qu.:1126   1st Qu.:0.0000  
 Median :   0.0   Median :   0.000   Median :1444   Median :0.0000  
 Mean   : 336.5   Mean   :   4.694   Mean   :1501   Mean   :0.4296  
 3rd Qu.: 704.0   3rd Qu.:   0.000   3rd Qu.:1744   3rd Qu.:1.0000  
 Max.   :2065.0   Max.   :1064.000   Max.   :5642   Max.   :3.0000  
                                                                    
  BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr 
 Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.00  
 1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.00  
 Median :0.00000   Median :2.000   Median :0.0000   Median :3.00  
 Mean   :0.06132   Mean   :1.568   Mean   :0.3803   Mean   :2.86  
 3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.00  
 Max.   :2.00000   Max.   :4.000   Max.   :2.0000   Max.   :8.00  
                                                                  
  KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
 Min.   :0.000   Length:2919        Min.   : 2.000   Length:2919       
 1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
 Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
 Mean   :1.045                      Mean   : 6.452                     
 3rd Qu.:1.000                      3rd Qu.: 7.000                     
 Max.   :3.000                      Max.   :15.000                     
                                                                       
   Fireplaces     FireplaceQu         GarageType         GarageYrBlt  
 Min.   :0.0000   Length:2919        Length:2919        Min.   :1895  
 1st Qu.:0.0000   Class :character   Class :character   1st Qu.:1962  
 Median :1.0000   Mode  :character   Mode  :character   Median :1979  
 Mean   :0.5971                                         Mean   :1978  
 3rd Qu.:1.0000                                         3rd Qu.:2001  
 Max.   :4.0000                                         Max.   :2207  
                                                                      
 GarageFinish         GarageCars      GarageArea      GarageQual       
 Length:2919        Min.   :0.000   Min.   :   0.0   Length:2919       
 Class :character   1st Qu.:1.000   1st Qu.: 320.0   Class :character  
 Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
                    Mean   :1.767   Mean   : 472.9                     
                    3rd Qu.:2.000   3rd Qu.: 576.0                     
                    Max.   :5.000   Max.   :1488.0                     
                                                                       
  GarageCond         PavedDrive          WoodDeckSF       OpenPorchSF    
 Length:2919        Length:2919        Min.   :   0.00   Min.   :  0.00  
 Class :character   Class :character   1st Qu.:   0.00   1st Qu.:  0.00  
 Mode  :character   Mode  :character   Median :   0.00   Median : 26.00  
                                       Mean   :  93.71   Mean   : 47.49  
                                       3rd Qu.: 168.00   3rd Qu.: 70.00  
                                       Max.   :1424.00   Max.   :742.00  
                                                                         
 EnclosedPorch      3SsnPorch        ScreenPorch        PoolArea      
 Min.   :   0.0   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
 1st Qu.:   0.0   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
 Median :   0.0   Median :  0.000   Median :  0.00   Median :  0.000  
 Mean   :  23.1   Mean   :  2.602   Mean   : 16.06   Mean   :  2.252  
 3rd Qu.:   0.0   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
 Max.   :1012.0   Max.   :508.000   Max.   :576.00   Max.   :800.000  
                                                                      
    PoolQC             Fence           MiscFeature           MiscVal        
 Length:2919        Length:2919        Length:2919        Min.   :    0.00  
 Class :character   Class :character   Class :character   1st Qu.:    0.00  
 Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
                                                          Mean   :   50.83  
                                                          3rd Qu.:    0.00  
                                                          Max.   :17000.00  
                                                                            
     MoSold     YrSold      SaleType         SaleCondition        SalePrice     
 6      :503   2006:619   Length:2919        Length:2919        Min.   :     0  
 7      :446   2007:692   Class :character   Class :character   1st Qu.:     0  
 5      :394   2008:622   Mode  :character   Mode  :character   Median : 34900  
 4      :279   2009:647                                         Mean   : 90492  
 8      :233   2010:339                                         3rd Qu.:163000  
 3      :232                                                    Max.   :755000  
 (Other):832                                                                                                                                 

#Scatterplot

rplot01

Price Vs. Overall Condition (1-10 scale, 10 being best condition) condition

Price Vs. Overall Quality (1-10 scale, 10 being highest quality) quality

Looking at Sale Price V. Year Built V. Overall Quality salepricevquality

This plot show a the relationship between the SalePrice, YearBuilt and OverallQuality. It seems that houses from recent years have better grade of quality.

Looking at the data we have a ton of values that are missing data. I've decided to input the missing values into missing data points based on what would most make sense for my model. For instance, the variable for Lot Frontage has missing values. When I run a summary for this data, I get the following:

summary(df$LotFrontage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   60.00   68.00   69.09   78.00  313.00 

In this case, it is probably best to use the median, I've decided to use the median of 69.09 to use for all missing values for this partiuclar variable since it is normally distributed

In another case, I've used just the most frequent feature. For example, the varibale of MSZoing, which refers to how a house is zoned (residential, commerical, etc). As we can see from the table, the vast majority of data is zoned as 'RL', so we will use this to input for the missing values.

table(df$MSZoning)
C (all)      FV      RH      RL      RM 
     25     139      26    2269     460 

Now that we have our data cleaned up, let's start putting the data into some models. THe first model I will use is a very simple linear model using only month and year sold, square footage, number of bedrooms and lot size. I wanted to input some data values that I assumed wouldn't be a strong indicator, but would still yield some basic results that we can compare against later.

lineargraphs

#Linear Graphs

Residuals vs. Fitted Plots

*This plot shows error Residuals vs fitted values. *The dotted line at y=0 indicates our fit line. *Any point on fit line obviously has zero residual. Points above have positive residuals and points below have negative residuals. *The red line is the the smoothed high order polynomial curve to give us an idea of pattern of residual movement. In our case we can see that our residuals have curved pattern. This could mean that we may get a better model is we try a model with a quadratic term included. We will explore this point further by actually trying this to see if it helps

Normal Q-Q Plot

*The Normal Q-Q plot is used to check if our residuals follow Normal distribution or not. *The residuals are normally distributed if the points follow the dotted line closely *My graph indicates that most points are within the normal distribution, so my model appears to pass the test of normality

Scale – Location Plot

*Scale location plot indicates spread of points across predicted values range. *One of the assumptions for Regression is Homoscedasticity meaning variance should be reasonably equal across the predictor range. *As residuals spread wider from each other the red spread line goes up. In my graph, it appears that around the 100,000 mark, it looks like the residuals are getting closer to each other therefore causing the red spread line to go down A horizontal red line is ideal and would indicate that residuals have uniform variance across the range.

Residuals vs Leverage Plot

*This plot took was a little harder to grasp when I first looked into what it’s telling us. *Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. *Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence. *What we are concerned with here is that if any points land outside the two dotted lines, meaning those points has very high leverage or potential for influencing our model. So typically if I did have points in that range, I would most likely want to exclude those points. My model have all points within the desired range

#This is a very basic linear model using only lot's square footage, bedrooms, and month 
linearModel <- lm(SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr, data=train)
linearPreds <- data.frame(Id = test$Id, SalePrice= predict(linearModel, test))
str(linearPreds)
head(linearPreds)
data.frame':	1459 obs. of  2 variables:
 $ Id       : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
 $ SalePrice: num  169277 187758 183584 179317 150730 ...
> linearModel <- lm(SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr, data=df)
> linearPreds <- data.frame(Id = df$Id, SalePrice= predict(linearModel, df))
> str(linearPreds)
'data.frame':	2919 obs. of  2 variables:
 $ Id       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ SalePrice: num  173686 180162 186931 177646 209445 ...
> head(linearPreds)
  Id SalePrice
1  1  173686.1
2  2  180161.7
3  3  186930.9
4  4  177646.3
5  5  209445.2
6  6  166224.7
> summary(linearModel)

Call:
lm(formula = SalePrice ~ YrSold + MoSold + LotArea + BedroomAbvGr, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-258793  -48846  -17832   30415  542493 

Coefficients:
                 Estimate   Std. Error t value             Pr(>|t|)    
(Intercept)  1917554.0845 3037799.2464   0.631                0.528    
YrSold          -897.6770    1512.5335  -0.593                0.553    
MoSold          1104.8926     743.2912   1.486                0.137    
LotArea            1.9680       0.2005   9.817 < 0.0000000000000002 ***
BedroomAbvGr   13275.8801    2456.2101   5.405         0.0000000756 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 75870 on 1455 degrees of freedom
  (1459 observations deleted due to missingness)
Multiple R-squared:  0.09035,	Adjusted R-squared:  0.08785 
F-statistic: 36.13 on 4 and 1455 DF,  p-value: < 0.00000000000000022

The result I focused on here was R-Squared. R – squared is a statistical measure of how close the data are to the fitted regression line. Typically, the higher the R-squared, the better the model fits your data.

The adjusted R-squared value for this simple linear regression is 0.08785 which of course shows that this is close to random, which means our model is not very good. Let’s put all the variables with our cleaned up data set, to where we input values for all the missing values and see what happens:

The 2nd time I ran the linear model was with all the data and I was able to get a better R-squared result:

lm(formula = SalePrice ~ ., data = clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-118577   -6221       0    5912  118577 

Coefficients: (236 not defined because of singularities)
                             Estimate       Std. Error t value             Pr(>|t|)    
(Intercept)           1164216785.4293   538384358.0602   2.162               0.0309 *  
V1                           -82.8075           2.0041 -41.319 < 0.0000000000000002 ***
Id                                 NA               NA      NA                   NA    
MSSubClass                     0.1044         198.8921   0.001               0.9996    
MSZoningFV                -10617.3665       31325.3579  -0.339               0.7348    
MSZoningRH                -24678.6745       27181.6366  -0.908               0.3642    
MSZoningRL                -14723.3311       22104.7801  -0.666               0.5056    
MSZoningRM                -25838.4083       20187.7555  -1.280               0.2010    
LotFrontage                  -51.4765         133.3043  -0.386               0.6995    
LotArea                        0.6451           0.5885   1.096               0.2733    
StreetPave                  7385.5740       25515.8856   0.289               0.7723    
AlleyNone                   7706.9297        8805.1317   0.875               0.3817    
AlleyPave                   2650.6978       15719.8815   0.169               0.8661    
LotShapeIR2                17873.8286       11532.2747   1.550               0.1216    
LotShapeIR3                53072.7219       30457.5635   1.743               0.0819 .  
LotShapeReg                 1287.5956        4464.7641   0.288               0.7731    
LandContourHLS             -7983.9429       14102.6656  -0.566               0.5715    
LandContourLow             11892.3204       17301.7462   0.687               0.4921    
LandContourLvl              2961.2140        9213.5524   0.321               0.7480    
UtilitiesAllPub           -17272.8471       45867.3678  -0.377               0.7066    
UtilitiesNoSeWa        -33556838.5227    15807623.5116  -2.123               0.0341 *  
LotConfigCulDSac            4985.4118        9581.7688   0.520               0.6030    
LotConfigFR2               -7761.1473       11881.8112  -0.653               0.5138    
LotConfigFR3              -49349.5668       32451.3853  -1.521               0.1288    
LotConfigInside            -6772.5410        4623.4876  -1.465               0.1434    
LandSlopeMod               -4382.8643       10926.7009  -0.401               0.6885    
LandSlopeSev              -17179.1659       41874.7000  -0.410               0.6817    
NeighborhoodBlueste       -68034.4315       78225.5593  -0.870               0.3847    
NeighborhoodBrDale        -41660.5064       58669.4354  -0.710               0.4779    
NeighborhoodBrkSide       -35685.8981       54757.7243  -0.652               0.5148    
NeighborhoodClearCr       -92390.0710       56862.6723  -1.625               0.1047    
NeighborhoodCollgCr       -47614.8060       52447.0766  -0.908               0.3643    
NeighborhoodCrawfor       -31944.8667       55698.7957  -0.574               0.5665    
NeighborhoodEdwards       -39612.9742       54036.5637  -0.733               0.4638    
NeighborhoodGilbert       -55993.3896       52707.0328  -1.062               0.2884    
NeighborhoodIDOTRR        -41346.1235       54805.8424  -0.754               0.4509    
NeighborhoodMeadowV       -33875.6942       60538.2197  -0.560               0.5759    
NeighborhoodMitchel       -37272.3556       53921.9454  -0.691               0.4896    
NeighborhoodNAmes         -60078.9334       53859.0037  -1.115               0.2650    
NeighborhoodNoRidge       -17924.0873       55669.4850  -0.322               0.7476    
NeighborhoodNPkVill         3025.2816       83086.1408   0.036               0.9710    
NeighborhoodNridgHt       -65725.7277       52373.4609  -1.255               0.2099    
NeighborhoodNWAmes        -50152.3138       54121.5372  -0.927               0.3544    
NeighborhoodOldTown       -44097.4300       54968.1215  -0.802               0.4227    
NeighborhoodSawyer        -47668.1502       54156.7509  -0.880               0.3791    
NeighborhoodSawyerW       -47173.7293       53013.5279  -0.890               0.3739    
NeighborhoodSomerst       -30111.3266       55682.5285  -0.541               0.5888    
NeighborhoodStoneBr         7461.8993       56334.9390   0.132               0.8947    
NeighborhoodSWISU         -28153.6418       55339.8306  -0.509               0.6111    
NeighborhoodTimber        -53311.1227       54023.9606  -0.987               0.3241    
NeighborhoodVeenker       -87102.8633       60158.1530  -1.448               0.1481    
Condition1Feedr            16059.1340       11547.2432   1.391               0.1647    
Condition1Norm              8166.5046        9191.2330   0.889               0.3746    
Condition1PosA             -7680.2092       23972.7793  -0.320               0.7488    
Condition1PosN              5293.9235       18557.8575   0.285               0.7755    
Condition1RRAe             24565.3938       18475.5459   1.330               0.1841    
Condition1RRAn             -1167.1212       15462.0458  -0.075               0.9399    
Condition1RRNe            -10578.6499       37679.4257  -0.281               0.7790    
Condition1RRNn            -26758.5261       31274.8125  -0.856               0.3925    
Condition2Feedr           -21418.5901       41158.2757  -0.520               0.6030    
Condition2Norm              3433.4548       34686.9257   0.099               0.9212    
Condition2PosA             29900.3269       63573.3496   0.470               0.6383    
Condition2PosN           -429562.2318      101878.5944  -4.216             0.000028 ***
Condition2RRAe          67462576.7580    31949879.7265   2.112               0.0351 *  
Condition2RRAn             19643.7897       76008.2728   0.258               0.7961    
Condition2RRNn            -69656.6182       59073.7832  -1.179               0.2387    
BldgType2fmCon            -12472.9611       29066.0957  -0.429               0.6680    
BldgTypeDuplex              3198.9819       17377.7548   0.184               0.8540    
BldgTypeTwnhs             -27989.2109       25243.0317  -1.109               0.2679    
BldgTypeTwnhsE            -11947.1529       23071.8841  -0.518               0.6047    
HouseStyle1.5Unf          -33141.8881       18927.7058  -1.751               0.0804 .  
HouseStyle1Story             601.1175       10627.5369   0.057               0.9549    
HouseStyle2.5Fin           45702.5201       39620.4296   1.154               0.2491    
HouseStyle2.5Unf          -19410.9237       19932.0178  -0.974               0.3305    
HouseStyle2Story           -2402.9209        8768.9081  -0.274               0.7841    
HouseStyleSFoyer          -19636.9166       15217.8696  -1.290               0.1973    
HouseStyleSLvl             13724.3974       14668.6724   0.936               0.3498    
OverallQual                 4319.6679        2614.8978   1.652               0.0990 .  
OverallCond                 3807.6912        2036.8390   1.869               0.0620 .  
YearBuilt                      3.4009         184.5513   0.018               0.9853    
YearRemodAdd                  55.6391         132.1857   0.421               0.6739    
RoofStyleGable             25773.1549       67619.1264   0.381               0.7032    
RoofStyleGambrel           30665.8630       70639.7838   0.434               0.6643    
RoofStyleHip               37237.7926       68097.4638   0.547               0.5847    
RoofStyleMansard           25460.6468       74785.3188   0.340               0.7336    
RoofStyleShed              28551.5134       95391.3254   0.299               0.7648    
RoofMatlCompShg       -952946134.0674   450416818.4687  -2.116               0.0347 *  
RoofMatlMembran       -880796389.4894   416347915.5020  -2.116               0.0347 *  
RoofMatlMetal         -952810600.7390   450413213.9633  -2.115               0.0347 *  
RoofMatlRoll         -1084999040.0417   512804721.5731  -2.116               0.0347 *  
RoofMatlTar&Grv       -952902253.4954   450415713.4401  -2.116               0.0347 *  
RoofMatlWdShake       -953031615.8603   450417209.6204  -2.116               0.0347 *  
RoofMatlWdShngl       -952871640.9807   450415244.9863  -2.116               0.0347 *  
Exterior1stAsphShn         62530.6896      105696.2889   0.592               0.5543    
Exterior1stBrkComm         41829.6565       55043.9109   0.760               0.4475    
Exterior1stBrkFace          4383.4808       30754.1907   0.143               0.8867    
Exterior1stCBlock          91827.3304       53935.3508   1.703               0.0891 .  
Exterior1stCemntBd         21226.4337       70225.4884   0.302               0.7625    
Exterior1stHdBoard        -15899.0680       29049.2028  -0.547               0.5843    
Exterior1stImStucc     116646665.3149    55127753.6071   2.116               0.0347 *  
Exterior1stMetalSd         11548.1228       32725.4667   0.353               0.7243    
Exterior1stPlywood         -8368.3916       28341.4899  -0.295               0.7679    
Exterior1stStone       105750223.7299    49927039.2877   2.118               0.0345 *  
Exterior1stStucco         -13527.1818       31252.4665  -0.433               0.6653    
Exterior1stVinylSd         14522.3085       33039.5903   0.440               0.6604    
Exterior1stWd Sdng         -4926.3874       28411.0329  -0.173               0.8624    
Exterior1stWdShing         12917.3695       29701.1279   0.435               0.6638    
Exterior2ndAsphShn        -65724.0508       98733.1885  -0.666               0.5058    
Exterior2ndBrk Cmn       -107084.8991       64266.9862  -1.666               0.0961 .  
Exterior2ndBrkFace          9574.0650       36492.0495   0.262               0.7931    
Exterior2ndCBlock         -45210.3186       57755.7594  -0.783               0.4340    
Exterior2ndCmentBd        -55304.1212       71463.1418  -0.774               0.4393    
Exterior2ndHdBoard         10945.0082       30981.0321   0.353               0.7240    
Exterior2ndImStucc        -10597.8574       41157.4529  -0.257               0.7969    
Exterior2ndMetalSd         -9946.1829       34895.7041  -0.285               0.7757    
Exterior2ndOther       152139124.4370    71867969.6824   2.117               0.0346 *  
Exterior2ndPlywood         11378.8219       29469.6228   0.386               0.6995    
Exterior2ndStone           57381.5648       43181.9814   1.329               0.1843    
Exterior2ndStucco          19032.7176       33635.9358   0.566               0.5717    
Exterior2ndVinylSd        -23619.2951       34826.6562  -0.678               0.4979    
Exterior2ndWd Sdng          2413.2181       30041.6112   0.080               0.9360    
Exterior2ndWd Shng         -8282.8355       30584.8946  -0.271               0.7866    
MasVnrTypeBrkFace           4799.8273       18744.6208   0.256               0.7980    
MasVnrTypeNone             10460.2382       18690.1870   0.560               0.5759    
MasVnrTypeStone            16497.4966       20438.7556   0.807               0.4198    
MasVnrArea                     2.7837          18.6911   0.149               0.8816    
ExterQualFa                -3327.0696       27254.2708  -0.122               0.9029    
ExterQualGd                -3923.8495       20784.3755  -0.189               0.8503    
ExterQualTA                 2218.5994       21701.5458   0.102               0.9186    
ExterCondFa                31014.2939       25738.7549   1.205               0.2286    
ExterCondGd                34972.3985       23369.0616   1.497               0.1350    
ExterCondPo                -7107.1452       52127.2663  -0.136               0.8916    
ExterCondTA                36396.6554       23114.5323   1.575               0.1158    
FoundationCBlock           11188.9564        7516.3038   1.489               0.1370    
FoundationPConc            -4975.7737        7875.7679  -0.632               0.5277    
FoundationSlab             25575.4342       15046.6133   1.700               0.0896 .  
FoundationStone           -51135.2053       32093.6629  -1.593               0.1115    
FoundationWood            -15205.2422       35717.7354  -0.426               0.6705    
BsmtQualFa                -15402.1230       16095.4592  -0.957               0.3389    
BsmtQualGd                 -2986.0714       12308.4538  -0.243               0.8084    
BsmtQualTA                -14926.9416       13797.4930  -1.082               0.2797    
BsmtCondGd                 -7894.1414       13171.1495  -0.599               0.5491    
BsmtCondPo                -15855.4030       40506.5602  -0.391               0.6956    
BsmtCondTA                  2435.7646        9763.6902   0.249               0.8031    
BsmtExposureGd            -15164.9942        9474.4605  -1.601               0.1099    
BsmtExposureMn             -9820.2935        8432.3737  -1.165               0.2446    
BsmtExposureNo             -5354.2403        6648.7661  -0.805               0.4209    
BsmtFinType1BLQ             2113.5043        7792.3843   0.271               0.7863    
BsmtFinType1GLQ             6420.9048        7275.5580   0.883               0.3778    
BsmtFinType1LwQ              322.9165        9917.4537   0.033               0.9740    
BsmtFinType1None      -186293892.5764    87975096.9926  -2.118               0.0346 *  
BsmtFinType1Rec            -6791.5853        7430.9157  -0.914               0.3610    
BsmtFinType1Unf            18444.3294       11478.1338   1.607               0.1085    
BsmtFinSF1               -186492.0768       88116.6583  -2.116               0.0347 *  
BsmtFinType2BLQ           -11458.4811       16590.7542  -0.691               0.4900    
BsmtFinType2GLQ           -24463.1703       21296.2276  -1.149               0.2511    
BsmtFinType2LwQ             9552.6182       16238.3565   0.588               0.5565    
BsmtFinType2Rec             1020.9162       16499.2350   0.062               0.9507    
BsmtFinType2Unf             -424.3592       17366.6962  -0.024               0.9805    
BsmtFinSF2               -186478.6163       88116.5714  -2.116               0.0347 *  
BsmtUnfSF100           -18663768.1730     8811726.8560  -2.118               0.0345 *  
BsmtUnfSF1005         -187408337.3779    88556040.2841  -2.116               0.0347 *  
BsmtUnfSF1007         -186296951.0595    87975956.3637  -2.118               0.0346 *  
BsmtUnfSF1008         -188016661.3513    88821368.8741  -2.117               0.0346 *  
BsmtUnfSF1010         -188448777.6843    88998928.5763  -2.117               0.0346 *  
BsmtUnfSF1012         -188736257.3180    89173456.5911  -2.117               0.0347 *  
BsmtUnfSF1013         -189082853.7897    89263980.5765  -2.118               0.0345 *  
BsmtUnfSF1017         -189548303.7368    89615372.8304  -2.115               0.0348 *  
BsmtUnfSF1018         -189902056.3487    89703056.3257  -2.117               0.0346 *  
BsmtUnfSF102           -19037358.4322     8988638.0599  -2.118               0.0345 *  
BsmtUnfSF1020         -217715960.8248   102866332.5324  -2.116               0.0347 *  
BsmtUnfSF1022         -190675363.1085    90052308.4518  -2.117               0.0346 *  
BsmtUnfSF1026         -191258983.9583    90407435.4003  -2.116               0.0347 *  
BsmtUnfSF1028         -191981566.4502    90585295.2863  -2.119               0.0344 *  
BsmtUnfSF103           -19168251.9942     9075537.8515  -2.112               0.0350 *  
BsmtUnfSF1030         -186199192.9715    87973146.2279  -2.117               0.0346 *  
BsmtUnfSF1032         -192384501.2641    90935721.9938  -2.116               0.0347 *  
BsmtUnfSF1035          -16167728.9039     7611963.6659  -2.124               0.0340 *  
BsmtUnfSF104           -19417722.6190     9165473.9617  -2.119               0.0345 *  
BsmtUnfSF1040         -193995683.7465    91641284.9552  -2.117               0.0346 *  
BsmtUnfSF1041         -194106175.5555    91728968.2323  -2.116               0.0347 *  
BsmtUnfSF1042         -194394079.7524    91818741.5868  -2.117               0.0346 *  
BsmtUnfSF1043         -258527845.9919   122079653.8324  -2.118               0.0345 *  
BsmtUnfSF1045           12484957.4430     5868473.9496   2.127               0.0337 *  
BsmtUnfSF1046         -195033247.0370    92172298.3435  -2.116               0.0347 *  
BsmtUnfSF1048         -195608151.8548    92345480.0335  -2.118               0.0345 *  
BsmtUnfSF105           -19441645.9678     9251691.9467  -2.101               0.0360 *  
BsmtUnfSF1050         -195872223.8409    92521870.0670  -2.117               0.0346 *  
BsmtUnfSF1052         -196280426.5007    92697925.8616  -2.117               0.0346 *  
BsmtUnfSF1053         -196464832.0286    92787919.7502  -2.117               0.0346 *  
BsmtUnfSF1054         -196541852.5350    92874267.6700  -2.116               0.0347 *  
BsmtUnfSF1055         -256714682.5495   121282235.1483  -2.117               0.0346 *  
BsmtUnfSF1057         -197188332.5895    93139400.6456  -2.117               0.0346 *  
BsmtUnfSF1058         -197328504.4629    93229115.1321  -2.117               0.0346 *  
BsmtUnfSF106           -19796658.8315     9341853.2875  -2.119               0.0344 *  
BsmtUnfSF1063         -261770208.0044   123659247.1006  -2.117               0.0346 *  
BsmtUnfSF1064         -198425672.6499    93755825.3823  -2.116               0.0347 *  
BsmtUnfSF1065         -198717122.4848    93844045.4308  -2.118               0.0346 *  
BsmtUnfSF1066         -186357155.1981    87974819.3517  -2.118               0.0345 *  
BsmtUnfSF1068         -199119032.1902    94109829.4447  -2.116               0.0347 *  
 [ reached getOption("max.print") -- omitted 2247 rows ]
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 43390 on 708 degrees of freedom
Multiple R-squared:  0.9597,	Adjusted R-squared:  0.834 
F-statistic: 7.635 on 2210 and 708 DF,  p-value: < 0.00000000000000022

##RPART

model <- rpart(SalePrice ~.,data = clean, method = "anova")
predict1 <- predict(model, clean)
summary(predict1)
library(caret)
head(predict1)
postResample(predict1, clean$SalePrice)

Show in New WindowClear OutputExpand/Collapse Output
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0  134939   90492  134939  577382 
       1        2        3        4        5        6 
195586.9 195586.9 195586.9 134938.7 195586.9 134938.7

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0  134939   90492  134939  577382 
    
    RMSE    Rsquared         MAE 
27861.88349     0.93153 14620.09323

##RANDOM FOREST One thing that I did have trouble with was converting the data from a categorical value to a numerical. When trying to run my random forest model, I kept getting errors that stated that my data could not be read.

I then had to use the following functions to get it work correctly. This took me awhile to figure out, but then through the help of the professor and used this code to process it.

#First we have to make it so the the columns in R don't start with numbers, so I'll change
names(clean) <- make.names(names(clean))
#Then we have to change the # of categorial bc RF cannot hanlde over 53
clean <- clean %>% mutate_if(is.character,as.factor)

So I was able to have what I think was a little bit more success in my 2nd try with the random forest. I had a couple factors in my dataset that had over 53 levels, so I changed those values from factors with over 53 levels to integers and ran the model.

> rf1 <- randomForest(SalePrice~.,data=clean, ntree=1000,proximity=TRUE)
> varImpPlot(rf1)
> predict_rf<-predict(rf1, clean) #Prediction
> head(predict_rf)
      1        2        3        4        5        6 
208014.0 175139.4 221352.1 158057.1 12000000000000 147914.5 
> summary(predict_rf)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     0       0   50532   90522  163504  642316 
> postResample(predict_rf, clean$SalePrice)
       RMSE     Rsquared          MAE 
8389.5080258    0.9941019 3464.3994625 

I used the caret package in R to test the models using the postResample function. As we can see, this R-squared for this model of 0.9941019 is much better than the Linear model or the Rpart model. Here is the graph for the RandomForest

rf model

CONCLUSIONS

As we made the models a bit more complicated we were able to get better results for our predictions. in order from least effective to most effective, it was linear regression, then using the recursive partitioning in caret, then finally the random forest gave us the best model to be able to predict a final house price. Given this was my first time around kaggle and a practical project, it was very interesting because what i thought i was going to be doing most of my work was the models, but the bulk of the time to get this data was just figuring out how to handle the missing data. there are much more sophisticated approaches out there to be used, but overall i believe the models used here explain

I believe next time given this same problem, i would do some more feature engineering to zoom in on what variables have the largest effect on the sale price.

About

Ben's Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages