Create a regression model with the Ames Housing Dataset to accurately predict the price of a house at sale. There’s been an increase in user traffic to properties with 3-star accuracy Zestimates. Our goal is to improve Zestimate accuracy (star rating) for areas with 3 stars (Good Zestimate).
We cleaned and analyzed 81 features which we engineered by way of dummying, mapping, and Polynomial features. This resulted in a final, 20,706 features which we fed into our LassoCV model. We chose a Lasso model because we knew a large number of features would have the potential of overfitting, therefore, we wanted the harshest method of regularization.
Our model achieved an R squared of 94% on training data and 86% on unseen data. This means two things: 86% of the variability in the data is explained by our model and our model is overfit. Our final model was not our best model. We, unfortunately, overwrote the features from our best model as the assumption was made that converting all features to numeric and using Polynomial Features would yield even better results which was not the case.
- Continue to test/learn to improve the model by:
- Revisiting my initial forward selection process
- Selecting features highly correlated with Sale Price
- Creating new ones and checking their correlation
- Feeding the model and checking the results
- Keeping track of the features that improve the model and discard (but also track) the features that don't
- We believe more time and the meeting of diverse minds to test and learn will yield better results.
- We also would like more data from other cities in Story county and more data on the neighborhood (e.g. number of schools, school types, prisons/jails in the area, etc.)