Team Members: Shusaku Asai, Congjun Huang, Jingjing Wang
Many machine learning applications model home sales price. We extended this work by applying various sales price models to Single-Family, Multi-Family, Mixed-Use, Industrial, Commercial, and Vacant Land property types in Philadelphia. Within each property type, we built and tuned a best performing model and tested its generalizability on future unseen data. We found that the Single-Family property type model had the best performing model on the test data and the Commercial property type had the worst performing model as defined by RMSE. Across all property types, XGBoost and Random Forest models showed superior performance. Future home buyers in Philadelphia may find our model useful in predicting sales prices of homes. Government and businesses should use the Commercial, Industrial, Mixed-Use and Vacant Land models with caution due to their poorer generalizability.
We utilized the publicly available Philadelphia Properties and Assessment History dataset. Data source and data dictionary can be found below.
https://www.opendataphilly.org/dataset/opa-property-assessments
The project is visualized with the below flowchart:
The Single-Family property type model had the smallest test RMSE and thus had best generalizability for
future predictions. Commercial property type model had the largest test RMSE, and thus had the worst
generalizability. The XGBoost and Random Forest models were superior in both validations set and
generalizability.
Our Single-Family and Multi-Family models show utility that future home buyers in Philadelphia may find
useful. By training a model with all available data, a user may input a desired home’s amenities and location
to predict the sale price. Government organizations and private companies in Philadelphia may find our
model for Mixed, Vacant, Commercial, and Industrial useful, but we recommend this be done with caution
due to the larger test RMSE and poorer generalizability.
The poor generalizability of the Commercial and Industrial property types may be a result of the underlying
variability of price and smaller data availability. The superior performance of the Single-Family type in
contrast may be due to a larger and more robust sample size and tighter distribution in sale price. Future
works that aim to predict Commercial and Industrial property types should ensure a generally non-missing
feature space or attempt to aggregate datapoints from several cities to increase sample size.
git clone https://github.com/delashu/Philadelphia-Housing.git
pip install -r requirements.txt
Data is found here:
https://www.opendataphilly.org/dataset/opa-property-assessments
Direct download of data below: https://opendata-downloads.s3.amazonaws.com/opa_properties_public.csv
All the experiment results can be accessed through the following 3 parts of code.
https://github.com/delashu/Philadelphia-Housing/tree/main/EDA
https://github.com/delashu/Philadelphia-Housing/tree/main/processing
https://github.com/delashu/Philadelphia-Housing/tree/main/model