In this project, we try to use photos of streets to find some sociological information using machine learning methods. For example, the relationship between the number of cars in the street and the average value of property in the street, and we also have some other hypotheses, such as whether the percentage of Japanese cars in the street is related to the average education level in that street.
intro: FIPS Place Codes: Used to identify cities, towns, and villages in the United States. These codes are essential for describing and analyzing geographic location information in census data.
Image source:Google Street View API
Number of pictures: 20000
Number of FIPS: 200 randomly
Number of images per FIPS: 100
intro:For all the downloaded street maps, use the API to recognize vehicles, including the number, brand, types, and series.The recognized data is stored in a database and output to a csv table after performing relevant statistical operations, including counting information such as the number of vehicles in a fips area.
Step1:use API
API: tecentcloudapi
Step2:store data in database
Database: MySQL
Step3:
Step4:output
After seeking and analyzing the dataset,we try to use both regression model and classifiers model to discover the relationship between the amount and type of the vehicles and the median property in the certain area.
For the regression model, we selected the baseline regression model and the regression model with multiple independent variables. The baseline regression model was used to discover the relationship between the amount of the vehicles and the median property in the certain area. The regression model with multiple independent variables was used to discover the relationship between vehicles series and the median property, and the relationship between vehicles types and the median property. Among them, the vehicle series refers to Japanese, American, and other series. Vehicle types refer to pickups, SUVs, sedans, and others.
The total number of vehicles in the region is considered as the independent variable, and the median property in the region is considered as the dependent variable. The result is shown below:
The two independent dependent variables are the proportion of Japanese series and the proportion of American series, and the dependent variable is the median property. The result is shown below: div style="text-align:center;">
The independent dependent variables are the proportion of pickups, SUVs and sedans, and the dependent variable is the median property.The result is shown below: div style="text-align:center;">
The script loads two CSV files (data.csv and Florida_ct.csv) using pandas. The data from the two DataFrames are merged based on the 'FIPS' column. Additional columns are created based on calculations involving existing columns. The relevant columns for the analysis are selected and stored in the test_df DataFrame. A new column 'property_value_discrete' is created based on a threshold value for 'property_value_median'.
The data is split into features (X) and the target variable (y). The feature set consists of selected columns, and the target variable is 'property_value_discrete'.
A logistic regression model is instantiated, trained on the training data, and evaluated on both training and testing sets. A list of different classifier instances is created. (adding three new models: xgboost, lightgbm, catboost) A loop iterates through each classifier, fits it to the training data, and evaluates its performance on both training and testing sets. Metrics such as accuracy and log loss are calculated for each classifier.
The model uses seaborn and matplotlib to create bar plots for visualizing the performance of each classifier. Two plots are generated: one showing the test accuracy of each classifier and another showing the test log loss.
The model outputs the training and testing accuracy, as well as the training and testing log loss, for each classifier. It also generates visualizations of classifier performance in terms of accuracy and log loss.
-
According to the results of Regression Models, it seems that the number of vehicles does not have a corresponding linear relationship with the median property. At the same time, the relationship between the series of vehicles and the median property, and the type of vehicles and the median property is not simple linear.
-
Based on the results of the classifier models (highest test set accuracy of 71%), it seems to verify that there is some correspondence between these independent variables (vehicle series, vehicle types in the region) and the dependent variable (median property in the region). This verifies the feasibility of using these dependent variables to calculate regional property conditions.
-
Due to the limited time of this summer study, considering the difficulty of obtaining pre-processed data, that is, it takes a lot of time to transform the data, and the final training data is not much, which is likely to be the key factor restricting the performance of the model. In the subsequent tests, if the amount of training data is gradually increased, the relationship between the independent variable and the dependent variable will be clearer, and the accuracy of the test will be improved.
- Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States
- Combining satellite imagery and machine learning to predict poverty
- Deep hybrid models with urban imagery
- Learning representations of satellite imagery by leveraging point of interests
Rongfei Zheng:data collection, extraction of information and modeling
Jingcheng Wang: Model modification and training
Junxi Wu: Model modification and training