This project is a collaborative work by Jingwen Ni, Haohan Shi, Lu Zhang, and Zhe Zhang. This README will first present a brief Introduction to the project and state its significance. We will then focus on the Data, Method: Machine Learning Model, and Method: Topic Modeling sections and present our Discussion and Limitation.
In terms of large-scale data techniques, we have used lambda function and step function to web scrape Amazon data, we have used Dask to conduct EDA, and we have used Spark to perform feature engineering (include sentiment analysis), machine learning, and topic modeling.
Covid-19, later renamed Coronavirus, gripped China since the beginning of 2020 (Qiu et al., 2020). By the end of November 2021, there were more than 63 million reported cases and 1.4 million deaths worldwide (Brodeur et al., 2021). The degree of the finicial crash was more severe than the crisis in 2008(Castillo, Melin 2020). A study suggested that from the European Commission's Spring 2020 Monetary Conjecture report, the total national output will contract profoundly: 7.5% for the EU, 4% for Poland, 9% for Italy, France, and Spain (Lai et al., 2021). Covid-19 spread rapidly worldwide and brought an unignorable crisis in all industries, including the pet industry.
Since Covid-19 is a respiratory infectious disease, countries have published several policies about social distance and quarantine to reduce Covid-19 infections. An online survey on 2,007 respondents in November 2020 shows that approximately 70% of Americans from all generations spend more time with their pets as a result of social distancing regulations. A previous study shows that the Relative Search Volume (RSV) for both god and cat adoption in 2020 increased by up to 250% compared with the RSV in the same period in 2019 (Ho et al., 2021). U.S.News also reported an increase in pet adoption due to the pandemic. Two of our group members foster pets for the first time in our lives during Covid (Bento on the right and Fortune on the left) and get huge emotional support from our pets.
We thus want to explore COVID-19's impact on the pet industry.
We think this envisioned study is critical to understanding how consumer behavior is influenced by major events such as the Covid-19. This study can be further applied to other industries. By applying large-scale computing, we minimized the computational time needed for each industry. Therefore other researchers can use our code and analyze the whole market in a reasonable amount of time.
lambda function (amazon_scraper.zip) and step function (amazon_scraper_sfn_setup.py)
The resulting data is too be to be uploaded on GitHub. Boto3 notebook is in scraper. This is the Google Drive Link to the data.
We used Beautiful Soup to write scraper for Amazon products within the pet section. The scraper can automatically go through all products on each page and store each product's information as a JSON file in AWS S3. To speed the scraping process up, we parallel the process using AWS lambda and Step Functions.
For each product, we scraped information about:
- product id (String)
- product title (String)
- overall star rating (Double)
- number of rating (Integer)
- price (Double)
- review text (String)
- time of the review (Date)
- location of the review (String)
- star rating of this review (Integer)
- number of helpful votes (Integer)
Each instance/observation in the data is a unique review of pet-related amazon products. We have in total more than 330,000 reviews for 1161 distinct products.
We got the aggregated Covid case data from the New York Times. Since Amazon only published review locations on the country level, we can only use data on this level although. Features of this data include:
- date
- aggregated number of Covid cases
- aggregated number of Covid death cases
Each instance/observation in the data is a unique day since 2020-01-21.
In this file, we first cleaned the Amazon Data
and cast each feature to its corresponding type.
We then created the increased number of Covid cases per day and the increased number of Covid death per day for the Covid Data
. We also cast each feature to its corresponding type.
We then merged Covid Data
with Amazon Data
. Reviews wrote before Covid have Covid Data
related features (cases
, death
, increased cases
, and increased death
) as 0.
We visualized the variables in the datasets using Dask. Here are some visulizations that we think worth noticing (Other visulizations are performed yet not presented here. All visualizations are stored here):
The graph describes the changes in the number of comments on the website, which increased gradually from 2004 until it reached its peak in 2021. 5 months have passed in 2022, and the number of comments almost reaches the level of 2020. This can either caused by the increase in the amount of new product, or the increase in the number of customers who are more willing to leave comments under the product. Either explanation indicates the prosperity in the pet industry.
The chart shows the change of average price of products in different years. As can be seen from the chart, although the average price of products has been declining since 2014, the price increased slightly until the pandemic. Thus, the epidemic has increased people's demand for pet products.
According to the modeling of price and review star, we found that the higher the price, the worse the product's evaluation. But it also has to do with the lack of samples for higher-priced products.
Most pet related product are not expansive and the distribution is as expected. However, this distribution is right skewed, suggesting that we need to normalize the data before using it.
To test our hypothesis that the reviews of pets-related products is significantly different after major events such as the Covid-19, we decided to train machine learning models.
Originally, we wanted to set the target variable as the diversity of the reviews per day. We first classify text with pre-trained sentiment analysis model on IMDB reviews and bert-based emotion recognition. We test several Spark NLP classification models with some sampled reviews, and these two models yield the most accurate results. Each text is assigned a sentiment (positive
, neutral
, or negative
) and an emotion (sadness
, love
, joy
, anger
, fear
, or surprise
). We assembled the two classification results into one vector. We then calculated the mean of pair-wise similarity scores between the assembled vector per day. This resulting score represents the diversity of reviews for that particular day.
However, we faced the following problems:
- The sentiment analysis requires large memory. We opened new personal and selected instances with large memory to perform this sentiment analysis, but it's impossible for us either to write the computed data out to S3 or to perform further steps on that set of instances. We eventually have to use m5. When we try to increase the number of m5 cores, the machine learning model can be computed fast, but the sentiment analysis takes forever. We eventually have to perform the most basic sentiment analysis on 2 cores, and train the machine learning models on 10 cores.
- Although we are able to covert categorical sentiment classes to numerical features in the Pipeline without explicitly writing it out using
withColumn
, it takes forever for PySpark to compute this step. We thus need to take it out and create separate steps. - When iterating through the data and calculating the pair-wise similarity scores, the
collect()
function on large dataset results in theOutOfMemory
error. This problem is also noted by many others here. We already used a personal account and used instances with large memory, but the problem remained.
Due to all the constraints we faced, we eventually had to give up our original idea. The implementations of the pre-trained sentiment analysis model
and the bert-based emotion recognition
are at the end of the machine_learning.ipynb
. When we have more funding, we can easily implement those models.
We thus decided to look at if Covid has any influence on product reviews' sentiment. The sentiment is estimated using a model developed by Vivek Narayanan, which requires smaller memory but also produces lower accuracy.
We mainly focus on the COVID-related variables, so we acquired COVID-related deaths and cases from https://github.com/nytimes/covid-19-data/blob/master/us.csv. We have also constructed 2 new features: increased deaths
and increased cases
.
We have also included 2 variables that are related to the reivews and products: review length
and product price
, and these 2 variables can serve
has control variables in our logistic regression model.
We did not use tf-idf becuase then the weight of other features would be small, and tf-idf can have a high correlation with the emotion lables, which can
be divergent to our goal of this project.
Here is an example of our dataframe:
We have also normalized the feature matrix, so that the different scales of the features would not have much effect in feature importance.
Before balancing the data, the counts of each label are shown below:
sentiment_code | count |
---|---|
1 | 11753 |
0 | 14654 |
After balancing the data, the counts of each label are shown below:
sentiment_code | count |
---|---|
1 | 11753 |
0 | 12249 |
In this task, we mainly want to see the importance of the features, so our main model is the logistic regression model.
We are also interested in how accurate this prediction can be, so we employed 4 other models to test this feature-label combination, namely random forest, gaussian naive bayes, gradient boost tree, and linear SVM.
Here we only closely evaluate the logistic regression model.
The AUC of the model is shown below:
We can see that the peroformance of the model is really bad. In addition, the coefficient of the features are shown below:
price | length | cases | deaths | increased cases | increased deaths |
---|---|---|---|---|---|
1.0592 | -0.1002 | 0.0259 | -2.287 | 1.0431 | -31.2387 |
We can see that increased deaths
has a pretty strong negative effect on the label. Given that we encoded positive sentiment as 1 and negative sentiment as 0, increased deaths
has a negative effect on the sentiment. This means that when there are more increased deaths, the sentiment of the reviews tends to be negative..
We can see that all of the models perform equally badly on the test set. Random forest has comparatively high accuracy on the training set. This is
expected because random forest is a very complex model, making it very easy to overfit on the training set.
Based on this comparison, we conclude that the sentiment of the reviews cannot be predicted by our features, and increased deaths has a negative effect on the reviews' sentiment.
We then performed LDA Topic Modeling using PySpark to see if topics changed in the pet industry before and after covid. We pre-processed the data using documentAssembler
, tokenizer
, normalizer
, lemmatizer
, stopwords cleaner
, and finisher
.
To determine which words are important, we used TF-IDF vectorizer implemented by first using the CountVectorizer
and then the IDF
estimator.
Due to the computational time limit, we set the number of topics to 6 and the maximum iteration to 50.
After the topic models before and after Covid are trained, we used the UDF that converts word ids (the actual output for a topic by a topic model) into the words that describe the derived topics. We looked at the top 10 words for each model and their corresponding termWeights.
The results are visualized through Dask.
Most topics are similar before and after Covid. However, the 5th topic before Covid contains top 10 words: seat
, leash
, diaper
, car
, harness
, cover
, u
, velcro
, strap
, and belt
. This topic and these words are not in the topics after Covid. These are all out-door related words, meaning that reviews for outdoor activities and related products are largely influenced by Covid.
We collected Amazon data and Covid data. We visualized the data and then performed machine learning and topic modeling to investigate Covid's influence on the pet industry. Our visualizations suggested that the pet industry has been flourishing these years and is likely to continue this increasing trend. The machine learning results suggested that Covid is not significantly influencing people's sentiment towards pet-related products. However, LDA topic modeling suggests that out-door related pet products lose their popularity in the public pet-related discursive field.
As presented in the machine learning model section, we faced constraints with AWS. We implemented a better Pipeline and alternative ideas on a small scale through a personal account. We believe that with more resources in the future, we can improve the accuracy of our model and expand this project easily. This project can also be directly used to understand consumer behavior in other industries.
Brodeur, A., Clark, A. E., Fleche, S., & Powdthavee, N. (2021). COVID-19, lockdowns and well-being: Evidence from Google Trends. Journal of Public Economics, 193, 104346. https://doi.org/10.1016/j.jpubeco.2020.104346
Castillo, O., & Melin, P. (2020). Forecasting of COVID-19 time series for countries in the world based on a hybrid approach combining the fractal dimension and fuzzy logic. Chaos, Solitons & Fractals, 140, 110242. https://doi.org/10.1016/j.chaos.2020.110242
Ho, J., Hussain, S., & Sparagano, O. (2021). Did the COVID-19 Pandemic Spark a Public Interest in Pet Adoption? Frontiers in Veterinary Science, 8. https://www.frontiersin.org/article/10.3389/fvets.2021.647308
Lai, H., Khan, Y. A., Thaljaoui, A., Chammam, W., & Abbas, S. Z. (2021). COVID-19 pandemic and unemployment rate: A hybrid unemployment rate prediction approach for developed and developing countries of Asia. Soft Computing. https://doi.org/10.1007/s00500-021-05871-6
Qiu, Y., Chen, X., & Shi, W. (2020). Impacts of social and economic factors on the transmission of coronavirus disease 2019 (COVID-19) in China. Journal of Population Economics, 33(4), 1127–1172. https://doi.org/10.1007/s00148-020-00778-2
Haohan Shi: Data Collection, Machine Learning Models
Lu Zhang: LDA Topic Modelings, Data Cleaning, README
Jingwen Ni: Data Collection and Cleaning, Presentation
Zhe Zhang: Dask Visulizations, Presentation
We also thank Bento and Fortune for their inspiration and emotional support.