This is a sample code for a big data project using R. This research is mainly about the impact of social networks on COVID-19 vaccine hesitancy. We utilize a sample of 10 million individual-level Twitter data and: I) use machine learning techniques to predict the gender, race, and age of the Twitter user. II) do econometric analysis to do causal inference of the true impact. (Logistic regression, IV regression
All the projects have been done in the private server of R before. Due to privacy reasons, I just uploaded the code without the raw data here. Sorry if this causes confusion.
The code can be divided into two parts. One is the data cleaning part, one is the data analysis part. The data cleaning part mostly turns the massive raw datasets into preferred data structures. In the data analysis part, we do data analysis to conduct different structures of regression analysis.
Further introduction to the project and the result of the project can be found in the slides: "Social Connection and Vaccine Hesitancy Slides"