This project is a time series analysis and prediction of the spread of COVID-19 and the financial impact it has had. For a video presentation: https://youtu.be/C3qMzDXyIaU
COVID-19 is a corona virus causing respiratory illness believed to have originated in Wuhan China and brought to the World Health Organizations attention on December 31st of 2019. Person to person spread is occurring at a rapid rate and has since been slowed somewhat by quarantine. This quarantine has had a dramatic impact on the financial market as many people and industries are unable to function in a work at home world riddled with travel bans. This project will tackle an analysis of the virus symptoms, doctor sentiment, who is at risk, geographic spread, total cases and their outcomes and the financial impact it has had on large tech companies.
The below interactive plot shows the spread of the virus over the past few months. (last updated 4/25/2020)
Future deaths, recoveries and confirmed cases were forecasted based on current trends and modeled. Below is a graph of the forecasted confirmed cases. The initial exponential growth has subsided to linear growth.
Stock prices have taken a dive as we see here with large tech companies but are forecasted to return to their previous positions. Some industries may never recover.
In conclusion the main symptoms are fever, cough and sore throat. The initial exponential spread is down to linear growth and expected to continue in the short term. U.S. and China have reported the most cases. Individual regions and provinces can be tracked to help determine when quarantine might be lifted and where additional medical support is most needed. Large tech companies have taken a big financial hit but are expected to make a decent recovery by the end of the year.
- Data Exploration and Cleaning
- Case Review Summary Sentiment Analysis
- Time Series Data Exploration and Forecasting
- Financial Data Analysis and Forecasting
- Conclusions
- Future Work
All the data files can be found in the "Data" folder.
The initial data cleaning and exploration can be found in Covid-19 People & Symptom Analysis Practicum.ipynb
in the Jupyter Notebooks
folder as well as the respective python files in the Python files
folder.
Dataset: (https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)
- This dataset contains time series data on the number of confirmed, deaths and recovered COVID-19 cases. After some general cleaning on the "line list" data looks like this:
- Looking for correlations in the numerical variables I ran a pair plot:
We can see that the virus infects all age ranges but almost all deaths are older individuals:
- Further cleaning of the "line list" data specifically looking at the patient summary we have:
- Next, I cleaned the summaries by removing punctuation and digits. I left the pronouns, lemmatized the rest to remove the stop words and then joined them back together. A wordcloud of the results is below:
The wordcloud reveals that the top words are confirm, covid, patient, new, symptom, male, onset, female, wuhan, fever, pneumonia etc. We can see this further with a bar chart. I used matplotlib and seaborn to visualize the top occurring words post cleaning:
I visualized the top symptoms:
The case review summary sentiment analysis work can be found in Covid-19 People & Symptom Analysis Practicum.ipynb
in the Jupyter Notebooks
folder as well as the respective python files in the Python files
folder.
- In addition to previous text cleaning bi-gram and tri-gram models were made and lemmatized for the text cleaning. TextBlob was used to determine the sentiment of the summaries and the results plotted below:
The sentiment of the summaries is actually slightly positive overall.
The time series data and forecasting work can be found in Covid-19 Time Series and Prediction Practicum.ipynb
in the Jupyter Notebooks
folder as well as the respective python files in the Python files
folder.
- This dataset contains time series data on the number of confirmed, deaths and recovered COVID-19 cases. After some general cleaning on the "covid_19_data" file our output looks like this:
A good deal of further cleaning went into the "time_series_covid_19_confirmed", "..deaths" and "..recovered" data files as well which hold critical time series data as they track the outbreak results over-time.
- A plot of the reported cases by country is below:
This plot shows that China, the United States, Australia, Canada and France have the most reported cases as of the last iteration.
- I created an interactive plotly stacked barchart that shows the total reported cases over time and the number who have recovered or died.
(insert method of linking to interactive plot outside of the jupyter notebook?) This plot shows that the number of cases is increasing greatly and is up to over 350k but many patients recover.
- An interactive geo scatter plot using plotly depicts the top reported cases overlaid on their countries and sized by the number of total cases:
This plot shows that China has the most reported cases at 81k as of the last iteration.
- A similar plot in Tableau.
- An interactive geo scatter plot using plotly depicts the top reported cases overlaid on their countries and sized by the number of total cases with the addition of showing the outbreak spread over time:
- Forecasts were made using FbProphet to model the virus's upcoming outlook, the confirmed cases forecast is below:
The deaths forecast:
The recovered cases forecast:
- The final step was to forecast the next five days of new cases, deaths and recoveries for each country, region, state and province. Again, using Fbprophet and a couple of loops we are able to model each location and combine them back into one forecast. A portion of the results for can be seen below showing the forecast for New South Wales in Australia.
The Mean Absolute Error for this prediction was 70.9.
The financial data analysis and forecasting work can be found in Yahoo Finance API Data Practicum.ipynb
in the Jupyter Notebooks
folder as well as the respective python files in the Python files
folder.
- This notebook accessed the Yahoo Finance Data API which contains time series data on company stocks. The Google data for the last five years looks like this:
A plot of this five year data:
We can see an upward trend over the past five years in Google stock and then a significant dip over the last couple months likely due to COVID-19.
Google stock five-year returns:
Google stock returns have spiked negatively the most in the past five years down to -10% during the COVID-19 outbreak and appears quite volatile.
- To examine if this is a trend across large technology companies other major companies stock information is brought in from Yahoo Finance API:
A correlation plot shows similarity between the companies:
A scatter of the five years of Google and Microsoft stocks shows slightly above average returns with more high return days than low.
- The same data but only focusing on 2020. Google stocks this year:
The stock has greatly declined since February.
Google returns this year:
We can see that most days have a negative return since the end of February.
Comparing Google and Microsoft only during the past 6 months we see many negative returns and some that are quite high including -15%.
Correlations between the companies are even higher now suggesting an across the board decline:
- Forecasting major tech companies financial capability:
Past five years expected returns:
GE and IBM have negative expected returns while the other major tech companies are positive with Microsoft having the highest expected return. (Possibly due to being awarded the JEDI contract.)
Just 2020 data expected returns:
We can see here that only Microsoft has positive expected returns and low risk while each of the other major tech companies have expected losses.
Finally, we have the tech companies forecasted stock prices:
The stocks are expected to recover. I did not run another forecast on just this years data as I don't believe it is enough to forecast on. I also suspect it would not be trustworthy as this would likely not suggest recovery being possible but logically if COVID-19 eventually allows for business to resume as usual the market should begin an upward trend again.
The main symptoms are fever, cough and sore throat. The initial exponential spread is down to linear growth and expected to continue in the short term. U.S. and China have reported the most cases. Individual regions and provinces can be tracked to help determine when quarantine might be lifted and where additional medical support is most needed. Large tech companies have taken a big financial hit but are expected to make a decent recovery by the end of the year.
Additional visualizations and dashboards. Financial data for other industries. Other Forecasting methods.