DAND Project 3 - Investigate a Dataset
This project uses NumPy, Pandas, Matplotlib, and Jupyter Notebooks to analyze a dataset and communicate findings. A dataset curated by Udacity will be utilized.
This project was created and tested on Windows 7 64bit using Python 3.6.4 32bit, NumPy 1.13.3, Pandas 0.22.0, Matplotlib 2.1.1, Jupyter 1.0.0, IPykernel 4.8.2, IPython 6.2.1, Jupyter-client 5.2.2, Jupyter-core 4.4.0, IPywidgets 7.1.2, nbformat 4.4.0, traitlets 4.3.2, widgetsnbextension 3.1.4, notebook 5.4.0, Jupyter-console 5.2.0, nbconvert 5.3.1
- Install Python
- Note 1: Due to the features used, Python v3.6 or later is required
- Install NumPy, Pandas, Matplotlilb, and Jupyter Notebook
- Download the Udacity curated TMDb movie data
- Clone this repo
- From the repo's Proj3 directory, run:
jupyter notebook
- From Jupyter, open
Project 3 - Investigate TMDb.ipynb
- Using the selected dataset (TMDB movie data), perform an analysis using descriptive statistics
- Choosen questions to explore:
- Overall statistics from movie titles in dataset:
- Most popular films?
- Highest budget films?
- Highest revenue films?
- Highest margin films?
- Most successful directors?
- Most popular genres?
- Popularity of genres over time?
- Most successful production companies?
- Most popular actors/actresses?
- What kinds of properties are associated with high revenue movies? (dependent variable)
- directors, genres, production companies, cast (features with multiple values - independent variables)
- runtime, budget, release month (features with single value - independent variables)
- Overall statistics from movie titles in dataset:
- Analysis will be performed and documented in a Jupyter Notebook
- Project 3 - Investigate TMDb.ipynb - Jupyter Notebook for project
- Project 3 - Investigate TMDb.html - Jupyter Notebook in Web (HTML) format