This repository contains an analysis of a movie dataset, where we explore various aspects of the movie industry, such as movie ratings, revenue, production details, and more. The analysis is performed using Python and relevant data analysis libraries such as Pandas, Matplotlib, and Seaborn.
The dataset includes key features about movies such as:
- Title: The name of the movie
- Release Date: When the movie was released
- Revenue: Total revenue generated by the movie
- Runtime: Duration of the movie in minutes
- Genres: The genres the movie falls into
- Production Companies: The companies involved in producing the movie
- Languages Spoken: The languages featured in the movie
- Popularity: Popularity score
- Vote Count and Average: Number of votes and average rating of the movie
Key analyses performed in this notebook include:
- Data Cleaning: Handling missing values, data type conversions, and general preprocessing.
- Exploratory Data Analysis (EDA):
- Distribution of movie release dates.
- Revenue and runtime trends over time.
- Popular genres and their correlation with revenue and ratings.
- Visualizations:
- Box plots to show the distribution of revenue and runtime.
- Scatter plots comparing various attributes such as revenue vs. runtime, and vote average vs. revenue.
- Histograms and bar charts for categorical features like genres and production countries.
- Correlations:
- Identifying correlations between different features such as revenue, vote count, and vote average.
- Analysis of how runtime affects revenue and vote averages.
As part of the analysis, a predictive model was built to forecast movie success. The goal was to predict movie revenue based on various features such as runtime, vote average, popularity, and production budget.
The model was built using a supervised learning approach, with the following steps:
- Feature Selection: Selected relevant features like
runtime
,vote_average
,popularity
,budget
, andgenres
. - Data Preprocessing:
- Handled missing values.
- Encoded categorical variables (such as
genres
andproduction_companies
) using techniques like one-hot encoding. - Scaled numerical features where necessary.
- Model Training:
- Used a Linear Regression model to predict the revenue.
- Evaluated the model's performance using metrics like Mean Squared Error (MSE) and R-squared (R²).
- Pandas: For data manipulation and analysis.
- Matplotlib/Seaborn: For visualizing the data and generating plots.
- NumPy: For numerical operations.