Insights
- Data cleaning was crucial, involving rectifying typos and addressing missing/duplicated values.
- A temporal analysis revealed the dataset's timeline, spanning from the initial entry in 1931 to a movie with a mere 45 minutes of duration.
- Mithun emerged as the most frequently featured lead actor.
- Identification of both top-performing and poorly-rated movies based on votes and ratings was accomplished.
- Insights were gained regarding directors with the highest and lowest movie counts.
- The distribution of movies over the years exhibited a skewed pattern, with a concentration in the 2015-2019 period.
- In 2010, some movies garnered the highest average votes.
- Short-duration movies tended to receive higher ratings and votes, suggesting a potential preference for concise films.
- Drama consistently maintained popularity, while Comedy and Action genres originated in 1953 and 1964, respectively.
- Ratings and votes displayed Gaussian-like patterns, with specific peaks and evolving trends over time.
- The Random Forest regression outperformed Linear Regression, boasting an impressive R-squared score of 0.79, highlighting its robustness.
- The analysis provided a comprehensive understanding of the dataset and its trends, enabling informed decision-making in areas such as movie
- production and genre selection. Future endeavors could include the development of advanced machine learning models or a more in-depth exploration of specific genres or time periods to unveil additional insights.