Skip to content

Messy data about Movies were extracted from Wikipedia, Kaggle, and local csv/json files, transformed to a well-structured dataset into SQL database for further queries.

Notifications You must be signed in to change notification settings

CelineWW/Movies_ETL

Repository files navigation

Movies-ETL

Overview of ETL process

  • This project is aiming to create a function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data, then performs the ETL process to load the data into a PostgreSQL database.
  1. Write a function extract_transform_load(ETL) that reads in the three data files from appropriate paths.
  2. Convert three files to separate Pandas DataFrames.
  3. Clean up Wikipedia data and Kaggle metadata. Including dropping and combining columns, removing missing data, changing data type, and formatting messing data with Regex.
  4. Merge cleaned wiki_movies_df and kaggle_metadata to a new data frame movies__df.
  5. Transform and merge the ratings DataFrame into movies__df to get movies_with_ratings_df.
  6. Add movies_df DataFrame and MovieLens rating CSV data to a SQL database. Check movies table and ratings tabel to comfirm the import.

Results

  • Data frames

    1. wiki_movies_df data framewiki_movies_df
    2. movies_df data frame movies_df
    3. movies_with_ratings_df dataframe movies_with_ratings_df
  • Import into SQL Movie database

    1. ETL processing ratings_importing

    2. Movies table movies_query

    3. Ratings tabel ratings_query

Summary

  • Three file resouces with two different file types(csv and json files) were all imported into pandas library. File data were cleaned up with various method and converted into data frames, which were easier to read.
  • All useful data were imported into SQL database. This is a convenient way to hold the meta data.
  • To do more specific researches about on movie data, we can do neccessery queries from Movie database and export the results later.

About

Messy data about Movies were extracted from Wikipedia, Kaggle, and local csv/json files, transformed to a well-structured dataset into SQL database for further queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published