- This project is aiming to create a function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data, then performs the ETL process to load the data into a PostgreSQL database.
- Write a function extract_transform_load(ETL) that reads in the three data files from appropriate paths.
- Convert three files to separate Pandas DataFrames.
- Clean up Wikipedia data and Kaggle metadata. Including dropping and combining columns, removing missing data, changing data type, and formatting messing data with Regex.
- Merge cleaned wiki_movies_df and kaggle_metadata to a new data frame movies__df.
- Transform and merge the ratings DataFrame into movies__df to get movies_with_ratings_df.
- Add movies_df DataFrame and MovieLens rating CSV data to a SQL database. Check movies table and ratings tabel to comfirm the import.
-
Data frames
-
Import into SQL Movie database
- Three file resouces with two different file types(csv and json files) were all imported into pandas library. File data were cleaned up with various method and converted into data frames, which were easier to read.
- All useful data were imported into SQL database. This is a convenient way to hold the meta data.
- To do more specific researches about on movie data, we can do neccessery queries from Movie database and export the results later.