- Inspired by https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset
- note: rating is from 0 to 10
- Add plausible timestamps to all three tables
- book timestamps: e.g. base on yearOfPublication + x months
- ratings: 2020 - 2023
- user signups: 2020 - 2023 (min of random date + min rating for a user)
- See
data_cleaning.ipynb
for cleaning steps
See files at s3://tecton-demo-data/apply-book-recsys/
Files are tracked with DVC too:
- If you have DVC, you can run
dvc pull
to get the data files
dvc init
dvc add books_data/
dvc remote add -d storage s3://tecton-demo-data/apply-book-recsys/
dvc push