Amazon Reviews Dataset
The original Amazon5core dataset, which was available here (http://jmcauley.ucsd.edu/data/amazon/index_2014.html) is no longer available in its original form with the same indices for each example. For reproducibility, and to match up the indices of the label errors with the dataset, we host the Amazon5core dataset here.
We made four modifications (as compared to the original amazon5core dataset):
- Removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
- Removed unhelpful reviews, i.e. we only kept reviews with more helpful votes than unhelpful votes.
- Removed reviews with zero helpful upvotes.
- Removed empty reviews.
The dataset has been prepared/formatted into fastext format, i.e. lines in the txt dataset file look like:
__label__5 I bought this for my husband who plays the piano.
__label__1 Both tutus were mailed in a flat plastic bag in a manila envelope.
__label__3 ...
The label number matches the number of stars (out of 5) associated with each review. As a reminder, we removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
Download the dataset files
Make sure pigz
and wget
are installed:
# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz
Download the Amazon5core reviews pre-prepared dataset files
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partab
To combine the tar.gz file parts into the pre-prepared amazon5core.txt dataset:
cat amazon5core.tar.gz-part?? | unpigz | tar -xvC .