Skip to content

Amazon Reviews Dataset

Compare
Choose a tag to compare
@cgnorthcutt cgnorthcutt released this 05 May 13:50
· 68 commits to main since this release
4260126

The original Amazon5core dataset, which was available here (http://jmcauley.ucsd.edu/data/amazon/index_2014.html) is no longer available in its original form with the same indices for each example. For reproducibility, and to match up the indices of the label errors with the dataset, we host the Amazon5core dataset here.

We made four modifications (as compared to the original amazon5core dataset):

  • Removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
  • Removed unhelpful reviews, i.e. we only kept reviews with more helpful votes than unhelpful votes.
  • Removed reviews with zero helpful upvotes.
  • Removed empty reviews.

The dataset has been prepared/formatted into fastext format, i.e. lines in the txt dataset file look like:

__label__5 I bought this for my husband who plays the piano.
__label__1 Both tutus were mailed in a flat plastic bag in a manila envelope.
__label__3 ...

The label number matches the number of stars (out of 5) associated with each review. As a reminder, we removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.

Download the dataset files

Make sure pigz and wget are installed:

# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz

Download the Amazon5core reviews pre-prepared dataset files

wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partab

To combine the tar.gz file parts into the pre-prepared amazon5core.txt dataset:

cat amazon5core.tar.gz-part?? | unpigz | tar -xvC .