Skip to content

The process of extracting product data from Amazon using Python, including titles, ratings, prices, images, and descriptions.

Notifications You must be signed in to change notification settings

oxylabs/how-to-scrape-amazon-product-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 

Repository files navigation

Oxylabs promo code

Scraping Amazon Product Data With Python

You can find an extended version of this guide on our blog.

This guide uses Python to scrape the following data points from Amazon:

  • Product name
  • Product rating
  • Product price
  • Product images
  • Product description

Contents

Setting up

Create a folder to save your code files. Also, creating a virtual environment is generally a good practice.

The following commands work on macOS and Linux. The commands will create a virtual environment and activate it:

python3 -m venv .env
source .env/bin/activate

If you are on Windows, these commands will vary a little:

python -m venv .env
.env\scripts\activate

Installing packages

python3 -m pip install requests beautifulsoup4 lxml pandas

For Windows, use Python instead of Python3:

python -m pip install requests beautifulsoup4 lxml pandas

To try the Requests library, create a new file with the name amazon.py and enter the following:

import requests
url = 'https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L'

response = requests.get(url)

print(response.text)

Save the file and run it from the terminal:

python3 amazon.py

In most cases, you cannot view the desired HTML. Amazon will block this request, and you will see the following text in the response:

To discuss automated access to Amazon data, please contact [email protected].

If you print the response.status_code, you will see that instead of getting 200, which means success, you may get 503, which means an error.

Amazon knows this request was not using a browser and thus blocks it.

Many websites employ this practice. Amazon will block your requests and return an error code beginning with 500 or sometimes even 400.

The solution is simple in most cases. You can send HTTP headers along with your request just like an actual browser.

Sometimes, sending only the user-agent is enough. At other times, you may need to send more headers. A good example is sending the accept-language header.

To identify the user-agent sent by your browser, press F12 and open the Network tab. Reload the page. Select the first request and examine Request Headers.

You can copy this user-agent and create a dictionary for the headers.

The following shows a dictionary with the user-agent and accept-language headers:

custom_headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'accept-language': 'en-GB,en;q=0.9',
}

You can send this dictionary to the optional parameter of the get method as follows:

response = requests.get(url, headers= custom_headers)

Executing the code with these changes may show the expected HTML with the product details.

You will not need Javascript rendering if you send as many headers as possible. If you need rendering, you will have to use tools like Playwright or Selenium. If the User-Agent and Accept-Language strings still bring you the 503 error, you can try to use the following headers:

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}

It’s also a good idea to rotate different User-Agent strings and try your requests again to overcome the 503 error.

Scraping product data

When scraping Amazon products, typically, you would work with two categories of pages — the category page and the product details page.

For example, open this or search for Over-Ear Headphones on Amazon. The page that shows the search results is the category page.

The category page displays the product title, product image, product rating, product price, and, most importantly, the product URLs page. If you want more details, such as product descriptions, you will get them only from the product details page.

Let's examine the structure of the product details page.

Open a product URL, such as this, in Chrome or any other modern browser, right-click the product title, and select Inspect. You will see that the HTML markup of the product title is highlighted.

You will see that it is a span tag with its id attribute set to productTitle.

Similarly, if you right-click the price and select Inspect, you will see the HTML markup of the price.

You can see that the dollar component of the price is in a span tag with the class a-price-whole, and the cents component is in another span tag with the class set to a-price-fraction.

Similarly, you can locate the rating, image, and description.

1. Sending a GET request with custom headers

from bs4 import BeautifulSoup

response = requests.get(url, headers=custom_headers)
soup = BeautifulSoup(response.text, 'lxml')

This guide uses CSS selectors. You can now use the Soup object to query for specific information.

2. Locating and scraping product name

The product name or title is located in a span element with its id productTitle. It's easy to select elements using a unique ID.

title_element = soup.select_one('#productTitle')

Send the CSS selector to the select_one method, which returns an element instance. You can extract information from the text using the text attribute.

title = title_element.text

Upon printing, you will see that there are few white spaces. To fix that, add .strip() function call as follows:

title = title_element.text.strip()

3. Locating and scraping product rating

Create a selector for rating:

#acrPopover

The following statement can select the element that contains the rating:

rating_element = soup.select_one('#acrPopover')

Note that the rating value is actually in the title attribute:

rating_text = rating_element.attrs.get('title')
print(rating_text)
# prints '4.6 out of 5 stars'

Lastly, use the replace method to get the number:

rating = rating_text.replace('out of 5 stars','')

4. Locating and scraping product price

The product price is located in two places: below the product title and on the Buy Now box. You can use either of these tags.

Create a CSS selector for the price:

span.a-offscreen

The CSS selector can be passed to the select_one method of BeautifulSoup as follows:

price_element = soup.select_one('span.a-offscreen')

You can now print the price:

print(price_element.text)

5. Locating and scraping product image

Let's scrape the default image. This image has the CSS selector as #landingImage. Write the following to get the image URL from the src attribute:

image_element = soup.select_one('#landingImage')
image = image_element.attrs.get('src')

6. Locating and scraping product description

The methodology remains the same — create a CSS selector and use the select_one method.

#productDescription

You can extract the element as follows:

description_element = soup.select_one('#productDescription').text.strip()
print(description_element)

7. Handling product listing

To reach the product information, begin with product listing or category pages.

For example, here is the category page for over-ear headphones.

Notice that all the products are contained in a div with the special attribute [data-asin]. In the div, all the product links are in an h2 tag.

The CSS Selector is as follows:

[data-asin] h2 a

You can read the href attribute of this selector and run a loop. However, note that the links will be relative. You would need to use the urljoin method to parse these links.

from urllib.parse import urljoin

def parse_listing(listing_url):
    global visited_urls
    response = requests.get(listing_url, headers=custom_headers)
    print(response.status_code)
    soup_search = BeautifulSoup(response.text, "lxml")
    link_elements = soup_search.select("[data-asin] h2 a")
    page_data = []

    for link in link_elements:
        full_url = urljoin(listing_url, link.attrs.get("href"))
        if full_url not in visited_urls:
            visited_urls.add(full_url)
            print(f"Scraping product from {full_url[:100]}", flush=True)
            product_info = get_product_info(full_url)
            if product_info:
                page_data.append(product_info)

Handling pagination

The link to the next page contains the text "Next". Look for this link using the contains operator of CSS as follows:

    next_page_el = soup_search.select_one('a.s-pagination-next')
    if next_page_el:
        next_page_url = next_page_el.attrs.get('href')
        next_page_url = urljoin(listing_url, next_page_url)
        print(f'Scraping next page: {next_page_url}', flush=True)
        page_data += parse_listing(next_page_url)

    return page_data

8. Exporting scraped product data to a CSV file

The scraped data is being returned as a dictionary. It is intentional.

You can create a list that contains all the scraped products:

def main():
    data = []
    search_url = "https://www.amazon.com/s?k=bose&rh=n%3A12097479011&ref=nb_sb_noss"
    data = parse_listing(search_url)

This page_data can then be used to create a Pandas DataFrame object:

    df = pd.DataFrame(data)
    df.to_csv("headphones.csv", index=False)

Reviewing the final script

Putting together everything, here is the final script:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}

visited_urls = set()

def get_product_info(url):
    response = requests.get(url, headers=custom_headers)
    if response.status_code != 200:
        print(f"Error in getting webpage: {url}")
        return None

    soup = BeautifulSoup(response.text, "lxml")

    title_element = soup.select_one("#productTitle")
    title = title_element.text.strip() if title_element else None

    price_element = soup.select_one('span.a-offscreen')
    price = price_element.text if price_element else None

    rating_element = soup.select_one("#acrPopover")
    rating_text = rating_element.attrs.get("title") if rating_element else None
    rating = rating_text.replace("out of 5 stars", "") if rating_text else None

    image_element = soup.select_one("#landingImage")
    image = image_element.attrs.get("src") if image_element else None

    description_element = soup.select_one("#productDescription")
    description = description_element.text.strip() if description_element else None

    return {
        "title": title,
        "price": price,
        "rating": rating,
        "image": image,
        "description": description,
        "url": url
    }


def parse_listing(listing_url):
    global visited_urls
    response = requests.get(listing_url, headers=custom_headers)
    print(response.status_code)
    soup_search = BeautifulSoup(response.text, "lxml")
    link_elements = soup_search.select("[data-asin] h2 a")
    page_data = []

    for link in link_elements:
        full_url = urljoin(listing_url, link.attrs.get("href"))
        if full_url not in visited_urls:
            visited_urls.add(full_url)
            print(f"Scraping product from {full_url[:100]}", flush=True)
            product_info = get_product_info(full_url)
            if product_info:
                page_data.append(product_info)

    next_page_el = soup_search.select_one('a.s-pagination-next')
    if next_page_el:
        next_page_url = next_page_el.attrs.get('href')
        next_page_url = urljoin(listing_url, next_page_url)
        print(f'Scraping next page: {next_page_url}', flush=True)
        page_data += parse_listing(next_page_url)

    return page_data


def main():
    data = []
    search_url = "https://www.amazon.com/s?k=bose&rh=n%3A12097479011&ref=nb_sb_noss"
    data = parse_listing(search_url)
    df = pd.DataFrame(data)
    df.to_csv("headphones.csv", orient='records')


if __name__ == '__main__':
    main()

An easier solution to extract Amazon data

You can simplify the whole process with Oxylabs Amazon Scraper (a free trial is available).

Scraping products from search results

Extract product data with the following code:

import requests
from pprint import pprint

# Structure payload.
payload = {
    'source': 'amazon_search',
    'query': 'bose',  # Search for "bose"
    'start_page': 1,
    'pages': 10,
    'parse': True,
    'context': [
        {'key': 'category_id', 'value': 12097479011}  # category id for headphones
    ],
}

# Get response
response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('USERNAME', 'PASSWORD'),
    json=payload,
)

# Print prettified response to stdout.
pprint(response.json())

Notice how it requests 10 pages beginning with the page 1. Also, we limit the search to category ID 12097479011, which is Amazon's category ID for headphones. You’ll get the data in JSON format:

Extracting product details

You only need the product URL, regardless of the country where the Amazon store is located. The only code change is the payload.

The following payload extracts details, such as name, price, stock availability, description, and more, for the Bose QC 45:

payload = {
    'source': 'amazon',
    'url': 'https://www.amazon.com/dp/B098FKXT8L',
    'parse': True
}

The output:

Scraping products by ASIN

Another way to get data is by the ASIN of a product. You need to modify the payload:

payload = {
    'source': 'amazon_product',
    'domain': 'co.uk',
    'query': 'B098FKXT8L',
    'parse': True,
    'context': [
        {'key': 'autoselect_variant', 'value': True}
    ]
}

Note the optional parameter domain. Use this parameter to get Amazon data from any domain, such as amazon.co.uk.

Looking to scrape more other Amazon data? Amazon Review Scraper, Amazon ASIN Scraper, Bypass Amazon CAPTCHA, How to Scrape Amazon Prices

About

The process of extracting product data from Amazon using Python, including titles, ratings, prices, images, and descriptions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published