This repo contains a full-fledged Python-based script that scrapes a JavaScript-rendered website homzmart.com/en
, cleans the data, and pushes the results to a cloud-based database. The workflow is orchestrated on Airflow to run automatically without any human intervention
The goal of the project was to scrape an E-commerce website that sells furniture online homzmart.com/en
to get insights about the company's product assortment and pricing. The aim was to create a table containing all kinds of information about the products on the website including:
- Product name + Product Display Page (PDP) URL
- Category name of the product + Category URL
- Sub-category name of the product + Sub-category URL
- Number of product pages under the product's sub-category
- Page rank of the product in the browsing area
- The link to the product's main image
- Product description
- Current Price
- Discount Tag
- Strikethrough Price
- Vendor Name
- Promised Delivery Time
- Availability Type (On-demand vs. In Stock)
- Dimensions (L x W x H)
- SKU Name
The website employed several throttling and anti-bot mechanisms and its content was rendered by Javascript, making it a very challenging website to scrape effectively at scale.
The website has 4 sections:
- The Home Page --> This is the landing page of the website
https://homzmart.com/en
- This page was used to scrape the category names and links
- The Category Pages --> These are pages that contain all products under a particular category (e.g., Beds, Sofas, Storage, etc.)
- An example of a category page (Beds) -->
https://homzmart.com/en/products/3#1
- These pages were used to obtain the sub-category names and links
- An example of a category page (Beds) -->
- The Sub-category Pages --> These are pages that contain all products under a particular sub-category (e.g., Future Furniture Bed Beige FF.A022)
- An example of a sub-category page under Beds (King Beds) -->
https://homzmart.com/en/products/4288#1
- These pages were used to obtain the product page links
- An example of a sub-category page under Beds (King Beds) -->
- The Product Pages
- An example of a product page under King Beds -->
https://homzmart.com/en/product-details/FF.A022
- These pages were used to obtain all the remaining information from the list above
- An example of a product page under King Beds -->
The result of the pipeline after it is executed looks as follows (only a snippet of the table is shown because it is huge):
Since the crawling process had to be done repeatedly, it was necessary to orchestrate the entire process via Airflow so that it runs at a regular cadence. I will explain the process of reproducing the code on your own machine below.
This section is split into two parts. Section one explains how to replicate the crawling code without the Airflow orchestration. Section two demonstrates how to create a pipeline out of the scrapy spiders. The Airflow pipeline uses the Python Operator, Email Operator, and File Sensor to orchestrate the process.
Step 1: Clone the repo using this command in your terminal
git clone https://github.com/omar-elmaria/about_you_case_study.git
Step 2: Create a virtual environment by running this command python -m venv venv_scraping
Step 3: Activate the virtual environment by typing this source venv_scraping/bin/activate
if you are on Mac/Linux or source venv_scraping/Scripts/activate
if you are on Windows. You might need to replace the forwardslashes with a backslash if you are on Windows
Step 4: Double-check that you are using the correct Python path by typing which python
and clicking enter (which python3 on Mac/Linux
). It should point to the Python executable in the virtual environment you just created
Step 5: Ctrl-Shift-P
to view the command palette in VSCode --> Python: Select Interpreter
--> Browse to the Python executable in your virtual environment so that the Jupyter notebook uses the correct Python interpreter
Step 6: Run this command in the terminal to install the required dependencies pip install -r requirements.txt
Step 7: This website is dynamically rendered by Java Script, so we have three options to scrape it
- scrapy-playwright
- scrapy-splash
- Proxy service with JS-rendering capability (e.g., ScraperAPI)
Since this is a high volume scraping job, I opted for option #3. My preferred service is ScraperAPI. You can easily sign up in a couple of minutes with your Email and get an API key with 5000 API credits. This should more than suffice for testing purposes.
Step 8: Create a .env file with the following parameters without the curly braces
SCRAPER_API_KEY={API_KEY_FROM_SCRAPER_API}
DATA_FOLDER_PATH_LOCAL="{INSERT_LOCAL_PATH_TO_FOLDER_CONTAINING_THE_JSON_FILES_GENERATED_FROM_SCRAPING}"
DATA_FOLDER_PATH_AIRFLOW="{INSERT_VIRTUAL_PATH_TO_FOLDER_CONTAINING_THE_JSON_FILES_GENERATED_FROM_SCRAPING}"
The local path can look something like this:
"I:\scraping_gigs\python_scrapy_airflow_pipeline\homzmart_scraping\data"
.
Note that I used backslashes because I am using the Windows OS
The virtual path is ONLY required for the Airflow step, so you can skip it you don't want to orchestrate the process. That said, it can look something like this:
"/opt/airflow/python_scrapy_airflow_pipeline/homzmart_scraping/data"
Note that I used forwardslashes here because the Airflow container is usually created in a Linux environment. Also, keep in mind that the ending of both paths are the same. You are simply cloning the data folder on your local computer to the Airflow environment. If you want more elaboration on this step, please check out my guide on how to install Airflow locally on your machine and navigate to step 11 under section 1.
Step 9: Add a new environment variable to PYTHONPATH
pointing to the location of your python_scrapy_airflow_pipeline
project folder. This screenshot demonstrates the process on the Windows operating system
The new value that will be added under the Path variable is --> %PYTHONPATH%
N.B. You will need to restart your computer for this change to come into effect
Step 10: Delete the JSON files from the data folder to start on a clean slate
Step 11: Now, you are ready to scrape the website. The order of running the scripts should be as follows:
homzmart_home_page_spider.py
homzmart_cat_page_spider.py
homzmart_subcat_page_spider.py
homzmart_prod_page_spider.py
You can ignore the last two scripts homzmart_combine_jsons.py
and homzmart_delete_tables.py
. These scripts are used to push the scraped data to an Airtable database hosted on the cloud. You will not be able to replicate these steps because you will not have the API keys required to access these private databases.
You can also ignore the test_crawlera.py
and test_scraperapi.py
. These test scripts were created to play around with some popular Proxy API services on the market, Zyte Smart Proxy Manager (Formerly Crawlera) and ScraperAPI
Step 12.1: The output of the homzmart_home_page_spider.py
script should look something like this
Step 12.2: The output of the homzmart_cat_page_spider.py
script should look something like this. Please note that the screenshot is truncated to preserve space
Step 12.3: The output of the homzmart_subcat_page_spider.py
script should look something like this. Please note that the screenshot is truncated to preserve space
Step 12.4: The output of the homzmart_prod_page_spider.py
script should look something like this. Please note that the screenshot is truncated to preserve space
N.B. I purposely adjusted the script to only scrape a small portion of the website because the website has more than 60,000 pages and the entire process takes several hours to complete. The entire script should run in under 5 minutes.
The pre-requisite to this section is installing Docker and Airflow on your local machine. Please follow the steps explained in section #1 of my other guide (excluding step 7) and come back to this part once you're done.
- After installing Airflow, you will need to add the following commands to a
Dockerfile
:
FROM apache/airflow:2.3.4
# Install the libraries required for scraping
RUN pip3 install scrapy scrapy-playwright pyairtable scrapy-proxy-pool scrapy-user-agents scraperapi-sdk python-dotenv pandas numpy
# Install the playwright headless browsers
RUN playwright install
# Add a default path
ENV PYTHONPATH="$PYTHONPATH:/opt/airflow/python_scrapy_airflow_pipeline"
- You will also need to add the following parameters to a .env file within the Airflow directory if you haven't already. Note that this might be different from the .env file you used in the first part of this guide.
AIRFLOW_IMAGE_NAME=apache/airflow:2.3.4
AIRFLOW_UID=50000
AIRFLOW__SMTP__SMTP_PASSWORD={GOOGLE_PASSWORD_WITHOUT_THE_CURLY_BRACES}
To generate the GOOGLE_PASSWORD
and be able to send emails via Airflow, please follow the steps in this guide. If you don't want to send a success email at the end of your Airflow pipeline, you can delete the last line in the .env file
It is generally recommended to have one external directory to host the DAGs from all of your projects. I call it airflow-local and it looks something like this
- Finally, you will need to add a new volume to the docker-compose file under the
volumes:
section like this
- {INSERT_LOCAL_PATH_TO_PYTHON_SCRAPY_AIRFLOW_PIPELINE_PROJECT}:/opt/airflow/python_scrapy_airflow_pipeline
- Now, you are ready to launch Airflow, go to your browser and type in
localhost:8080
and enter the credentials- username: airflow
- password: airflow
You should land on a page that looks like this
The DAG itself will look something like this
- Before triggering the DAG, you will need to set a new connection for the file sensor steps. To do that, go to
Admin
-->Connections
, and click on the plus sign. Enter a new connection of type "File (path)" and enter the path todata
folder in the Airflow environment. Typically, you will not need to change the path shown in the screenshot if you followed the steps of this guide
{"path":"opt/airflow/python_scrapy_airflow_pipeline/homzmart_scraping/data"}
- Now, you are ready to fire up the DAG. Navigate to the grid view, click on the play button, and then Trigger DAG. The steps should be executed sequentially as shown in the screenshot below. The JSON files will gradually appear in your local directory
- If you change anything in the
Dockerfile
(e.g., add more dependencies), you will need to re-build the docker image. Run the commands below in your terminal from within the directory that hosts the Airflow folder structuredocker-compose down --volumes --rmi all
docker-compose up -d
These two commands remove the Airflow image and re-build it using the new parameters in the Dockerfile and docker-compose file
-
If you change anything in the
docker-compose
file (e.g., add a new volume), you don't need to re-build the image. It is enough to stop the docker containers and spin them up again by running the following commands, again from within the directory that hosts the Airflow folder structuredocker-compose down
docker-compose up -d
-
If you face a
ModuleNotFound
error, it is usually because you are importing modules from a path that is not recognized by Python. To remedy that, you will need to add that path to an environment variable pointing to the path from which you start searching and importing your module. The environment variable should be added to the following places:- On your operating system --> Check step 9 under section 2.1
- In the Dockerfile --> Check step 1 under section 2.2. This is only necessary if you are using Airflow in the process
- Possibly also a new volume in the docker-compose file if you are importing from a folder that is different from
./dags
,./plugins
, or./logs
--> Check step 3 under section 2.2. This is only necessary if you are using Airflow in the process
If you have any questions or wish to build a scraper for a particular use case, feel free to contact me on LinkedIn