This is a data pipelining project that scrapes data from https://www.nepalyp.com/
- Scrape data from https://www.nepalyp.com
- Store the scraped data into AWS RDS PostgreSQL database.
- Serve API using API Gateway and AWS Lambda.
- Requested data from the API server created using API Gateway and create visualization using Plotly in jupyter notebook.
- Scrapy: To scrape data from https://www.nepalyp.com
- PostgreSQL: To store scraped data using Scrapy.
- Nodejs and Express.js: To create API endpoint from the data stored in the PostgreSQL database.
- Plotly: To create visualization from the data received as a response from the API server.
- Apache Airflow: To automate the ETL task and create visualization.
Step1:
pip install -r requirements.txt
Step2:
airflow webserver
airflow scheduler
Step3:
Run scrape_DAG from the Airflow server
Step 4:
see plotly dashboard on localhost:8050 or open visualization.nbconvert.ipynb to see updates on visualizations