nepalypscraper

This is a data pipelining project that scrapes data from https://www.nepalyp.com/

This project uses Airflow for automation. The steps include:

Scrape data from https://www.nepalyp.com
Store the scraped data into AWS RDS PostgreSQL database.
Serve API using API Gateway and AWS Lambda.
Requested data from the API server created using API Gateway and create visualization using Plotly in jupyter notebook.

Scrapy: To scrape data from https://www.nepalyp.com
PostgreSQL: To store scraped data using Scrapy.
Nodejs and Express.js: To create API endpoint from the data stored in the PostgreSQL database.
Plotly: To create visualization from the data received as a response from the API server.
Apache Airflow: To automate the ETL task and create visualization.

Step1:

pip install -r requirements.txt

Step2:

airflow webserver

airflow scheduler

Step3:

Run scrape_DAG from the Airflow server

Step 4:

see plotly dashboard on localhost:8050 or open visualization.nbconvert.ipynb to see updates on visualizations

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
__pycache__		__pycache__
controller		controller
finalquestion		finalquestion
.gitignore		.gitignore
README.md		README.md
dag_scheduler.py		dag_scheduler.py
datasets.csv		datasets.csv
diagram.png		diagram.png
index.js		index.js
lambda_function.py		lambda_function.py
package-lock.json		package-lock.json
package.json		package.json
plotly-dash.py		plotly-dash.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
styles.css		styles.css
visualization.ipynb		visualization.ipynb
visualization.nbconvert.ipynb		visualization.nbconvert.ipynb