This is a crawler that crawls the complete website https://sbmurban.org/rrr-centers and extracts the complete information.
- This crawler was build as a task for ATLAN.
- The complete data from the website was crawled and stored in a single file
- This was a new task and learned how to scrap the ASP.NET websites which use
__VIEWSTATE
(Used the https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition as a tutorial on how to Crawl ASP.NET websites). - Also at the end of complete data scrapping, a POST request would be made to the
URL
specified. - Also, the setup.py file has been added.
- The extracted file contains the following columns:-
- State
- District
- ULB Name
- Ward
- No. of Applications Received
- No. of Applications Not Verified
- No. of Applications Verified
- No. of Applications Approved
- No. of Applications Approved having Aadhar No.
- No. of Applications Rejected
- No. of Applications Pullback
- No. of Applications Closed
- No. of Constructed Toilet Photo
- No. of Commenced Toilet Photo
- No. of Constructed Toilet Photo through Swachhalaya
- DOUBT=> How can we make a POST request every 5 minutes, since the data crawling itself takes a lot more of time.
ASSUMPTION=> For making the post request every 5 minutes, we can put the project in the ScrapingHub, and schedule it to crawl every 5 minutes. The crawler has been made such that it would make a POST request on completing the crawling, and the data would automatically be posted.
- DOUBT=> How many output files are required? Like 1 file containing all the information. Or the 4 Files containing information for 4 different levels like State, District, ULB and Ward Level.
ASSUMTIONS=> I have made 1 CSV file only whose table was shown in the task containing all the information. Since all other information can be easily extracted from that file.
- Clone the repository
git clone https://github.com/sagar-sehgal/SwatchBharaturban_Crawler
- Make a Virtual Environment
virtualenv venv --python=python3
- Activate the virtualenv
source venv/bin/activate
- Change the Repository
cd SwatchBharaturban_Crawler
- Install the dependencies
pip install -r requirements.txt
- Run the Crawler
scrapy crawl swatchbharaturban_crawler
The crawled data would be stored in the swatchbharaturban_crawler/data/swatchbharat_data.csv
file.