Skip to content

Rajarshi1001/webCrawler

Repository files navigation

WebCrawler

Submission of Team MA-217164 in Web Crawler competition organised by TechFest , IIT Bombay

Contributed by:


Problem Statement

We had to develop a web crawler which would identify the following key components :

  • SSL certificate compliance – Check all links in the site for URL validation of SSL (all hyperlinks should be https://), and verify the validity of the SSL certificate.
  • Cookie checker – Verify cookies being used by the website, the cookie checker will scan the cookies on the website, and cookie consent verification links.
  • ADA compliance
    • Alt text in all images.
    • Color contrast for the site as per w3.org guidelines.
    • Accessibility issues to check the site markup for null tab index

Approach

  • The user can run the program individually for each type of problem. There is also a combined script for all executing all tasks.
  • We have used streamlit to render the results in a web-interface instead of displaying it in the terminal. The SSL Certificate details (if enabled), cookies present, verification attribute, info regarding null tab index and the image tags without alt text of a website are displayed in a local URL.

Libraries used :

ssl
socket
prettytable
streamlit
beautifulsoup
requests
urllib

Usage

  • Clone the repository git clone https://github.com/Rajarshi1001/webCrawler.git
  • Install the requirements pip install -r requirements.txt
  • py pip install streamlit Specify the url using --link option while executing the script.

This Script displays the ssl details, verification & details about the cookies being used by the website, img tags without alt-text and null tab index. (e.g = https://github.com)

Run the following command

py -m streamlit run script.py -- --link https://github.com

For Alt-text :

Firstly head to cd .\webCralTF\webCralTF\ (yes twice) then run

scrapy crawl spidey

For color-contast :

Now head to the correct directory then run

python colContr.py