ML conference scraper and analysis

Web scrapers for Machine Learning conference pages. Intended to aid performing quantitative statistical analysis on the natural language content of the research presented.

Overview

The codebase comprises a set of services managed by Docker. These are:

A scrapy project containing the scraping code and associated execution environment
A scrapy-splash server for scraping Javascript which runs in the background (usually required on modern conference websites). I have migrated SPA scraping to Scrapy-Playwright which uses Playwright since AAAI2023.
A postgreSQL database for storing the scraped conference data which runs in the background
A pgAdmin server for querying and interacting with the above database from your web browser which runs in the background (optional)
An analysis environment for interacting with the database in python (via e.g. jupyter notebooks) and doing some NLP/stats on its content

Each entry has a corresponding container in the top level compose file which also defines some environment variables, database credentials etc. This provides a complete environment for running locally. In principle components also can be spun up and used independently if you know what you're doing of course.

Quick start

To get started locally (assuming you have Docker and docker-compose installed):

If you want to do analysis of conference data

Bring up the analysis environment by running:

docker-compose up analysis -d

This will open a notebook server that you can access by obtaining the URL from:

docker-compose logs analysis

If you want to do everything (Scraper, database server, analysis environment)

Bring up the complete environment (this is heavier on your computer) by running:

docker-compose up -d

from the root of the repo (e.g. here).

Further info

Once the environment is built and running, if you would like to:

Run an existing scraper (and populate the database), implement a new scraper, or test/interactively scrape a conference website, refer to the scraper readme
Perform analysis in python, refer to the analysis readme
Interact independently with/change something about the database, refer to the database readme

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
db		db
scraper		scraper
README.md		README.md
docker-compose.yml		docker-compose.yml
run_crawler.sh		run_crawler.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML conference scraper and analysis

Overview

Quick start

If you want to do analysis of conference data

If you want to do everything (Scraper, database server, analysis environment)

Further info

About

Releases

Packages

Languages

SlapDrone/scraping-machine

Folders and files

Latest commit

History

Repository files navigation

ML conference scraper and analysis

Overview

Quick start

If you want to do analysis of conference data

If you want to do everything (Scraper, database server, analysis environment)

Further info

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages