PSU Club Crawler

This simple script will crawl the PSU Student Affairs website and pull out all the data about student clubs:

http://studentaffairs.psu.edu/hub/studentorgs/orgdirectory/

It's problematic to scrape this site because of the way that it's generated. There appears to be stateful information that prevents plain POST requests from being effective, and none of the links actually have href attributes -- they're controlled by a javascript callback instead.

So our approach is to use phantomjs to manually navigate to each page, taking full advantage of the full session and javascript support that it provides.

Prerequisites

Ruby
PhantomJS

Run it

bundle install
ruby crawl.rb

After it runs (it can take 30 minutes), you'll have a CSV file with all the PSU club data.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
crawl.rb		crawl.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PSU Club Crawler

Prerequisites

Run it

About

Releases

Packages

Languages

westarete/psu-clubs

Folders and files

Latest commit

History

Repository files navigation

PSU Club Crawler

Prerequisites

Run it

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages