Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTM important dates scraper #81

Merged
merged 9 commits into from
May 15, 2016
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 65 additions & 4 deletions uoftscrapers/scrapers/calendar/utm.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,75 @@
import json
import os
import requests
import datetime
now = datetime.datetime.now()


class UTMCalendar:
'''Scraper for Important dates from UTM calendar found at https://www.utm.utoronto.ca/registrar/important-dates
'''

host = 'http://www.artsandscience.utoronto.ca/ofr/calendar/'

summerLink = 'http://m.utm.utoronto.ca/importantDates.php?mode=full&session={0}5&header='
fallLink = 'http://m.utm.utoronto.ca/importantDates.php?mode=full&session={0}9&header='
sessionLinks = [summerLink, fallLink]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just have one URL above, and have an array of [5, 9] so it's less repetitive.

currentSession = "Summer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can scrape this from the page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any advantages to scraping it instead of just having it as is? thanks.

@staticmethod
def scrape(location='.'):
def scrape(location='.', year=None): #scrapes most current sessions by default

year = year or now.year
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call datetime.datetime.now() here instead if it's needed.


calendar = OrderedDict()
Scraper.logger.info('UTMCalendar initialized.')
Scraper.logger.info('Not implemented.')
for link in UTMCalendar.sessionLinks:
html = Scraper.get(link.format(year))
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('div', class_='content')
dates = content.find_all('div', class_='title')
i = 0
currentDate = dates[i]
while(i<len(dates)):
date = dates[i].text
events = []
while (currentDate == dates[i]):
info = dates[i].find_next('div', class_='info')
description = info.text
eventStartEnd = date.split('-') #splits event dates over a period
eventStart = eventStartEnd[0].strip()
if len(eventStartEnd)>1:
eventEnd = eventStartEnd[1].strip()
else:
eventEnd = eventStart

events.append(OrderedDict([
('end_date', eventEnd),
('campus', 'UTM'),
('description', description)
]))
i+=1
if(i>=len(dates)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with the while loop's predicate, isn't it? Also, this could be a for loop instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually checks for the case when we reach the final date in the array. It's kinda "hacky" way to solve the problem of the list going out of range.

I originally used a for loop but I thought a while loop was easier to implement and better in this case. Please let me know if I should change it.

break;
calendar[date] = OrderedDict([
('date', eventStart),
('session', UTMCalendar.currentSession),
('events', events)
])
if(i<len(dates)):
currentDate = dates[i]
UTMCalendar.currentSession = "Fall/Winter"


for date, info in calendar.items():
Scraper.save_json(info, location, UTMCalendar.convert_date(date))

Scraper.logger.info('UTMCalendar completed.')
return calendar

@staticmethod
def convert_date(date):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely be useful for other scrapers too. Maybe consider moving it to the Scraper class?

Copy link
Member

@kashav kashav May 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkon I think there's the problem that dates are formatted differently for each scraper, so we'd still have to format the dates before we can use that method to convert them. Because of this, I'd personally prefer keeping an individual method for each scraper – you think it'd still be better to move it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kshvmdn Good point. Maybe we could add helper methods to alleviate the need for a lot of the date parsing boilerplate though.

date_dict = {'January':'1', 'February':'2', 'March':'3', 'April':'4', 'May':'5', 'June':'6', 'July':'7',
'August':'8', 'September':'9', 'October':'10', 'November':'11', 'December':'12'}
splitDate = date.split(' ')
year = splitDate[2]
day = splitDate[1].strip(',')
month = date_dict[splitDate[0]]
return("{0}-{1}-{2}".format(year, month, day))