-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTM important dates scraper #81
Changes from 6 commits
13fdda0
9b7ff62
d4a02d7
0b017b9
158d743
c2744a4
d2d2082
c006dbe
5bafb09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,14 +4,75 @@ | |
import json | ||
import os | ||
import requests | ||
import datetime | ||
now = datetime.datetime.now() | ||
|
||
|
||
class UTMCalendar: | ||
'''Scraper for Important dates from UTM calendar found at https://www.utm.utoronto.ca/registrar/important-dates | ||
''' | ||
|
||
host = 'http://www.artsandscience.utoronto.ca/ofr/calendar/' | ||
|
||
summerLink = 'http://m.utm.utoronto.ca/importantDates.php?mode=full&session={0}5&header=' | ||
fallLink = 'http://m.utm.utoronto.ca/importantDates.php?mode=full&session={0}9&header=' | ||
sessionLinks = [summerLink, fallLink] | ||
currentSession = "Summer" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can scrape this from the page. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there any advantages to scraping it instead of just having it as is? thanks. |
||
@staticmethod | ||
def scrape(location='.'): | ||
def scrape(location='.', year=None): #scrapes most current sessions by default | ||
|
||
year = year or now.year | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Call |
||
|
||
calendar = OrderedDict() | ||
Scraper.logger.info('UTMCalendar initialized.') | ||
Scraper.logger.info('Not implemented.') | ||
for link in UTMCalendar.sessionLinks: | ||
html = Scraper.get(link.format(year)) | ||
soup = BeautifulSoup(html, 'html.parser') | ||
content = soup.find('div', class_='content') | ||
dates = content.find_all('div', class_='title') | ||
i = 0 | ||
currentDate = dates[i] | ||
while(i<len(dates)): | ||
date = dates[i].text | ||
events = [] | ||
while (currentDate == dates[i]): | ||
info = dates[i].find_next('div', class_='info') | ||
description = info.text | ||
eventStartEnd = date.split('-') #splits event dates over a period | ||
eventStart = eventStartEnd[0].strip() | ||
if len(eventStartEnd)>1: | ||
eventEnd = eventStartEnd[1].strip() | ||
else: | ||
eventEnd = eventStart | ||
|
||
events.append(OrderedDict([ | ||
('end_date', eventEnd), | ||
('campus', 'UTM'), | ||
('description', description) | ||
])) | ||
i+=1 | ||
if(i>=len(dates)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is redundant with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it actually checks for the case when we reach the final date in the array. It's kinda "hacky" way to solve the problem of the list going out of range. I originally used a for loop but I thought a while loop was easier to implement and better in this case. Please let me know if I should change it. |
||
break; | ||
calendar[date] = OrderedDict([ | ||
('date', eventStart), | ||
('session', UTMCalendar.currentSession), | ||
('events', events) | ||
]) | ||
if(i<len(dates)): | ||
currentDate = dates[i] | ||
UTMCalendar.currentSession = "Fall/Winter" | ||
|
||
|
||
for date, info in calendar.items(): | ||
Scraper.save_json(info, location, UTMCalendar.convert_date(date)) | ||
|
||
Scraper.logger.info('UTMCalendar completed.') | ||
return calendar | ||
|
||
@staticmethod | ||
def convert_date(date): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would likely be useful for other scrapers too. Maybe consider moving it to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @arkon I think there's the problem that dates are formatted differently for each scraper, so we'd still have to format the dates before we can use that method to convert them. Because of this, I'd personally prefer keeping an individual method for each scraper – you think it'd still be better to move it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kshvmdn Good point. Maybe we could add helper methods to alleviate the need for a lot of the date parsing boilerplate though. |
||
date_dict = {'January':'1', 'February':'2', 'March':'3', 'April':'4', 'May':'5', 'June':'6', 'July':'7', | ||
'August':'8', 'September':'9', 'October':'10', 'November':'11', 'December':'12'} | ||
splitDate = date.split(' ') | ||
year = splitDate[2] | ||
day = splitDate[1].strip(',') | ||
month = date_dict[splitDate[0]] | ||
return("{0}-{1}-{2}".format(year, month, day)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just have one URL above, and have an array of
[5, 9]
so it's less repetitive.