-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTM important dates scraper #81
Changes from all commits
13fdda0
9b7ff62
d4a02d7
0b017b9
158d743
c2744a4
d2d2082
c006dbe
5bafb09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,14 +4,76 @@ | |
import json | ||
import os | ||
import requests | ||
import datetime | ||
|
||
|
||
class UTMCalendar: | ||
'''Scraper for Important dates from UTM calendar found at https://www.utm.utoronto.ca/registrar/important-dates | ||
''' | ||
|
||
host = 'http://www.artsandscience.utoronto.ca/ofr/calendar/' | ||
|
||
link = 'http://m.utm.utoronto.ca/importantDates.php?mode=full&session={0}{1}&header=' | ||
sessionNumber = [5, 9] | ||
@staticmethod | ||
def scrape(location='.'): | ||
def scrape(location='.', year=None): #scrapes most current sessions by default | ||
|
||
year = year or datetime.datetime.now().year | ||
|
||
currentSession = "{0} SUMMER" | ||
calendar = OrderedDict() | ||
Scraper.logger.info('UTMCalendar initialized.') | ||
Scraper.logger.info('Not implemented.') | ||
for session in UTMCalendar.sessionNumber: | ||
html = Scraper.get(UTMCalendar.link.format(year, session)) | ||
soup = BeautifulSoup(html, 'html.parser') | ||
content = soup.find('div', class_='content') | ||
dates = content.find_all('div', class_='title') | ||
i = 0 | ||
currentDate = dates[i] | ||
while(i<len(dates)): | ||
date = dates[i].text | ||
events = [] | ||
while (currentDate == dates[i]): | ||
info = dates[i].find_next('div', class_='info') | ||
description = info.text | ||
eventStartEnd = date.split('-') #splits event dates over a period | ||
eventStart = UTMCalendar.convert_date(eventStartEnd[0].strip()) | ||
if len(eventStartEnd)>1: | ||
eventEnd = UTMCalendar.convert_date(eventStartEnd[1].strip()) | ||
else: | ||
eventEnd = eventStart | ||
|
||
events.append(OrderedDict([ | ||
('end_date', eventEnd), | ||
('session', currentSession.format(UTMCalendar.get_year_from(eventStart))), | ||
('campus', 'UTM'), | ||
('description', description) | ||
])) | ||
i+=1 | ||
if(i>=len(dates)): | ||
break; | ||
calendar[date] = OrderedDict([ | ||
('date', eventStart), | ||
('events', events) | ||
]) | ||
if(i<len(dates)): | ||
currentDate = dates[i] | ||
currentSession = "{0} FALL/WINTER" | ||
|
||
|
||
for date, info in calendar.items(): | ||
Scraper.save_json(info, location, UTMCalendar.convert_date(date)) | ||
|
||
Scraper.logger.info('UTMCalendar completed.') | ||
return calendar | ||
|
||
@staticmethod | ||
def convert_date(date): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would likely be useful for other scrapers too. Maybe consider moving it to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @arkon I think there's the problem that dates are formatted differently for each scraper, so we'd still have to format the dates before we can use that method to convert them. Because of this, I'd personally prefer keeping an individual method for each scraper – you think it'd still be better to move it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kshvmdn Good point. Maybe we could add helper methods to alleviate the need for a lot of the date parsing boilerplate though. |
||
splitDate = date.split(' ') | ||
year = splitDate[2] | ||
day = splitDate[1].strip(',') | ||
month = datetime.datetime.strptime(splitDate[0], '%B').strftime('%m') | ||
return("{0}-{1}-{2}".format(year, month, day.zfill(2))) | ||
|
||
@staticmethod | ||
def get_year_from(date): | ||
splitDate = date.split('-') | ||
return splitDate[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant with the
while
loop's predicate, isn't it? Also, this could be afor
loop instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually checks for the case when we reach the final date in the array. It's kinda "hacky" way to solve the problem of the list going out of range.
I originally used a for loop but I thought a while loop was easier to implement and better in this case. Please let me know if I should change it.