-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similar Courses Algorithm #470
Conversation
[diff-counting] Significant lines: 211. |
I see that you mentioned storing the pre-processed + vectors for each course in our database. I am going to comment and say we might want to create a new collection (table) for this instead of putting it directly in the course collections itself. ie. courses collection = [] query between both of these mongo collections unless the pre-processing data isn't too large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jacqueline! Amazing work on the course similarity algorithm -- I'd love to chat with you to learn more about what goes into the preprocessing and computing similarity, and we could see if there are ways to improve accuracy and the score returned by the functions. If my comments are irrelevant, definitely disregard :)
* @param description A course description that needs to be preprocessed | ||
* @returns The processed description for a course | ||
*/ | ||
export const preprocess = (description: string) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in the testing picture you uploaded that there are some strange text breaks or punctuation that now occurs between words? Not sure if that's still happening
export const preprocess = (description: string) => { | ||
let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description]; | ||
let processedText = sentences.map(sentence => { | ||
let words = sentence.match(/\b\w+\b/g) || []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought I had about preprocessing was getting rid of "filler words," (i.e. and, the, to, for, with...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea! Also i saw "this" and maybe any pronouns?
const descriptions = ["This course provides a detailed study on multiple financial markets including bonds, forwards, futures, swaps, and options and their role in addressing major issues facing humanity. In particular, we plan to study specific topics on the role of financial markets in addressing important issues like funding cancer cure, tackling climate change, and financing educational needs for the underserved. Relative to a traditional finance class, we take a broad approach and think of finance as a way to get things done and financial instruments as a way to solve problems. We exploreΒ topics related to diversification and purpose investing, including a highly innovative idea of a mega-fund developing cancer treatment. We examine how financial instruments can help solve or hedge some societal issues, particularly on climate change. As an example, we will beΒ studying a financial solution to deal with California forest fire.Β We also examine the potential for social impact bonds for educating pre-school children and reducing prisoners' recidivism.", | ||
"This course introduces and develops the leading modern theories of economies open to trade in financial assets and real goods. The goal is to understand how cross-country linkages in influence macroeconomic developments within individual countries; how financial markets distribute risk and wealth around the world; and how trade changes the effectiveness of national monetary and fiscal policies. In exploring these questions, we emphasize the role that exchange rates and exchange rate policy take in shaping the consequences of international linkages. We apply our theories to current and recent events, including growing geoeconomic conflict between Eastern and Western countries, hyperinflation in Argentina, Brexit, and recent Euro-area debt crises.", | ||
"The Corporate Finance Immersion (CFI) Practicum is designed to provide students with a real world and practical perspective on the activities, processes and critical questions faced by corporate finance executives. It is oriented around the key principles of shareholder value creation and the skills and processes corporations use to drive value. The CFI Practicum will help develop skills and executive judgement for students seeking roles in corporate finance, corporate strategy, business development, financial planning, treasury, and financial management training programs. The course can also help students pursuing consulting to sharpen their financial skills and get an excellent view of a corporation's strategic and financial objectives.Β The practicum will be comprised of a mix of lectures, cases, guest speakers, and team projects. Additionally, there will be training workshops to build your financial modelling skills.", | ||
"Environmental Finance & Impact Investing Practicum", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these test cases (the short ones that are just the course name) meant to be tested for similarity against the longer descriptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just courses without descriptions on the course roster API, so I used the course title as a filler for now.
* Gets the processed description to use for the similarity algorithm | ||
* Currently used for testing | ||
*/ | ||
courseRouter.post('/getPreDesc', async (req, res) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any errors to catch here? Also I think that a route name more like /preprocess
or /preprocess-desc
might fit more with our naming theme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we should handle errors at the router level
* @body courseId: a course's id field | ||
* Gets the array of the top 5 similar courses for the course with id = courseId | ||
*/ | ||
courseRouter.post('/getSimilarity', async (req, res) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, /api/courses/get/similarity
or something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your hard work on the search algorithm Jacqueline! The similarity scores tested look really good and this will definitely add a lot of improvement to the results generated by users as they search. I left just a few comments!
if (idf && idf[term] === undefined) { | ||
idf[term] = 1; | ||
} | ||
d[term] *= idf[term]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could also normalize by dividing by term frequency here to make sure that the tfidf score accounts for different lengths of the documents to reflect an accurate importance for each term no matter document length.
const dotProduct = dot(vecA, vecB); | ||
const magA = norm(vecA); | ||
const magB = norm(vecB); | ||
return dotProduct / (magA * magB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could also add a check here in case magA or magB is 0 to avoid dividing by 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - I think this serves as a good start for the algorithm!
Q1. In the example descriptions, there are ones with just the title if there is no description? So you plan to just use the course title if there is no description?
Q1.1 But we don't use the course title for courses that have a description? Do you think we should use the course title + description (can be empty) in our vector to calculate similarity?
_id: { type: String }, | ||
classSub: { type: String }, | ||
classNum: { type: String }, | ||
processedDescriptions: { type: String }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be plural?
export const preprocess = (description: string) => { | ||
let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description]; | ||
let processedText = sentences.map(sentence => { | ||
let words = sentence.match(/\b\w+\b/g) || []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea! Also i saw "this" and maybe any pronouns?
* Gets the processed description to use for the similarity algorithm | ||
* Currently used for testing | ||
*/ | ||
courseRouter.post('/getPreDesc', async (req, res) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we should handle errors at the router level
Summary
This PR works on the similar courses feature.
PR Type
Mobile + Desktop Screenshots & Recordings
Preprocessing descriptions:
Sample similarity score array with hard-coded course description inputs:
QA - Test Plan
All testing is currently manual.
http://localhost:8080/api/courses/getPreDesc
- input a course description into the request bodyhttp://localhost:8080/api/courses/getSimilarity
- this currently tests the similarity between 5 different finance-related courses at CornellBreaking Changes & Notes
Added to documentation?
What GIF represents this PR?
gif