Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar Courses Algorithm #470

Merged
merged 6 commits into from
Nov 20, 2024
Merged

Similar Courses Algorithm #470

merged 6 commits into from
Nov 20, 2024

Conversation

jacquelinecai
Copy link
Contributor

Summary

This PR works on the similar courses feature.

PR Type

  • πŸ• Feature
  • πŸ› Bug Fix
  • πŸ“ Documentation Update
  • 🎨 Style
  • πŸ§‘β€πŸ’» Code Refactor
  • πŸ”₯ Performance Improvements
  • βœ… Test
  • πŸ€– Build
  • πŸ” CI
  • πŸ“¦ Chore (Release)
  • ⏩ Revert

Mobile + Desktop Screenshots & Recordings

Preprocessing descriptions:
image

Sample similarity score array with hard-coded course description inputs:
image

QA - Test Plan

All testing is currently manual.

  • POST http://localhost:8080/api/courses/getPreDesc - input a course description into the request body
  • POST http://localhost:8080/api/courses/getSimilarity - this currently tests the similarity between 5 different finance-related courses at Cornell

Breaking Changes & Notes

  • Implemented cosine similarity and TF-IDF algorithms for the course descriptions
  • Added preprocessing steps to increase accuracy of similarity measures
  • Sorted and sliced similarity array to output top 5 most similar courses
  • Next steps include preprocessing all descriptions, generating the TF-IDF vectors for each of them, and adding this information into new database fields

Added to documentation?

  • πŸ“œ README.md
  • πŸ““ notion
  • πŸ• codebase
  • πŸ“• ...
  • πŸ™… no documentation needed

What GIF represents this PR?

gif

@jacquelinecai jacquelinecai requested a review from a team as a code owner October 26, 2024 05:50
@dti-github-bot
Copy link
Member

dti-github-bot commented Oct 26, 2024

[diff-counting] Significant lines: 211.

@wizhaaa
Copy link
Collaborator

wizhaaa commented Oct 26, 2024

I see that you mentioned storing the pre-processed + vectors for each course in our database. I am going to comment and say we might want to create a new collection (table) for this instead of putting it directly in the course collections itself.

ie.

courses collection = []
course_recomendation_metadata = []

query between both of these mongo collections unless the pre-processing data isn't too large

Copy link
Contributor

@qiandrewj qiandrewj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jacqueline! Amazing work on the course similarity algorithm -- I'd love to chat with you to learn more about what goes into the preprocessing and computing similarity, and we could see if there are ways to improve accuracy and the score returned by the functions. If my comments are irrelevant, definitely disregard :)

* @param description A course description that needs to be preprocessed
* @returns The processed description for a course
*/
export const preprocess = (description: string) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in the testing picture you uploaded that there are some strange text breaks or punctuation that now occurs between words? Not sure if that's still happening

export const preprocess = (description: string) => {
let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description];
let processedText = sentences.map(sentence => {
let words = sentence.match(/\b\w+\b/g) || [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had about preprocessing was getting rid of "filler words," (i.e. and, the, to, for, with...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea! Also i saw "this" and maybe any pronouns?

const descriptions = ["This course provides a detailed study on multiple financial markets including bonds, forwards, futures, swaps, and options and their role in addressing major issues facing humanity. In particular, we plan to study specific topics on the role of financial markets in addressing important issues like funding cancer cure, tackling climate change, and financing educational needs for the underserved. Relative to a traditional finance class, we take a broad approach and think of finance as a way to get things done and financial instruments as a way to solve problems. We exploreΒ topics related to diversification and purpose investing, including a highly innovative idea of a mega-fund developing cancer treatment. We examine how financial instruments can help solve or hedge some societal issues, particularly on climate change. As an example, we will beΒ studying a financial solution to deal with California forest fire.Β We also examine the potential for social impact bonds for educating pre-school children and reducing prisoners' recidivism.",
"This course introduces and develops the leading modern theories of economies open to trade in financial assets and real goods. The goal is to understand how cross-country linkages in influence macroeconomic developments within individual countries; how financial markets distribute risk and wealth around the world; and how trade changes the effectiveness of national monetary and fiscal policies. In exploring these questions, we emphasize the role that exchange rates and exchange rate policy take in shaping the consequences of international linkages. We apply our theories to current and recent events, including growing geoeconomic conflict between Eastern and Western countries, hyperinflation in Argentina, Brexit, and recent Euro-area debt crises.",
"The Corporate Finance Immersion (CFI) Practicum is designed to provide students with a real world and practical perspective on the activities, processes and critical questions faced by corporate finance executives. It is oriented around the key principles of shareholder value creation and the skills and processes corporations use to drive value. The CFI Practicum will help develop skills and executive judgement for students seeking roles in corporate finance, corporate strategy, business development, financial planning, treasury, and financial management training programs. The course can also help students pursuing consulting to sharpen their financial skills and get an excellent view of a corporation's strategic and financial objectives.Β The practicum will be comprised of a mix of lectures, cases, guest speakers, and team projects. Additionally, there will be training workshops to build your financial modelling skills.",
"Environmental Finance & Impact Investing Practicum",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these test cases (the short ones that are just the course name) meant to be tested for similarity against the longer descriptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just courses without descriptions on the course roster API, so I used the course title as a filler for now.

* Gets the processed description to use for the similarity algorithm
* Currently used for testing
*/
courseRouter.post('/getPreDesc', async (req, res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any errors to catch here? Also I think that a route name more like /preprocess or /preprocess-desc might fit more with our naming theme

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should handle errors at the router level

* @body courseId: a course's id field
* Gets the array of the top 5 similar courses for the course with id = courseId
*/
courseRouter.post('/getSimilarity', async (req, res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, /api/courses/get/similarity or something similar

Copy link
Contributor

@leihelen leihelen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work on the search algorithm Jacqueline! The similarity scores tested look really good and this will definitely add a lot of improvement to the results generated by users as they search. I left just a few comments!

if (idf && idf[term] === undefined) {
idf[term] = 1;
}
d[term] *= idf[term];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could also normalize by dividing by term frequency here to make sure that the tfidf score accounts for different lengths of the documents to reflect an accurate importance for each term no matter document length.

const dotProduct = dot(vecA, vecB);
const magA = norm(vecA);
const magB = norm(vecB);
return dotProduct / (magA * magB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could also add a check here in case magA or magB is 0 to avoid dividing by 0.

Copy link
Collaborator

@wizhaaa wizhaaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - I think this serves as a good start for the algorithm!

Q1. In the example descriptions, there are ones with just the title if there is no description? So you plan to just use the course title if there is no description?

Q1.1 But we don't use the course title for courses that have a description? Do you think we should use the course title + description (can be empty) in our vector to calculate similarity?

_id: { type: String },
classSub: { type: String },
classNum: { type: String },
processedDescriptions: { type: String },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be plural?

export const preprocess = (description: string) => {
let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description];
let processedText = sentences.map(sentence => {
let words = sentence.match(/\b\w+\b/g) || [];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea! Also i saw "this" and maybe any pronouns?

* Gets the processed description to use for the similarity algorithm
* Currently used for testing
*/
courseRouter.post('/getPreDesc', async (req, res) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should handle errors at the router level

@jacquelinecai jacquelinecai merged commit 51c42cb into main Nov 20, 2024
4 checks passed
@jacquelinecai jacquelinecai deleted the jacqueline/rec-algo branch November 20, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants