Similar Courses Algorithm #470

jacquelinecai · 2024-10-26T05:50:39Z

Summary

This PR works on the similar courses feature.

PR Type

Mobile + Desktop Screenshots & Recordings

Preprocessing descriptions:

Sample similarity score array with hard-coded course description inputs:

QA - Test Plan

All testing is currently manual.

POST http://localhost:8080/api/courses/getPreDesc - input a course description into the request body
POST http://localhost:8080/api/courses/getSimilarity - this currently tests the similarity between 5 different finance-related courses at Cornell

Breaking Changes & Notes

Implemented cosine similarity and TF-IDF algorithms for the course descriptions
Added preprocessing steps to increase accuracy of similarity measures
Sorted and sliced similarity array to output top 5 most similar courses
Next steps include preprocessing all descriptions, generating the TF-IDF vectors for each of them, and adding this information into new database fields

Added to documentation?

What GIF represents this PR?

gif

dti-github-bot · 2024-10-26T05:50:54Z

[diff-counting] Significant lines: 211.

wizhaaa · 2024-10-26T07:16:29Z

I see that you mentioned storing the pre-processed + vectors for each course in our database. I am going to comment and say we might want to create a new collection (table) for this instead of putting it directly in the course collections itself.

ie.

courses collection = []
course_recomendation_metadata = []

query between both of these mongo collections unless the pre-processing data isn't too large

qiandrewj

Hi Jacqueline! Amazing work on the course similarity algorithm -- I'd love to chat with you to learn more about what goes into the preprocessing and computing similarity, and we could see if there are ways to improve accuracy and the score returned by the functions. If my comments are irrelevant, definitely disregard :)

qiandrewj · 2024-10-26T13:17:52Z

server/src/course/course.recalgo.ts

+ * @param description A course description that needs to be preprocessed
+ * @returns The processed description for a course
+ */
+export const preprocess = (description: string) => {


I noticed in the testing picture you uploaded that there are some strange text breaks or punctuation that now occurs between words? Not sure if that's still happening

qiandrewj · 2024-10-26T13:18:49Z

server/src/course/course.recalgo.ts

+export const preprocess = (description: string) => {
+  let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description];
+  let processedText = sentences.map(sentence => {
+    let words = sentence.match(/\b\w+\b/g) || [];


Another thought I had about preprocessing was getting rid of "filler words," (i.e. and, the, to, for, with...)

nice idea! Also i saw "this" and maybe any pronouns?

qiandrewj · 2024-10-26T13:20:12Z

server/src/course/course.controller.ts

+  const descriptions = ["This course provides a detailed study on multiple financial markets including bonds, forwards, futures, swaps, and options and their role in addressing major issues facing humanity. In particular, we plan to study specific topics on the role of financial markets in addressing important issues like funding cancer cure, tackling climate change, and financing educational needs for the underserved. Relative to a traditional finance class, we take a broad approach and think of finance as a way to get things done and financial instruments as a way to solve problems. We explore topics related to diversification and purpose investing, including a highly innovative idea of a mega-fund developing cancer treatment. We examine how financial instruments can help solve or hedge some societal issues, particularly on climate change. As an example, we will be studying a financial solution to deal with California forest fire. We also examine the potential for social impact bonds for educating pre-school children and reducing prisoners' recidivism.",
+    "This course introduces and develops the leading modern theories of economies open to trade in financial assets and real goods. The goal is to understand how cross-country linkages in influence macroeconomic developments within individual countries; how financial markets distribute risk and wealth around the world; and how trade changes the effectiveness of national monetary and fiscal policies. In exploring these questions, we emphasize the role that exchange rates and exchange rate policy take in shaping the consequences of international linkages. We apply our theories to current and recent events, including growing geoeconomic conflict between Eastern and Western countries, hyperinflation in Argentina, Brexit, and recent Euro-area debt crises.",
+    "The Corporate Finance Immersion (CFI) Practicum is designed to provide students with a real world and practical perspective on the activities, processes and critical questions faced by corporate finance executives. It is oriented around the key principles of shareholder value creation and the skills and processes corporations use to drive value. The CFI Practicum will help develop skills and executive judgement for students seeking roles in corporate finance, corporate strategy, business development, financial planning, treasury, and financial management training programs. The course can also help students pursuing consulting to sharpen their financial skills and get an excellent view of a corporation's strategic and financial objectives. The practicum will be comprised of a mix of lectures, cases, guest speakers, and team projects. Additionally, there will be training workshops to build your financial modelling skills.",
+    "Environmental Finance & Impact Investing Practicum",


Are these test cases (the short ones that are just the course name) meant to be tested for similarity against the longer descriptions?

These are just courses without descriptions on the course roster API, so I used the course title as a filler for now.

qiandrewj · 2024-10-26T13:22:25Z

server/src/course/course.router.ts

+ * Gets the processed description to use for the similarity algorithm
+ * Currently used for testing
+*/
+courseRouter.post('/getPreDesc', async (req, res) => {


Would there be any errors to catch here? Also I think that a route name more like /preprocess or /preprocess-desc might fit more with our naming theme

Yep we should handle errors at the router level

qiandrewj · 2024-10-26T13:22:46Z

server/src/course/course.router.ts

+ * @body courseId: a course's id field
+ * Gets the array of the top 5 similar courses for the course with id = courseId
+*/
+courseRouter.post('/getSimilarity', async (req, res) => {


Here, /api/courses/get/similarity or something similar

leihelen

Thanks for your hard work on the search algorithm Jacqueline! The similarity scores tested look really good and this will definitely add a lot of improvement to the results generated by users as they search. I left just a few comments!

leihelen · 2024-10-27T00:33:34Z

server/src/course/course.recalgo.ts

+    if (idf && idf[term] === undefined) {
+      idf[term] = 1;
+    }
+    d[term] *= idf[term];


Maybe you could also normalize by dividing by term frequency here to make sure that the tfidf score accounts for different lengths of the documents to reflect an accurate importance for each term no matter document length.

leihelen · 2024-10-27T00:34:33Z

server/src/course/course.recalgo.ts

+  const dotProduct = dot(vecA, vecB);
+  const magA = norm(vecA);
+  const magB = norm(vecB);
+  return dotProduct / (magA * magB);


Maybe you could also add a check here in case magA or magB is 0 to avoid dividing by 0.

wizhaaa

Awesome - I think this serves as a good start for the algorithm!

Q1. In the example descriptions, there are ones with just the title if there is no description? So you plan to just use the course title if there is no description?

Q1.1 But we don't use the course title for courses that have a description? Do you think we should use the course title + description (can be empty) in our vector to calculate similarity?

wizhaaa · 2024-11-05T01:47:27Z

server/db/schema.ts

+  _id: { type: String },
+  classSub: { type: String },
+  classNum: { type: String },
+  processedDescriptions: { type: String },


should this be plural?

wizhaaa · 2024-11-05T06:23:51Z

server/src/course/course.recalgo.ts

+export const preprocess = (description: string) => {
+  let sentences = description.match(/[^.!?]*[.!?]\s+[A-Z]/g) || [description];
+  let processedText = sentences.map(sentence => {
+    let words = sentence.match(/\b\w+\b/g) || [];


nice idea! Also i saw "this" and maybe any pronouns?

wizhaaa · 2024-11-05T06:47:03Z

server/src/course/course.router.ts

+ * Gets the processed description to use for the similarity algorithm
+ * Currently used for testing
+*/
+courseRouter.post('/getPreDesc', async (req, res) => {


Yep we should handle errors at the router level

jacquelinecai added 5 commits October 23, 2024 11:40

Implement recommendations algorithms

d27f941

Add endpoints for testing

e2939c5

Add documentation

3159abe

Merge branch 'main' into jacqueline/rec-algo

5d285eb

Sort and slice similarity array

6963acc

jacquelinecai requested a review from a team as a code owner October 26, 2024 05:50

qiandrewj reviewed Oct 26, 2024

View reviewed changes

leihelen reviewed Oct 27, 2024

View reviewed changes

Set up new metadata collections

51c42cb

wizhaaa approved these changes Nov 5, 2024

View reviewed changes

jacquelinecai merged commit 51c42cb into main Nov 20, 2024
4 checks passed

jacquelinecai deleted the jacqueline/rec-algo branch November 20, 2024 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similar Courses Algorithm #470

Similar Courses Algorithm #470

jacquelinecai commented Oct 26, 2024

dti-github-bot commented Oct 26, 2024 •

edited

Loading

wizhaaa commented Oct 26, 2024

qiandrewj left a comment

qiandrewj Oct 26, 2024

qiandrewj Oct 26, 2024

wizhaaa Nov 5, 2024

qiandrewj Oct 26, 2024

jacquelinecai Oct 26, 2024

qiandrewj Oct 26, 2024

wizhaaa Nov 5, 2024

qiandrewj Oct 26, 2024

leihelen left a comment

leihelen Oct 27, 2024

leihelen Oct 27, 2024

wizhaaa left a comment

wizhaaa Nov 5, 2024

wizhaaa Nov 5, 2024

wizhaaa Nov 5, 2024

Similar Courses Algorithm #470

Similar Courses Algorithm #470

Conversation

jacquelinecai commented Oct 26, 2024

Summary

PR Type

Mobile + Desktop Screenshots & Recordings

QA - Test Plan

Breaking Changes & Notes

Added to documentation?

What GIF represents this PR?

dti-github-bot commented Oct 26, 2024 • edited Loading

wizhaaa commented Oct 26, 2024

qiandrewj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leihelen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wizhaaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dti-github-bot commented Oct 26, 2024 •

edited

Loading