Feat: Spam Detection #675

aimura09 · 2023-10-11T07:33:14Z

Attempts to close https://github.com/comses/planning/issues/113

Squashed commits and solved merge conflicts.

Summary

Management commands for Machine Learning spam detection.

Features

Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).

UserMetadataSpamClassifier() ... Uses XGboost as a classifier. Takes "user_id", "labelled_by_curator", "first_name", "last_name", "is_active", "email", "affiliations", "bio", "research_interests" fields of the MemberProfiles as input.
TextSpamClassifier() ... Uses MultinomialNB as a classifier. Takes the "bio" and "research_interest" fields of the MemberProfiles as input. MemberProfiles that have NaN in these fields are disregarded.

To get a list of spam users (detected and labeled), run the following command.
./manage.py curator_spam_detection --exe
This will train the machine learning models if no instance file (pickle file) is found. Otherwise, it loads pre-trained model files and performs predictions on the users that haven't been labeled by a curator.
options
-- get_model_metrics: To print the accuracy, precision, recall, and f1 scores of spam detection models
--load_labels: To load the labels in spam_dataset.csv into DB. Running --exe the first time will load labels without specifying this option. It might be useful when we want to update labels separately.
--train_user: To train the UserMetadataSpamClassifier() individually
--predict_user: To get the predictions of UserMetadataSpamClassifier() individually
--train_text: To train the TextSpamClassifier() individually
--predict_text: To get the predictions of TextSpamClassifier() individually

Tests

Wrote tests using Django tests

alee · 2023-10-17T23:25:43Z

closing for commit surgery on a second branch so we can integrate properly with Noel's tag changes

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

… 'email', 'affiliations', and 'bio' of MemberProfile fix: update user pipeline to fix latency issues fix: small fix in all_users_df(). - Convert df.value from markup to string - Fix the name of df.columns fix: modify the partial_train to use the correct tokenizer

feat: - save to database from df - save recommendations - add load_labels() function to curator/spam_detect.py - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table. - get all unlabelled users in dataframe chore: - migrations for altering SpamRecommendation fix:fixed SpamRecommendation __str__ function fix: - using None instead of an extra column in SpamRecommendation - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.

refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier chore:removed print statements feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.

chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated

…ile is created feat: - fit text spam classifier - prediction function in classifiers

…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: added model validation in TextClassifier fix: fixed positional argument bug - fixed positional argument bug - dataset.csv replaced and create a directory in shared folder for spam detection related files fix: fix typing issue in df[labelled_by_curator] column - manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv

fix: - move SPAM_DIR_PATH into settings - set SPAM_DIR_PATH as a pathlib.Path - remove last reference to update_labelled_by_curator - adjust test curator labelling references - use assertCountEqual for order independent comparison - could also convert to sets because there shouldn't be any duplicates refactor: restructure code and tests - tests should use SpamDetector entrypoint instead of instantiating - individual components to ensure proper initialization - move initial training dataset path into settings - move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager Co-Authored-By: Allen Lee <[email protected]>

…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files

also clean up duplicate / dead imports

… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.

alee closed this Oct 17, 2023

alee reopened this Oct 18, 2023

aimura09 force-pushed the feat_spam_detection branch from 99dfa44 to 81ce216 Compare October 23, 2023 18:14

CharlesSheelam and others added 12 commits October 23, 2023 11:52

feat: create user pipeline for spam detection

e29f886

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

feat:created a model for storing spam recommendations

d55ab81

feat: added extra field in SpamRecommendation for user classifier

06a1ba6

chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated

feat: initializing a new SpamRecommendation whenever a new MemberProf…

2341ce5

…ile is created feat: - fit text spam classifier - prediction function in classifiers

fix/refactor: modifies UserPipeline functions and added an abstruct c…

727d3de

…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: unit tests added, comments added, SpamDetection class moved fro…

95db1f3

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

fix/refactor: TextSpamClassifier and UserSpamStatusProcessor

539d738

fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv

aimura09 force-pushed the feat_spam_detection branch from 81ce216 to f8508db Compare October 23, 2023 18:55

fix: replace Tensorflow Tokenizer with CountVectorizer. Deleted parti…

365a5df

…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files

aimura09 force-pushed the feat_spam_detection branch from f8508db to 365a5df Compare October 24, 2023 21:47

alee and others added 4 commits October 24, 2023 16:43

fix: create UserSpamStatus with MemberProfiles

1b85945

also clean up duplicate / dead imports

style: black

52875e8

fix: fix the timing to create the dir to /shared/curator/spam

36a9b62

refactor: Cleaning and adding more comments for the functions related…

31b4795

… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.

aimura09 closed this Nov 21, 2023

aimura09 deleted the feat_spam_detection branch November 21, 2023 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Spam Detection #675

Feat: Spam Detection #675

aimura09 commented Oct 11, 2023 •

edited

Loading

alee commented Oct 17, 2023

Feat: Spam Detection #675

Feat: Spam Detection #675

Conversation

aimura09 commented Oct 11, 2023 • edited Loading

Summary

Features

Tests

alee commented Oct 17, 2023

aimura09 commented Oct 11, 2023 •

edited

Loading