Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Spam Detection #675

Closed
wants to merge 17 commits into from
Closed

Conversation

aimura09
Copy link
Contributor

@aimura09 aimura09 commented Oct 11, 2023

Attempts to close https://github.com/comses/planning/issues/113

Squashed commits and solved merge conflicts.

Summary

Management commands for Machine Learning spam detection.

Features

Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).

  • UserMetadataSpamClassifier() ... Uses XGboost as a classifier. Takes "user_id", "labelled_by_curator", "first_name", "last_name", "is_active", "email", "affiliations", "bio", "research_interests" fields of the MemberProfiles as input.
  • TextSpamClassifier() ... Uses MultinomialNB as a classifier. Takes the "bio" and "research_interest" fields of the MemberProfiles as input. MemberProfiles that have NaN in these fields are disregarded.
  1. To get a list of spam users (detected and labeled), run the following command.
    ./manage.py curator_spam_detection --exe
    This will train the machine learning models if no instance file (pickle file) is found. Otherwise, it loads pre-trained model files and performs predictions on the users that haven't been labeled by a curator.

  2. options
    -- get_model_metrics: To print the accuracy, precision, recall, and f1 scores of spam detection models
    --load_labels: To load the labels in spam_dataset.csv into DB. Running --exe the first time will load labels without specifying this option. It might be useful when we want to update labels separately.
    --train_user: To train the UserMetadataSpamClassifier() individually
    --predict_user: To get the predictions of UserMetadataSpamClassifier() individually
    --train_text: To train the TextSpamClassifier() individually
    --predict_text: To get the predictions of TextSpamClassifier() individually

Tests

Wrote tests using Django tests

@alee alee closed this Oct 17, 2023
@alee
Copy link
Member

alee commented Oct 17, 2023

closing for commit surgery on a second branch so we can integrate properly with Noel's tag changes

@alee alee reopened this Oct 18, 2023
@aimura09 aimura09 force-pushed the feat_spam_detection branch from 99dfa44 to 81ce216 Compare October 23, 2023 18:14
CharlesSheelam and others added 12 commits October 23, 2023 11:52
feat: rewrite UserPipeline to include user id

feat: correct user pipeline for user id

feat: fix user id column in dataframes
… 'email', 'affiliations', and 'bio' of MemberProfile

fix: update user pipeline to fix latency issues
fix: small fix in all_users_df().
   - Convert df.value from markup to string
   - Fix the name of df.columns

fix: modify the partial_train to use the correct tokenizer
feat:
 - save to database from df
 - save recommendations
 - add load_labels() function to curator/spam_detect.py
 - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.
 - get all unlabelled users in dataframe

chore:
 - migrations for altering SpamRecommendation
fix:fixed SpamRecommendation __str__ function

fix:
  - using None instead of an extra column in SpamRecommendation
  - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.
refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier

chore:removed print statements

feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.
chore: todo noel
chore: organized imports
fix: fix the issue that data in database is not updated
…ile is created

feat:
 - fit text spam classifier
 - prediction function in classifiers
…lass SpamClassifier

refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier
fix: fixed positional argument bug
- fixed positional argument bug
- dataset.csv replaced and create a directory in shared folder for spam detection related files

fix: fix typing issue in df[labelled_by_curator] column
- manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/
fix: fixed KeyError bug in TextSpamClassifier
refactor:
- create new file 'spam_processor.py' for UserSpamStatusProcessor.
- change name from dataset.csv to spam_detaset.csv
fix:
- move SPAM_DIR_PATH into settings
- set SPAM_DIR_PATH as a pathlib.Path
- remove last reference to update_labelled_by_curator
- adjust test curator labelling references
- use assertCountEqual for order independent comparison
- could also convert to sets because there shouldn't be any duplicates

refactor: restructure code and tests
- tests should use SpamDetector entrypoint instead of instantiating
- individual components to ensure proper initialization
- move initial training dataset path into settings
- move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager

Co-Authored-By: Allen Lee <[email protected]>
@aimura09 aimura09 force-pushed the feat_spam_detection branch from 81ce216 to f8508db Compare October 23, 2023 18:55
…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly.

Also re-generated migration files
@aimura09 aimura09 force-pushed the feat_spam_detection branch from f8508db to 365a5df Compare October 24, 2023 21:47
alee and others added 4 commits October 24, 2023 16:43
also clean up duplicate / dead imports
… to the spam feature.

 - adding headline comments for the functions.

 - Cleaning up the management command code and clarifying code responsibilities.

 - Bettering execution messages.
@aimura09 aimura09 closed this Nov 21, 2023
@aimura09 aimura09 deleted the feat_spam_detection branch November 21, 2023 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants