-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Spam Detection #675
Closed
Closed
Feat: Spam Detection #675
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
closing for commit surgery on a second branch so we can integrate properly with Noel's tag changes |
aimura09
force-pushed
the
feat_spam_detection
branch
from
October 23, 2023 18:14
99dfa44
to
81ce216
Compare
feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes
… 'email', 'affiliations', and 'bio' of MemberProfile fix: update user pipeline to fix latency issues fix: small fix in all_users_df(). - Convert df.value from markup to string - Fix the name of df.columns fix: modify the partial_train to use the correct tokenizer
feat: - save to database from df - save recommendations - add load_labels() function to curator/spam_detect.py - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table. - get all unlabelled users in dataframe chore: - migrations for altering SpamRecommendation fix:fixed SpamRecommendation __str__ function fix: - using None instead of an extra column in SpamRecommendation - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.
refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier chore:removed print statements feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.
chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated
…ile is created feat: - fit text spam classifier - prediction function in classifiers
…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier fix: fixed positional argument bug - fixed positional argument bug - dataset.csv replaced and create a directory in shared folder for spam detection related files fix: fix typing issue in df[labelled_by_curator] column - manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/
fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv
fix: - move SPAM_DIR_PATH into settings - set SPAM_DIR_PATH as a pathlib.Path - remove last reference to update_labelled_by_curator - adjust test curator labelling references - use assertCountEqual for order independent comparison - could also convert to sets because there shouldn't be any duplicates refactor: restructure code and tests - tests should use SpamDetector entrypoint instead of instantiating - individual components to ensure proper initialization - move initial training dataset path into settings - move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager Co-Authored-By: Allen Lee <[email protected]>
aimura09
force-pushed
the
feat_spam_detection
branch
from
October 23, 2023 18:55
81ce216
to
f8508db
Compare
…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files
aimura09
force-pushed
the
feat_spam_detection
branch
from
October 24, 2023 21:47
f8508db
to
365a5df
Compare
also clean up duplicate / dead imports
… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attempts to close https://github.com/comses/planning/issues/113
Squashed commits and solved merge conflicts.
Summary
Management commands for Machine Learning spam detection.
Features
Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).
To get a list of spam users (detected and labeled), run the following command.
./manage.py curator_spam_detection --exe
This will train the machine learning models if no instance file (pickle file) is found. Otherwise, it loads pre-trained model files and performs predictions on the users that haven't been labeled by a curator.
options
-- get_model_metrics: To print the accuracy, precision, recall, and f1 scores of spam detection models
--load_labels: To load the labels in spam_dataset.csv into DB. Running --exe the first time will load labels without specifying this option. It might be useful when we want to update labels separately.
--train_user: To train the UserMetadataSpamClassifier() individually
--predict_user: To get the predictions of UserMetadataSpamClassifier() individually
--train_text: To train the TextSpamClassifier() individually
--predict_text: To get the predictions of TextSpamClassifier() individually
Tests
Wrote tests using Django tests