Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup PR for existing Survey Assist RF PR #938 #995

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

MukuFlash03
Copy link
Contributor

@MukuFlash03 MukuFlash03 commented Nov 22, 2024

This is a cleanup PR to address pending review comments in the Survey Assist using RF #938

I have approached this by going through all the comments throughout
Then I had to check the future commits to see if those were addressed already or if they needed any improvements.

I will list all this relevant information file wise below along with my comments and any relevant changes I've made.

This information will be in a table format with these columns:

  • Summary of review comment
  • Links to the original review comments
  • Links to commits (if present) that already addressed the comments
  • Status: DONE (already addressed by the existing commits in the original PR, OR I committed new changes); PENDING (I have a query).

This will be followed by more details including my own changes and comments.

I have marked review comments / code changes where I need more clarification with the status tag: PENDING QUERY


Listing files involved in the PR:

A. Directly involved in original PR as well as this PR

  1. conf/analysis/trip_model.conf.json.sample
  2. emission/analysis/modelling/trip_model/forest_classifier.py
  3. emission/analysis/modelling/trip_model/model_type.py
  4. emission/analysis/modelling/trip_model/models.py
  5. emission/analysis/modelling/trip_model/run_model.py
  6. emission/tests/modellingTests/TestForestModelIntegration.py
  7. emission/tests/modellingTests/TestForestModelLoadandSave.py
  8. emission/tests/modellingTests/TestRunForestModel.py
  9. emission/tests/modellingTests/modellingTestAssets.py

B. Part of original PR but not a part of this PR

  1. emission/analysis/modelling/trip_model/util.py

C. Involved in original PR but removed later in some commits

  • These had some review comments
  • Code from some of these files was also moved to the other PR files
  • Hence mentioning analyzed them as well:
  1. emission/analysis/core/wrapper/entry.py
  2. emission/tests/modellingTests/TestForestModel.py
  3. emission/analysis/modelling/trip_model/dbscan_svm.py

D.Extra files that I made changes to

  1. emission/analysis/modelling/trip_model/clustering.py

hlu109 and others added 30 commits August 12, 2022 19:33
…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model

`cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  now have no dependence on the custom branch. Results of plots  are attached to show no difference in theie previous and current outputs.
Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions.  Default set to 'none'.
…results.py`

Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual.

 Other minor fixes to support previous changes.
1. removed mentions of `tour_model` or `tour_model_first_only` .

2. removed two reads from database.

3. Removed notebook outputs  ( this could be the reason a few diffs are too big to view)
RF initialisation and fit function. Build test written and tested.  fit of Random forest uses df .
Predict is now included. Just need to figure out model storage and model testing.
Model loading and storing is now improved since it just stores the required predictors and encoders.

Regression  test and null value test included in tests.
1. switching a model is as simple as changing model_type in config file

2. ForestModel is now working. Main model is in model.py file which is copied from label_assist

3. TestRunForestModel.py is working.

3. Regression test in TestForestmodel.py are still under construction.
Removed redundancies and unnecessary code segments.
Config copy not required.
Updating branch with changes from master. Moved mdoels from eval repo to server repo.
This will fail due to testForest.py file.
Changes here include :

1. Integrated the shifting of randomForest model from eval to server.

2. unit tests for Model save and load

3. RegressionTest for RF model in testRandomForest.py.
This is replaced by models.py (movied with history)
Improving the test file by changing the way previpous predictions are stored.
Fixing circular import
The changes in this iteration are improvements in test for forest model :

1. Post discussion last week, the regression test was removed ( `TestForestModel.py` )since it won't be useful when model performance improves. Rather, the structures of predictions is checked. This check is merged with TestForestModel.py

2. After e-mission#944 , `predict_labels_with_n` in `run_model.py` expectes a lists and then iterates over it. The forest model and rest of the tests were updated accordingly.
1. Improved tests in `TestForestModelLoadandSave.py`

2. Better comments, imports nd cleanup
While testing model integration, 2 forest model features specific features are added in the `TestForestModelIntegration.py` file rather than in `entry.py` file.
2 more ( total 4) Forest model specific features are now added after generating random trips for testing purpose.
Forst model specific values added in test setup for random Trips
test in testRandomForestTypePreservation as using `allceodata` specific user id. The tests on github use different db. Fixed by generating random samples.
Mahadik, Mukul Chandrakant added 4 commits November 17, 2024 23:08
1. Newline fixes - reverted removal of line
a. util.py
e-mission#938 (comment)

b. run_model.py
e-mission#938 (comment)

2. Import format changed to "import X as Y" instead of "from X import Y"
Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
e-mission#938 (comment)
Utility functions were added in this commit from the original PR
e-mission@e9abd51#diff-e910051a05987388fa06c96f484c23b3e491408a61136ed62235117c445fd664

However, these are already present in clustering.py, which is also imported and used in models.py.
But util.py functions are not being used anywhere.
a. Removed check for remaining test data from previous test runs.
- This should not be possible if data is cleared correctly in tearDown().
- Improved database clearing in tearDown() just to be sure.

b. Moved model build to setup() since all tests need this
- I did see Shankari's comment stating that model building is a heavyweight process (e-mission@104dd9a#r1486605432)
- But it is anyways required by all tests and moving it to setup helps reduce duplicate code.

c. Merged EqualityTest with Type Preservation test
- Shankari had left a comment to check for values versus checking for types (e-mission#938 (comment)).
- Satyam had added changes to check the predictions list after serialization and deserialization respectively.
- However this equality test was already being done in a previous test.
- Hence merged these two.

d. Merged Serialization and Deserialization error handling test.
- These tests were identical and mock functions were being used to assert raised exceptions.
- Merged these as well.

Merging tests helps reduce the number of types we have to build the model as for all tests the common steps involve building model and fetching predictions.
…tance variables

1. Split up mock trips data into train / test data.
- Saw that this was being done in one of the tests in TestForestModelLoadandSave.py itself as well as in TestGreedySimilarityBinning.py
- Hence added it for all tests in forest model tests for uniformity.

2. Reduced number of instance variables since they were used inside setUp() only.
This addresses review comment mentioned originally for TestForestModelIntegration
e-mission#938 (comment)

3. Cleaned up TestForestModeIntegration.py
- Added equality tests that check for prediction values generated in pipeline.
Address review comment:
e-mission#938 (comment)

- Added train / test data split.

- Removed check for empty data in setUp()
Addresses review comment:
e-mission#938 (comment)
@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Nov 22, 2024

A. Files directly involved in original PR as well as this PR

  1. conf/analysis/trip_model.conf.json.sample
S. No. Review Summary Review Comments Commits Status
a. Clustering features added to conf file; config.py file Comment DONE

a. Mukul’s comments / code changes

  • Confirmed, already resolved.
  • Existing config file will still work fine, not necessary to refactor config file. And then similar to current approach for reading greedy parameters can be taken:

greedy_similarity_binning.py

config = eamtc.get_config_value_or_raise('model_parameters.greedy')

forest_classifier.py

 config = eamtc.get_config_value_or_raise('model_parameters.forest')

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Nov 22, 2024

  1. emission/analysis/modelling/trip_model/forest_classifier.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Recommended approach to serialize random forest model Comment DONE
b. Module import format not corrected Comment Commit DONE
c. cluster_expected_keys, forest classifier should be separate from cluster classifier Comment Commit PENDING QUERY
d. Use kwargs instead of listing all config values Comment Commit DONE
e. why is serialization done attribute by attribute instead of using JSON; trip_grouperw Comment PENDING QUERY
f. commit history, moving files over related Comment DONE
g. average of probabilities Comment PENDING QUERY
h. “pass” comment to be added Comment Commit DONE

a. Initial Comments

  • Comment - Satyam provided official documentation that lists joblib as one of the methods other than pickle.

Mukul’s comments / code changes

  • Confirmed, it is valid.
  • Link - Additionally, joblib uses pickle internally and is more efficient when working with numpy arrays.

b. Mukul’s comments / code changes

  • Confirmed, fixed in this commit.

  • Commit - Additionally, I've made import format changes wherever necessary.


PENDING QUERY

c. Initial Comments

  • Satyam may have updated related code; need to check

Mukul’s comments / code changes

  • Confirmed, commented out code sections in this commit.
  • Is this what Shankari wanted though? Is there a better way to do it by removing unnecessary code?
    • Comment - Refers to more comments regarding removing loc_cluster to avoid DRY while moving code

d. Initial comments

  • Satyam has used kwargs; need to check code

Mukul’s comments / code changes

  • Confirmed, fixed in this commit.

PENDING QUERY

e. Initial Comments

  • Comment - Satyam’s reasoning for this serialization approach.

Mukul’s comments / code changes

  • trip_grouper was commented out already.
  • JSON dump works only for basic data types it seems.

Satyam's explanation looks valid. But I will try any alternative approaches if required...


f. Initial comments

  • Seems complicated with managing the commit history of files while moving them across repositories.
  • Comment - This is the “earlier comment” which refers to util.py comment regarding moving files over.
  • eval-private-data repo
    • Issue, PR - for deleting files after moving them to server repo.
  • Comment - Related comment highlighting DRY

Mukul’s comments / code changes

  • Looks like it was fixed already.

PENDING QUERY

g. Comment, Code - average of probabilities

  • Satyam pointed out that currently predict_labels_with_n returns a list of predictions each with a dict of labels and a single prediction value.

Mukul’s comments / code changes

  • Confirmed, that inferrers.py uses the single prediction value.
  • Need confirmation whether the approach to use average of the predicted probabilities is correct or not.

h. Initial comments

  • Added in this commit on Mar 15, 2024
  • Just one line comment added.

Mukul’s comments / code changes

  • I’m unsure how the comment added by Satyam can be modified or expanded upon.
  • I cannot find the referenced extraction logic under ForestClassifier class in models.py.

@MukuFlash03
Copy link
Contributor Author

  1. emission/analysis/modelling/trip_model/model_type.py
S. No. Review Summary Review Comments Commits with fixes Status
a. List all model types Comment Commit DONE
b. Reason for code changes related to model types dict Comment DONE

a. Initial comments

  • Looks like already fixed.

Mukul’s comments / code changes

  • Confirmed, changes made are present (see commit in table).

b. Initial comments

  • Comment - Satyam had replied already. Need to understand explanation.

Mukul’s comments / code changes

  • Satyam’s explanation is valid. Config parameter used to instantiate models varies as per the model. Hence cannot use same config variable to initialize every model type in the dictionary definition.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Nov 23, 2024

  1. emission/analysis/modelling/trip_model/models.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Module import format to be changed Comment DONE
b. Commit history to be merged correctly; were waiting on another PR to be merged Comment DONE
c. Change file name “rf_for_label_models.py” Comment PENDING QUERY - Can skip?

a. Mukul’s comments / code changes
The imports in models.py were added before Satyam’s PR.

The “from X import Y” type imports are present in various files not in PR.
My plan now is to change these import types only in the files that are involved in the PR:

  • Changes made in these files:
    • clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
      • models.py - sklearn, emission.analysis.modelling.trip_model.clustering as eamtcl
      • clustering.py - sklearn (file not a part of PR but was involved in moving over files from eval-private-data repository)

b. Mukul’s comments / code changes

  • Commit history available for models.py
  • Comment - Detailed conversation regarding moving files, commit history
  • Comment - Mentions that some files can be moved over.
  • Comment - But this is dependent on another PR being merged.
  • clustering.py, mapping.py, data_wrangling.py - available in master branch of e-mission-server with commit history.
  • Looks like this was all resolved.
  • Hence marking it as done.

PENDING QUERY

c. Initial comments

  • Comment - Satyam’s replied that models.py contains other models too and not just Random Forest model.

Mukul’s comments / code changes

  • The reasoning is valid; multiple models present inside models.py.
  • So, can this renaming be skipped?

@MukuFlash03
Copy link
Contributor Author

  1. emission/analysis/modelling/trip_model/run_model.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Update comment with correct description of algorithm Comment Commit DONE
b. Newline, whitespace changes to be reverted Comment Commit DONE

a. Mukul’s comments / code changes

  • Commit - code comment updated on Nov 2nd, 2023 while review comment left on Oct 21st, 2023.
  • Did not find relevant code near the comment in the same file, probably refers to code elsewhere.
  • Resolved already.

b. Mukul’s comments / code changes

  • In the previous commits, newlines were removed. Added the newlines back.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Nov 23, 2024

  1. emission/tests/modellingTests/TestForestModelIntegration.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Sample config; why duplicated Comment Commit DONE
b. Properties not needed to be stored under “self” Comment Commit DONE
c. start, end timestamps to be handled in tests, not in core/wrappers/entry.py Comment PENDING QUERY
d. Insufficient to check existence of properties, need to check values as well; more tests needed Comment DONE
e. Building model in setup stage is not good unless other tests require it too; can add more tests Comment DONE

a. Mukul’s comments / code changes

  • Confirmed, review comments addressed with code changes.

b. Mukul’s comments / code changes

  • Confirmed, review comments addressed with code changes.

PENDING QUERY

c. Initial comments

  • Carried over from entry.py
  • Perhaps need to make changes in test files that use create_fake_entry() ?

Mukul’s comments / code changes

  • Satyam moved this over to this file from entry.py which is fine.
  • But it still does not address this concern by Shankari:
the trip's start and end timestamps are almost always guaranteed to not be the same as the time it was created. 
  • I’m trying to understand whether this means that start and end timestamps should not be equal to each other? Or whether they can be equal but the value that they are storing metadata.write_local_dt is incorrect (creation time as mentioned by Shankari).

d. Initial comments

  • Comment - Satyam says that new tests TestForestModelLoadandStore and TestForestModelIntegration have been added already.
    • Model values need to be hardcoded but won’t stay relevant in future; hence cannot check model values.

Mukul’s comments / code changes

  • Need to see if there’s a better way to test.
    • Added some more tests based on existing tests in TestLabelInferencePipeline.py

e. Mukul’s comments / code changes

  • Needs to be handled for TestFMLoadandStore.py as well
    • Moved model building to setup() for this file since it was being used by all tests functions.
    • Need to confirm if this is fine.

@MukuFlash03
Copy link
Contributor Author

  1. emission/tests/modellingTests/TestForestModelLoadandSave.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Test data should not be needed to be saved by default Comment DONE
b. Model variables names not relevant Comment Commit DONE
c. Why checking types instead of values Comment Commit DONE
d. Move out of setup() ; self.fail() not needed then Comment DONE

Extra changes made by me that were not explicitly a part of code review comments:

  • Unable to execute and see outputs.

    • if main: wasn’t present; added it.
  • Cleanup changes

    • Addressed review comments.
    • Also made cleanup fixes including typos.

PENDING QUERY

  • Moved common code in test functions to setup()

    • Comment - Shankari said that model building is heavy and not recommended to be in setup(), unless all functions need it.
      • This comment was for TestFMIntegration.py.
      • TestFMLoadandStore.py does need model_building for all test functions.
  • Confirm whether having model building is fine in setup() since all tests in this file require model building.


a. Initial comments.

  • Satyam has updated comment; need to check

Mukul’s comments / code changes

  • While Satyam updated comment, it does not follow the correct approach.
  • There should not be any test data between test runs / functions since tearDown() clears all data. Or, at least it should.
  • So the correct fix would be to remove the check for any existing data and ensure tearDown() clears all collections.
  • Removed the if len == 0 check, tests passing; yes they passed.
  • Using etc.dropAllCollections() in tearDown() just to be sure.
  • Also, this code was copied from emission/tests/modellingTests/TestRunGreedyModel.py .
    • Hence, future fixes could also be done in TestRunGreedyModel.py

b. Mukul’s comments / code changes

  • Confirmed, review comments addressed with code changes.

c. Initial comments.

  • Satyam has updated code; need to check

Mukul’s comments / code changes

  • Confirmed, fixes added in this .
  • But changes added to the test seem repetitive and similar to first test w.r.t comparing pre / post predictions list.
  • Need to check if first and last test can be merged somehow.
    • The only difference is that test/train data is fetched, split is done, and model.fit() called in test function. In other tests, it’s all abstracted under eamur.update_trip_model()
      • Commit - Not clear why this last test does this differently.
        • It’s not different actually; other tests directly use test data like this:
          • test_data = esda.get_entries(key="analysis/confirmed_trip", user_id=user_id, time_query=None)
    • Additionally, won’t need the repeated value check which is already being done in the first test.
    • Hence, for uniformity, I added train/test split to setUp().
    • Also, merged a couple of tests into one single test which will help reduce calls to setUp() and tearDown() and hence reduce number of times model_build occurs.

d. Mukul’s comments / code changes

  • Copied over from TestRunGreedyModel.py
  • Can keep in setup() to avoid duplicate code in all test functions.
  • Now, I am thinking, why is there even a need for this check?
    • Isn’t it guaranteed that generate_mock_trips would return the same number of trips as total_trips?
      • If not, need to check whether unit tests for generate_mock_trips contain a test for this.
        • Commit - this is where generate_mock_trips() was added.
        • I just checked the generate_mock_trips() and it will always produce the number of trips equal to trips parameter.
        • Hence we don’t even need the check for total trips.
      • Because then we are just testing generate_mock_trips and not RF survey assist.

@MukuFlash03
Copy link
Contributor Author

  1. emission/tests/modellingTests/TestRunForestModel.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Check model properties; do not simply assume success if no errors thrown Comment Commit DONE
b. Tests are insufficient and incomplete - should test pipeline invocation too; load/store should have tests as well Comment Commit, Commit DONE

a. Mukul’s comments / code changes

Confirmed, review comments addressed with code changes.


b. Mukul’s comments / code changes

  • Review comments to be addressed for TestFMIntegration.py and TestFMLoadAndStore.py
    • Confirmed, load / store tests added. TestFMLoadAndStore.py has its own set of review comments.
    • Pipeline invocation testing added in TestFMIntegration.py which has it own set of review comments.

@MukuFlash03
Copy link
Contributor Author

  1. emission/tests/modellingTests/TestRunForestModel.py

N/A - No review comments


B. Part of original PR but not a part of this PR

  1. emission/analysis/modelling/trip_model/util.py
S. No. Review Summary Review Comments Commits with fixes Status
a. Module import format; code moved to models.py. Comment DONE
b. Import with full commit history Comment DONE
c. Newline, whitespace changes to be reverted Comment DONE

Changes I added; Not affiliated with any review comments

S. No. Review Summary Review Comments Commits with fixes Status
d. get_distance_matrix incorrect indentation fixed DONE
e. Reverting changes in this file DONE

a. Initial comments.

  • This comment points to diff in forest_classifier.py but the highlighted import is already in correct import format
  • Refer to models.py for changes

Mukul’s comments / code changes

  • For the PR files, I will make import format changes wherever necessary.
  • Commit history available for models.py

b. Initial comments

  • Comment - In another comment for models.py, Shankari said that commit history did not seem to be copied over. But this comment was before Satyam made the changes.
  • Comment - Satyam did say that commit history was copied over.

Mukul’s comments / code changes

  • Looks like relevant files were copied over with commit history. Also see this comment above.

c. Mukul’s comments / code changes

  • Same as run_model.py, a newline was removed by Satyam.
  • Re-added newline so that it matches code in master where two newlines present after set of imports.
  • A couple of more newlines just below the newlines referred to in review comments.

Extra code fixes

d. Mukul’s comments / code changes

  • Was unreachable code after return statement due to being incorrectly indented.

e. Mukul’s comments / code changes

  • Since code defined in this file is already present in clustering.py

@MukuFlash03
Copy link
Contributor Author

C. Involved in original PR but removed later in some commits

  • These had some review comments
  • Code from some of these files was also moved to the other PR files
  • Hence mentioning analyzed them as well

  1. emission/core/wrapper/entry.py

There was a single change made related to start/end timestamps. This code is now moved to TestForestModelIntegration.py and discussed under that file.

Commit - First time changes made
Commit - Changes reverted

a. Comment - start, end timestamps to be handled in tests, not in core

  • Perhaps need to make changes in test files that use create_fake_entry() ?

Mukul’s comments / code changes

  • These changes were moved to TestForestModelIntegration.py; discussing in detail under that file.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Nov 23, 2024

  1. TestForestModel.py
S. No. Review Summary Review Comments Commits with fixes Status
a. metadata.key will never be “predictions”. Comment, Comment Commit DONE
b. Use timeseries functions Comment Commit DONE
c. obsolete, params: tripsdf Comment Commit DONE
d. Prediction length message Comment Commit DONE
e. Not sure what test is doing; cannot check previous test run data Comment DONE
f. Class name different, comment incorrect, MAC failures Comment DONE
g. Ignoring comment, waiting for clarification; tests copied over during refactor Comment DONE

Commit - File removed

  • Regression test not needed.

  • Tests merged in TestRunForestModel.py

  • Comment - Satyam: Contents of file to be ignored

  • Comment - Satyam: Tests will fail


a. Initial comments

  • Commit - Removed metadata.key related code in this commit.

Mukul’s comments / code changes

  • Removed metadata.key related code in this commit.
    • Not seeing this used anywhere else in the PR code.

b. Mukul’s comments / code changes

  • Used esda.get_entries() in Test files.

c. Mukul’s comments / code changes

Confirmed, review comments addressed with code changes.


d. Mukul’s comments / code changes

  • Corrected error message in TestRunForestModel.py

e. Mukul’s comments / code changes

  • Cannot see the referenced code in PR branch code.
  • Will have to analyze test file and see if improvements required.
  • Commit - File removed; regression test not present now and not even relevant since previous test run data won’t be available.
  • Tests changed to run model_build multiple times in same test function in TestRunForestModel.py to test this instead of trying to store/load data from previous test runs.

f. Mukul’s comments / code changes


g. Mukul’s comments / code changes

  • Same as f.

@MukuFlash03
Copy link
Contributor Author

  1. dbscan_svm.py
S. No. Review Summary Review Comments Commits with fixes Status
a. File not needed; can remove Comment Commit DONE

a. Mukul’s comments / code changes

  • Confirmed, file not present anymore

D.Extra files that I made changes to:

  1. emission/analysis/modelling/trip_model/clustering.py

a. Mukul’s comments / code changes

  • This file was Iinvolved in moving over files from eval-private-data repository.
  • Corrected module import format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PRs for review by peers
Development

Successfully merging this pull request may close these issues.

4 participants