PR to filter big jumps even if all segments are in clusters #897

shankari · 2023-01-23T07:09:08Z

…nts are clusters Once the actual issue is addressed, this will fix e-mission/e-mission-docs#843 For now, we load the location dataframes for the two use cases and verify that the returned values are the ones in the current implementation. Procedure: - Perturb the location points in the original use cases to avoid leaking information - Load the location points into the test case - Run the filtering code - Verify that the output is consistent with e-mission/e-mission-docs#843 (comment) e-mission/e-mission-docs#843 (comment) Also change the location smoothing code from `logging.info` to `logging.exception` so that we can see where the error is in a more meaningful way Testing done: - Test passes ``` ---------------------------------------------------------------------- Ran 1 test in 0.387s ``` Note that due to the perturbation of the location points, the outliers no longer perfectly match the original use case, but are close enough ``` 2023-01-22 22:37:57,262:INFO:4634275328:After first round, still have outliers accuracy altitude ... distance speed 17 70.051 88.551857 ... 8.468128e+06 50922.935508 26 3.778 66.404068 ... 8.467873e+06 2878.645674 49 3.900 72.118635 ... 4.673209e+00 2.336605 2023-01-22 22:37:57,308:INFO:4634275328:After first round, still have outliers Unnamed: 0 accuracy altitude ... heading distance speed 14 14 5.638 470.899994 ... 88.989357 1.113137e+07 284923.028227 ```

To make it easier to debug in case there are errors

- Since we have already implemented many different smoothing algorithms, we pick POSDAP to use as backup - if we still have outliers after the first round, and the max value is over MACH1, we fall back to the backup algo - after implementing the backup algo, if we don't have outliers, the backup algo has succeeded and we use its results - if we do have outliers, but the max value is under MACH1, the backup algo has succeeded and we use its results - if we have outliers, and the max is high (> MACH1) the backup algo has failed With this change, both the tests also change to the correctly deleted values - [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment)) - [11] for use case 2 (e-mission/e-mission-docs#843 (comment)) In this commit, we also check in the csv data files for the two test cases

…moothing file This addresses a long-term TODO https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/intake/cleaning/cleaning_methods/jump_smoothing.py#L262 It also: - ensures that the individual algorithms are clean and modular and don't depend on other algorithms - we can swap in any algorithm for the backup algo - we can support more complex backups in the future Testing done: - modified the test to pass in the backup algo - tests pass

Added a new unit test for the case of `backup_algo == None`, which should return the original algo results. While testing, found that the ZigZag algo returns a pandas Series, while the Posdap algo returns a numpy array, which means that combining them could be problematic Changed ZigZag to also return a numpy array to unify the implementations. Testing done: - All tests now pass

Before this change, we only used one algorithm, so we hardcoded it into the result. However, we can now use either the main algorithm or the backup algorithm. So we return the algo also from `get_points_to_filter` and attribute it correctly. `get_points_to_filter` is used only in `location_smoothing` and in the tests. So also fix the tests to read both values and check the sel algo in each case Testing done: tests pass

- Unify algo outputs: `self.inlier_mask_ = self.inlier_mask_.to_numpy()` - remove `to_numpy()` from all the checks in the tests - Return two outputs -> `return (None, None)` Testing done: - All tests in this file pass

When we moved the second round checks to the calling function in cebb81f we caused a very subtle regression The filtering code had an early return if there were no jumps detected. So in that case, we would not try the second round of checks, or attempt to filter again. However, when we moved the second round checking to the outer function, we called the second round anyway even if the first round didn't detect any jumps And in this one case, we actually found an outlier in the second round, which caused the test to fail. Fixed by checking to see if there were no outliers in the first round and skipping the second round check in that case. Everything in the `else` for the `if outlier_arr[0].shape[0] == 0:` is unchanged, just moved in a bit, not changed. The check for the length was unexpectedly complicated and took many hours to debug, so I added it as a simple use case. Note also that it is not clear if this is the correct long-term approach. If there were no jumps, then why did using the backup change anything? Maybe we should always use the backup. But changing this to avoid the regression for now; will look at this the next time we look at smoothing Testing done: - `TestPipelineRealData.testIosJumpsAndUntrackedSquishing` passes - `TestLocationSmoothing` passes

`get_filtered_points` is not used anywhere else we don't need to print out the series and the numpy version any more now that we have added the unit test in 5a4ae3d

shankari added 4 commits January 22, 2023 22:53

Change the assertion checks to use the row index instead of the id

7d44d63

To make it easier to debug in case there are errors

shankari changed the title ~~PR to fix an issue where if all the segments in a section are clusters, we don't filter big jumps~~ PR to filter big jumps even if all segments are in clusters Jan 24, 2023

shankari added 5 commits January 24, 2023 11:27

Fix regressions in tests

95f88c5

- Unify algo outputs: `self.inlier_mask_ = self.inlier_mask_.to_numpy()` - remove `to_numpy()` from all the checks in the tests - Return two outputs -> `return (None, None)` Testing done: - All tests in this file pass

🔥 Remove unused function and extraneous logs

29e78de

`get_filtered_points` is not used anywhere else we don't need to print out the series and the numpy version any more now that we have added the unit test in 5a4ae3d

shankari merged commit b7749d0 into e-mission:master Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR to filter big jumps even if all segments are in clusters #897

PR to filter big jumps even if all segments are in clusters #897

shankari commented Jan 23, 2023

PR to filter big jumps even if all segments are in clusters #897

PR to filter big jumps even if all segments are in clusters #897

Conversation

shankari commented Jan 23, 2023