-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giant, obvious GPS jump was not filtered #843
Comments
While processing, using the IQR to find potential clusters, we find 6 potential clusters.
The clusters that are most interesting to us are segments 1, 2 and 3
|
First, this seems to be the case where all the segments are clusters, which is a corner case, and probably why it is failing.
But if we look at the speeds and distances of interest, we have
The distance from 16 to 17 is 166 not 0... |
Stepping through the code, the detected threshold is
while computing distances, we recompute the distance between the start and end points, which is why we end up with a distance of 0 for the (16,17) segment since the segment has only one location point if it uses So [16:17] is indeed a one-trip cluster and we can assume that it is bad But we then return the segment before it (11 - 16) as good, and following expected switching, we get
And so we end up deleting The real problem is that we expect to find alternating GOOD and BAD segments but we actually have two back to back BAD segments (16-17, and 17-21). And we should not delete the points from 21 to 26. |
Just to confirm, if I set the coordinates of 16 to
Segment 1 (11-16) is again GOOD
So we end up deleting points Maybe it would be easier to see that we are retaining [11 12 13 14 15 21 22 23 24 25], which are basically two clumps on opposite sides of a lake. which is actually pretty correct. So really the problem is that we expect good and bad to alternate, and that doesn't always seem to be the case. Note that before we changed the lat/lon for point 16, we still had outliers after the filtering and now we don't
So in order to avoid regressions, let's try a second method if the first method still has outliers. |
I can think of two possible second methods: I'm tempted to go with (1) first since it fits into our current heuristic-based solution better than (2). If we have (2) for example, do we still need the current alternation method? Let's implement (1) and see if it also fixes the large not a trip for mm_masscec |
The large trip for mm_masscec had only one point in the jump. Let's see why it wasn't captured.
|
Looking at the logs, this is another example where all the segments are clusters, so the IQR was super small.
Note also that there was a big gap between point 10 and point 11, so although the jump was large, the speed (590) is under Mach 2 (2 * 340.29 = 680). And then since there are multiple zero length clusters, we pick one of them as BAD and then do our alternating bad and good, which ends up with 2-11 as BAD and 11-12 as GOOD
So we end up deleting almost everything other than the item that we want to delete aka point 11.
We do still have outliers after the first check is done
so again, the problem seems to be with these weird trips were everything is a cluster. Our previous plan to look at high speed jumps should work although given the big gap in time, we need to tone down the check to Mach1 instead of Mach 2. |
So the planned fix is:
|
…nts are clusters Once the actual issue is addressed, this will fix e-mission/e-mission-docs#843 For now, we load the location dataframes for the two use cases and verify that the returned values are the ones in the current implementation. Procedure: - Perturb the location points in the original use cases to avoid leaking information - Load the location points into the test case - Run the filtering code - Verify that the output is consistent with e-mission/e-mission-docs#843 (comment) e-mission/e-mission-docs#843 (comment) Also change the location smoothing code from `logging.info` to `logging.exception` so that we can see where the error is in a more meaningful way Testing done: - Test passes ``` ---------------------------------------------------------------------- Ran 1 test in 0.387s ``` Note that due to the perturbation of the location points, the outliers no longer perfectly match the original use case, but are close enough ``` 2023-01-22 22:37:57,262:INFO:4634275328:After first round, still have outliers accuracy altitude ... distance speed 17 70.051 88.551857 ... 8.468128e+06 50922.935508 26 3.778 66.404068 ... 8.467873e+06 2878.645674 49 3.900 72.118635 ... 4.673209e+00 2.336605 2023-01-22 22:37:57,308:INFO:4634275328:After first round, still have outliers Unnamed: 0 accuracy altitude ... heading distance speed 14 14 5.638 470.899994 ... 88.989357 1.113137e+07 284923.028227 ```
We implemented multiple smoothing algorithms. So instead of adding a new one, let's see if one of the algorithms - e.g. POSDAP, will work as the backup For use case 1:
So it looks like we removed everything other than 16-20? Maybe I just have to edit POSDAP to flip the signs. Although after the second round, our outliers are not the original ones?
Use case 2:
|
Ok, so printing out the outliers instead of the inliers indicates that this does seem to work
However, even after removing them, we still have "outliers". Since most of the points are in clusters, these outliers are pretty slow, with
For this second check, we should look to see if the outliers are in fact large. |
After adding that check, the correct values are deleted. Will need to pull changes to the GIS branch and re-run to confirm that it works correctly here as well. |
- Since we have already implemented many different smoothing algorithms, we pick POSDAP to use as backup - if we still have outliers after the first round, and the max value is over MACH1, we fall back to the backup algo - after implementing the backup algo, if we don't have outliers, the backup algo has succeeded and we use its results - if we do have outliers, but the max value is under MACH1, the backup algo has succeeded and we use its results - if we have outliers, and the max is high (> MACH1) the backup algo has failed With this change, both the tests also change to the correctly deleted values - [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment)) - [11] for use case 2 (e-mission/e-mission-docs#843 (comment)) In this commit, we also check in the csv data files for the two test cases
We now have a fix for this issue, but it is a bit ugly Planning to move the retry code out to |
There was a regression caused by e-mission/e-mission-server@cebb81f and fixed by If there were no jumps in the data, then why did we even call the backup algo and why did it filter out values? Investigating further: Before the change:
With the new one, we have
And then, it looks like the max is over the speed of sound. Maybe we should filter this after all?
The section is:
Would be good to check if it indeed has a jump at the end |
ok.. I read the values and calculated the speeds and none of them are that high.
|
Aha! The big calculated outlier is around point 8
And we apparently "filled in" a large gap on iOS at around point 8
Note, however, that this happens for the next section as well
But it doesn't result in outliers
|
I am not going to dig deeper into the iOS fill code right now, because, given our much finer grained default data collection on iOS, this is not likely to happen in practice. Maybe we should remove it later? |
To finish up this issue, we should not remove this newly added point, so the original implementation, without the regression, is the correct one. One final cleanup and then we can merge. |
Re-running with the GIS branch, I still get the same trip/section segmentation.
In the re-run, we have
Before the logs are deleted, let's see how this was processed initially |
In the original processing, the previous trip had not ended
Let's look at where we detected a trip end in the re-run
In the original run, we had finished processing How did we process that point in the original run? So we closed out the trip as expected at 2022-12-27T19:07:29.865000-04:00, and when we got the new points at 2022-12-27T18:56:04.894000-04:00, we did not continue the old trip, but started a new one
|
Tracking final issue with two batches vs. one batch in |
Resetting use case 1, last trip:
place exit times: delete everything after the place exit...
|
reset the pipeline for use case 1, it works fine |
For use case 2, resetting to
|
For use case 1, we do seem to have some issue with multiple matches. This may in fact be related, since we reset the pipeline to
and there are in fact multiple inferred sections for the cleaned section although they are all identical
|
Checked the confirmed trips and there are no overlaps. Only one entry even seems to be out of order and on checking the
|
quick check on the discrepancy in use case 1, given that we have no incoming data...
I cannot quite understand the discrepancy. We reset the pipeline only for one user. That user has the same number of trips before and after the reset, for a time range that spans the reset timestamp ( I can't think of how to test this any further, but if we should take a mongodump before we reset the pipeline again and investigate further. |
The text was updated successfully, but these errors were encountered: