Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant, obvious GPS jump was not filtered #843

Closed
shankari opened this issue Dec 24, 2022 · 30 comments · Fixed by e-mission/e-mission-server#897
Closed

Giant, obvious GPS jump was not filtered #843

shankari opened this issue Dec 24, 2022 · 30 comments · Fixed by e-mission/e-mission-server#897

Comments

@shankari
Copy link
Contributor

shankari commented Dec 24, 2022

Screen Shot 2022-12-23 at 5 24 13 PM

two points after a big jump one point other point
Screen Shot 2022-12-23 at 5 25 17 PM Screen Shot 2022-12-23 at 5 25 57 PM Screen Shot 2022-12-23 at 5 26 50 PM
@shankari
Copy link
Contributor Author

Those points are at the trip level, at the raw section level, they are:

One point Other point
Screen Shot 2023-01-21 at 8 56 45 PM Screen Shot 2023-01-21 at 8 57 05 PM

@shankari
Copy link
Contributor Author

While processing, using the IQR to find potential clusters, we find 6 potential clusters.

2023-01-21 17:54:25,005:DEBUG:4599795200:For cluster 0 - 11, distance = 3.065360237392023, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 11 - 16, distance = 40.23523836440616, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 16 - 17, distance = 0.0, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 17 - 21, distance = 1.179366164322961, is_cluster = True
2023-01-21 17:54:25,007:DEBUG:4599795200:For cluster 21 - 26, distance = 9.268393002564368, is_cluster = True
2023-01-21 17:54:25,007:DEBUG:4599795200:For cluster 26 - 89, distance = 29.561381182750548, is_cluster = True

The clusters that are most interesting to us are segments 1, 2 and 3

fig.add_subplot(1,3,1).add_child(map_list[1][0])
fig.add_subplot(1,3,2).add_child(map_list[2][0])
fig.add_subplot(1,3,3).add_child(map_list[3][0])

Screen Shot 2023-01-21 at 9 39 58 PM

@shankari
Copy link
Contributor Author

First, this seems to be the case where all the segments are clusters, which is a corner case, and probably why it is failing.
In this case, the 16-17 segment was computed to have a length of 0, so it was picked as a bad cluster.

        if len(non_cluster_segments) == 0:
            # If every segment is a cluster, then it is very hard to
            # distinguish between them for zigzags. Let us see if there is any
            # one point cluster - i.e. where the distance is zero. If so, that is likely
            # to be a bad cluster, so we return the one to the right or left of it
            minDistanceCluster = segment_distance_df.distance.idxmin()
            if minDistanceCluster == 0:
                goodCluster = minDistanceCluster + 1
                assert(goodCluster < len(segment_list))
                return goodCluster
            else:
                goodCluster = minDistanceCluster - 1
                assert(goodCluster >= 0)
                return goodCluster
2023-01-21 17:54:25,008:DEBUG:4599795200:non_cluster_segments Empty DataFrame
Columns: [distance, is_cluster]
Index: []
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 2: Segment(16, 17, 0.0), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:At the end of the loop for direction IterationDirection.RIGHT, i = 3

But if we look at the speeds and distances of interest, we have

id latitude longitude fmt_time distance speed
15 XXXX YYYY 2022-10-23T08:15:44.029000-04:00 4.473726e+01 0.111098
16 -0.490372 1.069103 2022-10-23T08:18:25.431000-04:00 9.128037e+06 56554.671637
17 -0.497175 1.071761 2022-10-23T08:18:30.322000-04:00 8.121292e+02 166.045639
18 -0.497173 1.071759 2022-10-23T08:18:32.742000-04:00 3.024755e-01 0.124990
19 -0.497171 1.071755 2022-10-23T08:18:35.222000-04:00 4.475375e-01 0.180459
20 -0.497169 1.071752 2022-10-23T08:18:37.383000-04:00 4.326480e-01 0.200207
21 XXXX YYYY 2022-10-23T09:07:26.414000-04:00 9.128553e+06 3116.577828

The distance from 16 to 17 is 166 not 0...

@shankari
Copy link
Contributor Author

shankari commented Jan 22, 2023

Stepping through the code, the detected threshold is

2023-01-21 17:54:25,004:DEBUG:4599795200:maxSpeed = 2.669519511324665

while computing distances, we recompute the distance between the start and end points, which is why we end up with a distance of 0 for the (16,17) segment since the segment has only one location point if it uses loc[16:17]

So [16:17] is indeed a one-trip cluster and we can assume that it is bad

But we then return the segment before it (11 - 16) as good, and following expected switching, we get

2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 2: Segment(16, 17, 0.0), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 3: Segment(17, 21, 1.179366164322961), expecting state Segment_State.GOOD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 4: Segment(21, 26, 9.268393002564368), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 5: Segment(26, 89, 29.561381182750548), expecting state Segment_State.GOOD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 0: Segment(0, 11, 3.065360237392023), expecting state Segment_State.BAD

And so we end up deleting [ 0 1 2 3 4 5 6 7 8 9 10 16 21 22 23 24 25] when we should in fact only delete [16, 17, 18, 19, 20]

The real problem is that we expect to find alternating GOOD and BAD segments but we actually have two back to back BAD segments (16-17, and 17-21). And we should not delete the points from 21 to 26.

@shankari
Copy link
Contributor Author

shankari commented Jan 22, 2023

Just to confirm, if I set the coordinates of 16 to -0.497175, 1.071761 and re-run the pipeline, it is merged into the 17-21 cluster and we end up with the following

2023-01-22 08:36:50,565:DEBUG:4615695872:For cluster 0 - 11, distance = 3.065360237392023, is_cluster = True
2023-01-22 08:36:50,565:DEBUG:4615695872:For cluster 11 - 16, distance = 40.23523836440616, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 16 - 21, distance = 1.1822994501886754, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 21 - 26, distance = 9.268393002564368, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 26 - 89, distance = 29.561381182750548, is_cluster = True

Segment 1 (11-16) is again GOOD

2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 2: Segment(16, 21, 1.1822994501886754), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 3
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 3: Segment(21, 26, 9.268393002564368), expecting state Segment_State.GOOD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 4
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 4: Segment(26, 89, 29.561381182750548), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 5
2023-01-22 08:36:50,568:DEBUG:4615695872:Finished marking segment states for direction IterationDirection.RIGHT
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 0: Segment(0, 11, 3.065360237392023), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.LEFT, i = -1
2023-01-22 08:36:50,568:DEBUG:4615695872:Finished marking segment states for direction IterationDirection.LEFT

So we end up deleting points [ 0 1 2 3 4 5 6 7 8 9 10 16 17 18 19 20 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88]

Maybe it would be easier to see that we are retaining [11 12 13 14 15 21 22 23 24 25], which are basically two clumps on opposite sides of a lake. which is actually pretty correct. So really the problem is that we expect good and bad to alternate, and that doesn't always seem to be the case.

Note that before we changed the lat/lon for point 16, we still had outliers after the filtering and now we don't

2023-01-22 10:14:28,694:DEBUG:4560678400:quartile values are 0.25    0.149342
0.75    0.712335
Name: speed, dtype: float64
2023-01-22 10:14:28,694:DEBUG:4560678400:iqr 0.5629931387473088
2023-01-22 10:14:28,715:INFO:4560678400:After first round, still have outliers     accuracy   altitude  ...      distance         speed
17    70.051  88.551857  ...  9.128723e+06  54895.414928
26     3.778  66.404068  ...  9.128552e+06   3103.242852
49     3.900  72.118635  ...  5.182353e+00      2.591177

[3 rows x 24 columns]
2023-01-22 08:36:50,576:DEBUG:4615695872:quartile values are 0.25    0.415091
0.75    1.433652
Name: speed, dtype: float64

So in order to avoid regressions, let's try a second method if the first method still has outliers.

@shankari
Copy link
Contributor Author

I can think of two possible second methods:
(1) find a pair of super-fast speed jumps and delete everything in between. The speeds here are 54895.414928 and 3103.242852 m/s. The speed of sound is 340.29 m/s. The fastest commercial planes were around Mach2, and neither is flying now. Everything else is test aircraft which are not going to show up in our dataset
https://internationalaviationhq.com/2020/06/27/17-fastest-aircraft/
So if we find a matched pair of (> 2 Mach jumps) (< 5 mins apart), we delete them and any points associated with them
(2) try to use clustering to find clusters of bad clusters instead of assuming that there is always a good/bad alternation.

I'm tempted to go with (1) first since it fits into our current heuristic-based solution better than (2). If we have (2) for example, do we still need the current alternation method?

Let's implement (1) and see if it also fixes the large not a trip for mm_masscec

Screen Shot 2023-01-22 at 2 42 27 PM

@shankari
Copy link
Contributor Author

shankari commented Jan 22, 2023

The large trip for mm_masscec had only one point in the jump. Let's see why it wasn't captured.

_id latitude longitude fmt_time distance speed
10 63abd6b780ea0c4fbb3a8612 XXXX YYYY 2022-12-27T19:11:31.852000-04:00 2.677547e-01 8.367072e-03
11 63abd6ba80ea0c4fbb3a865a 33.652812 73.087085 2022-12-28T01:18:22.615000-04:00 1.299755e+07 5.905088e+02
12 63abd6ba80ea0c4fbb3a865c XXXX YYYY 2022-12-28T01:18:25.744000-04:00 1.299756e+07 4.153902e+06
13 63abd6ba80ea0c119978254e XXXX YYYY 2022-12-28T01:18:56.608000-04:00 0.000000e+00 0.000000e+00

@shankari
Copy link
Contributor Author

Looking at the logs, this is another example where all the segments are clusters, so the IQR was super small.

2022-12-28 06:39:06,288:DEBUG:140290522912576:For cluster 0 - 1, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,289:DEBUG:140290522912576:For cluster 1 - 2, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,290:DEBUG:140290522912576:For cluster 2 - 11, distance = 1.888352145235123, is_cluster = True
2022-12-28 06:39:06,290:DEBUG:140290522912576:For cluster 11 - 12, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,291:DEBUG:140290522912576:For cluster 12 - 14, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,292:DEBUG:140290522912576:For cluster 14 - 21, distance = 2.2736647837268644, is_cluster = True

Note also that there was a big gap between point 10 and point 11, so although the jump was large, the speed (590) is under Mach 2 (2 * 340.29 = 680).

And then since there are multiple zero length clusters, we pick one of them as BAD and then do our alternating bad and good, which ends up with 2-11 as BAD and 11-12 as GOOD

2022-12-28 06:39:06,294:DEBUG:140290522912576:non_cluster_segments Empty DataFrame
Columns: [distance, is_cluster]
Index: []
2022-12-28 06:39:06,294:DEBUG:140290522912576:Processing segment 2: Segment(2, 11, 1.888352145235123), expecting state Segment_State.BAD
2022-12-28 06:39:06,294:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 3
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 3: Segment(11, 12, 0.0), expecting state Segment_State.GOOD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 4
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 4: Segment(12, 14, 0.0), expecting state Segment_State.BAD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 5
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 5: Segment(14, 21, 2.2736647837268644), expecting state Segment_State.GOOD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 6
2022-12-28 06:39:06,295:DEBUG:140290522912576:Finished marking segment states for direction IterationDirection.RIGHT
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 0: Segment(0, 1, 0.0), expecting state Segment_State.BAD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.LEFT, i = -1
2022-12-28 06:39:06,296:DEBUG:140290522912576:Finished marking segment states for direction IterationDirection.LEFT

So we end up deleting almost everything other than the item that we want to delete aka point 11.

2022-12-28 06:39:06,297:DEBUG:140290522912576:after setting values, outlier_mask = [ 0  2  3  4  5  6  7  8  9 10 12 13]

We do still have outliers after the first check is done

2022-12-28 06:39:06,311:DEBUG:140290522912576:quartile values are 0.25      0.013428
0.75    438.430600
Name: speed, dtype: float64
2022-12-28 06:39:06,312:DEBUG:140290522912576:iqr 438.4171720372147
2022-12-28 06:39:06,344:INFO:140290522912576:After first round, still have outliers     accuracy    altitude  ...      distance          speed
14     5.638  470.899994  ...  1.299756e+07  332690.597033
[1 rows x 24 columns]

so again, the problem seems to be with these weird trips were everything is a cluster. Our previous plan to look at high speed jumps should work although given the big gap in time, we need to tone down the check to Mach1 instead of Mach 2.

@shankari
Copy link
Contributor Author

So the planned fix is:

  • if there are still outliers after the first round, identify any jumps that are over the speed of sound (Mach1, 340.29 m/s).
  • if there are two of these, and they are within 10 mins of each other, and the point just before and just after them are within 100 meters of each other, delete all points between them. Then re-run the zig-zag algorithm on the resulting points and append the newly deleted points.

shankari added a commit to shankari/e-mission-server that referenced this issue Jan 23, 2023
…nts are clusters

Once the actual issue is addressed, this will fix
e-mission/e-mission-docs#843

For now, we load the location dataframes for the two use cases and verify that
the returned values are the ones in the current implementation.

Procedure:
- Perturb the location points in the original use cases to avoid leaking information
- Load the location points into the test case
- Run the filtering code
- Verify that the output is consistent with
e-mission/e-mission-docs#843 (comment)
e-mission/e-mission-docs#843 (comment)

Also change the location smoothing code from `logging.info` to
`logging.exception` so that we can see where the error is in a more meaningful way

Testing done:
- Test passes

```
----------------------------------------------------------------------
Ran 1 test in 0.387s
```

Note that due to the perturbation of the location points, the outliers no
longer perfectly match the original use case, but are close enough

```
2023-01-22 22:37:57,262:INFO:4634275328:After first round, still have outliers     accuracy   altitude  ...      distance         speed
17    70.051  88.551857  ...  8.468128e+06  50922.935508
26     3.778  66.404068  ...  8.467873e+06   2878.645674
49     3.900  72.118635  ...  4.673209e+00      2.336605

2023-01-22 22:37:57,308:INFO:4634275328:After first round, still have outliers     Unnamed: 0  accuracy    altitude  ...    heading      distance          speed
14          14     5.638  470.899994  ...  88.989357  1.113137e+07  284923.028227

```
@shankari
Copy link
Contributor Author

We implemented multiple smoothing algorithms. So instead of adding a new one, let's see if one of the algorithms - e.g. POSDAP, will work as the backup

For use case 1:

while considering point 16, speed = 52461.47140936417
currSpeed > 340, starting new quality segment at index 16
while considering point 17, speed = 165.74234069705636
currSpeed < 340, retaining index 17 in existing quality segment

while considering point 21, speed = 2891.014996913109
currSpeed > 340, starting new quality segment at index 21
while considering point 22, speed = 0.13761215470733973
currSpeed < 340, retaining index 22 in existing quality segment

Number of quality segments is 3
Considering segments [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] and [16, 17, 18, 19, 20]
About to compare curr_segment duration 11.95199990272522 with last segment duration 644.021999835968
curr segment [16, 17, 18, 19, 20] is shorter, cut it

Considering segments [16, 17, 18, 19, 20] and [21, 22, ... 88]
About to compare curr_segment duration 827.8680000305176 with last segment duration 11.95199990272522
prev segment [16, 17, 18, 19, 20] is shorter, cut it

Filtering complete, removed indices = [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 21 22 23 24 25 26 27 28
 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
 77 78 79 80 81 82 83 84 85 86 87 88]

So it looks like we removed everything other than 16-20? Maybe I just have to edit POSDAP to flip the signs.

Although after the second round, our outliers are not the original ones?

2023-01-23 16:05:41,143:INFO:4427972096:After second round, still have outliers     accuracy   altitude  elapsedRealtimeNanos  ...     heading   distance     speed
11   164.400  52.826053       127394353000000  ...   74.375112  17.876278  3.558176
26     3.778  66.404068       130926514000000  ... -170.883629   8.164765  2.721588

[2 rows x 24 columns]

Use case 2:

while considering point 11, speed = 505.7237830222791
currSpeed > 340, starting new quality segment at index 11

while considering point 12, speed = 3557486.793555044
currSpeed > 340, starting new quality segment at index 12

Number of quality segments is 3

2023-01-23 16:05:41,205:INFO:4427972096:Filtering complete, removed indices = [ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19 20]

@shankari
Copy link
Contributor Author

Ok, so printing out the outliers instead of the inliers indicates that this does seem to work

2023-01-23 20:55:05,705:INFO:4580654592:Filtering complete, retained indices = (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]),), removed indices = (array([16, 17, 18, 19, 20]),)

2023-01-23 20:55:05,794:INFO:4580654592:Filtering complete, retained indices = (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 12, 13, 14, 15, 16, 17,
       18, 19, 20]),), removed indices = (array([11]),)

However, even after removing them, we still have "outliers". Since most of the points are in clusters, these outliers are pretty slow, with

2023-01-23 20:55:05,717:DEBUG:4580654592:quartile values are 0.25    0.137612
0.75    0.724411
Name: speed, dtype: float64
2023-01-23 20:55:05,717:DEBUG:4580654592:iqr 0.5867984458938249
2023-01-23 20:55:05,738:INFO:4580654592:After second round, still have outliers     accuracy   altitude  elapsedRealtimeNanos  ...     heading   distance     speed
11   164.400  52.826053       127394353000000  ...   74.375112  17.876278  3.558176
26     3.778  66.404068       130926514000000  ... -170.883629   8.164765  2.721588

[2 rows x 24 columns]
2023-01-23 20:55:05,804:DEBUG:4580654592:quartile values are 0.25    0.006758
0.75    0.037754
Name: speed, dtype: float64
2023-01-23 20:55:05,804:DEBUG:4580654592:iqr 0.030995491580030114
2023-01-23 20:55:05,825:INFO:4580654592:After second round, still have outliers     Unnamed: 0  accuracy    altitude  ...     heading   distance     speed
1            1   135.600    2.600000  ...   57.784408  29.242851  1.635506
2            2     8.817    2.600000  ... -122.554813  29.260349  2.230890
14          14     5.638  470.899994  ...   88.989357   6.934772  1.366458

[3 rows x 25 columns]

For this second check, we should look to see if the outliers are in fact large.

@shankari
Copy link
Contributor Author

After adding that check, the correct values are deleted.
For case 1, the trip goes from one side of a lake to another
For case 2, while re-running the pipeline, we were running on a branch forked from master, so we ended up with a slightly different trip and section segmentation.

Will need to pull changes to the GIS branch and re-run to confirm that it works correctly here as well.

shankari added a commit to shankari/e-mission-server that referenced this issue Jan 24, 2023
- Since we have already implemented many different smoothing algorithms, we
  pick POSDAP to use as backup
- if we still have outliers after the first round, and the max value is over
  MACH1, we fall back to the backup algo
- after implementing the backup algo, if we don't have outliers,
  the backup algo has succeeded and we use its results
- if we do have outliers, but the max value is under MACH1,
  the backup algo has succeeded and we use its results
- if we have outliers, and the max is high (> MACH1)
  the backup algo has failed

With this change, both the tests also change to the correctly deleted values
- [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment))
- [11] for use case 2 (e-mission/e-mission-docs#843 (comment))

In this commit, we also check in the csv data files for the two test cases
@shankari
Copy link
Contributor Author

We now have a fix for this issue, but it is a bit ugly
e-mission/e-mission-server@67f5c86

Planning to move the retry code out to LocationSmoothing instead of keeping it in the algo.

@shankari
Copy link
Contributor Author

shankari commented Jan 25, 2023

There was a regression caused by e-mission/e-mission-server@cebb81f and fixed by
e-mission/e-mission-server@5a4ae3d
which still seems a bit suspicious.

If there were no jumps in the data, then why did we even call the backup algo and why did it filter out values?

Investigating further:

Before the change:

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
maxSpeed = 1.7570292255633762
After first step, jumps = Int64Index([], dtype='int64')
for ios, returning all_jumps = []
 last_point index 12 not in found points [0]
 added new entry 12
For cluster 0 - 12, distance = 388.3474593735831, is_cluster = False
 segment_list = [Segment(0, 12, 388.3474593735831)]
After splitting, segment list is [Segment(0, 12, 388.3474593735831)] with size 1
No jumps, nothing to filter, early return, series = 0     True

With the new one, we have

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
maxSpeed = 1.7570292255633762
After first step, jumps = Int64Index([], dtype='int64')
for ios, returning all_jumps = []
 last_point index 12 not in found points [0]
 added new entry 12
For cluster 0 - 12, distance = 388.3474593735831, is_cluster = False
 segment_list = [Segment(0, 12, 388.3474593735831)]
After splitting, segment list is [Segment(0, 12, 388.3474593735831)] with size 1
No jumps, nothing to filter, early return

And then, it looks like the max is over the speed of sound. Maybe we should filter this after all?

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
After first round, recomputed max = 34113180.058238134, recomputed threshold = 1.7570292255633762
After first round, still have outliers    index  ...         speed
1      1  ...  6.616681e+00
9      8  ...  3.411318e+07

2023-01-24 18:35:13,599:INFO:4522135040:Filtering complete, retained indices = (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),), removed indices = (array([ 9, 10, 11]),)

2023-01-24 18:35:13,610:INFO:4522135040:After second round, max = 6.6166812363619245, recomputed threshold = 1.7168245431845124

2023-01-24 18:35:13,635:DEBUG:4522135040:But they are all < 340.29, so returning backup to delete (array([ 9, 10, 11]),)
2023-01-24 18:35:13,635:INFO:4522135040:After all checks, inlier mask = (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),), outlier_mask = (array([ 9, 10, 11]),)

The section is:

{{'key': 'segmentation/raw_section', 'start_fmt_time': '2016-07-20T21:25:45.641876-07:00', 'end_fmt_time': '2016-07-20T21:32:04.995923-07:00'}

Would be good to check if it indeed has a jump at the end

@shankari
Copy link
Contributor Author

ok.. I read the values and calculated the speeds and none of them are that high.
Poking around a bit more to check that we are in fact recomputing them correctly...

_id latitude longitude fmt_time distance speed
63d099cd1118ee44abc85f95 37.349984 -122.032194 2016-07-20T21:25:45.641876-07:00 0.000000 0.000000
63d099cd1118ee44abc85f97 37.349891 -122.032133 2016-07-20T21:25:47.396046-07:00 11.606783 6.616681
63d099ce1118ee44abc85f99 37.349431 -122.032164 2016-07-20T21:26:29.999997-07:00 51.215544 1.202131
63d099ce1118ee44abc85f9b 37.348980 -122.032154 2016-07-20T21:27:07.999998-07:00 50.210762 1.321336
63d099ce1118ee44abc85f9f 37.348529 -122.032218 2016-07-20T21:27:48.999998-07:00 50.482474 1.231280
63d099ce1118ee44abc85fa1 37.348075 -122.032218 2016-07-20T21:28:25.999998-07:00 50.399271 1.362142
63d099ce1118ee44abc85fa3 37.347621 -122.032208 2016-07-20T21:29:03.999997-07:00 50.570575 1.330805
63d099ce1118ee44abc85fa5 37.347160 -122.032182 2016-07-20T21:29:45.999996-07:00 51.287565 1.221133
63d099ce1118ee44abc85fa7 37.346809 -122.032548 2016-07-20T21:30:45.999997-07:00 50.735176 0.845586
63d099ce1118ee44abc85fa9 37.346723 -122.033107 2016-07-20T21:31:26.995922-07:00 50.286425 1.226620
63d099ce1118ee44abc85fae 37.346698 -122.033686 2016-07-20T21:32:04.995923-07:00 51.224425 1.348011

@shankari
Copy link
Contributor Author

Aha! The big calculated outlier is around point 8

2023-01-24 21:54:17,737:INFO:4623691264:After first round, still have outliers    index  ...         speed
1      1  ...  6.616681e+00
9      8  ...  3.411318e+07

And we apparently "filled in" a large gap on iOS at around point 8

2023-01-24 21:54:17,644:DEBUG:4623691264:Found 1 large gaps, filling them all
2023-01-24 21:54:17,644:DEBUG:4623691264:Found large gap ending at 8, filling it
2023-01-24 21:54:17,644:DEBUG:4623691264:start = 7, end = 8, generating entries between 1469075385.999996 and 1469075445.999997

Note, however, that this happens for the next section as well

2023-01-24 21:54:17,817:DEBUG:4623691264:Found iOS section, filling in gaps with fake data
2023-01-24 21:54:17,818:DEBUG:4623691264:Found 1 large gaps, filling them all
2023-01-24 21:54:17,818:DEBUG:4623691264:Found large gap ending at 28, filling it
2023-01-24 21:54:17,818:DEBUG:4623691264:start = 27, end = 28, generating entries between 1469076690.999049 and 1469076766.999048
2023-01-24 21:54:17,819:DEBUG:4623691264:Fill lengths are: dist 1, angle 1, lat 1, lng 1

But it doesn't result in outliers

2023-01-24 21:54:17,873:INFO:4623691264:After first round, recomputed max = 1.3911640011230477, recomputed threshold = 1.625639405160407

@shankari
Copy link
Contributor Author

shankari commented Jan 25, 2023

The filled in entry is not an outlier and should not be deleted.

Before fill After fill
Screen Shot 2023-01-24 at 10 12 03 PM Screen Shot 2023-01-24 at 10 12 26 PM

Final question: why is the calculated speed so large?
Because the time delta for the filled in point is very small

Distance delta = 51.28756532378865 and time delta = 41.999999046325684
Distance delta = 22.49458343565475 and time delta = 60.0
Distance delta = 32.436446584498526 and time delta = 9.5367431640625e-07
Distance delta = 50.28642549839274 and time delta = 40.995925188064575

@shankari
Copy link
Contributor Author

I am not going to dig deeper into the iOS fill code right now, because, given our much finer grained default data collection on iOS, this is not likely to happen in practice. Maybe we should remove it later?
Let's file an issue to follow up on this later.
#848

@shankari
Copy link
Contributor Author

To finish up this issue, we should not remove this newly added point, so the original implementation, without the regression, is the correct one.

One final cleanup and then we can merge.

@shankari
Copy link
Contributor Author

shankari commented Jan 25, 2023

Re-running with the GIS branch, I still get the same trip/section segmentation.
Note that there is in fact a big gap between the two points, so not sure why they were part of a trip in the first place.

idx id lat lon fmt_time ts speed
10 63abd6b780ea0c4fbb3a8612 XXXX YYYY 2022-12-27T19:11:31.852000-04:00 2.677547e-01 8.367072e-03
11 63abd6ba80ea0c4fbb3a865a 33.652812 73.087085 2022-12-28T01:18:22.615000-04:00 1.299755e+07 5.905088e+02

In the re-run, we have

2023-01-25 07:03:17,245:DEBUG:4580113920:------------------------------2022-12-27T19:11:31.852000-04:00------------------------------
2023-01-25 07:03:17,246:INFO:4580113920:Points ... are within the distance filter and only 1 min apart so part of the same trip

2023-01-25 07:03:17,246:DEBUG:4580113920:------------------------------2022-12-28T01:18:22.615000-04:00------------------------------
2023-01-25 07:03:17,247:DEBUG:4580113920:Setting new trip start point with idx 327

2023-01-25 07:03:17,249:DEBUG:4580113920:------------------------------2022-12-28T01:18:25.744000-04:00------------------------------
2023-01-25 07:03:17,267:DEBUG:4580113920:Too few points to make a decision, continuing

...

2023-01-25 07:03:17,910:DEBUG:4580113920:------------------------------2022-12-28T01:23:31.686000-04:00------------------------------
2023-01-25 07:03:17,929:DEBUG:4580113920:prev_point.ts = 1672204982.682, curr_point.ts = 1672205011.686, time gap = 29.004000186920166 (vs 300), distance_gap = 0.5535023040251027 (vs 100), speed_gap = 0.019083653994551888 (vs 0.3333333333333333) continuing trip
2023-01-25 07:03:17,929:DEBUG:4580113920:last5MinsDistances.max() = 12.12844463675069, last10PointsDistance.max() = 2.3905017930227643
2023-01-25 07:03:17,930:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 336
2023-01-25 07:03:17,930:INFO:4580113920:Found trip end at 2022-12-28T01:21:01.685000-04:00

Before the logs are deleted, let's see how this was processed initially

@shankari
Copy link
Contributor Author

In the original processing, the previous trip had not ended

2022-12-28 06:37:55,880:DEBUG:140290522912576:------------------------------2022-12-27T19:09:30.849000-04:00------------------------------
2022-12-28 06:37:56,749:DEBUG:140290522912576:prev_point.ts = 1672182539.86, curr_point.ts = 1672182570.849, time gap = 30.98900008201599 (vs 300), distance_gap = 0.07455484590077509 (vs 100), speed_gap = 0.0024058487109444326 (vs 0.3333333333333333) continuing trip

2022-12-28 06:37:59,396:DEBUG:140290522912576:------------------------------2022-12-27T19:11:31.852000-04:00------------------------------
2022-12-28 06:37:59,400:DEBUG:140290522912576:last5MinsDistances = [ 1.85212427 32.75416379  1.88835215  1.83614084  1.83614084  1.74143561
1.6698634   1.36609206  0.98469344  0.26775467] with length 10
2022-12-28 06:37:59,403:DEBUG:140290522912576:last10PointsDistances = [32.75416379  1.88835215  1.83614084  1.83614084  1.74143561  1.6698634
1.36609206  0.98469344  0.26775467  0.        ] with length 10, shape (10,)
2022-12-28 06:37:59,405:DEBUG:140290522912576:len(last10PointsDistances) = 10, len(last5MinsDistances) = 10
2022-12-28 06:37:59,405:DEBUG:140290522912576:last5MinsTimes.max() = 241.9869999885559, time_threshold = 300
2022-12-28 06:38:00,273:DEBUG:140290522912576:prev_point.ts = 1672182659.851, curr_point.ts = 1672182691.852, time gap = 32.00099992752075 (vs 300), distance_gap = 0.26775466865075986 (vs 100), speed_gap = 0.008367071943289239 (vs 0.3333333333333333) continuing trip
2022-12-28 06:38:00,274:DEBUG:140290522912576:Too few points to make a decision, continuing

2022-12-28 06:38:00,274:DEBUG:140290522912576:------------------------------2022-12-28T01:18:22.615000-04:00------------------------------
2022-12-28 06:38:00,279:DEBUG:140290522912576:len(last10PointsDistances) = 10, len(last5MinsDistances) = 0
2022-12-28 06:38:00,279:DEBUG:140290522912576:last5MinsTimes.max() = nan, time_threshold = 300

...... Lots of tracking error transitions (id: 18)......

2022-12-28 06:38:01,174:DEBUG:140290522912576:prev_point.ts = 1672182691.852, curr_point.ts = 1672204702.615, time gap = 22010.763000011444 (vs 300), distance_gap = 12997548.695653537 (vs 100), speed_gap = 590.5087749864364 (vs 0.3333333333333333) continuing trip
2022-12-28 06:38:01,174:DEBUG:140290522912576:Too few points to make a decision, continuing

Let's look at where we detected a trip end in the re-run

2023-01-25 07:03:17,186:DEBUG:4580113920:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2023-01-25 07:03:17,191:DEBUG:4580113920:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
 32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321] with length 15
2023-01-25 07:03:17,193:DEBUG:4580113920:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2023-01-25 07:03:17,195:DEBUG:4580113920:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2023-01-25 07:03:17,195:DEBUG:4580113920:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2023-01-25 07:03:17,195:DEBUG:4580113920:curr_query = ..., sort_key = data.ts
2023-01-25 07:03:17,195:DEBUG:4580113920:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2023-01-25 07:03:17,197:DEBUG:4580113920:finished querying values for ['statemachine/transition'], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,201:DEBUG:4580113920:Found 0 results
2023-01-25 07:03:17,201:DEBUG:4580113920:In range 1672181741.861 -> 1672181764.894 found no transitions
2023-01-25 07:03:17,201:DEBUG:4580113920:curr_query = ..., sort_key = data.ts
2023-01-25 07:03:17,201:DEBUG:4580113920:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for ['background/motion_activity'], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,207:DEBUG:4580113920:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2023-01-25 07:03:17,207:DEBUG:4580113920:sample activities are []
2023-01-25 07:03:17,207:DEBUG:4580113920:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs 0.3333333333333333) continuing trip
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 272
2023-01-25 07:03:17,209:DEBUG:4580113920:Appending last_trip_end_point with index 272
2023-01-25 07:03:17,209:INFO:4580113920:Found trip end at 2022-12-27T18:53:13.860000-04:00

In the original run, we had finished processing

How did we process that point in the original run?
Aha! In the original run, we got a data push between 2022-12-27T18:56:04.894000-04:00 and 2022-12-27T19:07:29.865000-04:00

So we closed out the trip as expected at 2022-12-27T19:07:29.865000-04:00, and when we got the new points at 2022-12-27T18:56:04.894000-04:00, we did not continue the old trip, but started a new one

  1. Online version (on server)
2022-12-27 23:48:44,605:DEBUG:140720199223104:------------------------------2022-12-27T18:51:16.880000-04:00------------------------------
2022-12-27 23:48:44,606:DEBUG:140720199223104:Comparing with prev_point =
2022-12-27 23:48:44,606:DEBUG:140720199223104:Setting new trip start point with idx 276
2022-12-27 23:48:44,609:DEBUG:140720199223104:last5MinsDistances = [] with length 0
2022-12-27 23:48:44,610:DEBUG:140720199223104:last10PointsDistances = [0.] with length 1, shape (1,)
2022-12-27 23:48:44,611:DEBUG:140720199223104:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2022-12-27 23:48:44,611:DEBUG:140720199223104:last5MinsTimes.max() = nan, time_threshold = 300
2022-12-27 23:48:44,611:DEBUG:140720199223104:prev_point is None, continuing trip
2022-12-27 23:48:44,611:DEBUG:140720199223104:Too few points to make a decision, continuing



2022-12-27 23:48:57,182:DEBUG:140720199223104:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2022-12-27 23:48:57,187:DEBUG:140720199223104:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
31.04866665  0.         29.50565321] with length 15
2022-12-27 23:48:57,189:DEBUG:140720199223104:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2022-12-27 23:48:57,191:DEBUG:140720199223104:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2022-12-27 23:48:57,191:DEBUG:140720199223104:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2022-12-27 23:48:57,191:DEBUG:140720199223104:curr_query = .... sort_key = data.ts
2022-12-27 23:48:57,192:DEBUG:140720199223104:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2022-12-27 23:48:57,307:DEBUG:140720199223104:finished querying values for ['statemachine/transition'], count = 0
2022-12-27 23:48:57,307:DEBUG:140720199223104:finished querying values for [], count = 0
2022-12-27 23:48:57,307:DEBUG:140720199223104:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2022-12-27 23:48:57,429:DEBUG:140720199223104:Found 0 results
2022-12-27 23:48:57,429:DEBUG:140720199223104:In range 1672181741.861 -> 1672181764.894 found no transitions
2022-12-27 23:48:57,430:DEBUG:140720199223104:curr_query = ..., sort_key = data.ts
2022-12-27 23:48:57,430:DEBUG:140720199223104:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2022-12-27 23:48:57,761:DEBUG:140720199223104:finished querying values for ['background/motion_activity'], count = 0
2022-12-27 23:48:57,761:DEBUG:140720199223104:finished querying values for [], count = 0
2022-12-27 23:48:57,761:DEBUG:140720199223104:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2022-12-27 23:48:58,112:DEBUG:140720199223104:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2022-12-27 23:48:58,112:DEBUG:140720199223104:sample activities are []
2022-12-27 23:48:58,112:DEBUG:140720199223104:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs 0.3333333333333333) continuing trip
2022-12-27 23:48:58,112:DEBUG:140720199223104:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2022-12-27 23:48:58,113:DEBUG:140720199223104:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 283
2022-12-27 23:48:58,114:DEBUG:140720199223104:Appending last_trip_end_point ... with index 283
2022-12-27 23:48:58,114:INFO:140720199223104:Found trip end at 2022-12-27T18:53:13.860000-04:00

2022-12-28 06:37:44,461:INFO:140290522912576:Last ts processed = None
2022-12-28 06:37:44,462:DEBUG:140290522912576:------------------------------2022-12-27T19:07:29.865000-04:00------------------------------
2022-12-28 06:37:44,463:DEBUG:140290522912576:Appending currPoint because the current start point is None
2022-12-28 06:37:44,463:DEBUG:140290522912576:Setting new trip start point with idx 0
2022-12-28 06:37:44,471:DEBUG:140290522912576:last5MinsDistances = [] with length 0
2022-12-28 06:37:44,474:DEBUG:140290522912576:last10PointsDistances = [0.] with length 1, shape (1,)
2022-12-28 06:37:44,475:DEBUG:140290522912576:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2022-12-28 06:37:44,475:DEBUG:140290522912576:last5MinsTimes.max() = nan, time_threshold = 300
2022-12-28 06:37:44,475:DEBUG:140290522912576:prev_point is None, continuing trip
2022-12-28 06:37:44,476:DEBUG:140290522912576:Too few points to make a decision, continuing
  1. Rerun with full data
2023-01-25 07:03:16,898:DEBUG:4580113920:------------------------------2022-12-27T18:51:16.880000-04:00------------------------------
2023-01-25 07:03:16,898:DEBUG:4580113920:Comparing with prev_point =
2023-01-25 07:03:16,898:DEBUG:4580113920:Setting new trip start point with idx 265
2023-01-25 07:03:16,900:DEBUG:4580113920:last5MinsDistances = [] with length 0
2023-01-25 07:03:16,901:DEBUG:4580113920:last10PointsDistances = [0.] with length 1, shape (1,)
2023-01-25 07:03:16,901:DEBUG:4580113920:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2023-01-25 07:03:16,901:DEBUG:4580113920:last5MinsTimes.max() = nan, time_threshold = 300
2023-01-25 07:03:16,901:DEBUG:4580113920:prev_point is None, continuing trip
2023-01-25 07:03:16,901:DEBUG:4580113920:Too few points to make a decision, continuing

2023-01-25 07:03:17,186:DEBUG:4580113920:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2023-01-25 07:03:17,191:DEBUG:4580113920:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
 32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321] with length 15
2023-01-25 07:03:17,193:DEBUG:4580113920:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2023-01-25 07:03:17,195:DEBUG:4580113920:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2023-01-25 07:03:17,195:DEBUG:4580113920:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2023-01-25 07:03:17,195:DEBUG:4580113920:curr_query = , sort_key = data.ts
2023-01-25 07:03:17,195:DEBUG:4580113920:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2023-01-25 07:03:17,197:DEBUG:4580113920:finished querying values for ['statemachine/transition'], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,201:DEBUG:4580113920:Found 0 results
2023-01-25 07:03:17,201:DEBUG:4580113920:In range 1672181741.861 -> 1672181764.894 found no transitions
2023-01-25 07:03:17,201:DEBUG:4580113920:curr_query = , sort_key = data.ts
2023-01-25 07:03:17,201:DEBUG:4580113920:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for ['background/motion_activity'], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,207:DEBUG:4580113920:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2023-01-25 07:03:17,207:DEBUG:4580113920:sample activities are []
2023-01-25 07:03:17,207:DEBUG:4580113920:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs
 0.3333333333333333) continuing trip
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 272
2023-01-25 07:03:17,209:DEBUG:4580113920:Appending last_trip_end_point with index 272
2023-01-25 07:03:17,209:INFO:4580113920:Found trip end at 2022-12-27T18:53:13.860000-04:00

2023-01-25 07:03:17,238:DEBUG:4580113920:------------------------------2022-12-27T19:07:29.865000-04:00------------------------------
2023-01-25 07:03:17,238:DEBUG:4580113920:Comparing with prev_point = 
2023-01-25 07:03:17,238:INFO:4580113920:Points are within the distance filter and only 1 min apart so part of the same trip

@shankari
Copy link
Contributor Author

Tracking final issue with two batches vs. one batch in
#849

@shankari
Copy link
Contributor Author

shankari commented Jan 26, 2023

Resetting use case 1, last trip:

INFO:root:ret_trip_doc start = 2022-10-22T19:00:01-04:00, end = 2022-10-23T07:47:25.788000-04:00

place exit times: 'exit_fmt_time': '2022-10-22T19:00:01-04:00'

delete everything after the place exit...

>>> arrow.get(1666479601).to("America/New_York")
<Arrow [2022-10-22T19:00:01-04:00]>

@shankari
Copy link
Contributor Author

reset the pipeline for use case 1, it works fine
the entry was removed.
Will double check the dashboard tomorrow, when it should have been updated, and then reset use case 2 as well

@shankari
Copy link
Contributor Author

Dashboard has been updated

Looks like we ended up with a few more trips after the reset, but the number of labeled trips did not change.
Screen Shot 2023-01-25 at 9 06 23 PM
Screen Shot 2023-01-27 at 9 03 59 AM

@shankari
Copy link
Contributor Author

For use case 2, resetting to

DEBUG:root:last_place_enter_ts = 1672177147.9
DEBUG:root:reset_ts = 1672177147.9

@shankari
Copy link
Contributor Author

For use case 1, we do seem to have some issue with multiple matches. This may in fact be related, since we reset the pipeline to '2022-10-22T19:00:01-04:00'

2023-01-28 03:50:02,024:ERROR:140494563346240:Found error Found len(ret_list) = 3, expected <=1 while processing trip Entry({'_id': ObjectId('6354ad0e74264950e7b89e11'), 'end_fmt_time': '2022-10-22T19:00:01-04:00', 'start_fmt_time': '2022-10-22T16:52:18.729293-04:00', key': 'analysis/confirmed_trip'})

and there are in fact multiple inferred sections for the cleaned section although they are all identical

   data.duration  data.distance  data.sensed_mode      data.cleaned_section
0    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04
1    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04
2    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04

@shankari
Copy link
Contributor Author

shankari commented Jan 28, 2023

Checked the confirmed trips and there are no overlaps. Only one entry even seems to be out of order and on checking the write_fmt_time, the order for both the write_fmt_time and the start_fmt_time are the same. So there doesn't seem to be an issue with the confirmed trips.

15  2023-01-26T19:22:40.747753-08:00  2022-10-25T16
16  2023-01-26T19:22:40.793668-08:00  2022-10-26T13   <---- out of order
17  2023-01-26T19:22:40.767221-08:00  2022-10-26T11
18  2023-01-26T19:22:40.775635-08:00  2022-10-26T13
19  2023-01-26T19:22:40.801774-08:00  2022-10-26T14

@shankari
Copy link
Contributor Author

Fixed the second use case as well

Screen Shot 2023-01-27 at 9 21 40 AM

Screen Shot 2023-01-28 at 8 25 32 AM

@shankari
Copy link
Contributor Author

shankari commented Feb 1, 2023

quick check on the discrepancy in use case 1, given that we have no incoming data...
There is no discrepancy! We generate the same number of trips!

2023-01-20 08:35:17,590:DEBUG:140180398004032:keys = (analysis/cleaned_place, analysis/confirmed_trip), len(places) = 55, len(trips) = 54
2023-02-01 01:49:06,269:DEBUG:139888238438208:keys = (analysis/cleaned_place, analysis/confirmed_trip), len(places) = 55, len(trips) = 54

I cannot quite understand the discrepancy. We reset the pipeline only for one user. That user has the same number of trips before and after the reset, for a time range that spans the reset timestamp (2022-10-20 -> 2022-10-27, reset timestamp: 2022-10-22T19:00:01-04:00)

I can't think of how to test this any further, but if we should take a mongodump before we reset the pipeline again and investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant