Giant, obvious GPS jump was not filtered #843

shankari · 2022-12-24T01:10:49Z

two points after a big jump	one point	other point

shankari · 2023-01-22T04:58:18Z

Those points are at the trip level, at the raw section level, they are:

One point	Other point

shankari · 2023-01-22T05:40:23Z

While processing, using the IQR to find potential clusters, we find 6 potential clusters.

2023-01-21 17:54:25,005:DEBUG:4599795200:For cluster 0 - 11, distance = 3.065360237392023, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 11 - 16, distance = 40.23523836440616, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 16 - 17, distance = 0.0, is_cluster = True
2023-01-21 17:54:25,006:DEBUG:4599795200:For cluster 17 - 21, distance = 1.179366164322961, is_cluster = True
2023-01-21 17:54:25,007:DEBUG:4599795200:For cluster 21 - 26, distance = 9.268393002564368, is_cluster = True
2023-01-21 17:54:25,007:DEBUG:4599795200:For cluster 26 - 89, distance = 29.561381182750548, is_cluster = True

The clusters that are most interesting to us are segments 1, 2 and 3

fig.add_subplot(1,3,1).add_child(map_list[1][0])
fig.add_subplot(1,3,2).add_child(map_list[2][0])
fig.add_subplot(1,3,3).add_child(map_list[3][0])

shankari · 2023-01-22T06:02:20Z

First, this seems to be the case where all the segments are clusters, which is a corner case, and probably why it is failing.
In this case, the 16-17 segment was computed to have a length of 0, so it was picked as a bad cluster.

        if len(non_cluster_segments) == 0:
            # If every segment is a cluster, then it is very hard to
            # distinguish between them for zigzags. Let us see if there is any
            # one point cluster - i.e. where the distance is zero. If so, that is likely
            # to be a bad cluster, so we return the one to the right or left of it
            minDistanceCluster = segment_distance_df.distance.idxmin()
            if minDistanceCluster == 0:
                goodCluster = minDistanceCluster + 1
                assert(goodCluster < len(segment_list))
                return goodCluster
            else:
                goodCluster = minDistanceCluster - 1
                assert(goodCluster >= 0)
                return goodCluster

2023-01-21 17:54:25,008:DEBUG:4599795200:non_cluster_segments Empty DataFrame
Columns: [distance, is_cluster]
Index: []
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 2: Segment(16, 17, 0.0), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:At the end of the loop for direction IterationDirection.RIGHT, i = 3

But if we look at the speeds and distances of interest, we have

id	latitude	longitude	fmt_time	distance	speed
15	XXXX	YYYY	2022-10-23T08:15:44.029000-04:00	4.473726e+01	0.111098
16	-0.490372	1.069103	2022-10-23T08:18:25.431000-04:00	9.128037e+06	56554.671637
17	-0.497175	1.071761	2022-10-23T08:18:30.322000-04:00	8.121292e+02	166.045639
18	-0.497173	1.071759	2022-10-23T08:18:32.742000-04:00	3.024755e-01	0.124990
19	-0.497171	1.071755	2022-10-23T08:18:35.222000-04:00	4.475375e-01	0.180459
20	-0.497169	1.071752	2022-10-23T08:18:37.383000-04:00	4.326480e-01	0.200207
21	XXXX	YYYY	2022-10-23T09:07:26.414000-04:00	9.128553e+06	3116.577828

The distance from 16 to 17 is 166 not 0...

shankari · 2023-01-22T06:39:25Z

Stepping through the code, the detected threshold is

2023-01-21 17:54:25,004:DEBUG:4599795200:maxSpeed = 2.669519511324665

while computing distances, we recompute the distance between the start and end points, which is why we end up with a distance of 0 for the (16,17) segment since the segment has only one location point if it uses loc[16:17]

So [16:17] is indeed a one-trip cluster and we can assume that it is bad

But we then return the segment before it (11 - 16) as good, and following expected switching, we get

2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 2: Segment(16, 17, 0.0), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 3: Segment(17, 21, 1.179366164322961), expecting state Segment_State.GOOD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 4: Segment(21, 26, 9.268393002564368), expecting state Segment_State.BAD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 5: Segment(26, 89, 29.561381182750548), expecting state Segment_State.GOOD
2023-01-21 17:54:25,009:DEBUG:4599795200:Processing segment 0: Segment(0, 11, 3.065360237392023), expecting state Segment_State.BAD

And so we end up deleting [ 0 1 2 3 4 5 6 7 8 9 10 16 21 22 23 24 25] when we should in fact only delete [16, 17, 18, 19, 20]

The real problem is that we expect to find alternating GOOD and BAD segments but we actually have two back to back BAD segments (16-17, and 17-21). And we should not delete the points from 21 to 26.

shankari · 2023-01-22T18:16:19Z

Just to confirm, if I set the coordinates of 16 to -0.497175, 1.071761 and re-run the pipeline, it is merged into the 17-21 cluster and we end up with the following

2023-01-22 08:36:50,565:DEBUG:4615695872:For cluster 0 - 11, distance = 3.065360237392023, is_cluster = True
2023-01-22 08:36:50,565:DEBUG:4615695872:For cluster 11 - 16, distance = 40.23523836440616, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 16 - 21, distance = 1.1822994501886754, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 21 - 26, distance = 9.268393002564368, is_cluster = True
2023-01-22 08:36:50,566:DEBUG:4615695872:For cluster 26 - 89, distance = 29.561381182750548, is_cluster = True

Segment 1 (11-16) is again GOOD

2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 2: Segment(16, 21, 1.1822994501886754), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 3
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 3: Segment(21, 26, 9.268393002564368), expecting state Segment_State.GOOD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 4
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 4: Segment(26, 89, 29.561381182750548), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.RIGHT, i = 5
2023-01-22 08:36:50,568:DEBUG:4615695872:Finished marking segment states for direction IterationDirection.RIGHT
2023-01-22 08:36:50,568:DEBUG:4615695872:Processing segment 0: Segment(0, 11, 3.065360237392023), expecting state Segment_State.BAD
2023-01-22 08:36:50,568:DEBUG:4615695872:At the end of the loop for direction IterationDirection.LEFT, i = -1
2023-01-22 08:36:50,568:DEBUG:4615695872:Finished marking segment states for direction IterationDirection.LEFT

So we end up deleting points [ 0 1 2 3 4 5 6 7 8 9 10 16 17 18 19 20 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88]

Maybe it would be easier to see that we are retaining [11 12 13 14 15 21 22 23 24 25], which are basically two clumps on opposite sides of a lake. which is actually pretty correct. So really the problem is that we expect good and bad to alternate, and that doesn't always seem to be the case.

Note that before we changed the lat/lon for point 16, we still had outliers after the filtering and now we don't

2023-01-22 10:14:28,694:DEBUG:4560678400:quartile values are 0.25    0.149342
0.75    0.712335
Name: speed, dtype: float64
2023-01-22 10:14:28,694:DEBUG:4560678400:iqr 0.5629931387473088
2023-01-22 10:14:28,715:INFO:4560678400:After first round, still have outliers     accuracy   altitude  ...      distance         speed
17    70.051  88.551857  ...  9.128723e+06  54895.414928
26     3.778  66.404068  ...  9.128552e+06   3103.242852
49     3.900  72.118635  ...  5.182353e+00      2.591177

[3 rows x 24 columns]

2023-01-22 08:36:50,576:DEBUG:4615695872:quartile values are 0.25    0.415091
0.75    1.433652
Name: speed, dtype: float64

So in order to avoid regressions, let's try a second method if the first method still has outliers.

shankari · 2023-01-22T22:42:42Z

I can think of two possible second methods:
(1) find a pair of super-fast speed jumps and delete everything in between. The speeds here are 54895.414928 and 3103.242852 m/s. The speed of sound is 340.29 m/s. The fastest commercial planes were around Mach2, and neither is flying now. Everything else is test aircraft which are not going to show up in our dataset
https://internationalaviationhq.com/2020/06/27/17-fastest-aircraft/
So if we find a matched pair of (> 2 Mach jumps) (< 5 mins apart), we delete them and any points associated with them
(2) try to use clustering to find clusters of bad clusters instead of assuming that there is always a good/bad alternation.

I'm tempted to go with (1) first since it fits into our current heuristic-based solution better than (2). If we have (2) for example, do we still need the current alternation method?

Let's implement (1) and see if it also fixes the large not a trip for mm_masscec

shankari · 2023-01-22T22:56:22Z

The large trip for mm_masscec had only one point in the jump. Let's see why it wasn't captured.

	_id	latitude	longitude	fmt_time	distance	speed
10	63abd6b780ea0c4fbb3a8612	XXXX	YYYY	2022-12-27T19:11:31.852000-04:00	2.677547e-01	8.367072e-03
11	63abd6ba80ea0c4fbb3a865a	33.652812	73.087085	2022-12-28T01:18:22.615000-04:00	1.299755e+07	5.905088e+02
12	63abd6ba80ea0c4fbb3a865c	XXXX	YYYY	2022-12-28T01:18:25.744000-04:00	1.299756e+07	4.153902e+06
13	63abd6ba80ea0c119978254e	XXXX	YYYY	2022-12-28T01:18:56.608000-04:00	0.000000e+00	0.000000e+00

shankari · 2023-01-23T00:23:18Z

Looking at the logs, this is another example where all the segments are clusters, so the IQR was super small.

2022-12-28 06:39:06,288:DEBUG:140290522912576:For cluster 0 - 1, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,289:DEBUG:140290522912576:For cluster 1 - 2, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,290:DEBUG:140290522912576:For cluster 2 - 11, distance = 1.888352145235123, is_cluster = True
2022-12-28 06:39:06,290:DEBUG:140290522912576:For cluster 11 - 12, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,291:DEBUG:140290522912576:For cluster 12 - 14, distance = 0.0, is_cluster = True
2022-12-28 06:39:06,292:DEBUG:140290522912576:For cluster 14 - 21, distance = 2.2736647837268644, is_cluster = True

Note also that there was a big gap between point 10 and point 11, so although the jump was large, the speed (590) is under Mach 2 (2 * 340.29 = 680).

And then since there are multiple zero length clusters, we pick one of them as BAD and then do our alternating bad and good, which ends up with 2-11 as BAD and 11-12 as GOOD

2022-12-28 06:39:06,294:DEBUG:140290522912576:non_cluster_segments Empty DataFrame
Columns: [distance, is_cluster]
Index: []
2022-12-28 06:39:06,294:DEBUG:140290522912576:Processing segment 2: Segment(2, 11, 1.888352145235123), expecting state Segment_State.BAD
2022-12-28 06:39:06,294:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 3
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 3: Segment(11, 12, 0.0), expecting state Segment_State.GOOD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 4
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 4: Segment(12, 14, 0.0), expecting state Segment_State.BAD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 5
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 5: Segment(14, 21, 2.2736647837268644), expecting state Segment_State.GOOD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.RIGHT, i = 6
2022-12-28 06:39:06,295:DEBUG:140290522912576:Finished marking segment states for direction IterationDirection.RIGHT
2022-12-28 06:39:06,295:DEBUG:140290522912576:Processing segment 0: Segment(0, 1, 0.0), expecting state Segment_State.BAD
2022-12-28 06:39:06,295:DEBUG:140290522912576:At the end of the loop for direction IterationDirection.LEFT, i = -1
2022-12-28 06:39:06,296:DEBUG:140290522912576:Finished marking segment states for direction IterationDirection.LEFT

So we end up deleting almost everything other than the item that we want to delete aka point 11.

2022-12-28 06:39:06,297:DEBUG:140290522912576:after setting values, outlier_mask = [ 0  2  3  4  5  6  7  8  9 10 12 13]

We do still have outliers after the first check is done

2022-12-28 06:39:06,311:DEBUG:140290522912576:quartile values are 0.25      0.013428
0.75    438.430600
Name: speed, dtype: float64
2022-12-28 06:39:06,312:DEBUG:140290522912576:iqr 438.4171720372147
2022-12-28 06:39:06,344:INFO:140290522912576:After first round, still have outliers     accuracy    altitude  ...      distance          speed
14     5.638  470.899994  ...  1.299756e+07  332690.597033
[1 rows x 24 columns]

so again, the problem seems to be with these weird trips were everything is a cluster. Our previous plan to look at high speed jumps should work although given the big gap in time, we need to tone down the check to Mach1 instead of Mach 2.

shankari · 2023-01-23T00:51:08Z

So the planned fix is:

if there are still outliers after the first round, identify any jumps that are over the speed of sound (Mach1, 340.29 m/s).
if there are two of these, and they are within 10 mins of each other, and the point just before and just after them are within 100 meters of each other, delete all points between them. Then re-run the zig-zag algorithm on the resulting points and append the newly deleted points.

…nts are clusters Once the actual issue is addressed, this will fix e-mission/e-mission-docs#843 For now, we load the location dataframes for the two use cases and verify that the returned values are the ones in the current implementation. Procedure: - Perturb the location points in the original use cases to avoid leaking information - Load the location points into the test case - Run the filtering code - Verify that the output is consistent with e-mission/e-mission-docs#843 (comment) e-mission/e-mission-docs#843 (comment) Also change the location smoothing code from `logging.info` to `logging.exception` so that we can see where the error is in a more meaningful way Testing done: - Test passes ``` ---------------------------------------------------------------------- Ran 1 test in 0.387s ``` Note that due to the perturbation of the location points, the outliers no longer perfectly match the original use case, but are close enough ``` 2023-01-22 22:37:57,262:INFO:4634275328:After first round, still have outliers accuracy altitude ... distance speed 17 70.051 88.551857 ... 8.468128e+06 50922.935508 26 3.778 66.404068 ... 8.467873e+06 2878.645674 49 3.900 72.118635 ... 4.673209e+00 2.336605 2023-01-22 22:37:57,308:INFO:4634275328:After first round, still have outliers Unnamed: 0 accuracy altitude ... heading distance speed 14 14 5.638 470.899994 ... 88.989357 1.113137e+07 284923.028227 ```

shankari · 2023-01-24T04:44:21Z

We implemented multiple smoothing algorithms. So instead of adding a new one, let's see if one of the algorithms - e.g. POSDAP, will work as the backup

For use case 1:

while considering point 16, speed = 52461.47140936417
currSpeed > 340, starting new quality segment at index 16
while considering point 17, speed = 165.74234069705636
currSpeed < 340, retaining index 17 in existing quality segment

while considering point 21, speed = 2891.014996913109
currSpeed > 340, starting new quality segment at index 21
while considering point 22, speed = 0.13761215470733973
currSpeed < 340, retaining index 22 in existing quality segment

Number of quality segments is 3
Considering segments [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] and [16, 17, 18, 19, 20]
About to compare curr_segment duration 11.95199990272522 with last segment duration 644.021999835968
curr segment [16, 17, 18, 19, 20] is shorter, cut it

Considering segments [16, 17, 18, 19, 20] and [21, 22, ... 88]
About to compare curr_segment duration 827.8680000305176 with last segment duration 11.95199990272522
prev segment [16, 17, 18, 19, 20] is shorter, cut it

Filtering complete, removed indices = [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 21 22 23 24 25 26 27 28
 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
 77 78 79 80 81 82 83 84 85 86 87 88]

So it looks like we removed everything other than 16-20? Maybe I just have to edit POSDAP to flip the signs.

Although after the second round, our outliers are not the original ones?

2023-01-23 16:05:41,143:INFO:4427972096:After second round, still have outliers     accuracy   altitude  elapsedRealtimeNanos  ...     heading   distance     speed
11   164.400  52.826053       127394353000000  ...   74.375112  17.876278  3.558176
26     3.778  66.404068       130926514000000  ... -170.883629   8.164765  2.721588

[2 rows x 24 columns]

Use case 2:

while considering point 11, speed = 505.7237830222791
currSpeed > 340, starting new quality segment at index 11

while considering point 12, speed = 3557486.793555044
currSpeed > 340, starting new quality segment at index 12

Number of quality segments is 3

2023-01-23 16:05:41,205:INFO:4427972096:Filtering complete, removed indices = [ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19 20]

shankari · 2023-01-24T05:13:58Z

Ok, so printing out the outliers instead of the inliers indicates that this does seem to work

2023-01-23 20:55:05,705:INFO:4580654592:Filtering complete, retained indices = (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]),), removed indices = (array([16, 17, 18, 19, 20]),)

2023-01-23 20:55:05,794:INFO:4580654592:Filtering complete, retained indices = (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 12, 13, 14, 15, 16, 17,
       18, 19, 20]),), removed indices = (array([11]),)

However, even after removing them, we still have "outliers". Since most of the points are in clusters, these outliers are pretty slow, with

2023-01-23 20:55:05,717:DEBUG:4580654592:quartile values are 0.25    0.137612
0.75    0.724411
Name: speed, dtype: float64
2023-01-23 20:55:05,717:DEBUG:4580654592:iqr 0.5867984458938249
2023-01-23 20:55:05,738:INFO:4580654592:After second round, still have outliers     accuracy   altitude  elapsedRealtimeNanos  ...     heading   distance     speed
11   164.400  52.826053       127394353000000  ...   74.375112  17.876278  3.558176
26     3.778  66.404068       130926514000000  ... -170.883629   8.164765  2.721588

[2 rows x 24 columns]

2023-01-23 20:55:05,804:DEBUG:4580654592:quartile values are 0.25    0.006758
0.75    0.037754
Name: speed, dtype: float64
2023-01-23 20:55:05,804:DEBUG:4580654592:iqr 0.030995491580030114
2023-01-23 20:55:05,825:INFO:4580654592:After second round, still have outliers     Unnamed: 0  accuracy    altitude  ...     heading   distance     speed
1            1   135.600    2.600000  ...   57.784408  29.242851  1.635506
2            2     8.817    2.600000  ... -122.554813  29.260349  2.230890
14          14     5.638  470.899994  ...   88.989357   6.934772  1.366458

[3 rows x 25 columns]

For this second check, we should look to see if the outliers are in fact large.

shankari · 2023-01-24T06:30:05Z

After adding that check, the correct values are deleted.
For case 1, the trip goes from one side of a lake to another
For case 2, while re-running the pipeline, we were running on a branch forked from master, so we ended up with a slightly different trip and section segmentation.

Will need to pull changes to the GIS branch and re-run to confirm that it works correctly here as well.

- Since we have already implemented many different smoothing algorithms, we pick POSDAP to use as backup - if we still have outliers after the first round, and the max value is over MACH1, we fall back to the backup algo - after implementing the backup algo, if we don't have outliers, the backup algo has succeeded and we use its results - if we do have outliers, but the max value is under MACH1, the backup algo has succeeded and we use its results - if we have outliers, and the max is high (> MACH1) the backup algo has failed With this change, both the tests also change to the correctly deleted values - [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment)) - [11] for use case 2 (e-mission/e-mission-docs#843 (comment)) In this commit, we also check in the csv data files for the two test cases

shankari · 2023-01-24T16:00:11Z

We now have a fix for this issue, but it is a bit ugly
e-mission/e-mission-server@67f5c86

Planning to move the retry code out to LocationSmoothing instead of keeping it in the algo.

shankari · 2023-01-25T02:51:40Z

There was a regression caused by e-mission/e-mission-server@cebb81f and fixed by
e-mission/e-mission-server@5a4ae3d
which still seems a bit suspicious.

If there were no jumps in the data, then why did we even call the backup algo and why did it filter out values?

Investigating further:

Before the change:

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
maxSpeed = 1.7570292255633762
After first step, jumps = Int64Index([], dtype='int64')
for ios, returning all_jumps = []
 last_point index 12 not in found points [0]
 added new entry 12
For cluster 0 - 12, distance = 388.3474593735831, is_cluster = False
 segment_list = [Segment(0, 12, 388.3474593735831)]
After splitting, segment list is [Segment(0, 12, 388.3474593735831)] with size 1
No jumps, nothing to filter, early return, series = 0     True

With the new one, we have

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
maxSpeed = 1.7570292255633762
After first step, jumps = Int64Index([], dtype='int64')
for ios, returning all_jumps = []
 last_point index 12 not in found points [0]
 added new entry 12
For cluster 0 - 12, distance = 388.3474593735831, is_cluster = False
 segment_list = [Segment(0, 12, 388.3474593735831)]
After splitting, segment list is [Segment(0, 12, 388.3474593735831)] with size 1
No jumps, nothing to filter, early return

And then, it looks like the max is over the speed of sound. Maybe we should filter this after all?

quartile values are 0.25    1.216382
0.75    1.351544
Name: speed, dtype: float64
iqr 0.13516174772424905
After first round, recomputed max = 34113180.058238134, recomputed threshold = 1.7570292255633762
After first round, still have outliers    index  ...         speed
1      1  ...  6.616681e+00
9      8  ...  3.411318e+07

2023-01-24 18:35:13,599:INFO:4522135040:Filtering complete, retained indices = (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),), removed indices = (array([ 9, 10, 11]),)

2023-01-24 18:35:13,610:INFO:4522135040:After second round, max = 6.6166812363619245, recomputed threshold = 1.7168245431845124

2023-01-24 18:35:13,635:DEBUG:4522135040:But they are all < 340.29, so returning backup to delete (array([ 9, 10, 11]),)
2023-01-24 18:35:13,635:INFO:4522135040:After all checks, inlier mask = (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),), outlier_mask = (array([ 9, 10, 11]),)

The section is:

{{'key': 'segmentation/raw_section', 'start_fmt_time': '2016-07-20T21:25:45.641876-07:00', 'end_fmt_time': '2016-07-20T21:32:04.995923-07:00'}

Would be good to check if it indeed has a jump at the end

shankari · 2023-01-25T03:11:55Z

ok.. I read the values and calculated the speeds and none of them are that high.
Poking around a bit more to check that we are in fact recomputing them correctly...

_id	latitude	longitude	fmt_time	distance	speed
63d099cd1118ee44abc85f95	37.349984	-122.032194	2016-07-20T21:25:45.641876-07:00	0.000000	0.000000
63d099cd1118ee44abc85f97	37.349891	-122.032133	2016-07-20T21:25:47.396046-07:00	11.606783	6.616681
63d099ce1118ee44abc85f99	37.349431	-122.032164	2016-07-20T21:26:29.999997-07:00	51.215544	1.202131
63d099ce1118ee44abc85f9b	37.348980	-122.032154	2016-07-20T21:27:07.999998-07:00	50.210762	1.321336
63d099ce1118ee44abc85f9f	37.348529	-122.032218	2016-07-20T21:27:48.999998-07:00	50.482474	1.231280
63d099ce1118ee44abc85fa1	37.348075	-122.032218	2016-07-20T21:28:25.999998-07:00	50.399271	1.362142
63d099ce1118ee44abc85fa3	37.347621	-122.032208	2016-07-20T21:29:03.999997-07:00	50.570575	1.330805
63d099ce1118ee44abc85fa5	37.347160	-122.032182	2016-07-20T21:29:45.999996-07:00	51.287565	1.221133
63d099ce1118ee44abc85fa7	37.346809	-122.032548	2016-07-20T21:30:45.999997-07:00	50.735176	0.845586
63d099ce1118ee44abc85fa9	37.346723	-122.033107	2016-07-20T21:31:26.995922-07:00	50.286425	1.226620
63d099ce1118ee44abc85fae	37.346698	-122.033686	2016-07-20T21:32:04.995923-07:00	51.224425	1.348011

shankari · 2023-01-25T06:08:27Z

Aha! The big calculated outlier is around point 8

2023-01-24 21:54:17,737:INFO:4623691264:After first round, still have outliers    index  ...         speed
1      1  ...  6.616681e+00
9      8  ...  3.411318e+07

And we apparently "filled in" a large gap on iOS at around point 8

2023-01-24 21:54:17,644:DEBUG:4623691264:Found 1 large gaps, filling them all
2023-01-24 21:54:17,644:DEBUG:4623691264:Found large gap ending at 8, filling it
2023-01-24 21:54:17,644:DEBUG:4623691264:start = 7, end = 8, generating entries between 1469075385.999996 and 1469075445.999997

Note, however, that this happens for the next section as well

2023-01-24 21:54:17,817:DEBUG:4623691264:Found iOS section, filling in gaps with fake data
2023-01-24 21:54:17,818:DEBUG:4623691264:Found 1 large gaps, filling them all
2023-01-24 21:54:17,818:DEBUG:4623691264:Found large gap ending at 28, filling it
2023-01-24 21:54:17,818:DEBUG:4623691264:start = 27, end = 28, generating entries between 1469076690.999049 and 1469076766.999048
2023-01-24 21:54:17,819:DEBUG:4623691264:Fill lengths are: dist 1, angle 1, lat 1, lng 1

But it doesn't result in outliers

2023-01-24 21:54:17,873:INFO:4623691264:After first round, recomputed max = 1.3911640011230477, recomputed threshold = 1.625639405160407

shankari · 2023-01-25T06:32:49Z

The filled in entry is not an outlier and should not be deleted.

Before fill	After fill

Final question: why is the calculated speed so large?
Because the time delta for the filled in point is very small

Distance delta = 51.28756532378865 and time delta = 41.999999046325684
Distance delta = 22.49458343565475 and time delta = 60.0
Distance delta = 32.436446584498526 and time delta = 9.5367431640625e-07
Distance delta = 50.28642549839274 and time delta = 40.995925188064575

shankari · 2023-01-25T06:42:26Z

I am not going to dig deeper into the iOS fill code right now, because, given our much finer grained default data collection on iOS, this is not likely to happen in practice. Maybe we should remove it later?
Let's file an issue to follow up on this later.
#848

shankari · 2023-01-25T06:43:28Z

To finish up this issue, we should not remove this newly added point, so the original implementation, without the regression, is the correct one.

One final cleanup and then we can merge.

shankari · 2023-01-25T15:20:45Z

Re-running with the GIS branch, I still get the same trip/section segmentation.
Note that there is in fact a big gap between the two points, so not sure why they were part of a trip in the first place.

idx	id	lat	lon	fmt_time	ts	speed
10	63abd6b780ea0c4fbb3a8612	XXXX	YYYY	2022-12-27T19:11:31.852000-04:00	2.677547e-01	8.367072e-03
11	63abd6ba80ea0c4fbb3a865a	33.652812	73.087085	2022-12-28T01:18:22.615000-04:00	1.299755e+07	5.905088e+02

In the re-run, we have

2023-01-25 07:03:17,245:DEBUG:4580113920:------------------------------2022-12-27T19:11:31.852000-04:00------------------------------
2023-01-25 07:03:17,246:INFO:4580113920:Points ... are within the distance filter and only 1 min apart so part of the same trip

2023-01-25 07:03:17,246:DEBUG:4580113920:------------------------------2022-12-28T01:18:22.615000-04:00------------------------------
2023-01-25 07:03:17,247:DEBUG:4580113920:Setting new trip start point with idx 327

2023-01-25 07:03:17,249:DEBUG:4580113920:------------------------------2022-12-28T01:18:25.744000-04:00------------------------------
2023-01-25 07:03:17,267:DEBUG:4580113920:Too few points to make a decision, continuing

...

2023-01-25 07:03:17,910:DEBUG:4580113920:------------------------------2022-12-28T01:23:31.686000-04:00------------------------------
2023-01-25 07:03:17,929:DEBUG:4580113920:prev_point.ts = 1672204982.682, curr_point.ts = 1672205011.686, time gap = 29.004000186920166 (vs 300), distance_gap = 0.5535023040251027 (vs 100), speed_gap = 0.019083653994551888 (vs 0.3333333333333333) continuing trip
2023-01-25 07:03:17,929:DEBUG:4580113920:last5MinsDistances.max() = 12.12844463675069, last10PointsDistance.max() = 2.3905017930227643
2023-01-25 07:03:17,930:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 336
2023-01-25 07:03:17,930:INFO:4580113920:Found trip end at 2022-12-28T01:21:01.685000-04:00

Before the logs are deleted, let's see how this was processed initially

shankari · 2023-01-25T19:01:13Z

In the original processing, the previous trip had not ended

2022-12-28 06:37:55,880:DEBUG:140290522912576:------------------------------2022-12-27T19:09:30.849000-04:00------------------------------
2022-12-28 06:37:56,749:DEBUG:140290522912576:prev_point.ts = 1672182539.86, curr_point.ts = 1672182570.849, time gap = 30.98900008201599 (vs 300), distance_gap = 0.07455484590077509 (vs 100), speed_gap = 0.0024058487109444326 (vs 0.3333333333333333) continuing trip

2022-12-28 06:37:59,396:DEBUG:140290522912576:------------------------------2022-12-27T19:11:31.852000-04:00------------------------------
2022-12-28 06:37:59,400:DEBUG:140290522912576:last5MinsDistances = [ 1.85212427 32.75416379  1.88835215  1.83614084  1.83614084  1.74143561
1.6698634   1.36609206  0.98469344  0.26775467] with length 10
2022-12-28 06:37:59,403:DEBUG:140290522912576:last10PointsDistances = [32.75416379  1.88835215  1.83614084  1.83614084  1.74143561  1.6698634
1.36609206  0.98469344  0.26775467  0.        ] with length 10, shape (10,)
2022-12-28 06:37:59,405:DEBUG:140290522912576:len(last10PointsDistances) = 10, len(last5MinsDistances) = 10
2022-12-28 06:37:59,405:DEBUG:140290522912576:last5MinsTimes.max() = 241.9869999885559, time_threshold = 300
2022-12-28 06:38:00,273:DEBUG:140290522912576:prev_point.ts = 1672182659.851, curr_point.ts = 1672182691.852, time gap = 32.00099992752075 (vs 300), distance_gap = 0.26775466865075986 (vs 100), speed_gap = 0.008367071943289239 (vs 0.3333333333333333) continuing trip
2022-12-28 06:38:00,274:DEBUG:140290522912576:Too few points to make a decision, continuing

2022-12-28 06:38:00,274:DEBUG:140290522912576:------------------------------2022-12-28T01:18:22.615000-04:00------------------------------
2022-12-28 06:38:00,279:DEBUG:140290522912576:len(last10PointsDistances) = 10, len(last5MinsDistances) = 0
2022-12-28 06:38:00,279:DEBUG:140290522912576:last5MinsTimes.max() = nan, time_threshold = 300

...... Lots of tracking error transitions (id: 18)......

2022-12-28 06:38:01,174:DEBUG:140290522912576:prev_point.ts = 1672182691.852, curr_point.ts = 1672204702.615, time gap = 22010.763000011444 (vs 300), distance_gap = 12997548.695653537 (vs 100), speed_gap = 590.5087749864364 (vs 0.3333333333333333) continuing trip
2022-12-28 06:38:01,174:DEBUG:140290522912576:Too few points to make a decision, continuing

Let's look at where we detected a trip end in the re-run

2023-01-25 07:03:17,186:DEBUG:4580113920:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2023-01-25 07:03:17,191:DEBUG:4580113920:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
 32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321] with length 15
2023-01-25 07:03:17,193:DEBUG:4580113920:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2023-01-25 07:03:17,195:DEBUG:4580113920:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2023-01-25 07:03:17,195:DEBUG:4580113920:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2023-01-25 07:03:17,195:DEBUG:4580113920:curr_query = ..., sort_key = data.ts
2023-01-25 07:03:17,195:DEBUG:4580113920:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2023-01-25 07:03:17,197:DEBUG:4580113920:finished querying values for ['statemachine/transition'], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,201:DEBUG:4580113920:Found 0 results
2023-01-25 07:03:17,201:DEBUG:4580113920:In range 1672181741.861 -> 1672181764.894 found no transitions
2023-01-25 07:03:17,201:DEBUG:4580113920:curr_query = ..., sort_key = data.ts
2023-01-25 07:03:17,201:DEBUG:4580113920:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for ['background/motion_activity'], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,207:DEBUG:4580113920:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2023-01-25 07:03:17,207:DEBUG:4580113920:sample activities are []
2023-01-25 07:03:17,207:DEBUG:4580113920:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs 0.3333333333333333) continuing trip
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 272
2023-01-25 07:03:17,209:DEBUG:4580113920:Appending last_trip_end_point with index 272
2023-01-25 07:03:17,209:INFO:4580113920:Found trip end at 2022-12-27T18:53:13.860000-04:00

In the original run, we had finished processing

How did we process that point in the original run?
Aha! In the original run, we got a data push between 2022-12-27T18:56:04.894000-04:00 and 2022-12-27T19:07:29.865000-04:00

So we closed out the trip as expected at 2022-12-27T19:07:29.865000-04:00, and when we got the new points at 2022-12-27T18:56:04.894000-04:00, we did not continue the old trip, but started a new one

Online version (on server)

2022-12-27 23:48:44,605:DEBUG:140720199223104:------------------------------2022-12-27T18:51:16.880000-04:00------------------------------
2022-12-27 23:48:44,606:DEBUG:140720199223104:Comparing with prev_point =
2022-12-27 23:48:44,606:DEBUG:140720199223104:Setting new trip start point with idx 276
2022-12-27 23:48:44,609:DEBUG:140720199223104:last5MinsDistances = [] with length 0
2022-12-27 23:48:44,610:DEBUG:140720199223104:last10PointsDistances = [0.] with length 1, shape (1,)
2022-12-27 23:48:44,611:DEBUG:140720199223104:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2022-12-27 23:48:44,611:DEBUG:140720199223104:last5MinsTimes.max() = nan, time_threshold = 300
2022-12-27 23:48:44,611:DEBUG:140720199223104:prev_point is None, continuing trip
2022-12-27 23:48:44,611:DEBUG:140720199223104:Too few points to make a decision, continuing



2022-12-27 23:48:57,182:DEBUG:140720199223104:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2022-12-27 23:48:57,187:DEBUG:140720199223104:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
31.04866665  0.         29.50565321] with length 15
2022-12-27 23:48:57,189:DEBUG:140720199223104:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2022-12-27 23:48:57,191:DEBUG:140720199223104:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2022-12-27 23:48:57,191:DEBUG:140720199223104:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2022-12-27 23:48:57,191:DEBUG:140720199223104:curr_query = .... sort_key = data.ts
2022-12-27 23:48:57,192:DEBUG:140720199223104:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2022-12-27 23:48:57,307:DEBUG:140720199223104:finished querying values for ['statemachine/transition'], count = 0
2022-12-27 23:48:57,307:DEBUG:140720199223104:finished querying values for [], count = 0
2022-12-27 23:48:57,307:DEBUG:140720199223104:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2022-12-27 23:48:57,429:DEBUG:140720199223104:Found 0 results
2022-12-27 23:48:57,429:DEBUG:140720199223104:In range 1672181741.861 -> 1672181764.894 found no transitions
2022-12-27 23:48:57,430:DEBUG:140720199223104:curr_query = ..., sort_key = data.ts
2022-12-27 23:48:57,430:DEBUG:140720199223104:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2022-12-27 23:48:57,761:DEBUG:140720199223104:finished querying values for ['background/motion_activity'], count = 0
2022-12-27 23:48:57,761:DEBUG:140720199223104:finished querying values for [], count = 0
2022-12-27 23:48:57,761:DEBUG:140720199223104:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2022-12-27 23:48:58,112:DEBUG:140720199223104:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2022-12-27 23:48:58,112:DEBUG:140720199223104:sample activities are []
2022-12-27 23:48:58,112:DEBUG:140720199223104:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs 0.3333333333333333) continuing trip
2022-12-27 23:48:58,112:DEBUG:140720199223104:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2022-12-27 23:48:58,113:DEBUG:140720199223104:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 283
2022-12-27 23:48:58,114:DEBUG:140720199223104:Appending last_trip_end_point ... with index 283
2022-12-27 23:48:58,114:INFO:140720199223104:Found trip end at 2022-12-27T18:53:13.860000-04:00

2022-12-28 06:37:44,461:INFO:140290522912576:Last ts processed = None
2022-12-28 06:37:44,462:DEBUG:140290522912576:------------------------------2022-12-27T19:07:29.865000-04:00------------------------------
2022-12-28 06:37:44,463:DEBUG:140290522912576:Appending currPoint because the current start point is None
2022-12-28 06:37:44,463:DEBUG:140290522912576:Setting new trip start point with idx 0
2022-12-28 06:37:44,471:DEBUG:140290522912576:last5MinsDistances = [] with length 0
2022-12-28 06:37:44,474:DEBUG:140290522912576:last10PointsDistances = [0.] with length 1, shape (1,)
2022-12-28 06:37:44,475:DEBUG:140290522912576:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2022-12-28 06:37:44,475:DEBUG:140290522912576:last5MinsTimes.max() = nan, time_threshold = 300
2022-12-28 06:37:44,475:DEBUG:140290522912576:prev_point is None, continuing trip
2022-12-28 06:37:44,476:DEBUG:140290522912576:Too few points to make a decision, continuing

Rerun with full data

2023-01-25 07:03:16,898:DEBUG:4580113920:------------------------------2022-12-27T18:51:16.880000-04:00------------------------------
2023-01-25 07:03:16,898:DEBUG:4580113920:Comparing with prev_point =
2023-01-25 07:03:16,898:DEBUG:4580113920:Setting new trip start point with idx 265
2023-01-25 07:03:16,900:DEBUG:4580113920:last5MinsDistances = [] with length 0
2023-01-25 07:03:16,901:DEBUG:4580113920:last10PointsDistances = [0.] with length 1, shape (1,)
2023-01-25 07:03:16,901:DEBUG:4580113920:len(last10PointsDistances) = 1, len(last5MinsDistances) = 0
2023-01-25 07:03:16,901:DEBUG:4580113920:last5MinsTimes.max() = nan, time_threshold = 300
2023-01-25 07:03:16,901:DEBUG:4580113920:prev_point is None, continuing trip
2023-01-25 07:03:16,901:DEBUG:4580113920:Too few points to make a decision, continuing

2023-01-25 07:03:17,186:DEBUG:4580113920:------------------------------2022-12-27T18:56:04.894000-04:00------------------------------
2023-01-25 07:03:17,191:DEBUG:4580113920:last5MinsDistances = [29.70525623 29.70525623 29.70525623 17.30257545  0.79725283 29.24549103
 32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321] with length 15
2023-01-25 07:03:17,193:DEBUG:4580113920:last10PointsDistances = [32.17041795 29.54412285  3.55062247 29.65201247 29.65201247 29.70897293
 31.04866665  0.         29.50565321  0.        ] with length 10, shape (10,)
2023-01-25 07:03:17,195:DEBUG:4580113920:len(last10PointsDistances) = 10, len(last5MinsDistances) = 15
2023-01-25 07:03:17,195:DEBUG:4580113920:last5MinsTimes.max() = 288.01399993896484, time_threshold = 300
2023-01-25 07:03:17,195:DEBUG:4580113920:curr_query = , sort_key = data.ts
2023-01-25 07:03:17,195:DEBUG:4580113920:orig_ts_db_keys = ['statemachine/transition'], analysis_ts_db_keys = []
2023-01-25 07:03:17,197:DEBUG:4580113920:finished querying values for ['statemachine/transition'], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,198:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,201:DEBUG:4580113920:Found 0 results
2023-01-25 07:03:17,201:DEBUG:4580113920:In range 1672181741.861 -> 1672181764.894 found no transitions
2023-01-25 07:03:17,201:DEBUG:4580113920:curr_query = , sort_key = data.ts
2023-01-25 07:03:17,201:DEBUG:4580113920:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for ['background/motion_activity'], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:finished querying values for [], count = 0
2023-01-25 07:03:17,204:DEBUG:4580113920:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2023-01-25 07:03:17,207:DEBUG:4580113920:Found 0 motion_activity entries in range 1672181741.861 -> 1672181764.894
2023-01-25 07:03:17,207:DEBUG:4580113920:sample activities are []
2023-01-25 07:03:17,207:DEBUG:4580113920:prev_point.ts = 1672181741.861, curr_point.ts = 1672181764.894, time gap = 23.032999992370605 (vs 300), distance_gap = 29.505653212172994 (vs 100), speed_gap = 1.281016507704006 (vs
 0.3333333333333333) continuing trip
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsDistances.max() = 32.170417952445504, last10PointsDistance.max() = 32.170417952445504
2023-01-25 07:03:17,208:DEBUG:4580113920:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 272
2023-01-25 07:03:17,209:DEBUG:4580113920:Appending last_trip_end_point with index 272
2023-01-25 07:03:17,209:INFO:4580113920:Found trip end at 2022-12-27T18:53:13.860000-04:00

2023-01-25 07:03:17,238:DEBUG:4580113920:------------------------------2022-12-27T19:07:29.865000-04:00------------------------------
2023-01-25 07:03:17,238:DEBUG:4580113920:Comparing with prev_point = 
2023-01-25 07:03:17,238:INFO:4580113920:Points are within the distance filter and only 1 min apart so part of the same trip

shankari · 2023-01-25T20:23:36Z

Tracking final issue with two batches vs. one batch in
#849

shankari · 2023-01-26T00:18:51Z

Resetting use case 1, last trip:

INFO:root:ret_trip_doc start = 2022-10-22T19:00:01-04:00, end = 2022-10-23T07:47:25.788000-04:00

place exit times: 'exit_fmt_time': '2022-10-22T19:00:01-04:00'

delete everything after the place exit...

>>> arrow.get(1666479601).to("America/New_York")
<Arrow [2022-10-22T19:00:01-04:00]>

shankari · 2023-01-27T05:55:50Z

reset the pipeline for use case 1, it works fine
the entry was removed.
Will double check the dashboard tomorrow, when it should have been updated, and then reset use case 2 as well

shankari · 2023-01-27T17:06:23Z

Dashboard has been updated

Looks like we ended up with a few more trips after the reset, but the number of labeled trips did not change.

shankari · 2023-01-27T17:13:11Z

For use case 2, resetting to

DEBUG:root:last_place_enter_ts = 1672177147.9
DEBUG:root:reset_ts = 1672177147.9

shankari · 2023-01-28T05:12:39Z

For use case 1, we do seem to have some issue with multiple matches. This may in fact be related, since we reset the pipeline to '2022-10-22T19:00:01-04:00'

2023-01-28 03:50:02,024:ERROR:140494563346240:Found error Found len(ret_list) = 3, expected <=1 while processing trip Entry({'_id': ObjectId('6354ad0e74264950e7b89e11'), 'end_fmt_time': '2022-10-22T19:00:01-04:00', 'start_fmt_time': '2022-10-22T16:52:18.729293-04:00', key': 'analysis/confirmed_trip'})

and there are in fact multiple inferred sections for the cleaned section although they are all identical

   data.duration  data.distance  data.sensed_mode      data.cleaned_section
0    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04
1    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04
2    7662.270707  237386.026129                 5  6354ad0974264950e7b89d04

shankari · 2023-01-28T05:29:34Z

Checked the confirmed trips and there are no overlaps. Only one entry even seems to be out of order and on checking the write_fmt_time, the order for both the write_fmt_time and the start_fmt_time are the same. So there doesn't seem to be an issue with the confirmed trips.

15  2023-01-26T19:22:40.747753-08:00  2022-10-25T16
16  2023-01-26T19:22:40.793668-08:00  2022-10-26T13   <---- out of order
17  2023-01-26T19:22:40.767221-08:00  2022-10-26T11
18  2023-01-26T19:22:40.775635-08:00  2022-10-26T13
19  2023-01-26T19:22:40.801774-08:00  2022-10-26T14

shankari · 2023-01-28T16:35:29Z

Fixed the second use case as well

shankari · 2023-02-01T02:51:21Z

quick check on the discrepancy in use case 1, given that we have no incoming data...
There is no discrepancy! We generate the same number of trips!

2023-01-20 08:35:17,590:DEBUG:140180398004032:keys = (analysis/cleaned_place, analysis/confirmed_trip), len(places) = 55, len(trips) = 54
2023-02-01 01:49:06,269:DEBUG:139888238438208:keys = (analysis/cleaned_place, analysis/confirmed_trip), len(places) = 55, len(trips) = 54

I cannot quite understand the discrepancy. We reset the pipeline only for one user. That user has the same number of trips before and after the reset, for a time range that spans the reset timestamp (2022-10-20 -> 2022-10-27, reset timestamp: 2022-10-22T19:00:01-04:00)

I can't think of how to test this any further, but if we should take a mongodump before we reset the pipeline again and investigate further.

shankari mentioned this issue Jan 23, 2023

PR to filter big jumps even if all segments are in clusters e-mission/e-mission-server#897

Merged

shankari closed this as completed in e-mission/e-mission-server#897 Jan 25, 2023

shankari mentioned this issue Jan 25, 2023

non-idempotent pipeline analysis if the data is pushed in two batches #849

Open

shankari mentioned this issue Feb 1, 2023

feat: Add timeseries purge script e-mission/e-mission-server#899

Open

shankari mentioned this issue Dec 9, 2024

Speedy Bus Trip - Data Anamoly #1099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giant, obvious GPS jump was not filtered #843

Giant, obvious GPS jump was not filtered #843

shankari commented Dec 24, 2022 •

edited

Loading

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 23, 2023

shankari commented Jan 23, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 26, 2023 •

edited

Loading

shankari commented Jan 27, 2023

shankari commented Jan 27, 2023

shankari commented Jan 27, 2023

shankari commented Jan 28, 2023

shankari commented Jan 28, 2023 •

edited

Loading

shankari commented Jan 28, 2023

shankari commented Feb 1, 2023 •

edited

Loading

Giant, obvious GPS jump was not filtered #843

Giant, obvious GPS jump was not filtered #843

Comments

shankari commented Dec 24, 2022 • edited Loading

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023 • edited Loading

shankari commented Jan 22, 2023 • edited Loading

shankari commented Jan 22, 2023

shankari commented Jan 22, 2023 • edited Loading

shankari commented Jan 23, 2023

shankari commented Jan 23, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 24, 2023

shankari commented Jan 25, 2023 • edited Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023 • edited Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023 • edited Loading

shankari commented Jan 25, 2023

shankari commented Jan 25, 2023

shankari commented Jan 26, 2023 • edited Loading

shankari commented Jan 27, 2023

shankari commented Jan 27, 2023

Dashboard has been updated

shankari commented Jan 27, 2023

shankari commented Jan 28, 2023

shankari commented Jan 28, 2023 • edited Loading

shankari commented Jan 28, 2023

shankari commented Feb 1, 2023 • edited Loading

shankari commented Dec 24, 2022 •

edited

Loading

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 22, 2023 •

edited

Loading

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 25, 2023 •

edited

Loading

shankari commented Jan 26, 2023 •

edited

Loading

shankari commented Jan 28, 2023 •

edited

Loading

shankari commented Feb 1, 2023 •

edited

Loading