-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trip Segmentation Optimization #1014
Trip Segmentation Optimization #1014
Conversation
740c75d
to
6243b40
Compare
I noticed that in trip_segmentation, we query the user timeseries for background/filtered_location. Then we call segment_into_trips which performs the same query on the same timeseries with the same timequery. Passing the loc_df as an argument eliminates the need for this query
restart_checking has 2 functions that are called repeatedly from both segmentation filter classes: `is_tracking_restarted_in_range` and `get_ongoing_motion_in_range`. Both of these functions performed a DB query (for `statemachine/transition` and `background/motion_activity` entries, respectively) In the case of the former, we already have all the `statemachine/transition` in the current processing time range kept in memory as a dataframe. This can be passed as an (optional) argument; we just have to then filter it down to the range of start_ts -> end_ts. We do not have all the `background/motion_activity` kept as a dataframe but we can load them all at once and filter down later, just as we do with the`statemachine/transition`. This will use more memory but I think it is likely still more efficient than multiple queries because DB calls are a bottleneck on production. I will investigate the effect of these changes further, but by my inital estimates, this drastically reduces the number of DB queries during trip segmentation (by a magnitude of ~100)
6243b40
to
11fd09e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems fine; it is a pretty straightforward change in which we pre-load a bunch of values and then re-use them instead of loading them lazily.
I am fine with merging this so we can see the impact on the DB queries.
As part of cleanup, I would like to see:
- what was justification for this change? In particular, what were the timing results from our sample programs (stage, ccebike, smart commute, stm-community) that led you to focus on these areas?
- I see that we are querying for loc_df in
trip_segmentation
and passing it in the time/distance filters. And then we are querying for transition and motion activity insidesegment_into_trips
. Why? It seems like we can just read all input data from that time range and pass it intosegment_into_trips
at the same time, in three different dataframes. This will not have an impact on performance, but it is cleaner and easier to understand.
I checked staging, and this seems to be working correctly. The logs show a successful run
And I see two composite trips retrieved by the UI
I can't see them in the UI because of e-mission/e-mission-docs#1108 @JGreenlee can you verify that you saw the trips from yesterday (Wednesday) in the UI? |
@TeachMeTW as an aside, while evaluating this fix, I notice that we store the |
Yes, it is there and processed on all the phones I brought |
@shankari @JGreenlee Is no longer using |
No description provided.