quick fixes for scalability? #261

shankari · 2017-07-05T18:22:49Z

Right now, the pipeline is so slow that we can only run it once a day. It would be nice to be able to run it more frequently. Let's take a quick look at what's taking the time, and see whether we can do a quick fix to speed it up.

shankari · 2018-07-11T12:50:07Z

I thought that the slowness may be because there is too much monitoring data (e.g. API calls, pipeline_time, etc). But no, the bulk of the data is real and in the background/location, background/filtered_location and background/motion_activity categories.

stats/server_api_time -> 1,776,462
statemachine/transition -> 885,091
background/location -> 11,900,375
background/motion_activity -> 15,228,251
background/filtered_location_ignore -> 7657
background/filtered_location -> 10,078,204
background/battery -> 1,892,910
config/sensor_config -> 905
config/sync_config -> 298
config/consent -> 1427
stats/server_api_error -> 29,673
stats/client_time -> 2,513,973
stats/client_nav_event -> 325,137
stats/pipeline_time -> 1,125,218
stats/client_error -> 11,780
stats/pipeline_error -> 1,453
manual/incident -> 90
manual/mode_confirm -> 118
manual/purpose_confirm -> 111

shankari · 2019-03-01T00:24:19Z

@jf87 this is the previous bug I had filed for scalability. As you can see, I didn't get very far in my checking...

I also wrote up the goals to motivate an undergrad to work on them
https://github.com/e-mission/e-mission-docs/blob/master/docs/future_work/scalability.md

and he actually wrote up a trace generator that (unfortunately) needs some cleanup before it can be merged.
e-mission/e-mission-server#622

shankari · 2019-03-01T00:52:46Z

The last time I looked at this, I thought that the issue was with the database. Here's a quick check.
We have ~ 46 million entries in the database. Counting them takes ~ 1.3ms, which seems reasonable.

In [243]: edb.get_timeseries_db().find().count()
Out[243]: 46416708

In [244]: %timeit edb.get_timeseries_db().find().count()
1000 loops, best of 3: 1.39 ms per loop

But although the user_id field is indexed,

    TimeSeries = _get_current_db().Stage_timeseries
    TimeSeries.create_index([("user_id", pymongo.HASHED)])

Calling distinct on it takes 43000 times as long (43 secs).

In [241]: len(edb.get_timeseries_db().distinct("user_id"))
Out[241]: 1291

In [242]: %timeit edb.get_timeseries_db().distinct("user_id")
1 loop, best of 3: 43 s per loop

shankari · 2019-03-01T01:22:28Z

The reason why I looked at the database in the first place is because of logs like this. These are the logs for a single call to read the metrics and show it in the dashboard.

2019-03-01 00:51:08,775:DEBUG:140220397770496:START POST /result/metrics/timestamp
2019-03-01 00:51:08,775:DEBUG:140220397770496:metric_list = ['duration', 'median_speed', 'count', 'distance']
2019-03-01 00:51:08,775:DEBUG:140220397770496:['duration -> <function get_duration at 0x7f87a1802950>', 'median_speed -> <function get_median_speed at 0x7f87a18029d8>', 'count -> <function get_count at 0x7f87a1802840>', 'distance -> <function get_distance at 0x7f87a18028c8>']
2019-03-01 00:51:08,775:DEBUG:140220397770496:for user None, returning timeseries <emission.storage.timeseries.aggregate_timeseries.AggregateTimeSeries object at 0x7f8776b957f0>
2019-03-01 00:51:08,775:DEBUG:140220397770496:curr_query = {'$or': [{'metadata.key': 'analysis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}}, sort_key = None
2019-03-01 00:51:08,775:DEBUG:140220397770496:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/inferred_section']
2019-03-01 00:51:53,449:DEBUG:140220397770496:finished querying values for [], count = 0
2019-03-01 00:51:53,451:DEBUG:140220397770496:finished querying values for ['analysis/inferred_section'], count = 0
2019-03-01 00:52:36,638:DEBUG:140220397770496:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 00:53:16,953:DEBUG:140220397770496:Found 0 results
2019-03-01 00:53:16,954:DEBUG:140220397770496:Returning entry with length 0 result
2019-03-01 00:53:16,954:INFO:140220397770496:Found no entries for user None, time_query <emission.storage.timeseries.timequery.TimeQuery object at 0x7f8776b953c8>
2019-03-01 00:53:16,954:DEBUG:140220397770496:END POST /result/metrics/timestamp  128.1795883178711

The query that we run against mongodb includes the keys metadata.key and data.start_ts, both of which should be in the index. Further, metadata.key is HASHED and we look it up directly, while data.start_ts is DESCENDING but we do a range query over it.

def _create_analysis_result_indices(tscoll):
    tscoll.create_index([("metadata.key", pymongo.HASHED)])

    # trips and sections
    tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)

but yet, the query takes ~ 00:51:08 -> 00:51:53 = 45 seconds. Then there is another 45 second gap (maybe a second call to the database) 00:52:36 -> 00:53:16, so the overall response time is 00:51:08 -> 00:53:16 = 2mins 8 secs.

shankari · 2019-03-01T01:28:13Z

hm, but when I tried to run the queries directly against the database, it was much faster.

In [245]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}})
100 loops, best of 3: 11.3 ms per loop

In [246]: %timeit edb.get_timeseries_db().find({'$or': [{'metadata.key': 'analysis/infer
     ...: red_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}})
1000 loops, best of 3: 1.19 ms per loop

That is pretty bizarre because between the two lines, I literally do nothing but query the database.

        logging.debug("orig_ts_db_keys = %s, analysis_ts_db_keys = %s" %
            (orig_ts_db_keys, analysis_ts_db_keys))

        orig_ts_db_result = self._get_entries_for_timeseries(self.timeseries_db,
                                                             orig_ts_db_keys,
                                                             time_query,
                                                             geo_query,
                                                             extra_query_list,
                                                             sort_key)

        analysis_ts_db_result = self._get_entries_for_timeseries(self.analysis_timeseries_db,
                                                                 analysis_ts_db_keys,
                                                                 time_query,
                                                                 geo_query,
                                                                 extra_query_list,
                                                                 sort_key)
        logging.debug("orig_ts_db_matches = %s, analysis_ts_db_matches = %s" %
            (orig_ts_db_result.count(), analysis_ts_db_result.count()))

shankari · 2019-03-01T01:31:45Z

even if I sort (which shouldn't apply since the sort_key is None), it doesn't get much worse.
Which is not surprising since there are no entries returned.

In [247]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}}).sort("metadata.write_ts")
100 loops, best of 3: 14.8 ms per loop

shankari · 2019-03-01T01:38:21Z

this is raising the hope that there is a simple fix in my implementation of the timeseries interface which can fix the really bad performance problems. But I don't have a staging server where I can copy all this data easily...

shankari · 2019-03-01T05:30:37Z

ok so although this is bad practice in general, I added some logs directly to the timeseries. And I think I know what's going on. Here are the logs without the fix, for the record - see the giant jumps in time between the log statements.

2019-03-01 05:09:47,488:DEBUG:140666554283776:finished querying values for ['manual/incident'], count = 0
2019-03-01 05:10:06,167:DEBUG:140666554283776:finished querying values for [], count = 0
2019-03-01 05:10:24,545:DEBUG:140666554283776:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 05:10:41,938:DEBUG:140666554283776:Returning entry with length 0 result

shankari · 2019-03-01T05:41:24Z

ok, with a one-line fix, the response time drops to ~ 150ms. And it was such a stupid reason too.
@jf87 thank you for giving me enough motivation to take a closer look.

I will check in this fix now, although I think that the underlying cause for was to handle a use case that does not exist, and this can be simplified even further later.

2019-03-01 05:27:42,647:DEBUG:140164536702720:START POST /result/metrics/timestamp
2019-03-01 05:27:42,805:DEBUG:140164536702720:END POST /result/metrics/timestamp ea59084e-11d4-4076-9252-3b9a29ce35e0 0.15816450119018555

shankari · 2019-03-01T06:06:33Z

I will commit the fix tomorrow because I want to carefully compose a commit message. If anybody wants to try out the fix, it is below. And it is also very close to the height of stupidity.

--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -20,7 +20,7 @@ ts_enum_map = {
     esta.EntryType.ANALYSIS_TYPE: edb.get_analysis_timeseries_db()
 }

-INVALID_QUERY = {'1': '2'}
+INVALID_QUERY = {'metadata.key': 'invalid'}

 class BuiltinTimeSeries(esta.TimeSeries):
     def __init__(self, user_id):

This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)

shankari · 2019-03-01T17:07:36Z

PR for the one-liner is e-mission/e-mission-server#638

However, the invalid query change was introduced in Aug 2018 and this issue was filed in Jul 2017 and edited in Jul 2018. so I don't think the PR fixes everything. There are probably some other issues with retrieving non-blank data.

shankari · 2020-07-19T21:19:54Z

Closing this for now as fixed by e-mission/e-mission-server#638

This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)

shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019

shankari mentioned this issue Mar 1, 2019

Fix the invalid query to use an indexed key e-mission/e-mission-server#638

Merged

shankari closed this as completed Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quick fixes for scalability? #261

quick fixes for scalability? #261

shankari commented Jul 5, 2017

shankari commented Jul 11, 2018 •

edited

Loading

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019 •

edited

Loading

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019 •

edited

Loading

shankari commented Jul 19, 2020

quick fixes for scalability? #261

quick fixes for scalability? #261

Comments

shankari commented Jul 5, 2017

shankari commented Jul 11, 2018 • edited Loading

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019 • edited Loading

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019

shankari commented Mar 1, 2019 • edited Loading

shankari commented Jul 19, 2020

shankari commented Jul 11, 2018 •

edited

Loading

shankari commented Mar 1, 2019 •

edited

Loading

shankari commented Mar 1, 2019 •

edited

Loading