Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick fixes for scalability? #261

Closed
shankari opened this issue Jul 5, 2017 · 12 comments
Closed

quick fixes for scalability? #261

shankari opened this issue Jul 5, 2017 · 12 comments

Comments

@shankari
Copy link
Contributor

shankari commented Jul 5, 2017

Right now, the pipeline is so slow that we can only run it once a day. It would be nice to be able to run it more frequently. Let's take a quick look at what's taking the time, and see whether we can do a quick fix to speed it up.

@shankari
Copy link
Contributor Author

shankari commented Jul 11, 2018

I thought that the slowness may be because there is too much monitoring data (e.g. API calls, pipeline_time, etc). But no, the bulk of the data is real and in the background/location, background/filtered_location and background/motion_activity categories.

stats/server_api_time -> 1,776,462
statemachine/transition -> 885,091
background/location -> 11,900,375
background/motion_activity -> 15,228,251
background/filtered_location_ignore -> 7657
background/filtered_location -> 10,078,204
background/battery -> 1,892,910
config/sensor_config -> 905
config/sync_config -> 298
config/consent -> 1427
stats/server_api_error -> 29,673
stats/client_time -> 2,513,973
stats/client_nav_event -> 325,137
stats/pipeline_time -> 1,125,218
stats/client_error -> 11,780
stats/pipeline_error -> 1,453
manual/incident -> 90
manual/mode_confirm -> 118
manual/purpose_confirm -> 111

@shankari shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019
@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

@jf87 this is the previous bug I had filed for scalability. As you can see, I didn't get very far in my checking...

I also wrote up the goals to motivate an undergrad to work on them
https://github.com/e-mission/e-mission-docs/blob/master/docs/future_work/scalability.md

and he actually wrote up a trace generator that (unfortunately) needs some cleanup before it can be merged.
e-mission/e-mission-server#622

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

The last time I looked at this, I thought that the issue was with the database. Here's a quick check.
We have ~ 46 million entries in the database. Counting them takes ~ 1.3ms, which seems reasonable.

In [243]: edb.get_timeseries_db().find().count()
Out[243]: 46416708

In [244]: %timeit edb.get_timeseries_db().find().count()
1000 loops, best of 3: 1.39 ms per loop

But although the user_id field is indexed,

    TimeSeries = _get_current_db().Stage_timeseries
    TimeSeries.create_index([("user_id", pymongo.HASHED)])

Calling distinct on it takes 43000 times as long (43 secs).

In [241]: len(edb.get_timeseries_db().distinct("user_id"))
Out[241]: 1291

In [242]: %timeit edb.get_timeseries_db().distinct("user_id")
1 loop, best of 3: 43 s per loop

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

The reason why I looked at the database in the first place is because of logs like this. These are the logs for a single call to read the metrics and show it in the dashboard.

2019-03-01 00:51:08,775:DEBUG:140220397770496:START POST /result/metrics/timestamp
2019-03-01 00:51:08,775:DEBUG:140220397770496:metric_list = ['duration', 'median_speed', 'count', 'distance']
2019-03-01 00:51:08,775:DEBUG:140220397770496:['duration -> <function get_duration at 0x7f87a1802950>', 'median_speed -> <function get_median_speed at 0x7f87a18029d8>', 'count -> <function get_count at 0x7f87a1802840>', 'distance -> <function get_distance at 0x7f87a18028c8>']
2019-03-01 00:51:08,775:DEBUG:140220397770496:for user None, returning timeseries <emission.storage.timeseries.aggregate_timeseries.AggregateTimeSeries object at 0x7f8776b957f0>
2019-03-01 00:51:08,775:DEBUG:140220397770496:curr_query = {'$or': [{'metadata.key': 'analysis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}}, sort_key = None
2019-03-01 00:51:08,775:DEBUG:140220397770496:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/inferred_section']
2019-03-01 00:51:53,449:DEBUG:140220397770496:finished querying values for [], count = 0
2019-03-01 00:51:53,451:DEBUG:140220397770496:finished querying values for ['analysis/inferred_section'], count = 0
2019-03-01 00:52:36,638:DEBUG:140220397770496:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 00:53:16,953:DEBUG:140220397770496:Found 0 results
2019-03-01 00:53:16,954:DEBUG:140220397770496:Returning entry with length 0 result
2019-03-01 00:53:16,954:INFO:140220397770496:Found no entries for user None, time_query <emission.storage.timeseries.timequery.TimeQuery object at 0x7f8776b953c8>
2019-03-01 00:53:16,954:DEBUG:140220397770496:END POST /result/metrics/timestamp  128.1795883178711

The query that we run against mongodb includes the keys metadata.key and data.start_ts, both of which should be in the index. Further, metadata.key is HASHED and we look it up directly, while data.start_ts is DESCENDING but we do a range query over it.

def _create_analysis_result_indices(tscoll):
    tscoll.create_index([("metadata.key", pymongo.HASHED)])

    # trips and sections
    tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)

but yet, the query takes ~ 00:51:08 -> 00:51:53 = 45 seconds. Then there is another 45 second gap (maybe a second call to the database) 00:52:36 -> 00:53:16, so the overall response time is 00:51:08 -> 00:53:16 = 2mins 8 secs.

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

hm, but when I tried to run the queries directly against the database, it was much faster.

In [245]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}})
100 loops, best of 3: 11.3 ms per loop

In [246]: %timeit edb.get_timeseries_db().find({'$or': [{'metadata.key': 'analysis/infer
     ...: red_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}})
1000 loops, best of 3: 1.19 ms per loop

That is pretty bizarre because between the two lines, I literally do nothing but query the database.

        logging.debug("orig_ts_db_keys = %s, analysis_ts_db_keys = %s" %
            (orig_ts_db_keys, analysis_ts_db_keys))

        orig_ts_db_result = self._get_entries_for_timeseries(self.timeseries_db,
                                                             orig_ts_db_keys,
                                                             time_query,
                                                             geo_query,
                                                             extra_query_list,
                                                             sort_key)

        analysis_ts_db_result = self._get_entries_for_timeseries(self.analysis_timeseries_db,
                                                                 analysis_ts_db_keys,
                                                                 time_query,
                                                                 geo_query,
                                                                 extra_query_list,
                                                                 sort_key)
        logging.debug("orig_ts_db_matches = %s, analysis_ts_db_matches = %s" %
            (orig_ts_db_result.count(), analysis_ts_db_result.count()))

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

even if I sort (which shouldn't apply since the sort_key is None), it doesn't get much worse.
Which is not surprising since there are no entries returned.

In [247]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}}).sort("metadata.write_ts")
100 loops, best of 3: 14.8 ms per loop

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

this is raising the hope that there is a simple fix in my implementation of the timeseries interface which can fix the really bad performance problems. But I don't have a staging server where I can copy all this data easily...

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

ok so although this is bad practice in general, I added some logs directly to the timeseries. And I think I know what's going on. Here are the logs without the fix, for the record - see the giant jumps in time between the log statements.

2019-03-01 05:09:47,488:DEBUG:140666554283776:finished querying values for ['manual/incident'], count = 0
2019-03-01 05:10:06,167:DEBUG:140666554283776:finished querying values for [], count = 0
2019-03-01 05:10:24,545:DEBUG:140666554283776:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 05:10:41,938:DEBUG:140666554283776:Returning entry with length 0 result

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

ok, with a one-line fix, the response time drops to ~ 150ms. And it was such a stupid reason too.
@jf87 thank you for giving me enough motivation to take a closer look.

I will check in this fix now, although I think that the underlying cause for was to handle a use case that does not exist, and this can be simplified even further later.

2019-03-01 05:27:42,647:DEBUG:140164536702720:START POST /result/metrics/timestamp
2019-03-01 05:27:42,805:DEBUG:140164536702720:END POST /result/metrics/timestamp ea59084e-11d4-4076-9252-3b9a29ce35e0 0.15816450119018555

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

I will commit the fix tomorrow because I want to carefully compose a commit message. If anybody wants to try out the fix, it is below. And it is also very close to the height of stupidity.

--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -20,7 +20,7 @@ ts_enum_map = {
     esta.EntryType.ANALYSIS_TYPE: edb.get_analysis_timeseries_db()
 }

-INVALID_QUERY = {'1': '2'}
+INVALID_QUERY = {'metadata.key': 'invalid'}

 class BuiltinTimeSeries(esta.TimeSeries):
     def __init__(self, user_id):

shankari added a commit to shankari/e-mission-server that referenced this issue Mar 1, 2019
This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.

Some background:
- the timeseries interface allows users to specify a set of keys that they are querying
- the underlying mongodb implementation splits the keys into the
  `timeseries_db` which stores raw data, and `analysis_timeseries_db` which
  stores processed data
- these were originally in the same database. so when we split them, to make
  sure that we don't lose any data inadvertently, we query both collections and
  merge the results
- in response to a query, instead of returning the full results in memory,
  mongodb returns a cursor that you can iterate over
- To ensure that we didn't have to read the entire results into memory every
  time, we chained the values returned from the two queries
    e-mission@5367a01

    ```
    return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
    ```

so far so good. but then we ran into a series of bugs that we fixed by building
on each other.

1. If the entries are only in one database, the other database is queried with
   an empty array for the key, which returns all values
    (e-mission/e-mission-docs#168)
    - so we added a check - if there are no keys queried, we return an empty
        iterator that can be chained
        e-mission@b7f835a
1. But then, the empty iterator is not a cursor, so we can't display counts
    returned from each database (e-mission#599)
    - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503)
    - This is purely a nice to have, and the PR even says that the changes to
      enable it can be reverted if needed.
    - But the changes were correct, and the tests passed so we retained them

However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over record and determine that there were no "1"s.

Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.

e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)
@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

PR for the one-liner is e-mission/e-mission-server#638

However, the invalid query change was introduced in Aug 2018 and this issue was filed in Jul 2017 and edited in Jul 2018. so I don't think the PR fixes everything. There are probably some other issues with retrieving non-blank data.

@shankari
Copy link
Contributor Author

Closing this for now as fixed by e-mission/e-mission-server#638

jf87 pushed a commit to jf87/e-mission-server that referenced this issue Jun 21, 2021
This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.

Some background:
- the timeseries interface allows users to specify a set of keys that they are querying
- the underlying mongodb implementation splits the keys into the
  `timeseries_db` which stores raw data, and `analysis_timeseries_db` which
  stores processed data
- these were originally in the same database. so when we split them, to make
  sure that we don't lose any data inadvertently, we query both collections and
  merge the results
- in response to a query, instead of returning the full results in memory,
  mongodb returns a cursor that you can iterate over
- To ensure that we didn't have to read the entire results into memory every
  time, we chained the values returned from the two queries
    e-mission@5367a01

    ```
    return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
    ```

so far so good. but then we ran into a series of bugs that we fixed by building
on each other.

1. If the entries are only in one database, the other database is queried with
   an empty array for the key, which returns all values
    (e-mission/e-mission-docs#168)
    - so we added a check - if there are no keys queried, we return an empty
        iterator that can be chained
        e-mission@b7f835a
1. But then, the empty iterator is not a cursor, so we can't display counts
    returned from each database (e-mission#599)
    - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503)
    - This is purely a nice to have, and the PR even says that the changes to
      enable it can be reverted if needed.
    - But the changes were correct, and the tests passed so we retained them

However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over record and determine that there were no "1"s.

Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.

e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant