-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quick fixes for scalability? #261
Comments
I thought that the slowness may be because there is too much monitoring data (e.g. API calls, pipeline_time, etc). But no, the bulk of the data is real and in the
|
@jf87 this is the previous bug I had filed for scalability. As you can see, I didn't get very far in my checking... I also wrote up the goals to motivate an undergrad to work on them and he actually wrote up a trace generator that (unfortunately) needs some cleanup before it can be merged. |
The last time I looked at this, I thought that the issue was with the database. Here's a quick check.
But although the
Calling
|
The reason why I looked at the database in the first place is because of logs like this. These are the logs for a single call to read the metrics and show it in the dashboard.
The query that we run against mongodb includes the keys
but yet, the query takes ~ |
hm, but when I tried to run the queries directly against the database, it was much faster.
That is pretty bizarre because between the two lines, I literally do nothing but query the database.
|
even if I sort (which shouldn't apply since the sort_key is
|
this is raising the hope that there is a simple fix in my implementation of the timeseries interface which can fix the really bad performance problems. But I don't have a staging server where I can copy all this data easily... |
ok so although this is bad practice in general, I added some logs directly to the timeseries. And I think I know what's going on. Here are the logs without the fix, for the record - see the giant jumps in time between the log statements.
|
ok, with a one-line fix, the response time drops to ~ 150ms. And it was such a stupid reason too. I will check in this fix now, although I think that the underlying cause for was to handle a use case that does not exist, and this can be simplified even further later.
|
I will commit the fix tomorrow because I want to carefully compose a commit message. If anybody wants to try out the fix, it is below. And it is also very close to the height of stupidity.
|
This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)
PR for the one-liner is e-mission/e-mission-server#638 However, the invalid query change was introduced in Aug 2018 and this issue was filed in Jul 2017 and edited in Jul 2018. so I don't think the PR fixes everything. There are probably some other issues with retrieving non-blank data. |
Closing this for now as fixed by e-mission/e-mission-server#638 |
This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)
Right now, the pipeline is so slow that we can only run it once a day. It would be nice to be able to run it more frequently. Let's take a quick look at what's taking the time, and see whether we can do a quick fix to speed it up.
The text was updated successfully, but these errors were encountered: