Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the invalid query to use an indexed key #638

Merged
merged 1 commit into from
Mar 5, 2019

Commits on Mar 1, 2019

  1. Fix the invalid query to use an indexed key

    This one line fix improves performance by 99% on large databases :) And it is
    also an object lesson in the law of unintended consequences, so here's a
    summary to use as a test case.
    
    Some background:
    - the timeseries interface allows users to specify a set of keys that they are querying
    - the underlying mongodb implementation splits the keys into the
      `timeseries_db` which stores raw data, and `analysis_timeseries_db` which
      stores processed data
    - these were originally in the same database. so when we split them, to make
      sure that we don't lose any data inadvertently, we query both collections and
      merge the results
    - in response to a query, instead of returning the full results in memory,
      mongodb returns a cursor that you can iterate over
    - To ensure that we didn't have to read the entire results into memory every
      time, we chained the values returned from the two queries
        e-mission@5367a01
    
        ```
        return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
        ```
    
    so far so good. but then we ran into a series of bugs that we fixed by building
    on each other.
    
    1. If the entries are only in one database, the other database is queried with
       an empty array for the key, which returns all values
        (e-mission/e-mission-docs#168)
        - so we added a check - if there are no keys queried, we return an empty
            iterator that can be chained
            e-mission@b7f835a
    1. But then, the empty iterator is not a cursor, so we can't display counts
        returned from each database (e-mission#599)
        - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503)
        - This is purely a nice to have, and the PR even says that the changes to
          enable it can be reverted if needed.
        - But the changes were correct, and the tests passed so we retained them
    
    However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
    index in the database on the key "1". So as the database size grew, mongodb was
    taking 45 seconds to iterate over record and determine that there were no "1"s.
    
    Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves
    performance from 2 mins 6 secs to 150 ms.
    
    e-mission/e-mission-docs#261 (comment)
    to
    e-mission/e-mission-docs#261 (comment)
    shankari committed Mar 1, 2019
    Configuration menu
    Copy the full SHA
    7f8ccbf View commit details
    Browse the repository at this point in the history