-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up client-side bulk-handling #890
Speed up client-side bulk-handling #890
Conversation
With this commit we speed up preparing bulk requests from data files by implementing several optimizations: * Bulk data are passed as a string instead of a list to the runner. This avoids the cost of converting the raw list to a string in the Python Elasticsearch client. * Lines are read in bulk from the data source instead of line by line. This avoids many method calls. * We provide a special implementation for the common case (ids are autogenerated, no conflicts) to make the hot code path as simple as possible. This commit also adds a microbenchmark that measures the speedup. The following table shows a comparison of the throughput of the bulk reader for various bulk sizes: | Bulk Size | master [ops/s] | This PR [ops/s] | Speedup | |-----------|----------------|-----------------|---------| | 100 | 14829 | 92395 | 6.23 | | 1000 | 1448 | 10953 | 7.56 | | 10000 | 148 | 1100 | 7.43 | | 100000 | 15 | 107 | 7.13 | All data have been measured using Python 3.8 on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredible performance improvement!
LGTM left a nit and a question.
current_bulk = [] | ||
for action_metadata_item, document in zip(self.action_metadata, self.file_source): | ||
# hoist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly I had to look this up as my understanding this is mostly JS terminology whereas we in Python we talk about block scoping.
But what does this comment refer at example? Which variable, in which block, is getting hoisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that term from JVM optimizations. In any case we ensure that the field access is turned into a local variable access because it is used in the loop on the hot code path and this is what I was referring to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. For a person like me it doesn't add any additional info (seems obvious from the implementation) but if it's valuable to someone else, that's fine.
self.meta_data_index_no_id = '{"index": {"_index": "%s"}}' % index_name | ||
self.meta_data_index_with_id = '{"index": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s") | ||
self.meta_data_update_with_id = '{"update": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s") | ||
self.meta_data_index_no_id = '{"index": {"_index": "%s"}}\n' % index_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this might become something to fix when we enable C4001 in pylintrc; should we # pylint: disable=invalid-string-quote
right after __init__
(I found a ref here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we deal with this when we enable C4001
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine.
Thanks for your review! |
With this commit we speed up preparing bulk requests from data files by
implementing several optimizations:
avoids the cost of converting the raw list to a string in the Python
Elasticsearch client.
This avoids many method calls.
autogenerated, no conflicts) to make the hot code path as simple as
possible.
This commit also adds a microbenchmark that measures the speedup. The
following table shows a comparison of the throughput of the bulk reader
for various bulk sizes:
All data have been measured using Python 3.8 on Linux.