-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify create_readers #1659
Simplify create_readers #1659
Conversation
This also fixes the bug identified by the last commit.
7c9f658
to
8df6b29
Compare
This LGTM, though I haven't done any local testing, and I'm temporarily
without GH access. I can circle back in a few days if you'd like
…On Mon, Jan 30, 2023, 7:29 AM Quentin Pradet ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In esrally/track/params.py
<#1659 (comment)>:
> - target = docs.target_data_stream
- logger.debug(
- "Task-relative clients at index [%d-%d] will bulk index [%d] docs starting from line offset [%d] for [%s] "
- "from corpus [%s].",
- start_client_index,
- end_client_index,
- num_docs,
- offset,
- target,
- corpus.name,
- )
- readers[len(corpora) * entry + group] = create_reader(
- docs, offset, num_lines, num_docs, batch_size, bulk_size, id_conflicts, conflict_probability, on_conflict, recency
- )
- else:
+ if num_docs == 0:
logger.debug(
"Task-relative clients at index [%d-%d] skip [%s] (no documents to read).",
start_client_index,
end_client_index,
corpus.name,
)
Good catch, thanks. Fixed in a9c0ce7 (#1659)
<a9c0ce7>
and added a test.
—
Reply to this email directly, view it on GitHub
<#1659 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZBUUKEGL56X4VCFSIA4BLWU6X2XANCNFSM6AAAAAAUHNF524>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only wanted to leave a few high-level comments: For better or worse, bulk-indexing is among the most complex parts of Rally so IMHO it pays off to be a bit more explicit than usual when doing changes. I'm thinking specifically of:
- Adding more comments on the algorithm in this function. What is it doing on a high-level?
- Reducing noise as much as possible, e.g. do we need the debug log statements?
- Simplifying control structures, e.g. can we reimplement this so we don't need a
continue
statement? - Can we pick even more descriptive variable names (e.g. what differentiates
readers
fromstaggered_readers
? The similar name implies e.g. similarity in data structure which is not the case)?
staggered_readers = [] | ||
while total_readers > 0: | ||
for reader_queue in readers: | ||
if reader_queue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only needed because we eagerly add the deque to readers
? If we'd only add it when there are actually documents that need to be read, we could get rid of this if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actually this is needed because all queues don't necessarily have the same length. Say I have the following readers list of deques: [[doc1], [doc1, doc2]] and total_reads = 3. After one iteration I will have the following readers: [[], [doc2]] and total_reads = 1. I need to skip the empty deque.
I tried to explain this in 8f05ed2
(#1659).
esrally/track/params.py
Outdated
reader = create_reader( | ||
docs, offset, num_lines, num_docs, batch_size, bulk_size, id_conflicts, conflict_probability, on_conflict, recency | ||
) | ||
readers[-1].append(reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we assign the deque to a variable? This might be easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6870764
(#1659)
esrally/track/params.py
Outdated
return readers | ||
continue | ||
|
||
target = f"{docs.target_index}/{docs.target_type}" if docs.target_index else "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following two lines add a bit of noise in a quite complicated function. Maybe we can help the reader by moving this into a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was only needed for the debug logs, which gave me one more reason to remove them, which I have done in a26e18c
(#1659)
They added more noise and complexity (for example to compute the target variable) than actual value.
Done in
As mentioned above, this is done in
Done in
Done in Can you please take another look? |
@DJRickyB I appreciate the offer but you don't have to do this now, you know :) |
@elasticmachine run rally/it-python38 please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. It looks much simpler now. LGTM
Closing/reopening to get a RtD status. |
@elasticmachine run rally/rally-tracks-compat please now that elastic/rally-tracks#375 was merged. |
This builds upon #1657. The second commit shows a bug in the current implementation of create_readers. It was hidden by the fact we multiply by num_clients which is often more than 1 in multi-corpora implementations.