Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify create_readers #1659

Merged
merged 13 commits into from
Feb 5, 2023
Merged

Conversation

pquentin
Copy link
Member

This builds upon #1657. The second commit shows a bug in the current implementation of create_readers. It was hidden by the fact we multiply by num_clients which is often more than 1 in multi-corpora implementations.

This also fixes the bug identified by the last commit.
@pquentin pquentin changed the title Simplify create readers Simplify create_readers Jan 26, 2023
@pquentin pquentin added this to the 2.7.1 milestone Jan 26, 2023
@pquentin pquentin closed this Jan 30, 2023
@pquentin pquentin reopened this Jan 30, 2023
@DJRickyB
Copy link
Contributor

DJRickyB commented Jan 30, 2023 via email

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only wanted to leave a few high-level comments: For better or worse, bulk-indexing is among the most complex parts of Rally so IMHO it pays off to be a bit more explicit than usual when doing changes. I'm thinking specifically of:

  • Adding more comments on the algorithm in this function. What is it doing on a high-level?
  • Reducing noise as much as possible, e.g. do we need the debug log statements?
  • Simplifying control structures, e.g. can we reimplement this so we don't need a continue statement?
  • Can we pick even more descriptive variable names (e.g. what differentiates readers from staggered_readers? The similar name implies e.g. similarity in data structure which is not the case)?

staggered_readers = []
while total_readers > 0:
for reader_queue in readers:
if reader_queue:
Copy link
Member

@danielmitterdorfer danielmitterdorfer Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only needed because we eagerly add the deque to readers? If we'd only add it when there are actually documents that need to be read, we could get rid of this if.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually this is needed because all queues don't necessarily have the same length. Say I have the following readers list of deques: [[doc1], [doc1, doc2]] and total_reads = 3. After one iteration I will have the following readers: [[], [doc2]] and total_reads = 1. I need to skip the empty deque.

I tried to explain this in 8f05ed2 (#1659).

reader = create_reader(
docs, offset, num_lines, num_docs, batch_size, bulk_size, id_conflicts, conflict_probability, on_conflict, recency
)
readers[-1].append(reader)
Copy link
Member

@danielmitterdorfer danielmitterdorfer Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we assign the deque to a variable? This might be easier to read?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6870764 (#1659)

return readers
continue

target = f"{docs.target_index}/{docs.target_type}" if docs.target_index else "/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following two lines add a bit of noise in a quite complicated function. Maybe we can help the reader by moving this into a separate function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was only needed for the debug logs, which gave me one more reason to remove them, which I have done in a26e18c (#1659)

@pquentin
Copy link
Member Author

  • Adding more comments on the algorithm in this function. What is it doing on a high-level?

Done in 8f05ed2 (#1659). This helped me find a bug which I fixed in dca5271 (#1659).

  • Reducing noise as much as possible, e.g. do we need the debug log statements?

As mentioned above, this is done in a26e18c (#1659).

  • Simplifying control structures, e.g. can we reimplement this so we don't need a continue statement?

Done in dcee647 (#1659). This was much easier with the debug logs gone!

  • Can we pick even more descriptive variable names (e.g. what differentiates readers from staggered_readers? The similar name implies e.g. similarity in data structure which is not the case)?

Done in 8f05ed2 (#1659). I also added types in 6e35a40 (#1659) to clarify further the contents of the variables.

Can you please take another look?

@pquentin
Copy link
Member Author

This LGTM, though I haven't done any local testing, and I'm temporarily without GH access. I can circle back in a few days if you'd like

@DJRickyB I appreciate the offer but you don't have to do this now, you know :)

@pquentin
Copy link
Member Author

pquentin commented Feb 1, 2023

@elasticmachine run rally/it-python38 please

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. It looks much simpler now. LGTM

@pquentin
Copy link
Member Author

pquentin commented Feb 2, 2023

Closing/reopening to get a RtD status.

@pquentin pquentin closed this Feb 2, 2023
@pquentin pquentin reopened this Feb 2, 2023
@pquentin
Copy link
Member Author

pquentin commented Feb 2, 2023

@elasticmachine run rally/rally-tracks-compat please now that elastic/rally-tracks#375 was merged.

@pquentin pquentin merged commit e302a3d into elastic:master Feb 5, 2023
@pquentin pquentin deleted the simplify-create-readers branch February 16, 2023 06:31
@pquentin pquentin added the :misc Changes that don't affect users directly: linter fixes, test improvements, etc. label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:misc Changes that don't affect users directly: linter fixes, test improvements, etc. tech debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants