Simplify create_readers #1659

pquentin · 2023-01-26T11:58:13Z

This builds upon #1657. The second commit shows a bug in the current implementation of create_readers. It was hidden by the fact we multiply by num_clients which is often more than 1 in multi-corpora implementations.

This also fixes the bug identified by the last commit.

…ders

esrally/track/params.py

See PyCQA/isort#2078

DJRickyB · 2023-01-30T13:39:50Z

This LGTM, though I haven't done any local testing, and I'm temporarily without GH access. I can circle back in a few days if you'd like

…

On Mon, Jan 30, 2023, 7:29 AM Quentin Pradet ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In esrally/track/params.py <#1659 (comment)>: > - target = docs.target_data_stream - logger.debug( - "Task-relative clients at index [%d-%d] will bulk index [%d] docs starting from line offset [%d] for [%s] " - "from corpus [%s].", - start_client_index, - end_client_index, - num_docs, - offset, - target, - corpus.name, - ) - readers[len(corpora) * entry + group] = create_reader( - docs, offset, num_lines, num_docs, batch_size, bulk_size, id_conflicts, conflict_probability, on_conflict, recency - ) - else: + if num_docs == 0: logger.debug( "Task-relative clients at index [%d-%d] skip [%s] (no documents to read).", start_client_index, end_client_index, corpus.name, ) Good catch, thanks. Fixed in a9c0ce7 (#1659) <a9c0ce7> and added a test. — Reply to this email directly, view it on GitHub <#1659 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZBUUKEGL56X4VCFSIA4BLWU6X2XANCNFSM6AAAAAAUHNF524> . You are receiving this because your review was requested.Message ID: ***@***.***>

danielmitterdorfer

I only wanted to leave a few high-level comments: For better or worse, bulk-indexing is among the most complex parts of Rally so IMHO it pays off to be a bit more explicit than usual when doing changes. I'm thinking specifically of:

Adding more comments on the algorithm in this function. What is it doing on a high-level?
Reducing noise as much as possible, e.g. do we need the debug log statements?
Simplifying control structures, e.g. can we reimplement this so we don't need a continue statement?
Can we pick even more descriptive variable names (e.g. what differentiates readers from staggered_readers? The similar name implies e.g. similarity in data structure which is not the case)?

danielmitterdorfer · 2023-01-30T16:36:15Z

esrally/track/params.py

+    staggered_readers = []
+    while total_readers > 0:
+        for reader_queue in readers:
+            if reader_queue:


I think this is only needed because we eagerly add the deque to readers? If we'd only add it when there are actually documents that need to be read, we could get rid of this if.

No, actually this is needed because all queues don't necessarily have the same length. Say I have the following readers list of deques: [[doc1], [doc1, doc2]] and total_reads = 3. After one iteration I will have the following readers: [[], [doc2]] and total_reads = 1. I need to skip the empty deque.

I tried to explain this in 8f05ed2 (#1659).

danielmitterdorfer · 2023-01-30T16:36:42Z

esrally/track/params.py

+            reader = create_reader(
+                docs, offset, num_lines, num_docs, batch_size, bulk_size, id_conflicts, conflict_probability, on_conflict, recency
+            )
+            readers[-1].append(reader)


How about we assign the deque to a variable? This might be easier to read?

Done in 6870764 (#1659)

danielmitterdorfer · 2023-01-30T16:37:38Z

esrally/track/params.py

-    return readers
+                continue
+
+            target = f"{docs.target_index}/{docs.target_type}" if docs.target_index else "/"


This and the following two lines add a bit of noise in a quite complicated function. Maybe we can help the reader by moving this into a separate function?

So this was only needed for the debug logs, which gave me one more reason to remove them, which I have done in a26e18c (#1659)

They added more noise and complexity (for example to compute the target variable) than actual value.

pquentin · 2023-01-31T14:51:50Z

Adding more comments on the algorithm in this function. What is it doing on a high-level?

Done in 8f05ed2 (#1659). This helped me find a bug which I fixed in dca5271 (#1659).

Reducing noise as much as possible, e.g. do we need the debug log statements?

As mentioned above, this is done in a26e18c (#1659).

Simplifying control structures, e.g. can we reimplement this so we don't need a continue statement?

Done in dcee647 (#1659). This was much easier with the debug logs gone!

Can we pick even more descriptive variable names (e.g. what differentiates readers from staggered_readers? The similar name implies e.g. similarity in data structure which is not the case)?

Done in 8f05ed2 (#1659). I also added types in 6e35a40 (#1659) to clarify further the contents of the variables.

Can you please take another look?

pquentin · 2023-01-31T14:52:54Z

This LGTM, though I haven't done any local testing, and I'm temporarily without GH access. I can circle back in a few days if you'd like

@DJRickyB I appreciate the offer but you don't have to do this now, you know :)

pquentin · 2023-02-01T06:14:53Z

@elasticmachine run rally/it-python38 please

danielmitterdorfer

Great work. It looks much simpler now. LGTM

pquentin · 2023-02-02T09:47:36Z

Closing/reopening to get a RtD status.

pquentin · 2023-02-02T19:04:06Z

@elasticmachine run rally/rally-tracks-compat please now that elastic/rally-tracks#375 was merged.

pquentin added 2 commits January 26, 2023 10:51

Cleanup params_test.py file

5417c1d

Show bug in create_readers

05f9a00

pquentin added the tech debt label Jan 26, 2023

pquentin requested a review from DJRickyB January 26, 2023 11:58

pquentin self-assigned this Jan 26, 2023

pquentin mentioned this pull request Jan 26, 2023

Reduce the size of the readers list in create_readers #1658

Closed

pquentin added 2 commits January 26, 2023 16:11

Simplify create_readers method

c972319

This also fixes the bug identified by the last commit.

Reduce indentation in create_readers

8df6b29

pquentin force-pushed the simplify-create-readers branch from 7c9f658 to 8df6b29 Compare January 26, 2023 12:15

pquentin changed the title ~~Simplify create readers~~ Simplify create_readers Jan 26, 2023

pquentin added this to the 2.7.1 milestone Jan 26, 2023

Merge remote-tracking branch 'origin/master' into simplify-create-rea…

d255780

…ders

pquentin mentioned this pull request Jan 27, 2023

Allow indexing data in order with multiple indexing clients #1650

Closed

DJRickyB reviewed Jan 27, 2023

View reviewed changes

esrally/track/params.py Show resolved Hide resolved

pquentin added 2 commits January 30, 2023 11:42

Add missing continue

a9c0ce7

Fix precommit by updating isort

f7f2a80

See PyCQA/isort#2078

pquentin closed this Jan 30, 2023

pquentin reopened this Jan 30, 2023

danielmitterdorfer reviewed Jan 30, 2023

View reviewed changes

pquentin added 6 commits January 31, 2023 16:00

Remove debug logs from create_readers

a26e18c

They added more noise and complexity (for example to compute the target variable) than actual value.

Assign reader_queue to a variable

6870764

Remove continue statement

dcee647

Add type annotations to create_readers

6e35a40

Improve comments and variable names

8f05ed2

Fix reordered_corpora bug

dca5271

pquentin requested a review from danielmitterdorfer January 31, 2023 14:51

danielmitterdorfer approved these changes Feb 2, 2023

View reviewed changes

pquentin closed this Feb 2, 2023

pquentin reopened this Feb 2, 2023

pquentin merged commit e302a3d into elastic:master Feb 5, 2023

pquentin deleted the simplify-create-readers branch February 16, 2023 06:31

pquentin added the :misc Changes that don't affect users directly: linter fixes, test improvements, etc. label Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify create_readers #1659

Simplify create_readers #1659

pquentin commented Jan 26, 2023

DJRickyB commented Jan 30, 2023 via email

danielmitterdorfer left a comment •

edited

Loading

danielmitterdorfer Jan 30, 2023 •

edited

Loading

pquentin Jan 31, 2023

danielmitterdorfer Jan 30, 2023 •

edited

Loading

pquentin Jan 31, 2023

danielmitterdorfer Jan 30, 2023

pquentin Jan 31, 2023

pquentin commented Jan 31, 2023

pquentin commented Jan 31, 2023

pquentin commented Feb 1, 2023

danielmitterdorfer left a comment

pquentin commented Feb 2, 2023

pquentin commented Feb 2, 2023

Simplify create_readers #1659

Simplify create_readers #1659

Conversation

pquentin commented Jan 26, 2023

DJRickyB commented Jan 30, 2023 via email

danielmitterdorfer left a comment • edited Loading

Choose a reason for hiding this comment

danielmitterdorfer Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

pquentin Jan 31, 2023

Choose a reason for hiding this comment

danielmitterdorfer Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

pquentin Jan 31, 2023

Choose a reason for hiding this comment

danielmitterdorfer Jan 30, 2023

Choose a reason for hiding this comment

pquentin Jan 31, 2023

Choose a reason for hiding this comment

pquentin commented Jan 31, 2023

pquentin commented Jan 31, 2023

pquentin commented Feb 1, 2023

danielmitterdorfer left a comment

Choose a reason for hiding this comment

pquentin commented Feb 2, 2023

pquentin commented Feb 2, 2023

danielmitterdorfer left a comment •

edited

Loading

danielmitterdorfer Jan 30, 2023 •

edited

Loading

danielmitterdorfer Jan 30, 2023 •

edited

Loading