-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Enumerator Yielding Relations #91
Conversation
Co-authored-by: Étienne Barrié <[email protected]>
cc @Shopify/rails @Shopify/job-patterns |
@etiennebarrie shall we ship this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some propositions.
83b4ad2
to
7c9a1be
Compare
yield relation, cursor_value | ||
end | ||
else | ||
to_enum(:each) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot the size here 😄
def each | ||
if block_given? | ||
while (relation = next_batch) | ||
break if @cursor.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the @cursor
nil
? All places that set it set to Array.wrap()
and if you pass nil
that will return an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #next_batch
:
cursor = cursor_values.last
return unless cursor.present?
# The primary key was plucked, but original cursor did not include it, so we should remove it
cursor.pop unless @primary_key_index
@cursor = Array.wrap(cursor)
If the last batch is empty, we'll return early here, so @cursor
will be nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not clear at all. You are relying in hoisting to define the value of a variable and you need to remember that is how the interpreter works. In that case it would be better to do:
cursor = cursor_values.last
if cursor.present?
# The primary key was plucked, but original cursor did not include it, so we should remove it
cursor.pop unless @primary_key_index
@cursor = Array.wrap(cursor)
else
@cursor = nil
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisited the code and realized that @cursor
actually can't be nil
🤦♀️ The value will carry over from the previous batch, even if we return early. The check is unnecessary, so we'll put something up to remove it.
Take two of #86
Context
#active_record_on_batches
is an array of records#update_all
or#delete_all
) and convert this to a relation:Model.where(id: records.map(&:id)).update_all
. This produces an extra queryProposed Solution
#active_record_on_batch_relations
, leaving existing batch enumerator API for records intactActiveRecordBatchEnumerator
. The existingActiveRecordEnumerator
enumerator andActiveRecordCursor
classes have a lot of duplication and should probably be refactored. I intend to go back and abstract away cursor-related details from this newActiveRecordBatchEnumerator
class, while also fixing up the existingActiveRecordCursor
, in a separate PR.Enumerator
and defines#each
record = relation.last
and then querying the cursor columns on therecord
to construct the cursor, we pluck only the cursor columns.relation.last
will load all of the records because we are using aLIMIT
, as described here. Consequently, we optimize by only plucking what we need.#each_iteration
than reusing the original relation (which may have had complex logic / joins / etc). This was actually inspired by how Rails does#in_batches
ActiveRecordEnumerator
andActiveRecordCursor
, with some simplifications, given that now everything is happening within a single object.End Result
SELECT <cursor_columns> FROM relation LIMIT <batch_size>