Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Enumerator Yielding Relations #91

Merged
merged 4 commits into from
May 20, 2021
Merged

Conversation

adrianna-chang-shopify
Copy link
Contributor

Take two of #86

Context

  • Currently, the resulting batch from #active_record_on_batches is an array of records
  • Many job authors ultimately want a relation (ie. to perform a batch operation such as #update_all or #delete_all) and convert this to a relation: Model.where(id: records.map(&:id)).update_all. This produces an extra query
  • Not only do we perform this extra query, but we've already loaded the full records into memory, when instead we could optimize and only pluck the cursor-related columns to produce the cursor

Proposed Solution

  • New API for returning an enumerator yielding batch relations: #active_record_on_batch_relations, leaving existing batch enumerator API for records intact
  • We're written this as a new enumerator class, ActiveRecordBatchEnumerator. The existing ActiveRecordEnumerator enumerator and ActiveRecordCursor classes have a lot of duplication and should probably be refactored. I intend to go back and abstract away cursor-related details from this new ActiveRecordBatchEnumerator class, while also fixing up the existing ActiveRecordCursor, in a separate PR.
  • This new class is an actual Enumerator and defines #each
  • This new batch enumerator has two key optimizations to note:
    • Instead of using record = relation.last and then querying the cursor columns on the record to construct the cursor, we pluck only the cursor columns. relation.last will load all of the records because we are using a LIMIT, as described here. Consequently, we optimize by only plucking what we need.
    • We build the relation to yield using the primary keys of the records in the original relation. Since we have these values anyways, building a relation based on the PKs ends up being much more efficient for the user when they load the relation inside #each_iteration than reusing the original relation (which may have had complex logic / joins / etc). This was actually inspired by how Rails does #in_batches
  • The rest of the logic is pretty similar to what's in ActiveRecordEnumerator and ActiveRecordCursor, with some simplifications, given that now everything is happening within a single object.

End Result

  • Users can enumerate on relations directly
  • These relations are not loaded when they are yielded
  • We perform a single query per batch, SELECT <cursor_columns> FROM relation LIMIT <batch_size>

@adrianna-chang-shopify
Copy link
Contributor Author

cc @Shopify/rails @Shopify/job-patterns

@adrianna-chang-shopify
Copy link
Contributor Author

@etiennebarrie shall we ship this?

Copy link
Member

@etiennebarrie etiennebarrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some propositions.

test/unit/active_record_batch_enumerator_test.rb Outdated Show resolved Hide resolved
test/unit/active_job_iteration_test.rb Outdated Show resolved Hide resolved
test/unit/active_record_batch_enumerator_test.rb Outdated Show resolved Hide resolved
test/unit/active_record_batch_enumerator_test.rb Outdated Show resolved Hide resolved
lib/job-iteration/active_record_batch_enumerator.rb Outdated Show resolved Hide resolved
@adrianna-chang-shopify adrianna-chang-shopify merged commit 01463b7 into master May 20, 2021
@adrianna-chang-shopify adrianna-chang-shopify deleted the batch-enumerator branch May 20, 2021 19:45
yield relation, cursor_value
end
else
to_enum(:each)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot the size here 😄

@shopify-shipit shopify-shipit bot temporarily deployed to rubygems May 20, 2021 19:48 Inactive
def each
if block_given?
while (relation = next_batch)
break if @cursor.nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the @cursor nil? All places that set it set to Array.wrap() and if you pass nil that will return an empty array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #next_batch:

      cursor = cursor_values.last
      return unless cursor.present?
      # The primary key was plucked, but original cursor did not include it, so we should remove it
      cursor.pop unless @primary_key_index
      @cursor = Array.wrap(cursor)

If the last batch is empty, we'll return early here, so @cursor will be nil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not clear at all. You are relying in hoisting to define the value of a variable and you need to remember that is how the interpreter works. In that case it would be better to do:

      cursor = cursor_values.last
      if cursor.present?
        # The primary key was plucked, but original cursor did not include it, so we should remove it
        cursor.pop unless @primary_key_index
        @cursor = Array.wrap(cursor)
     else
        @cursor = nil
     end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisited the code and realized that @cursor actually can't be nil 🤦‍♀️ The value will carry over from the previous batch, even if we return early. The check is unnecessary, so we'll put something up to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants