Better support for sharding in lifecycle callbacks #1166

lunaru · 2020-06-04T19:04:10Z

@pat I wanted to create this issue to start this discussion to make sure we've exhausted all of the documented solutions.

Right now, we have a model with enough records that a single index is not practical, so instead, we have something like this:

N.times do |i|
  ThinkingSphinx::Index.define :document, name: "document_#{i}", :with => :active_record, :delta => ThinkingSphinx::Deltas::SidekiqDelta do
   
    where "MOD(...) = #{i}"
  end
end

However, this as the problem that when a record is modified, it ends up creating N delta jobs, one for each index. If N is large enough, this becomes a problem.

It looks like the offending code is here:

thinking-sphinx/lib/thinking_sphinx/active_record/callbacks/delta_callbacks.rb

Line 16 in 1adcf81

index.delta_processor.delete index, instance

So the question is: Is there a better way to do sharding that avoids this landmine? Or conversely, if this is the best way to shard, then would a PR to help solve this problem be welcome?

The text was updated successfully, but these errors were encountered:

pat · 2020-06-05T04:36:46Z

Hi @lunaru

This is a great point, and you're right that the existing solutions aren't ideal. I was thinking about whether a custom IndexSet class could help, or a subclass of ThinkingSphinx::Deltas::SidekiqDelta, but neither is going to provide the ability to filter indices based on the current model instance, just the model.

If there are ideas for PRs, please do share, because I'd love your experience (especially given I'm not using a sharding approach for any of my apps). I'm currently wondering if changing IndexSet filtering to accept an instance (much like it accepts a model class, see line 46 of that delta callback file), and then that would allow picking and choosing specific shards based on that instance. Other ideas are welcome too!

Also: I guess another way is to use real-time indices rather than deltas, and then you've got a touch more control over the callbacks in the model, and can invoke them a little more specifically. But that is a significant change if you're comfortable with SQL-backed indices.

lunaru · 2020-06-05T20:48:17Z

@pat thanks for sharing your thoughts. The idea that we have to solve this would be to allow the model to optionally implement ts_index_names.

Each of the Callback classes would then check instance.respond_to? :ts_index_names and if it does, it will select the indices against those names to narrow the indices that get update/delete/etc calls.

If you think this approach is agreeable, we'll take a stab at the PR ASAP.

pat · 2020-06-22T02:56:45Z

Hi @lunaru - sorry for the delay in getting back to you, but I've just merged some changes into the develop branch that should help solve this issue: I've reworked IndexSet so it accepts specific instances, and that's what the callbacks pass through. Which means that something like the following should help:

class IndexSet < ThinkingSphinx::IndexSet
  private

  def indices
    return super if instances.empty?

    super.select do |index|
      # this block below will return all core and delta indices for the
      # instance's model, filtered by the shard id for the instance.
      instances.any? { |instance| index.name[/_#{instance.shard_id}_$/] }
    end
  end
end

# This line would go in an initialiser:
ThinkingSphinx::Configuration.instance.index_set_class = IndexSet

The shard_id method in the example could be replaced with your MOD logic, and thus returns the integer of the index an instance is tied to. If you'd like to give this a shot and confirm if it works for you, that'd be great! Then I can look into getting a new gem release published.

lunaru · 2020-06-22T03:19:48Z

@pat awesome, we’ll give this a try this week (hopefully Monday) and let you know. I take it that your refactor is already live on the master branch and we’d just need to pin our gemfile to master to use it?

pat · 2020-06-22T03:29:45Z

master is for released gem versions, so it's only in develop for now, but that's considered stable so there shouldn't be any issues using that branch as the source.

lunaru · 2020-06-22T23:53:03Z

@pat Good news: It looks like this works brilliantly. The only difference with the index_set_class call had to be done differently because it looks like the configuration was getting wiped upon initialization. Here's how we ended up doing it:

class ShardedIndexSet < ThinkingSphinx::IndexSet
  private

  def indices
    return super if instances.empty?

    super.select do |index|
      # this block below will return all core and delta indices for the
      # instance's model, filtered by the shard id for the instance.
      instances.any? { |instance| !instance.respond_to?(:shard_id) || instance.shard_id.blank? || index.name[/_#{instance.shard_id}_/] }
    end
  end
end

# this needs to be set after initializers for the thinking-sphinx gem are run
Rails.application.config.to_prepare do
  ThinkingSphinx::Configuration.instance.index_set_class = ShardedIndexSet
end

Specifically Rails.application.config.to_prepare runs late enough to let us override the default index set class.

lunaru · 2020-06-22T23:54:22Z

I'll go ahead and close this issue and await the next release version

pat · 2020-06-23T03:34:56Z

Thanks for the confirmation, and the note about the initialiser. I'll try to get the new gem release out soon!

pat · 2020-07-20T09:06:35Z

I know I've already noted this in #1173, but wanted also to mention that the newly released v5.0 includes the change discussed here with instances being available in custom IndexSet subclasses :)

pat mentioned this issue Jun 22, 2020

Use instances for index filtering from callbacks. #1171

Merged

lunaru closed this as completed Jun 22, 2020

pat mentioned this issue Jun 24, 2020

Proposal: no more automatic callbacks inserted into models. #1173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for sharding in lifecycle callbacks #1166

Better support for sharding in lifecycle callbacks #1166

lunaru commented Jun 4, 2020

pat commented Jun 5, 2020

lunaru commented Jun 5, 2020

pat commented Jun 22, 2020

lunaru commented Jun 22, 2020

pat commented Jun 22, 2020

lunaru commented Jun 22, 2020

lunaru commented Jun 22, 2020

pat commented Jun 23, 2020

pat commented Jul 20, 2020

Better support for sharding in lifecycle callbacks #1166

Better support for sharding in lifecycle callbacks #1166

Comments

lunaru commented Jun 4, 2020

pat commented Jun 5, 2020

lunaru commented Jun 5, 2020

pat commented Jun 22, 2020

lunaru commented Jun 22, 2020

pat commented Jun 22, 2020

lunaru commented Jun 22, 2020

lunaru commented Jun 22, 2020

pat commented Jun 23, 2020

pat commented Jul 20, 2020