Rolling index rebuild #1134

njakobsen · 2019-05-03T20:19:12Z

It would be useful to rebuild indices in a rolling manner, only dropping the index right before it's rebuilt. This is an issue specifically using realtime indices, where some indices take a long time to rebuild, forcing later indices to wait empty until their turn to rebuild. Though there would be a period of time where the index would return an empty or partial list of results as the index is built, it could continue serving stale results while other indices are being built.

pat · 2019-05-15T08:25:41Z

Hi Nicholas - thanks for this suggestion. It's definitely a fair request, but I'm not sure it's possible.

A rebuild is (or at least, should) mostly used when at least one index's structure is changed. Given this is all in a single file, it's not really possible to keep legacy data files around - all of the configuration is updated at once, rather than on a per-index basis. If index A was kept the same, and index B was changed, then it's not going to be possible for Sphinx to run reliably with old B data while processing A's records, because the configuration will have the new B schema.

I'll keep thinking this through in my head, but right now I don't see a clear path forward. Also: how often are you rebuilding? If it's more than just when the structure of indices are changing, I'm curious as to why that's the case.

njakobsen · 2019-05-15T17:40:49Z

Hi Pat, thanks for the quick reply. I wasn't using the proper jargon when I posted. I want to initiate a rebuild that updates the config once, then clears one index at a time as it re-indexes it. That way we can continue to serve up results from the untouched indices while we repopulate the current index, avoiding empty result sets for the majority of queries.

The use case is we add a new RT index, deploy, our config is updated on deploy, but that has the effect of breaking document_id calculations and also requires an index before we can write documents to the new index. Since a reindex of all indices will cause duplicates to appear because the document_id is now different for each record due to the addition of the new index (

thinking-sphinx/lib/thinking_sphinx/core/index.rb

Line 37 in f78e9ab

key * config.indices.count + offset

), we are forced to rebuild to clear out the documents in each index. Reducing the amount of time the indices are empty is a priority.

As a sidenote, I have implemented two monkeypatches to improve this situation, perhaps you can weigh in on them.

Implemented index priority to ensure we reindex the most important (visitor-facing) indices first when rebuilding.
Overloaded the transcriber and delete callbacks to delete old records matching the current index and where the record's id matches the Sphinx document's sphinx_internal_id, eliminating the need to clear out old documents after creating/deleting a sphinx index since they will be cleaned out automatically.

pat · 2019-05-19T06:00:06Z

I want to initiate a rebuild that updates the config once, then clears one index at a time as it re-indexes it.

This isn't going to work, I'm afraid, because as you've noted, the offsets change, and so the existing index files will have old offsets, but any updates that may happen will expect new offsets, so you're going to find the wrong records get updated (via callbacks) in Sphinx.

I'm definitely interested in seeing the code for your monkey-patches, and I'm pondering how reindexing could become a multi-process thing, so all indices are processed in parallel (for a specified number of threads), which could hopefully speed things up.

As a manual workaround in the meantime, per-index processing (ts:index) could be a way to update a bunch of indices at the same time:

bundle exec rake ts:rebuild
# And then, once that starts on adding data to indices:
bundle exec rake ts:index INDEX_FILTER=user_core
bundle exec rake ts:index INDEX_FILTER=purchase_core
# etc

njakobsen · 2019-05-19T16:54:29Z

Automatic removal of stale Sphinx documents during Realtime Indexing
https://gist.github.com/njakobsen/2135706c35139d484a56d660b089cc5d

The index priority selection is not as interesting, I simply sort the list of indices as desired at the end of ThinkingSphinx::Index#define.

pat · 2019-05-22T03:24:29Z

I was going to suggest you submit this as a PR - but frustratingly, Sphinx 2.1.x doesn't support anything but id in the WHERE clause (2.2.x onwards are more flexible).

Given Sphinx 2.2.x has been around for over five years, I'm thinking I should probably drop 2.1 support soon. I'll keep this patch in mind for then.

njakobsen · 2019-05-22T17:13:37Z

Sounds good.

FWIW while modifying Transcriber I ended up deleting by ActiveRecord id before replacing because the replace occurs deeper down in the Riddle gem (https://github.com/pat/riddle/blob/develop/lib/riddle/query/insert.rb) and I didn't want to change the behaviour too broadly while I was experimenting. Ideally, the replace would simply be based on ActiveRecord id instead, negating the need for a sphinx document id replace, but this would require better knowledge of the gems involved, their desired compatibility with other ORMs, and of Sphinx itself. This is why I didn't make a PR and instead opted for the gist to at least showcase the idea.

In particular, this opens the door to parallel processing of indices. Prompted by discussions in #1134. Use your own Processor class, which can either inherit from `ThinkingSphinx::RealTime::Processor` or be whatever you want, as long as it responds to `call`, accepting an array of indices, and a block that gets called after each index is processed. https://gist.github.com/pat/53dbe982a2b311f5f7294809109419d2

pat · 2019-08-03T04:34:32Z

I've made a few attempts at parallel processing over the past couple of months, but have found a reasonably simple solution just by refactoring the current code and allowing custom processors. This leaves the door open to others building their own implementations. All wrapped up in f0a7c12.

And as noted in that commit, I've put together a quick example using the parallel gem:
https://gist.github.com/pat/53dbe982a2b311f5f7294809109419d2

In a test app locally, I'm finding this shaves almost 50% off indexing times, but I'm dealing with very simple indices (and only three of them). Certainly, I'd be keen to know whether such an approach is helpful in your app!

njakobsen · 2019-08-05T15:45:07Z

This worked well though I ran into an issue that could potentially be addressed in a similar way. As I reach the end of the list unless the indexing time is well balanced, we end up with several CPUs sitting idle while any long indices are completed. This led me to try parallelizing within the Populator. That worked okay, though I noticed that the memory footprint seemed to grow depending on the number of records in the table being indexed (not scientifically tested), so I'll need to investigate why I wasn't able to keep it constant even though I was using find_in_batches to load records 100 at a time so we could keep each process fed.

The performance difference for my project on a Macbook pro with 4 cores + hyperthreading was:

No parallelization: Usually around 30 to 40 minutes
Parallelizing per index: 11 minutes
Parallelizing within an index: 8 minutes

pat · 2019-08-06T12:15:50Z

Ah, nice to know the per-index option at least made things a decent bit faster on your machine. Fair point about the balancing of indices by size though, that's a bit tricky.

Can you share your changes to the Populator code? I'm curious to see your approach to shift the parallelisation into there.

njakobsen · 2019-08-06T18:11:21Z

Be warned it's super janky proof of concept. I just threw this into my config/initializers to test it. The true part was while I was poking around to see if this was the cause of memory seeming to balloon when I indexed a large table. One downside to this is if for some reason memory is affected depending on the table size, all your processes balloon at the same time (unlike your example with per-index parallelization), which is why I started looking into it.

ThinkingSphinx::RealTime::Populator.class_eval do
  def populate(&block)
    instrument 'start_populating'

    Parallel.each(scope.find_in_batches(batch_size: 100)) do |instances|
      transcriber.copy *instances
      instrument 'populated', :instances => instances
      true # Don't emit any large object because results are accumulated
    end
    ActiveRecord::Base.connection.reconnect! # As recommended by the Parallel gem

    instrument 'finish_populating'
  end
end

pat · 2019-08-09T02:50:26Z

I've just added a commit that allows similar customisation in a slightly more native approach, using your code:

class ParallelPopulator < ThinkingSphinx::RealTime::Populator
  def populate
    instrument 'start_populating'

    Parallel.each(scope.find_in_batches(batch_size: batch_size)) do |instances|
      transcriber.copy *instances

      instrument 'populated', :instances => instances

      true # Don't emit any large object because results are accumulated
    end
    ActiveRecord::Base.connection.reconnect!

    instrument 'finish_populating'
  end
end

ThinkingSphinx::RealTime.populator = ParallelPopulator

That said, when I gave it a spin, I didn't find any significant speed improvements over the standard non-parallel approach, let alone the Processor parallelisation. So maybe it's far more dependent on the index definitions and the Rails app?

njakobsen · 2019-08-09T17:57:09Z

I think it's going to be very dependent on how the indexes in one's app are balanced. We have some very large and very small indexes, so I ended up with the imbalance. However, combined with the reindexing priority monkey-patch I described in #1134 (comment), I ended up prioritizing the large index so it gets a head start while other shorter indices are being processed in parallel. Therefore I probably won't pursue within-index parallelization since the gain is minor compared to the simpler approach, and can be achieved with a manual tweaking of indexing order. I'm also a little worried about the whole ActiveRecord::Base.connection.reconnect! that the Parallel gem suggested, since it didn't really explain why we lose connection, and how exactly this line solved it, see https://github.com/grosser/parallel#activerecord.

FWIW here is a simplification of how I'm implementing index prioritization

module IndexExtensions
  def define(*, &block)
    super
    sort_indices
  end

  private

  def sort_indices
    ThinkingSphinx::Configuration.instance.indices.sort_by! do |index|
      index.options[:reindex_priority]
    end
  end
end

ThinkingSphinx::Index.singleton_class.prepend(IndexExtensions)

ThinkingSphinx::Index.define :page, reindex_priority: 5 do
   # your index here
end

pat · 2019-08-10T02:10:05Z

Yeah, avoiding the reconnecting is not a bad idea. For what it's worth, you could remove that monkey patch and instead shift the sorting into the custom processor instead? e.g.

class ParallelProcessor < ThinkingSphinx::RealTime::Processor
  def call(&block)
    Parallel.map(indices) do |index|
      puts "Populating index #{index.name}"
      ThinkingSphinx::RealTime::Populator.populate index
      puts "Populated index #{index.name}"

      block.call
    end
  end

  private

  def indices
    super.sort_by { |index| index.options[:reindex_priority] }
  end
end

ThinkingSphinx::RealTime.processor = ParallelProcessor

njakobsen · 2019-08-12T05:45:23Z

I guess I could sort on-the-fly each time since there's no interface for extending ThinkingSphinx::Index. However, indices is a private method, so overriding it in my own subclass would still feel like a monkey patch. Would you be interested in a PR to make custom indexing order a standard feature? Or do you not see it as fundamental to the gem itself? Even something like adding a <=> method to ThinkingSphinx::Index, defaulting to the current behaviour of sorting by name, which could be overridden as desired.

pat · 2019-08-12T05:58:01Z

I'm a little wary of adding the new feature, just because this hasn't been requested before in TS' long history.

Fair point about the above solution… here's a simpler approach though that avoids the reliance on a private method and the superclass:

ThinkingSphinx::RealTime.processor = Proc.new do |indices, &block|
  Parallel.map(indices) do |index|
    puts "Populating index #{index.name}"
    ThinkingSphinx::RealTime.populator.populate index
    puts "Populated index #{index.name}"

    block.call
  end
end

pat · 2019-08-12T06:02:08Z

Ah, but I forgot to add the sorting in there. Still, easy enough to add :)

The processor needs to be something that responds to call and accepts two arguments: an array of indices, and a block that should be invoked (without any arguments) after each index is processed. The default is a class, but a proc does the job just as neatly.

njakobsen · 2019-08-12T21:00:45Z

Yeah, makes sense, thanks for all your help!

pat · 2019-08-13T01:34:02Z

No worries, great to get some solutions sorted that help you :)

pat · 2019-08-21T13:02:53Z

These changes are now available in the freshly-released v4.4.0 :)

@njakobsen

Sphinx 2.1.x only allowed WHERE clauses for deletions to be on the `id` attribute, but 2.2.x onwards is more flexible, and saves us on calculations. This change was suggested by @njakobsen in #1134.

@njakobsen

This was suggested by @njakobsen in #1134, and as he says: “This avoids the need to rebuild the index after adding or removing an index. Without this, the old indexed record would not be deleted since Sphinx's calculation of that record's id would not match now that the number/order of indices has changed.” It does mean a touch more overhead when transcribing records, but I think it’s worth it.

pat · 2020-07-20T09:00:45Z

Hi Nicholas - just wanted to let you know that the newly released v5.0.0 release of Thinking Sphinx includes two changes you'd flagged in this discussion: real-time data is deleted before being inserted again (thus old data on different offset values is avoided), and the SphinxQL deletion calls are by ActiveRecord id rather than Sphinx's primary keys (as Sphinx 2.2.11 or newer is now required).

There's also a significant change with needing to specify callbacks - though this is more of a big deal for those not using real-time indices: https://github.com/pat/thinking-sphinx/releases/tag/v5.0.0

njakobsen · 2020-07-20T17:23:35Z

I saw that! Thanks for the update @pat. I'll update our code as soon as I'm able and let you know how it goes.

njakobsen closed this as completed Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling index rebuild #1134

Rolling index rebuild #1134

njakobsen commented May 3, 2019

pat commented May 15, 2019

njakobsen commented May 15, 2019 •

edited

Loading

pat commented May 19, 2019

njakobsen commented May 19, 2019

pat commented May 22, 2019

njakobsen commented May 22, 2019 •

edited

Loading

pat commented Aug 3, 2019

njakobsen commented Aug 5, 2019

pat commented Aug 6, 2019

njakobsen commented Aug 6, 2019 •

edited

Loading

pat commented Aug 9, 2019

njakobsen commented Aug 9, 2019

pat commented Aug 10, 2019

njakobsen commented Aug 12, 2019

pat commented Aug 12, 2019

pat commented Aug 12, 2019

njakobsen commented Aug 12, 2019

pat commented Aug 13, 2019

pat commented Aug 21, 2019

pat commented Jul 20, 2020

njakobsen commented Jul 20, 2020

Rolling index rebuild #1134

Rolling index rebuild #1134

Comments

njakobsen commented May 3, 2019

pat commented May 15, 2019

njakobsen commented May 15, 2019 • edited Loading

pat commented May 19, 2019

njakobsen commented May 19, 2019

pat commented May 22, 2019

njakobsen commented May 22, 2019 • edited Loading

pat commented Aug 3, 2019

njakobsen commented Aug 5, 2019

pat commented Aug 6, 2019

njakobsen commented Aug 6, 2019 • edited Loading

pat commented Aug 9, 2019

njakobsen commented Aug 9, 2019

pat commented Aug 10, 2019

njakobsen commented Aug 12, 2019

pat commented Aug 12, 2019

pat commented Aug 12, 2019

njakobsen commented Aug 12, 2019

pat commented Aug 13, 2019

pat commented Aug 21, 2019

pat commented Jul 20, 2020

njakobsen commented Jul 20, 2020

njakobsen commented May 15, 2019 •

edited

Loading

njakobsen commented May 22, 2019 •

edited

Loading

njakobsen commented Aug 6, 2019 •

edited

Loading