-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling index rebuild #1134
Comments
Hi Nicholas - thanks for this suggestion. It's definitely a fair request, but I'm not sure it's possible. A rebuild is (or at least, should) mostly used when at least one index's structure is changed. Given this is all in a single file, it's not really possible to keep legacy data files around - all of the configuration is updated at once, rather than on a per-index basis. If index A was kept the same, and index B was changed, then it's not going to be possible for Sphinx to run reliably with old B data while processing A's records, because the configuration will have the new B schema. I'll keep thinking this through in my head, but right now I don't see a clear path forward. Also: how often are you rebuilding? If it's more than just when the structure of indices are changing, I'm curious as to why that's the case. |
Hi Pat, thanks for the quick reply. I wasn't using the proper jargon when I posted. I want to initiate a rebuild that updates the config once, then clears one index at a time as it re-indexes it. That way we can continue to serve up results from the untouched indices while we repopulate the current index, avoiding empty result sets for the majority of queries. The use case is we add a new RT index, deploy, our config is updated on deploy, but that has the effect of breaking
As a sidenote, I have implemented two monkeypatches to improve this situation, perhaps you can weigh in on them.
|
This isn't going to work, I'm afraid, because as you've noted, the offsets change, and so the existing index files will have old offsets, but any updates that may happen will expect new offsets, so you're going to find the wrong records get updated (via callbacks) in Sphinx. I'm definitely interested in seeing the code for your monkey-patches, and I'm pondering how reindexing could become a multi-process thing, so all indices are processed in parallel (for a specified number of threads), which could hopefully speed things up. As a manual workaround in the meantime, per-index processing ( bundle exec rake ts:rebuild
# And then, once that starts on adding data to indices:
bundle exec rake ts:index INDEX_FILTER=user_core
bundle exec rake ts:index INDEX_FILTER=purchase_core
# etc |
Automatic removal of stale Sphinx documents during Realtime Indexing The index priority selection is not as interesting, I simply sort the list of indices as desired at the end of |
I was going to suggest you submit this as a PR - but frustratingly, Sphinx 2.1.x doesn't support anything but Given Sphinx 2.2.x has been around for over five years, I'm thinking I should probably drop 2.1 support soon. I'll keep this patch in mind for then. |
Sounds good. FWIW while modifying |
In particular, this opens the door to parallel processing of indices. Prompted by discussions in #1134. Use your own Processor class, which can either inherit from `ThinkingSphinx::RealTime::Processor` or be whatever you want, as long as it responds to `call`, accepting an array of indices, and a block that gets called after each index is processed. https://gist.github.com/pat/53dbe982a2b311f5f7294809109419d2
I've made a few attempts at parallel processing over the past couple of months, but have found a reasonably simple solution just by refactoring the current code and allowing custom processors. This leaves the door open to others building their own implementations. All wrapped up in f0a7c12. And as noted in that commit, I've put together a quick example using the parallel gem: In a test app locally, I'm finding this shaves almost 50% off indexing times, but I'm dealing with very simple indices (and only three of them). Certainly, I'd be keen to know whether such an approach is helpful in your app! |
This worked well though I ran into an issue that could potentially be addressed in a similar way. As I reach the end of the list unless the indexing time is well balanced, we end up with several CPUs sitting idle while any long indices are completed. This led me to try parallelizing within the The performance difference for my project on a Macbook pro with 4 cores + hyperthreading was:
|
Ah, nice to know the per-index option at least made things a decent bit faster on your machine. Fair point about the balancing of indices by size though, that's a bit tricky. Can you share your changes to the Populator code? I'm curious to see your approach to shift the parallelisation into there. |
Be warned it's super janky proof of concept. I just threw this into my config/initializers to test it. The
|
I've just added a commit that allows similar customisation in a slightly more native approach, using your code: class ParallelPopulator < ThinkingSphinx::RealTime::Populator
def populate
instrument 'start_populating'
Parallel.each(scope.find_in_batches(batch_size: batch_size)) do |instances|
transcriber.copy *instances
instrument 'populated', :instances => instances
true # Don't emit any large object because results are accumulated
end
ActiveRecord::Base.connection.reconnect!
instrument 'finish_populating'
end
end
ThinkingSphinx::RealTime.populator = ParallelPopulator That said, when I gave it a spin, I didn't find any significant speed improvements over the standard non-parallel approach, let alone the Processor parallelisation. So maybe it's far more dependent on the index definitions and the Rails app? |
I think it's going to be very dependent on how the indexes in one's app are balanced. We have some very large and very small indexes, so I ended up with the imbalance. However, combined with the reindexing priority monkey-patch I described in #1134 (comment), I ended up prioritizing the large index so it gets a head start while other shorter indices are being processed in parallel. Therefore I probably won't pursue within-index parallelization since the gain is minor compared to the simpler approach, and can be achieved with a manual tweaking of indexing order. I'm also a little worried about the whole FWIW here is a simplification of how I'm implementing index prioritization module IndexExtensions
def define(*, &block)
super
sort_indices
end
private
def sort_indices
ThinkingSphinx::Configuration.instance.indices.sort_by! do |index|
index.options[:reindex_priority]
end
end
end
ThinkingSphinx::Index.singleton_class.prepend(IndexExtensions)
ThinkingSphinx::Index.define :page, reindex_priority: 5 do
# your index here
end |
Yeah, avoiding the reconnecting is not a bad idea. For what it's worth, you could remove that monkey patch and instead shift the sorting into the custom processor instead? e.g. class ParallelProcessor < ThinkingSphinx::RealTime::Processor
def call(&block)
Parallel.map(indices) do |index|
puts "Populating index #{index.name}"
ThinkingSphinx::RealTime::Populator.populate index
puts "Populated index #{index.name}"
block.call
end
end
private
def indices
super.sort_by { |index| index.options[:reindex_priority] }
end
end
ThinkingSphinx::RealTime.processor = ParallelProcessor |
I guess I could sort on-the-fly each time since there's no interface for extending |
I'm a little wary of adding the new feature, just because this hasn't been requested before in TS' long history. Fair point about the above solution… here's a simpler approach though that avoids the reliance on a private method and the superclass: ThinkingSphinx::RealTime.processor = Proc.new do |indices, &block|
Parallel.map(indices) do |index|
puts "Populating index #{index.name}"
ThinkingSphinx::RealTime.populator.populate index
puts "Populated index #{index.name}"
block.call
end
end |
Ah, but I forgot to add the sorting in there. Still, easy enough to add :) The processor needs to be something that responds to |
Yeah, makes sense, thanks for all your help! |
No worries, great to get some solutions sorted that help you :) |
These changes are now available in the freshly-released v4.4.0 :) |
Sphinx 2.1.x only allowed WHERE clauses for deletions to be on the `id` attribute, but 2.2.x onwards is more flexible, and saves us on calculations. This change was suggested by @njakobsen in #1134.
This was suggested by @njakobsen in #1134, and as he says: “This avoids the need to rebuild the index after adding or removing an index. Without this, the old indexed record would not be deleted since Sphinx's calculation of that record's id would not match now that the number/order of indices has changed.” It does mean a touch more overhead when transcribing records, but I think it’s worth it.
Hi Nicholas - just wanted to let you know that the newly released v5.0.0 release of Thinking Sphinx includes two changes you'd flagged in this discussion: real-time data is deleted before being inserted again (thus old data on different offset values is avoided), and the SphinxQL deletion calls are by ActiveRecord id rather than Sphinx's primary keys (as Sphinx 2.2.11 or newer is now required). There's also a significant change with needing to specify callbacks - though this is more of a big deal for those not using real-time indices: https://github.com/pat/thinking-sphinx/releases/tag/v5.0.0 |
I saw that! Thanks for the update @pat. I'll update our code as soon as I'm able and let you know how it goes. |
It would be useful to rebuild indices in a rolling manner, only dropping the index right before it's rebuilt. This is an issue specifically using realtime indices, where some indices take a long time to rebuild, forcing later indices to wait empty until their turn to rebuild. Though there would be a period of time where the index would return an empty or partial list of results as the index is built, it could continue serving stale results while other indices are being built.
The text was updated successfully, but these errors were encountered: