Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Support automatic reindex for objects created while Solr is down (Near Realtime Search) #702

Closed
eaquigley opened this issue Jul 9, 2014 · 9 comments

Comments

@eaquigley
Copy link
Contributor

eaquigley commented Jul 9, 2014


Author Name: Philip Durbin (@pdurbin)
Original Redmine Issue: 4160, https://redmine.hmdc.harvard.edu/issues/4160
Original Date: 2014-06-30
Original Assignee: Philip Durbin


Dataverse 4.0 requires "near realtime search" because the moment dataverses, datasets, or files are added, updated, or deleted the "cards" and facet counts must immediately reflect the change.

"Near realtime search means thats documents are available for search almost immediately after being indexed - additions and updates to documents are seen in 'near' realtime." -- http://wiki.apache.org/solr/NearRealtimeSearch

In order to support near realtime search, we must handle indexing failure and re-try the indexing operation.

As we are designing this system, we should probably consider other cases where detecting failure of a network service and re-trying is desirable, such as:

  • registering DOIs
  • posting to Twitter

We should also considering using notifications for cases where re-indexing was attempted several times but continues to fail.

In DVN 3.x there is a method called getUnindexedStudies at https://github.com/IQSS/dvn/blob/3.6.1/DVN-root/DVN-web/src/main/java/edu/harvard/iq/dvn/core/index/IndexServiceBean.java#L1061 that uses the following query to determine which studies need to be re-indexed:

List<Study> studies = (List<Study>) em.createQuery("SELECT s from Study s where s.lastIndexTime < s.lastUpdateTime OR s.lastIndexTime is NULL").getResultList();

Another approach could be to use a database table as a queue (thought this approach could be problematic: https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you/ )

See also:

http://lucene.472066.n3.nabble.com/strategies-for-managing-Solr-indexing-failures-and-retries-td4139186.html


Related issue(s): #229
Redmine related issue(s): 3643


@eaquigley eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014
@scolapasta scolapasta modified the milestones: Beta 4 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014
@eaquigley eaquigley modified the milestones: Long Term Issues-Dataverse 4.0, Beta 8 - Dataverse 4.0 Sep 3, 2014
@pdurbin pdurbin modified the milestones: In Review - Dataverse 4.0, Beta 8 - Dataverse 4.0 Oct 14, 2014
@pdurbin pdurbin modified the milestones: Beta 9 - Dataverse 4.0, In Review - Dataverse 4.0 Nov 12, 2014
pdurbin added a commit that referenced this issue Nov 14, 2014
We'll compare these fields to know if Solr data is stale
pdurbin added a commit that referenced this issue Dec 15, 2014
@scolapasta scolapasta removed this from the Beta 11 - Dataverse 4.0 milestone Jan 23, 2015
pdurbin added a commit that referenced this issue Mar 19, 2015
@pdurbin pdurbin assigned pdurbin and unassigned scolapasta Mar 19, 2015
pdurbin added a commit that referenced this issue Mar 24, 2015
Log instead of throwing an Exception.
@scolapasta scolapasta modified the milestones: In Review - Short Term, 4.0.1 Apr 18, 2015
@pdurbin pdurbin removed their assignment Jan 21, 2016
@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016
@kcondon
Copy link
Contributor

kcondon commented May 9, 2016

@pdurbin This appears to have been addressed somewhat by asynch indexing. Closing but feel free to comment and reopen as needed with more detail on what need to happen.

@kcondon kcondon closed this as completed May 9, 2016
@pdurbin
Copy link
Member

pdurbin commented May 9, 2016

@kcondon this is the issue I was using to track the idea of handling indexing failure. For example, what if Solr is down? Some reindexing should happen, hopefully automatically (maybe it gets put in a queue?). If this is not a priority, I'm fine with leaving this issue closed. I defer to you on this. #2322 about fault tolerance is related and still open (addressed by pull request #2985).

@kcondon kcondon changed the title Support Near Realtime Search (handle indexing failures) Indexing: Support automatic reindex for objects created while Solr is down. May 9, 2016
@pdurbin pdurbin changed the title Indexing: Support automatic reindex for objects created while Solr is down. Indexing: Support automatic reindex for objects created while Solr is down (Near Realtime Search) Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants