Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up orphaned real-time records after reindexing. #1192

Merged
merged 3 commits into from
Mar 13, 2021
Merged

Conversation

pat
Copy link
Owner

@pat pat commented Mar 13, 2021

This PR adds a new internal attribute - sphinx_updated_at - to real-time indices, and sets that with the current timestamp whenever that record is inserted or updated. And then, when a reindex happens, any record that has not changed since the reindexing began is removed - as it's not in the known dataset. Likely, it was deleted but the callbacks failed for some reason.

The advantage of this is that running ts:index will remove any records that are no longer indexed - thus keeping index sizes to what they should be, avoiding any bloat that may crop up from situations where records haven't been removed (bulk deletions, for example) - and this is done without the need for deleting all records to begin with (the behaviour of ts:rebuild).

This is currently only enabled with real_time_tidy is set to true for the appropriate environment in config/thinking_sphinx.yml. I see it defaulting to true in the near future, but that might not be until v6.0.

pat added 3 commits March 13, 2021 11:26
Anything that hasn’t been inserted or updated will have an ‘old’ timestamp - the record will be from a deleted model instance that wasn’t cleaned up correctly.
@pat pat merged commit d1f0940 into develop Mar 13, 2021
@pat pat deleted the realtime-cleaning branch March 13, 2021 03:55
@akostadinov
Copy link
Contributor

akostadinov commented Apr 10, 2023

Hello, as a matter of upgrading. I applied the real_time_tidy option to an existing server. It properly generated the rt_attr_uint = sphinx_updated_at lines in the resulting sphinx.conf file.

But then trying to reindex the system, it fails with:

$ rake ts:index
Removing :environment prerequisite from db:create
Generating index files for account_core
rake aborted!
ThinkingSphinx::SphinxError: unknown column: 'sphinx_updated_at' - REPLACE INTO account_core (id, `name`, `account_id`, `username`, `user_full_name`, `email`, `user_key`, `app_id`, `app_name`, `user_id`, `sphinx_internal_id`, `sphinx_internal_class`, `sphinx_deleted`, `sphinx_updated_at`, `provider_account_id`, `tenant_id`, `state`) VALUES (36, 'Provider Name', '2', 'admin 3scaleadmin', ' 3scale Admin', '[email protected] [email protected]', '570a3e1fbaab0446bab40a36abf8bd33', 'a1c74f5e', 'Provider Name\'s App', '2 3', 2, 'Account', 0, 1681142319, 1, 2, 'approved'), (54, 'Developer', '3', 'john', 'John Doe', '[email protected]', '1735cf5c6d2bca70d2c719fbc132ce75', 'bb1df627', 'Developer\'s App', '4', 3, 'Account', 0, 1681142319, 2, 2, 'approved')


Caused by:
Innertube::Pool::BadResource: Innertube::Pool::BadResource


Caused by:
ThinkingSphinx::QueryExecutionError: unknown column: 'sphinx_updated_at' - REPLACE INTO account_core (id, `name`, `account_id`, `username`, `user_full_name`, `email`, `user_key`, `app_id`, `app_name`, `user_id`, `sphinx_internal_id`, `sphinx_internal_class`, `sphinx_deleted`, `sphinx_updated_at`, `provider_account_id`, `tenant_id`, `state`) VALUES (36, 'Provider Name', '2', 'admin 3scaleadmin', ' 3scale Admin', '[email protected] [email protected]', '570a3e1fbaab0446bab40a36abf8bd33', 'a1c74f5e', 'Provider Name\'s App', '2 3', 2, 'Account', 0, 1681142319, 1, 2, 'approved'), (54, 'Developer', '3', 'john', 'John Doe', '[email protected]', '1735cf5c6d2bca70d2c719fbc132ce75', 'bb1df627', 'Developer\'s App', '4', 3, 'Account', 0, 1681142319, 2, 2, 'approved')


Caused by:
Mysql2::Error: unknown column: 'sphinx_updated_at'

Tasks: TOP => ts:index => ts:rt:index
(See full trace by running task with --trace)

What is the suggested upgrade path for this option? Is it possible to upgrade without clearing the indices first?

This is the index file:

index account_core
{ 
  type = rt
  path = /home/avalon/ws/repos/porta/db/sphinx/development/account_core
  charset_table = 0..9, A..Z->a..z, a..z, U+23, U+25, U+27, U+2A..U+2C, U+2E, U+3A, U+3B, U+3F, U+5F, U+60, U+7B, U+7D
  min_infix_len = 3
  html_strip = 1
  html_index_attrs = img=alt,title; a=title
  html_remove_elements = style, script
  rt_field = name
  rt_field = account_id
  rt_field = username
  rt_field = user_full_name
  rt_field = email
  rt_field = user_key
  rt_field = app_id
  rt_field = app_name
  rt_field = user_id
  rt_attr_uint = sphinx_deleted
  rt_attr_uint = sphinx_updated_at
  rt_attr_bigint = sphinx_internal_id
  rt_attr_bigint = provider_account_id
  rt_attr_bigint = tenant_id
  rt_attr_string = sphinx_internal_class
  rt_attr_string = state
}

@akostadinov
Copy link
Contributor

It seems like the index needs to be wiped :(
https://sphinxsearch.com/forum/view.html?id=11061

If you know any other way, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants