Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning for Solr to improve indexing latency #529

Merged
merged 9 commits into from
May 23, 2017

Conversation

byxorna
Copy link
Contributor

@byxorna byxorna commented Mar 10, 2017

Fixes #526

This PR has a bunch of fixes for how solr behaves to improve indexing latency for assets after write.

  1. Removes using explicit commit() calls to the Solr server, which reduces the load on solr, given the default 10ms window of batching asset updates (only a handful of assets are reindexed at a time at this interval during mass tag updates).
  2. Increased assetBatchUpdateWindowMs from 10ms to 30ms to catch more assets in a single batch to add to solr.
  3. Added a commitWithinMs tunable (default 50ms) which lets solr decide on the best time to commit updates to the index, instead of us forcing commits (of possibly very small batches of docs). This means that an asset will be available for search in assetBatchUpdateWindowMs+commitWithinMs=80ms default from update time.
  4. Moves away from hard commits in solr, to using softAutoCommit. This means a document is searchable before solr performs an fsync on the index, reducing the IO load on solr and improving latency between doc update and searchability.

These tunables can be twiddled to balance throughput vs search after write latency for new attributes (i.e. collins.set_attribute!(tag, :foo, :bar), collins.find(foo: :bar)). Using the softAutoCommit feature will improve latency (but not entirely remove it!) when updating assets, then immediately searching for them.

@evanelias @bobpattersonjr @defect @roymarantz would love to hear any suggestions! This isnt a perfect fix for the jetpants issue, but its a start.

@byxorna byxorna requested review from defect and roymarantz March 14, 2017 16:59
@byxorna
Copy link
Contributor Author

byxorna commented Apr 6, 2017

@defect @roymarantz RFR again?

// TODO(gabe): if this log is indexed (serialized first) before solr is
// updated with the asset document, this will throw! This can happen when
// creating a new asset then immediately performing some attribute sets
// which create logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue/ticket to fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this could break things if we create a config asset and then immediately perform sets on it? I think jetpants may do that in some cases but can't recall by memory. What exactly would break / what are the implications / is this a new problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanelias this isn't new; it's been lurking since inception. Implications are that the tag set will fall (can't find asset to modify in solr index) until solr populates. Not cataclysmic, but definitely not a great consistency model to expose

def defaultMaxConnectionsPerHost = getInt("defaultMaxConnectionsPerHost", 100)
def assetBatchUpdateWindow = getInt("assetBatchUpdateWindowMs", 10) milliseconds
def assetBatchUpdateWindow = getInt("assetBatchUpdateWindowMs", 30) milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did these magic numbers come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same place the first numbers came from (the ether). its discussed in the description of the PR why these numbers were chosen as a default

@@ -32,8 +32,9 @@ object SolrConfig extends Configurable {
def socketTimeout = getInt("socketTimeout", 1000)
def connectionTimeout = getInt("connectionTimeout", 5000)
def maxTotalConnections = getInt("maxTotalConnections", 100)
def commitWithin = getInt("commitWithinMs", 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any docs explaining what commitWithinMs is about?

Copy link
Contributor Author

@byxorna byxorna Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, its discussed in bullet 3 in the description of the PR, but also here

fuckingJava.size,
serializer.docType.name.toLowerCase,
SolrConfig.commitWithin))
// dont explicitly hard commit, let solr figure it out and make docs available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any downsides to doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because a hard commit isnt actually synchronous because this is happening via actor in background thread

val scheduled = new AtomicBoolean(false)
case object Reindex

// TODO(gabe): perform batch updating on asset logs as well to prevent as much churn in solr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "asset logs" the same as log: AssetLog
If so is this todo note needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unneeded now

@roymarantz
Copy link
Contributor

👍

@byxorna
Copy link
Contributor Author

byxorna commented May 18, 2017

@michaeljs1990 wanna give this a once-over before i land?

@byxorna
Copy link
Contributor Author

byxorna commented May 20, 2017

also @defect, 👍 ?

@michaeljs1990
Copy link
Contributor

👍 on this. Curious about the jetpants issue mentioned in the summary though but didn't see any open issues about it on that repo.

@byxorna
Copy link
Contributor Author

byxorna commented May 21, 2017

@michaeljs1990 if you set FOO=xyz on an asset, then immediately query for FOO=xyz, the solr index update may not yet be finalized (used for queries) from the write, so the query will return an "older" set of assets. A workaround is to either slow down how quickly jetpants writes/queries collins, or add some logic to set unique marker tags (or perhaps use modification time) when updating assets and retry queries for those marker tags until a non-empty set is returned.

@evanelias @roymarantz @defect @bobpattersonjr ill land this in a day unless i hear any vocal objections?

@byxorna byxorna merged commit e27fb98 into tumblr:master May 23, 2017
@grahamc
Copy link

grahamc commented Jun 1, 2017

I just updated the container used in some jetpants integration testing, and wanted to report this made a direct improvement.

To improve read after write consistency we've added a read-loop after each write, busy-waiting until we can read back what we just wrote. Even with this fix in place, sometimes we would update the secondary role on an asset from A to B and then search for assets where the secondary role is A, and the asset would be returned. After updating the container, under these circumstances with the RaW busy loop, the asset is no longer returned.

Thank you!

@byxorna
Copy link
Contributor Author

byxorna commented Jun 1, 2017

@grahamc glad to hear it! This should be a good first (faux) step towards a truly consistent RW interface :)

@komapa
Copy link
Contributor

komapa commented Jun 1, 2017

Oh, I cc'ed @evanelias but it seems like @grahamc is already basking in the benefits of this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tune solr for realtime searching
7 participants