Deprecating _primary preference makes getting consistent results impossible(?) #31929

bra-fsn · 2018-07-10T14:51:43Z

Elasticsearch up to 6.x has a preference setting (for eg. get and search operations) named _primary.
According to the docs this makes the query to run on the primary shard:
" The operation will go and be executed only on the primary shards. "

On master (7.x) this setting has disappeared (on 6.x it's deprecated) and for the doc API only _local remained.

This makes it impossible to use the doc (and search) API to get consistent results. A subsequent query could return stale data, which may be even true if a custom preference value is used, because it just hashes the client to a given shard, which may be yet to be updated (while asking the primary would return correct result). Two different clients could get two different results for the same query in the same time this way (or given that all clients use the same custom preference could all see the same stale data).

One could argue that using wait_for_active_shards is the solution here, but even if it's working correctly in this manner (ie: all replicas and the primary are updated atomically from the PoV of the client, which I guess is not true), it would require that all of the shards are always available for writes, which makes replication somewhat useless, or at least much less useful.

Given all of these, I would like to ask you to restore the _primary option for preference.

elasticmachine · 2018-07-10T17:51:01Z

Pinging @elastic/es-search-aggs

elasticmachine · 2018-07-11T08:03:27Z

Pinging @elastic/es-distributed

colings86 · 2018-07-11T08:24:01Z

@bra-fsn could you explain a bit about your use case and why you only want to retrieve results from the primary?

If the use case is down to wanting every client to see the exact same view of the data and not have stale reads I think you'd have a hard time achieving this even with _primary. This is because of two things:

Shard failure - A primary shard failure will permanently change the primary and in the case where not all shard copies are up to date you may still see stale data anyway since the new primary becomes the source of truth. If the new primary is up to date with the failed one then you would be ok but this is the same as using those shard copies in your requests anyway?
Refreshes - If your index is changing then your clients may see different views of the data anyway depending on whether a refresh happened between one being serviced and the other being serviced

I also wonder why you do not consider custom preference an option since if you wanted to replicate the behaviour of _primary and have all requests form all clients go to the same shard copies (assuming no shard failures) then you could use the same preference value for all requests across all your clients instead of creating a preference value per client or session? This would not guarantee the primary is used but would mean that all requests from all clients should use the same shard copies (ignoring shard failures) which seems to be what you are after?

DaveCTurner · 2018-07-13T13:12:55Z

What @colings86 said is true: using _primary gives no stronger consistency guarantees than using a custom preference.

it just hashes the client to a given shard, which may be yet to be updated (while asking the primary would return correct result).

It is not the case that the primary always serves all acknowledged writes, because it might not have refreshed. It is also not the case that replicas lag behind the primary like this: writes are only acknowledged once written to all copies, not just to the primary.

bleskes · 2018-07-13T13:47:33Z

we discussed it on fix it friday. The main use case for preferences are data locality (for attribute based preferences) and more efficient use of shard caches (where the custom preference value sends requests to the same shard where caches are potentially hot). We should update the docs to better reflect that and also to clarify that this is why the _primary_* options were deprecated. They help with neither options.

Today it is unclear what guarantees are offered by the search preference feature, and we claim a guarantee that is stronger than what we really offer: > A custom value will be used to guarantee that the same shards will be used > for the same custom value. This commit clarifies this documentation and explains more clearly why `_primary`, `_replica`, etc. are deprecated in `6.x` and removed in `master`. Relates elastic#31929 elastic#26335 elastic#26791.

Today it is unclear what guarantees are offered by the search preference feature, and we claim a guarantee that is stronger than what we really offer: > A custom value will be used to guarantee that the same shards will be used > for the same custom value. This commit clarifies this documentation and explains more clearly why `_primary`, `_replica`, etc. are deprecated in `6.x` and removed in `master`. Relates #31929 #26335 #26791.

bra-fsn · 2018-07-29T13:09:05Z

@colings86, @bleskes
I use elasticsearch like a database.
For operations which need consistency I use only the doc APIs (index, update and get).
The way this API works has changed multiple times (last time in 6.3.0, where update reintroduced reading from the translog in #29264, get still does a refresh according to the docs), but AFAIK the get APIs remained consistent in all recent elastic versions.
From the documentation I assume the following:

an index/update to a doc is written to the translog and is acknowledged only after this happens. Depending on the value of wait_for_active_shards, this happens only on the primary or the given number of replicas. If I have a number_of_replicas: 2 and wait_for_active_shards: 2, this should mean the majority of the shard copies must be updated with a write to be acknowledged. I guess if only one shard (the primary) is available (with wait_for_active_shards > 1), the write will happen there, but it won't be acknowledged as successful, but I'm not sure about that (can you clarify please?).
the doc APIs represent a consistent view, which means after the write has been acknowledged on the given nodes if a get/update request comes in on the nodes which already acknowledged the write (the primary and the given number of replica nodes), these will either do a refresh (get) or read from the translog (update), but either way, if there was a modify operation, they can't give back stale data.
after reading the resiliency page I would think with a repcount of 2 and wait_for_active_shards: 2, promoting a stale replica shouldn't be possible. Isn't this the cause of the introduction of allocation IDs?

So if I'm right using the doc APIs in the above way, a consistent view can be achieved. Or at least this is how I understand after reading the docs.
The only thing which is needed for this to be true is the usage of _primary on get operations, because this guarantees that the query will hit a node, which has already acknowledged the latest writes.

Could you please tell me what I'm getting wrong?

Thanks,

bra-fsn · 2018-07-29T13:17:19Z

@DaveCTurner for searches, right. But for the doc APIs, inflight data will either be served from the translog or initiate a refresh (which means no stale data). BTW, this could even be true for searches. Where I need consistency, I use refresh=wait_for on index operations, so consistency here means after an index/update operation with refresh=wait_for has returned, a new search should include that data (and now you only can assume this with setting _primary).

"writes are only acknowledged once written to all copies, not just to the primary"
If this is true, what is wait_for_active_shards for (and why does the doc say its default value is 1, which means wait for only the primary)?
From:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

By default, write operations only wait for the primary shards to be active before proceeding (i.e. wait_for_active_shards=1).

DaveCTurner · 2018-07-30T04:34:09Z

Where I need consistency, I use refresh=wait_for on index operations, so consistency here means after an index/update operation with refresh=wait_for has returned, a new search should include that data (and now you only can assume this with setting _primary).

When an indexing request with ?refresh=wait_for returns, all active shard copies have replicated the write and exposed it to searches, not just the primary.

writes are only acknowledged once written to all copies, not just to the primary

If this is true, what is wait_for_active_shards for (and why does the doc say its default value is 1, which means wait for only the primary)?

wait_for_active_shards performs a preflight check on the number of active shard copies before starting work on an indexing operation. The default value of 1 means the check passes even if no replicas are active, but doesn't change the fact that write operations are only acknowledged after they have been replicated to all of the active shard copies. Note also the following paragraph from further down the docs:

It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary.

DaveCTurner · 2018-07-30T04:44:39Z

The docs were updated in #32098 to address the original question. I can see a case for addressing the followup questions with further doc improvements, so I'm leaving this open and marking it for further discussion.

bleskes · 2018-07-30T05:43:30Z

@bra-fsn thanks for explaining. I think you can do what you're doing but you have to change a bit how you reason about the API. Also, there are some edge cases when things fail that you should be aware of. @DaveCTurner explained some details of the API and the higher level constructs are explained here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html . Please me know if you find anything unclear in that document. It's very important for us that users understand what they can (and not) expect from Elasticsearch.

bra-fsn · 2018-07-30T08:25:04Z

wait_for_active_shards performs a preflight check on the number of active shard copies before starting work on an indexing operation. The default value of 1 means the check passes even if no replicas are active, but doesn't change the fact that write operations are only acknowledged after they have been replicated to all of the active shard copies. Note also the following paragraph from further down the docs:

Oh, I missed those, thanks for pointing that out!

bra-fsn · 2018-07-30T09:16:52Z

@bleskes What's still unclear to me is the answer to the original question.
Could you please tell me whether these are valid or not (and why)?

if an operation is acknowledged, it must not be lost (for example by electing a stale shard copy)
due to the replication model, this could happen to a document named doc with an original version 1 - referred to doc(version) from now on:
2.1 there is doc(1) - intial version
2.2 an update comes in for doc, primary applies that, resulting in doc(2)
2.3 at this point any new doc API requests will see doc(2) when they hit the primary and doc(1) when they hit the shard copies
2.4 primary is lost before it could reach out for the shard copies
2.5 master will notice this and elect a new shard copy as the primary
2.6 all further requests will see doc(1)
2.7 the original client which made the doc(1)->doc(2) update will see failure
get APIs provide linearizability on a given shard: if that shard has received a modification request any get or update after that will see that applied (whether it comes out from the translog or because of a refresh is irrelevant here)

Given all of these are true:

by using _primary, linearizability (all later reads will return the value of a write) on all requests (no matter what client issued that) can only fail (and can fail only for unacknowledged writes) if the primary fails before it can replicate all of inflight modifications and if in-sync allocation IDs can't otherwise protect against a stale (for unacknowledged, but already written by the primary) copy to be promoted to the primary. This is far from being impossible, but can be made a pretty rare event with enough care.
without the possibility of using _primary, linearizability will fail all the time (within the replication time interval), because clients can connect to any of the shard holders (primary and copies).

So I think it would be nicer to keep _primary and change this sentence (from #32098):

They do not help to avoid
inconsistent results that arise from the use of shards that have different
refresh states, and Elasticsearch uses synchronous replication so the primary
does not in general hold fresher data than its replicas.

to something like this: _primary helps to avoid inconsistent results, but there should be a caveat: if a primary fails and it can't update the in-sync allocation IDs, unacknowledged writes may have broken linearizability.

DaveCTurner · 2018-07-30T10:24:26Z

The situation you are describing is a dirty read which is indeed a counterexample to linearizability. The documentation there indicates that dirty reads can be exposed by an isolated primary, but in fact there is nothing special about primaries: they can be exposed by any shard copy. The _primary preference doesn't help here: a situation which could lead to a dirty read involves a failure of a primary, so preference=_primary will silently shift to a different shard copy.

without the possibility of using _primary, linearizability will fail all the time (within the replication time interval), because clients can connect to any of the shard holders (primary and copies).

This isn't the case - there's also prefer_nodes, only_nodes and custom preference values to guide clients towards reuse of the same shard copies each time.

bleskes · 2018-07-30T17:51:26Z

To add to what David said - in your description the update to doc(2) was never acknowledged (because it was not fully replicated) but all of the above can happen even for acked writes. Say a primary got isolated, but doesn't know it is the case yet. Another replica is promoted and that one happens to receive and process with doc(2). Now you issue a get and that happens to hit the node with the old primary. It will happily respond (in that moment) with doc(1).

Bottom line - what ever you do, the model we use (for good reasons, see the doc I linked to) doesn't offer linearization under failure modes but is way more efficient under normal operations. Instead of fighting to reduce the error window (it's already pretty small) and make the system more complex, we prefer to clearly communicate this and make it clear via the API. The goal is that people will build systems that take it into account (potentially not caring) rather then assuming we give guarantees that only seem to be true.

DaveCTurner · 2018-08-29T14:11:53Z

Thanks for your questions @bra-fsn, this kind of discussion is a good way to show us where our docs might be improved. If there's anything more that we can clarify then don't hold back. I'm closing this now as there's been no activity for a while and there's no further action to be taken at the moment.

jbaiera added discuss :Search/Search Search-related issues that do not fall into other categories labels Jul 10, 2018

colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Jul 11, 2018

colings86 added the feedback_needed label Jul 13, 2018

bleskes added >docs General docs changes and removed discuss labels Jul 13, 2018

bleskes assigned DaveCTurner Jul 13, 2018

DaveCTurner mentioned this issue Jul 16, 2018

Improve docs for search preferences #32098

Merged

DaveCTurner added team-discuss and removed feedback_needed labels Jul 30, 2018

colings86 removed the :Search/Search Search-related issues that do not fall into other categories label Jul 30, 2018

DaveCTurner closed this as completed Aug 29, 2018

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecating _primary preference makes getting consistent results impossible(?) #31929

Deprecating _primary preference makes getting consistent results impossible(?) #31929

bra-fsn commented Jul 10, 2018 •

edited

Loading

elasticmachine commented Jul 10, 2018

elasticmachine commented Jul 11, 2018

colings86 commented Jul 11, 2018

DaveCTurner commented Jul 13, 2018

bleskes commented Jul 13, 2018

bra-fsn commented Jul 29, 2018 •

edited

Loading

bra-fsn commented Jul 29, 2018 •

edited

Loading

DaveCTurner commented Jul 30, 2018

DaveCTurner commented Jul 30, 2018

bleskes commented Jul 30, 2018

bra-fsn commented Jul 30, 2018

bra-fsn commented Jul 30, 2018 •

edited

Loading

DaveCTurner commented Jul 30, 2018

bleskes commented Jul 30, 2018

DaveCTurner commented Aug 29, 2018

This comment has been minimized.

This comment has been minimized.

Deprecating _primary preference makes getting consistent results impossible(?) #31929

Deprecating _primary preference makes getting consistent results impossible(?) #31929

Comments

bra-fsn commented Jul 10, 2018 • edited Loading

elasticmachine commented Jul 10, 2018

elasticmachine commented Jul 11, 2018

colings86 commented Jul 11, 2018

DaveCTurner commented Jul 13, 2018

bleskes commented Jul 13, 2018

bra-fsn commented Jul 29, 2018 • edited Loading

bra-fsn commented Jul 29, 2018 • edited Loading

DaveCTurner commented Jul 30, 2018

DaveCTurner commented Jul 30, 2018

bleskes commented Jul 30, 2018

bra-fsn commented Jul 30, 2018

bra-fsn commented Jul 30, 2018 • edited Loading

DaveCTurner commented Jul 30, 2018

bleskes commented Jul 30, 2018

DaveCTurner commented Aug 29, 2018

This comment has been minimized.

This comment has been minimized.

bra-fsn commented Jul 10, 2018 •

edited

Loading

bra-fsn commented Jul 29, 2018 •

edited

Loading

bra-fsn commented Jul 29, 2018 •

edited

Loading

bra-fsn commented Jul 30, 2018 •

edited

Loading