Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Settings should be synced before syncing mappings #992

Closed
soosinha opened this issue Jun 13, 2023 · 5 comments
Closed

[BUG] Settings should be synced before syncing mappings #992

soosinha opened this issue Jun 13, 2023 · 5 comments
Assignees
Labels
bug Something isn't working must_fix v2.10.0 Issues targeting release v2.10.0

Comments

@soosinha
Copy link
Member

What is the bug?
When customer adds new mappings in the leader index and these mappings are dependent on analyzers which are newly defined in the settings, the replay fails on the follower side.
As per this logic, the follower tries to apply the operations directly. If the operations need mapping update, it then tries to sync remote mapping. But the syncing of remote mapping will fail if the settings have not been synced by the metadata polling task which happens every 1 minute.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Start replication on follower
  2. Define analyzer in the leader index settings and add new mappings which use this analyzer
  3. Immediately after step 2, start indexing documents which contain the new mappings.
  4. IndexReplicationTask will encounter exception and replication will get auto-paused

What is the expected behavior?
The replication should work successfully by syncing all the settings and mappings

Do you have any additional context?
This problem can be solved by syncing the remote settings before syncing the mappings here

@soosinha soosinha added bug Something isn't working untriaged must_fix v2.9.0 and removed untriaged labels Jun 13, 2023
@monusingh-1 monusingh-1 self-assigned this Jun 13, 2023
@monusingh-1
Copy link
Collaborator

Able to reproduce this locally

curl -XPOST http://${LEADER}/fruit-1/_close
curl -u 'admin:admin' -XPUT "http://${LEADER}/fruit-1/_settings" -H 'Content-Type: application/json' -d \
'{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}'

curl -XPUT "http://${LEADER}/fruit-1/_mapping?pretty" -H 'Content-type: application/json' \
-d '
{
  "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded"
      }
    }
}'

curl -XPOST http://${LEADER}/fruit-1/_open

Then indexing documents

 curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
  "status" : "PAUSED",
  "reason" : "AutoPaused:  + [[fruit-1][0] - org.opensearch.OpenSearchException - \"analyzer [std_folded] has not been configured in mappings\"], ",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1"
}

@monusingh-1 monusingh-1 mentioned this issue Jun 14, 2023
5 tasks
@monusingh-1 monusingh-1 added the v2.10.0 Issues targeting release v2.10.0 label Jul 24, 2023
@monusingh-1
Copy link
Collaborator

monusingh-1 commented Aug 18, 2023

HI @soosinha,
A user will only be able to update the static settings of leader index only after closing the index.
When cross cluster replication is set for an index and the index is closed on the leader, if there is get changes request during this time then the replication will go into autopaused state, ex:

curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1",
  "syncing_details" : {
    "leader_checkpoint" : 0,
    "follower_checkpoint" : 0,
    "seq_no" : 1
  }
}
❯ curl -XPOST http://localhost:9200/fruit-1/_close
{"acknowledged":true,"shards_acknowledged":true,"indices":{"fruit-1":{"closed":true}}}%
❯
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "replication_exception",
        "reason" : "failed to fetch replication status"
      }
    ],
    "type" : "replication_exception",
    "reason" : "failed to fetch replication status"
  },
  "status" : 500
}
❯
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
  "status" : "PAUSED",
  "reason" : "AutoPaused:  + [[fruit-1][0] - org.opensearch.indices.IndexClosedException - \"closed\"], ",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1"
}

@monusingh-1
Copy link
Collaborator

monusingh-1 commented Aug 18, 2023

If the user closes the index and opens it again, it is possible that getChanges request comes in during this time and Auto-pause the replication. However it is also possible that close and open is done so quickly and the there is no getChanges request between them, hence leaving the replication in syncing state.

If a user does the following

  1. close the index
  2. updates setting to create new analyzer
  3. updates index mapping to use the above analyzer
  4. opens the index
  5. ingest data using the new mapping.

if the above is performed instantaneously, then we see that the replication goes to auto pause with a different reason

 curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
  "status" : "PAUSED",
  "reason" : "AutoPaused:  + [[fruit-1][0] - org.opensearch.OpenSearchException - \"analyzer [std_folded] has not been configured in mappings\"], ",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1"
}

@monusingh-1
Copy link
Collaborator

To overcome the above the user must simply pause the replicaiton and then update the index settings on leader index and then resume the replication. This will lead to leader index settings to be replicated on follower index.

Testing details

  1. start replication
  2. index documents on leader index
  3. pause replication on follower index
  4. close leader index
  5. update setting and create analyser
  6. update index mapping to use the analyzer
  7. open the leader index
  8. ingest data | optional
  9. resume replication on follower index
  10. ingest data | if not done on step 8

When resume replication is triggered new persistent tasks are spinned up and the leader index settings are synced by IndexReplicationTask.

Adding testing details below:

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1",
  "syncing_details" : {
    "leader_checkpoint" : 0,
    "follower_checkpoint" : 0,
    "seq_no" : 1
  }
}
❯ chmod 777 pause_resume.sh
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_pause"  -H 'Content-Type: application/json' -d  '{}'

{"acknowledged":true}%
❯
curl -XPOST http://localhost:9200/fruit-1/_close
{"acknowledged":true,"shards_acknowledged":true,"indices":{"fruit-1":{"closed":true}}}%
❯ curl -u 'admin:admin' -XPUT "http://localhost:9200/fruit-1/_settings" -H 'Content-Type: application/json' -d \
'{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}'

curl -XPUT "http://localhost:9200/fruit-1/_mapping?pretty" -H 'Content-type: application/json' \
-d '
{
  "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded"
      }
    }
}'

{"acknowledged":true}{
  "acknowledged" : true
}
❯ curl -XPOST http://localhost:9200/fruit-1/_open
{"acknowledged":true,"shards_acknowledged":true}%
❯ curl -XPOST "http://localhost:9200/fruit-1/_doc/99" -H 'Content-Type: application/json' -d '{"value" : "data99", "my_text": "monu singh"}'
{"_index":"fruit-1","_id":"99","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":1,"_primary_term":3}%
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_resume"  -H 'Content-Type: application/json' -d  '{}'

{"acknowledged":true}%
❯ curl -XPOST "http://localhost:9200/fruit-1/_doc/98" -H 'Content-Type: application/json' -d '{"value" : "data98", "my_text": "monu singh"}'

{"_index":"fruit-1","_id":"98","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":2,"_primary_term":3}%
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "leader-cluster",
  "leader_index" : "fruit-1",
  "follower_index" : "fruit-1",
  "syncing_details" : {
    "leader_checkpoint" : 2,
    "follower_checkpoint" : 2,
    "seq_no" : 3
  }
}
❯ curl "localhost:9201/fruit-1/_settings?include_defaults=true"
{"fruit-1":{"settings":{"index":{"replication":{"type":"DOCUMENT"},"number_of_shards":"1","translog":{"generation_threshold_size":"32mb"},"plugins":{"replication":{"follower":{"leader_index":"leader-cluster:fruit-1"}}},"provided_name":"fruit-1","creation_date":"1692334252284","analysis":{"analyzer":{"std_folded":{"filter":["lowercase"],"type":"custom","tokenizer":"standard"}}},"number_of_replicas":"1","uuid":"8SlRCbGTQ5mfaq_YAcgq2A","version":{"created":"137217827"}}},"defaults":{"index":{"flush_after_merge":"512mb","plugins":{"replication":{"translog":{"retention_size":"536870912b","retention_lease":{"pruning":{"enabled":"false"}}}}},"final_pipeline":"_none","max_inner_result_window":"100","unassigned":{"node_left":{"delayed_timeout":"1m"}},"max_terms_count":"65536","routing_partition_size":"1","force_memory_term_dictionary":"false","max_docvalue_fields_search":"100","merge":{"scheduler":{"max_thread_count":"4","auto_throttle":"true","max_merge_count":"9"},"policy":{"reclaim_deletes_weight":"2.0","floor_segment":"2097152b","max_merge_at_once":"10","max_merged_segment":"5368709120b","expunge_deletes_allowed":"10.0","segments_per_tier":"10.0","deletes_pct_allowed":"20.0"}},"max_refresh_listeners":"1000","max_regex_length":"1000","load_fixed_bitset_filters_eagerly":"true","number_of_routing_shards":"1","write":{"wait_for_active_shards":"1"},"verified_before_close":"false","mapping":{"coerce":"false","nested_fields":{"limit":"50"},"depth":{"limit":"20"},"field_name_length":{"limit":"9223372036854775807"},"total_fields":{"limit":"1000"},"nested_objects":{"limit":"10000"},"ignore_malformed":"false"},"soft_deletes":{"enabled":"true","retention":{"operations":"0"},"retention_lease":{"period":"12h"}},"max_script_fields":"32","query":{"default_field":["*"],"parse":{"allow_unmapped_fields":"true"}},"format":"0","sort":{"missing":[],"mode":[],"field":[],"order":[]},"priority":"1","codec":"default","max_rescore_window":"10000","max_adjacency_matrix_filters":"100","analyze":{"max_token_count":"10000"},"gc_deletes":"60s","searchable_snapshot":{"repository":"","index":{"id":""},"snapshot_id":{"name":"","uuid":""}},"optimize_auto_generated_id":"true","max_ngram_diff":"1","hidden":"false","translog":{"flush_threshold_size":"512mb","sync_interval":"5s","retention":{"size":"-1","age":"-1"},"durability":"REQUEST"},"auto_expand_replicas":"false","mapper":{"dynamic":"true"},"recovery":{"type":""},"requests":{"cache":{"enable":"true"}},"data_path":"","merge_on_flush":{"enabled":"true","max_full_flush_merge_wait_time":"10s","policy":"default"},"highlight":{"max_analyzed_offset":"1000000"},"routing":{"rebalance":{"enable":"all"},"allocation":{"enable":"all","total_shards_per_node":"-1"}},"search":{"slowlog":{"level":"TRACE","threshold":{"fetch":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"},"query":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"}}},"default_pipeline":"_none","idle":{"after":"30s"},"throttled":"false"},"fielddata":{"cache":"node"},"codec.compression_level":"3","default_pipeline":"_none","max_slices_per_scroll":"1024","shard":{"check_on_startup":"false"},"max_slices_per_pit":"1024","allocation":{"max_retries":"5","existing_shards_allocator":"gateway_allocator"},"refresh_interval":"1s","indexing":{"slowlog":{"reformat":"true","threshold":{"index":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"}},"source":"1000","level":"TRACE"}},"remote_store":{"translog":{"buffer_interval":"650ms"}},"compound_format":"0.1","blocks":{"metadata":"false","read":"false","read_only_allow_delete":"false","read_only":"false","write":"false"},"max_result_window":"10000","store":{"hybrid":{"mmap":{"extensions":["nvd","dvd","tim","tip","dim","kdd","kdi","cfs","doc"]},"nio":{"extensions":["segments_N","write.lock","si","cfe","fnm","fdx","fdt","pos","pay","nvm","dvm","tvx","tvd","liv","dii","vec","vem"]}},"stats_refresh_interval":"10s","type":"","fs":{"fs_lock":"native"},"preload":[]},"queries":{"cache":{"enabled":"true"}},"warmer":{"enabled":"true"},"max_shingle_diff":"3","query_string":{"lenient":"false"}}}}}%

As we can see from the last output, replication is in SYNCING state and the leader index mapping std_folded is now synced on follower index.

@soosinha
Copy link
Member Author

Thanks @monusingh-1 for working on this and verifying the behavior.
If the analyzer settings were dynamic, only then it would have been a bug. But since the index has to be closed before updating the analyzer settings, auto-pause of replication is the expected behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working must_fix v2.10.0 Issues targeting release v2.10.0
Projects
None yet
Development

No branches or pull requests

2 participants